Rationale 1: "Preferred form of modification" is not satisfied. --------------------------------------------------------------- Without the original training data or training software, the kinds of possible modifications are very limited. Take LLMs, typically fine-tuning a pre-trained LLM through LoRA does not require the original training data or training software, but fine-tuning is not the only way to modify a model. For example, when one needs to change the tokenizer (e.g., for adding support for a new language), the context window size, the position encoding, or improve the model architecture, merely AI model itself is not enough. Taking "fine-tuning" (or other types of secondary development) as the only "preferred form of modification", is relentlessly excluding the minority, namely power users who are really able to understand, modify, maintain, and improve, and iterate the AI model at a deeper or even fundamental level. Thus, the "preferred form of modification" is not satisfied with just the AI model file itself (without the original training data or training software). This part also connects the "freedom to change and improve" of the AI model. Without the original training data or training software, the ways to change and improve the AI model are very limited. Rationale 2: Training data and program are the "Source code" (DFSG #2). ----------------------------------------------------------------------- If we treat the emacs.c as the input, gcc as the processing software, and the emacs ELF binary executable as the output. Then the emacs.c is the source code. The emacs.c is the "preferred form of modification" of the emacs ELF binary. If we treat the training data as the input, the training software as the processing software, and the trained AI model as the output. Then the training data is the "source code" of the AI model. The training data plus training software is the "preferred form of modification" of the AI model. Plus, if a user would like to study and edit the "source code" of an AI model like the original author does, the "source code" is the training data and training software, instead of the AI model (a pile of matrices and vectors). Rationale 3: Reproducibility is not satisfied. ---------------------------------------------- It is impossible to reproduce the original author's work (the pre-trained AI model), without the original training data or training software. Here "reproduce" means to produce an AI model that has very similar or identical performance/behavior as the original author's released AI model. The definition of "reproducibility" may be ambiguous sometimes. Collecting alternative training data and writing new training software based on the information provided by the author of the pre-trained AI model is sometimes called "reproducing a work" in some contexts, but it is in fact a mimic of the original work that creates new work, instead of "reproducing the original work". Rationale 4: Safety, Security, Bias, and Ethics Issues. ------------------------------------------------------- Without the original training data or training software, the security patching mechanism will be limited to binary diff on the AI model file, or simply replacing the old AI model with a brand new AI model. Nobody excepts the original author can understand the security update. If we encounter a safety/bias/ethics issue where the AI model is producing contents that is harmful to the society, such as discrimination against a certain group of people, or a certain type of endeavor, etc., patching will be needed -- but doing that at the fundamental level can only be done by the original author, let alone downstream distributors. For security issues (e.g., when AI takes role in making decisions that can lead to real-world impact and hence security risks), there is not yet a CVE (Common Vulnerabilities and Exposures) system for AI models. When we face security issues, security patching to the mentioned type of AI models at the fundamental level can only be done by the original author, let alone downstream distributors. Rationale 5: The freedom to study is broken. -------------------------------------------- Take LLMs, without the original training data, it is impossible to study whether the AI model leverages GPL-licensed data, or even verify whether the model is trained on legal data or not. It is impossible to study how the AI model's outputs are affected by the GPL-licensed data, such as whether the AI model will explicitly copy the GPL-licensed data in its outputs, without citing the source or providing the license information. If such kind of "study" particularly involving GPL-licensed data is too harsh, we may need to revisit the definition of "study". That said, as the MSFT/NYT case is not yet settled, we should put the "fair use" issue aside for now. At least, "the freedom to verify the license of training data" does not rely on the "fair use" issue. When there is unfortunately copyrighted data in the training data by accident, directly removing those portion of training data will be effective for avoiding legal risks, but it is challenging to remove the influence of those data from the AI model directly and cleanly (a pile of vectors and matrices). This again goes back to the "preferred form of modification" issue.