Integration tests of Spark applications
You just finished the Apache Spark-based application. You ran so many times, you just know the app works exactly as expected: it loads the input…
Read moreIn the rapidly evolving landscape of artificial intelligence (AI), large language models (LLMs) have become indispensable tools for various applications, from natural language processing to content generation.
However, as organizations explore the integration of LLMs for commercial purposes, it's crucial to dive into the legal landscape that governs these advanced technologies. This includes multifaceted aspects such as copyrights, licensing, data privacy, sourcing, liabilities and broader AI transparency and ethics concerns.
Among the escalating demand for sophisticated LLMs, the choice of licensing emerges as a critical factor in shaping accessibility, collaboration and overall impact, especially in commercial contexts. Prior to organizations embarking on acquiring or deploying LLMs, a comprehensive exploration of the legal complexities surrounding their use is essential.
In this blog post, we'll explore the advantages and considerations of using open source licenses for different large language models for commercial purposes. We will examine license models and look closer into the Vicuna and Llama2 case to finally find the open-source license best suited to you.
Large language models can be broadly categorized into two main types: General LLMs and Custom LLMs.
General Large Language Models (LLMs) encompass models designed to execute a diverse array of language-related functions without being specifically adapted to a particular domain or application. These general LLMs, exemplified by OpenAI's GPT (Generative Pre-trained Transformer) models, undergo training on extensive datasets that capture a broad spectrum of language patterns and topics. Their versatility allows for application in tasks like text generation, language translation, summarization and more, all without the need for fine-tuning tailored to specific use cases.
In contrast, Custom Large Language Models (LLMs) relate to models that have undergone additional training or fine-tuning for specific applications or domains. Organizations or researchers engaging with custom LLMs often take a pre-existing general LLM and refine it, by subjecting it to a dataset pertinent to a particular industry, field or application. This fine-tuning process enhances the model's performance for targeted tasks, rendering custom LLMs optimized for a narrower set of language functions relevant to a specific context. Consequently, this specialization makes them more effective in those specific domains.
LLMs operate within two predominant models: proprietary and open source.
Proprietary LLMs are owned by companies, necessitate licensing for usage and often come with restrictions described in the terms and conditions. Usually users have to pay a fee for license and are prohibited from sharing or distributing the software or its outputs without authorization.
Open source LLMs are freely accessible to anyone, allowing improvement, modification and distribution without stringent limitations.
It’s up to the company wanting to implement the LLMs which chooses the direction to take.
With regards to open source models, two license models can be applicable: copy left licences and permissive licences. Below we will distinguish between the two.
Concerning copyleft licensing, there are various legal aspects that both the creators and users of copyleft-licensed works should be mindful of. Here are some legal considerations to bear in mind:
License Compatibility: It is crucial to verify that your copyleft license aligns with any other licenses applicable to your work. Certain licenses may be incompatible with each other, leading to legal complications if you attempt to merge works licensed under different terms.
Viral Effect: The viral effect of copyleft licenses dictates that any work derived from a copyleft-licensed work must also be licensed under the same copyleft terms. This can pose significant consideration for both creators and users, impacting the ability to use and distribute the work in specific ways.
International Considerations: Copyleft licensing is a global phenomenon, and it is crucial to comprehend how the chosen license will be interpreted and enforced in various jurisdictions worldwide. Different countries may have distinct legal requirements and interpretations of copyleft licenses, necessitating thorough research before selecting a license.
Numerous copyleft licenses are available, such as the GNU General Public License (GPL) and the Creative Commons ShareAlike license. While these licenses come with distinct terms and conditions, they fundamentally revolve around a common principle: companies that use or modify a copyleft-licensed work are obligated to distribute their derived work under the identical license terms. Please note that what constitutes a "derived work" must be interpreted in light of the specific open source license. "Derived work" (or the term used in the copyleft license) is not necessarily as limited in scope as "derived work" would be under copyright law.
Some copyleft licenses define "derived work" as the entire product in which the open source component is used, in addition to material based on the original component. This is referred to as the so-called "strong" copyleft effect.
The intent behind incorporating copyleft clauses is to keep the freedom granted by the open-source license for any "derived work." The underlying principle is the promotion of collective contributions to a growing repository of source code that remains open and accessible to anyone for use, commercial exploitation and further enhancement. Contrary, commercial developers typically aim to maintain the confidentiality of their entire source code to deter plagiarism and other infringements. Additionally, they often prefer licensing their products under a stringent proprietary license of their choosing. Such licenses typically only provide the right to use the product for the licensee's internal purposes, without permitting commercialization, modifications or further development. Essentially, commercial developers seek to preserve the exclusive rights granted by copyright law.
When copyleft clauses come into play, developers of "derived work" are unable to dictate the terms and conditions for licensing the "derived work." Consequently, the copyleft effect is often deemed commercially unprofitable when an open-source component forms a "derived work."
In accordance with LLMs, the example of copyleft license is GPL 3.0. The GPL 3.0 requires that any derivative works of the software be licensed under the same license. This means that if you use GPL-3.0-licensed software in your project, your project must also be licensed under GPL-3.0.
Using permissive open-source components typically presents fewer challenges compared to copyleft ones, as permissive licenses generally impose less strict obligations. Common permissive licenses include Apache 2.0, MIT and various BSD licenses. In general, permissive licenses grant users the right to use, copy, modify and distribute copies of the licensed source code component.
Developers can take the permissive-licensed software, make it their own through changes or additions, keep their new version to themselves, or share them if they choose to. This is a majorly positive feature if you’re looking to create proprietary software that you can sell and keep secret from competitors — and one of the main reasons why permissive licenses are popular.
However, these licenses often make such rights contingent to providing licensing information to the company's own licensees, including attributions of copyright owners and disclaimers. Consequently, failure to comply with this requirement could potentially render open-source license grants invalid. It's important to note that this risk applies to all open-source licenses, not just the permissive ones. Infringement of intellectual property rights may occur if open source is used without full compliance with the respective license terms.
Requires license notifications and copyrights on the distributed code and/or as a notice in the software. However, derivative works, larger projects or modifications are permitted to carry different licensing terms when distributed and are not required to provide source code.
This on bears the name of the famous university where it originated and is very short and clear and easy to understand. It allows anyone to do whatever they wish with the original code, as long as the original copyright and license notice is included either in the distributed source code or software.
Moreover, not all open-source licenses can be seamlessly combined with components licensed under other open-source licenses. For instance, it is generally assumed that a component licensed under the permissive MIT license can be integrated into a larger work licensed under the copyleft GPL license. Conversely, a component licensed under the GPL license may not be integrated into a larger piece of work intended to be licensed under the MIT license.
The list of LLMs with open source licenses can be found on Github https://github.com/eugeneyan/open-llms
The evolving landscape introduces innovative licensing approaches, such as the RAIL license, which combines an open access approach with behavioral restrictions. This nuanced copyright license aims to enforce responsible AI use, introducing usage-based restrictions for models like OPT, Stable Diffusion and BLOOM.
This license has certain use-based restrictions, for example it cannot be used in anything that violates laws and regulations, exploits or harms minors, or in something which discriminates or harms “individuals or groups based on social behavior or known or predicted personal or personality characteristics”. For more information - https://www.licenses.ai/
Some models under this license are: OPT, Stable Diffusion and BLOOM
Bloom is an open-access multilingual language model available for commercial use under the bigscience-bloom-rail-1.0 license, with restrictions on providing medical advice and interpretation of medical results.
Vicuna for research purposes
Vicuna is an open-source chatbot trained by fine-tuning on LLaMA. The Vicuna model card would show the Apache 2.0 license that can be used commercially. However, the LLaMA weights are not available commercially. A closer examination of real-world cases, such as Vicuna, highlights the complexity of licensing LLMs commercially. Despite an Apache 2.0 license, restrictions on underlying LLaMA weights limit commercial usability, limiting its application to research purposes.
LLama2 with additional commercial restrictions
In accordance with LLama2 terms, in the event that the total monthly active users for products or services offered by or on behalf of the Licensee, or Licensee's affiliates surpasses 700 million in the preceding calendar month, you are required to seek a license from Meta. Meta retains the discretion to grant such a license at its sole discretion, and you are not permitted to exercise any rights outlined in this Agreement until Meta expressly grants such rights. This contradicts the principles of open-source purpose.
Secondly, concerning the weights: Meta does not publicly disclose the weights. To obtain a copy of the weights from Meta, you need to submit an application. Furthermore, these weights cannot be utilized for training any Language Model (LM) except Llama 2, unless you obtain explicit written approval from Meta.
LLMs licensing should be part of a risk assessment and subject to due diligence and/or a data governance policy.
It should be revisited, to account for the specific risks which arise when a business develops or incorporates an LLM into the technology it uses to conduct business or provide products or services to customers. A series of questions should be asked also about the type of sources from where the data has been taken, the licensing arrangements which attach to that data and the LLM, and the methods used to source the data. Sometimes the platforms from where the data has been obtained allow for, and even encourage, public access.
Reviews of legal terms regarding the LLMs and the data used to train them should be thoroughly undertaken. Reviewing the legal terms is particularly important in reducing the risk of the permissions not covering a data provider or platform owner or used in a way that specifically prohibits their utilization in relation to training LLMs.
For financial institutions (for example regulated in the UK by the Prudential Regulation Authority (PRA)) it is also a regulatory matter. The PRA has said they must ensure that they “obtain appropriate assurance and documentation from third parties on the provenance or lineage of the data to be satisfied that it has been collected and processed in line with applicable legal and regulatory requirements”.
In others, robust contractual protections will need to be put in place and internal governance structures, polices, processes and controls will be necessary to take advantage of the huge potential LLMs have to transform business.
Copyleft License Caution: Careful consideration is essential when opting for a copyleft license due to the restrictions mentioned in point 3 above
Copyleft vs. Permissive: Generally, copyleft licenses impose more restrictions and possibly offer less liability compared to permissive licenses. When prioritizing code reusability and shareability, a moderately permissive license is often the preferable choice.
GPL Versions and Compatibility: The GPL license exists in two main versions—GPLv2 and GPLv3. Noteworthy differences in GPLv3 address issues not covered in GPLv2, such as patents, and enhance compatibility with other open-source licenses like the Apache License 2.0. It's crucial to recognize that GPLv2 and GPLv3 are not compatible with each other.
Advantages of MIT Licenses: MIT licenses enjoy widespread use, boasting recognition and common understanding. Software licensed under MIT entails no restrictions on redistribution or monetization, making it appealing for various applications. Additionally, MIT licenses are compatible with many other open-source licenses, enabling the use of MIT-licensed code in projects employing different licenses.
The commercial deployment of LLMs demands a nuanced understanding of licensing terms. Organizations must carefully evaluate the conditions set by providers, ensuring compliance and staying informed about evolving requirements. Rapid developments, such as open-source alternatives to initially restricted models, underscore the need for continuous monitoring in this dynamic landscape.
Finding a balance between technical advantages and legal complexities is imperative for responsible and effective implementation. A thorough understanding of licensing models, coupled with vigilant monitoring, is the key to unlocking the transformative potential of LLMs in the ever-evolving landscape of artificial intelligence.
You just finished the Apache Spark-based application. You ran so many times, you just know the app works exactly as expected: it loads the input…
Read moreApache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data…
Read moreWill AI replace us tomorrow? In recent years, there have been many predictions about what areas of our lives will be automated and which professions…
Read moreIn the first part of the series "Power of Big Data", I wrote about how Big Data can influence the development of marketing activities and how it can…
Read moreThe end of 2020 has come, and it's time to stop for a moment and look back. The past year was not the easiest one and presented us with many…
Read moreData Mesh as an answer In more complex Data Lakes, I usually meet the following problems in organizations that make data usage very inefficient: Teams…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?