In the current era, artificial intelligence (AI) has evolved from a mere buzzword to a tangible advancement with an expansive and rapidly growing market. Businesses are fervently embracing AI to gain a competitive edge, from implementing basic AI chatbots to employing sophisticated models like ChatGPT, Bard, Perplexity, or Bing.
While the excitement around AI development is palpable, legal concerns often take a backseat. One critical issue that companies may be overlooking is the potential risk of copyright infringement during the AI training process. This article sheds light on a crucial issue – the potential risk of copyright infringement during the AI training process.
Unraveling Copyright Infringement in AI Training
To understand the crux of the matter, it is essential to recognize that training an AI system is akin to nurturing a robotic child. The AI, like a robotic child, doesn't inherently develop its own intelligence; rather, it requires a substantial amount of data for training. The quality and quantity of data directly influence the AI's capabilities, and for the same reason, one is equating data as the new oil, as it truly powers the AI revolution, or one could say, data is the juice to train superior AI.
However, in the pursuit of high-quality data, companies might inadvertently source information from copyrighted material or simply scrape copyright-protected data from the internet to train their AI. While this practice is extremely common for many companies, it could actually expose them to potential copyright infringement lawsuits. Therefore, concerns arise regarding how AI is trained, with a specific focus on the origin of the training data. One might also ask, what constitutes copyrighted data or material and how is it protected? To simplify the answer and without delving too much into technicalities, copyright protection hinges on a fundamental principle, which is that anyone who, by his or her skill and labor, creates an original work of whatever character shall, for a limited period of time, enjoy copyright protection in respect of the work created, and can control (amongst others) the circulation, reproduction, distribution and communication of the work.
In the current landscape, several copyright infringement suits have been filed against AI creators. Notable cases include The New York Times suing Microsoft and OpenAI for copyright infringement by alleging that the companies used the newspaper’s articles without permission to train their AI, and separately, a group of U.S. authors, including Pulitzer Prize winner Michael Chabon, has also sued OpenAI for copyright infringement by alleging that OpenAI reproduced their copyrighted works without permission to train ChatGPT.
Best Practices for AI Training
Copyright infringement generally occurs when copyright-protected material is used without the permission of the copyright owner, and of course, within the framework of copyright law, there are exceptions and defenses available - the main question we aim to explore is whether training AI using copyrighted data constitutes copyright infringement, or if there are defenses and exceptions that permit companies to use copyrighted data for AI training.
In the United States, one common defense against allegations of copyright infringement in AI training is the argument of fair use. In Malaysia, the concept of fair use is comparable but not identical, grounded in fair dealing principles. As AI copyright infringement cases in the US are still in their early stages, the outcomes will serve as precedents, shaping the interpretation of whether the use of copyrighted content to train AI models constitutes fair use or fair dealing.
For companies, particularly in Malaysia, the following question arises: how should they navigate the complexities of training AI while minimizing the risk of copyright infringement? The guardrail lies in a cautious approach and careful consideration of the following five questions:
1. Source of Data:
Understand the origin of the data to ensure it does not infringe upon subsisting copyrights or that usage rights have been secured.
2. Purpose and Character of Use:
Determine whether the use of the data is for commercial or nonprofit purposes, as this can impact the fair use argument.
3. Nature of Copyrighted Work:
Assess the nature of the copyrighted work being used and its relevance to the AI training process.
4. Amount and Substantiality of Portion Used:
Consider the quantity and significance of the copyrighted content used in relation to the overall work.
5. Effect on the Potential Market:
Evaluate how the use of copyrighted material may impact the market for the original work.
Given the legal uncertainties and potential risks, companies are advised to adopt a cautious approach when training their AI models - the safest route is to use proprietary data whenever possible. While this may not always be feasible due to the extensive amount of data required, purchasing data specifically licensed for AI training purposes is also a viable alternative.
Conclusion
In the face of the ever-expanding AI landscape, companies must stay vigilant. If uncertainty persists regarding potential copyright infringement during AI training, seeking the advice of an intellectual property lawyer well-versed in AI is crucial to mitigate the risk of exposing themselves to substantial legal suits.
As the intersection of AI and copyright law continues to evolve, companies must tread carefully - by understanding the nuances of fair use or fair dealing, diligently evaluating data sources, and seeking legal counsel when in doubt, companies can navigate the intricacy of AI training while minimizing the risk of copyright infringement and legal repercussions.
If you seek further information on intellectual properties, AI, copyright, or any related matters discussed in this article, we invite you to contact our team of experts. We are eager to collaborate with you in navigating transactional and dispute issues within the realms of AI and IP law. Your inquiries are welcome, and we anticipate the opportunity to assist you in these specialized areas.
About the authors
Ong Johnson
Partner
Head of Technology Practice Group
Transactions and Dispute Resolution, Technology,
Media & Telecommunications, Intellectual Property,
Fintech, Privacy and Cybersecurity
Halim Hong & Quek
Lo Khai Yi
Partner
Co-Head of Technology Practice Group
Technology, Media & Telecommunications, Intellectual
Property, Corporate/M&A, Projects and Infrastructure,
Privacy and Cybersecurity
Halim Hong & Quek