Generative artificial intelligence (GenAI) company Anthropic has claimed to a US court that using copyrighted content in large language model (LLM) training data counts as “fair use”, however.
Under US law, “fair use” permits the limited use of copyrighted material without permission, for purposes such as criticism, news reporting, teaching, and research.
In October 2023, a host of music publishers including Concord, Universal Music Group and ABKCO initiated legal action against the Amazon- and Google-backed generative AI firm Anthropic, demanding potentially millions in damages for the allegedly “systematic and widespread infringement of their copyrighted song lyrics”.
A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.
Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.
Subcommunities on Beehaw:
This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
Yup. Same as the way the rest of use and learn from the internet. We basically wouldn’t have the internet as we know it if it weren’t 99% free content.
And yet, it seems when you say anything anti-ai, lemmy bites your head off.
We are allowed to have nuance, nothing is inherently good or bad. A knife can wound or make dinner.
Trying to reduce nuance lessens the public discourse, do not be tempted by lowest common denominator memery.
Whether anyone likes it or not LLMs are here and even if we strictly regulate them there will be organizations and governments that do not.
WHAT WE SHOULD be focusing on is how to prevent low effort AI content from just basically overtaking the web.
We are already mostly there.
You can’t prevent it without regulations. Companies won’t care while gaining money from it unless they’re obligated to, and even then, some won’t comply either.
BTW, that mentality of “other countries vs mine” is absurd. War crimes shouldn’t be committed by a country just because the other commits them;
others bad ≠ I good
.LLMs can’t and should NOT replace a human, at least not yet (they’re not even that good either). If we can’t have guaranteed basic needs such as housing, food and healthcare or a BUI, then they should not keep leaving people without jobs because no one will be able to afford anything.
You can’t prevent it WITH regulation.
Just like illegal dumping: If it makes the company more than the fine, it is just a cost of business.
China will never agree to a limitation of tech advancement because that is their primary source of wealth, and frankly your comment shows a tragic lack of understanding on international affairs.
This isn’t ‘us good them bad’, this is 'China has a history of ignoring technology patents and restrictions in order to gain international advantage. The fact that you assumed that I had petty reasons makes it clear you have nothing to contribute to this conversation.
…then maybe they shouldn’t exist. If you can’t pay the copyright holders what they’re owed for the license to use their materials for commercial use, then you can’t use ‘em that way without repercussions. Ask any YouTuber.
You might want to read this article by Kit Walsh, a senior staff attorney at the EFF, and this one by Katherine Klosek, the director of information policy and federal relations at the Association of Research Libraries. YouTube’s one-sided strike-happy system isn’t the real world.
Headlines like these let people assume that it’s illegal, rather than educate them on their rights.
By and large copyright infringement is illegal. That some things aren’t infringement doesn’t change that a general stance of “if I don’t have permission, I can’t copy it” is correct. The first argument in the EFF article is effectively the title: “it can’t be copyright, because otherwise massive AI models would be impossible to build”. That doesn’t make it fair use, they just want it to become so.
The purpose of copyright is to promote the sciences and useful arts. To increase the depth, width, and breadth of the public domain. “Fair Use” is not the exception. “Fair Use” is the fundamental purpose for which copyrights and patents exist. Copyright is not the rule. Copyright is the exception. The temporary exception. The limited exception. The exception we grant to individuals for their contribution to the public.
If that is, indeed, true, and if AI is a progression of science or the useful arts, then it is copyright that must yield, not AI.
When Annas-Archive or Sci-Hub get treated the same as these giant corporations, I’ll start giving a shit about the “fair use” argument.
When people pirate to better the world by increasing access to information, the whole world gets together to try to kick them off the internet.
When giant companies with enough money to make Solomon blush pirate to make more oodles of money and not improve access to information, it’s “fAiR uSe.”
Literally everyone knew from the start that books3 was all pirated and from ebooks with the DRM circumvented and removed. It was noted when it was created it was basically the entirety of private torrent tracker Bibliotik.
You don’t see the difference between distributing someone else’s content against their will and using their content for statistical analysis? There’s a pretty clear difference between the two, especially as fair use is concerned.
AI training should not be a privilege of the mega-corporations. We already have the ability to train open source models, and organizations like Mozilla and LAION are working to make AI accessible to everyone. We can’t allow the ultra-wealthy to monopolize a public technology by creating barriers that make it prohibitively expensive for regular people to keep up. Mega corporations already have a leg up with their own datasets and predatory terms of service that exploit our data. Don’t do their dirty work for them.
Denying regular people access to a competitive, corporate-independent tool for creativity, education, entertainment, and social mobility, we condemn them to a far worse future, with fewer rights than we started with.
How am I doing their dirty work for them? I literally will stop thinking that they’re getting away with piracy for profit when we stop haranguing people who are committing to piracy for the benefit of mankind.
I’m not saying Meta should be stopped, I’m saying the prosecution of Sci-Hub and Annas-Archive need to be stopped under the same pretenses.
If it’s okay to pirate for the purpose of making money (what we put The Pirate Bay admins in jail for), then it’s okay to pirate to benefit mankind.
There is literally no way in hell someone can convince me what Meta and others are doing is not pirating to use the data contained within to make money. What’s good for the goose is good for the gander, as they say.
I reiterate, they knew it was pirated and had DRM circumvented when they downloaded it. There was zero question of the source of this data. They knew from the beginning they intended to profit from the use of this data. How is that different than what we accused The Pirate Bay admins of?
It really feels like “Well these corporations have money to steal more prolifically than little people, so since they’re stealing is so big, we have to ignore it.”
Then your argument is non-falsifiable, and therefore, invalid.
Major corporations and pirates are finally on the same side for once. “Fair Use” finally has financial backing. Meta is certainly not a friend, but our interests currently align.
The worst possible outcome here is that copyright trolls manage to convince the courts that they are owed licensing fees. Next worse is a settlement that grants rightsholders a share of profits generated by AI, like they got from manufacturers of blank tapes and CDs.
Best case is that the MPAA, RIAA, and other copyright trolls get reminded that “Fair Use” is not an exception to copyright law, but the fundamental reason it exists: Fair Use is the promotion of science and the useful arts. Fair Use is the rule; Restriction is the exception.
Wow this is some powerful internet word salad, just shot gunning scientific sounding words at the wall to try to pretty up a basic internet debate. Falsifiability is about scientific hypotheses, not statements of belief. “Nothing you can say can convince me that murder isn’t wrong” may mean there’s no further use in debate, but it isn’t “non-falsifiable” in any meaningful way nor does it somehow make the argument for the immorality of murder “invalid”.
Then I misunderstood what you were saying. Carry on.
I love seeing Lemmy users trip over themselves to declare that copyrights don’t or shouldn’t exist when it comes to pirating, right up until it comes to AI. Then Copyrights are enshrined by The Constitution and all the corporations NEED to pay for them, even when they’re not actually copying anything.
And corporations want people to pay for it but they don’t want to pay for it themselves. It’s almost as if no one likes copyright, but it benefits some ppl more than others.
You do realize that lemmy contains very many users, many of whom disagree on any number of things. You are randomly assigning the opinions of lemmy’s pirate users to a random commenter without evidence that they actually hold those opinions, because it’d be convenient for you if they’re contradicting themself in any way (though the degree to which that would be a contradiction is also arguable). It’s just a way of constructing a strawman instead of engaging with your interlocutor’s actual words.
Also, part of the problem is that these LLMs very often do directly copy and spit out articles and random forum posts and etc word-for-word verbatim, or it’ll do something that’s the equivalent of a plagiarist who swaps a few words around in a sad attempt to not get caught. It becomes especially likely depending on how specific the search is, like if you look for a niche topic hardly anyone has written extensively on or for the solution to an esoteric problem that maybe just one person on a forum somewhere found an answer to. It also typically does not even give credit or link to its sources.
Plus, copyright law, if it exists, must apply to everyone, including major coporations. That’s a separate issue than whether or not copyright law needs reform (it obviously does). If you wanna abolish copyright, fine, ok, get it abolished through the government. But while copyright law is still the law, I’m not ozk with giving magacorps a pass to break it legally, especially when they’re more than happy to sue random, harmless individuals for violating their own copyrights. They want the law not to apply to them because they’re rich.
The argument they’re making is just ridiculous on its face when you compare it to other crimes. If AI should be allowed to violate copyright because otherwise it can’t exist as it is, then anyone should be able to violate copyright because otherwise their cool projects won’t be able to exist. And I should be able to rob a bank because otherwise I won’t have all that money. You should be able to commit murder because otherwise your annoying coworker will keep bugging you. She should be able to walk out of a store with an iPhone without paying for it because otherwise she won’t have an iPhone. Etc. It’s an argument that says the criminal’s motivations are legal justification for the crime. “You should let me legally do the thing because otherwise I can’t do the thing” is just not a convincing argument in my book.
Already addressed in another comment.
It’s a problem they’ve acknowledged and are actively working on.
Well many people here would disagree. That was the entire point of my comment.
You do realize that there may in fact be different, distinct groups of Lemmy users with differing, potentially non-overlapping beliefs, yeah?
Sure but Lemmy also operates as a sort of hivemind. This is the top-voted post in the last 24 hours and piracy content usually makes up at least 25% of content here.
Oh, well, you’ve clearly done the kind of deep and thoughtful analysis that would allow you to determine the general opinions of all Lemmy users. My mistake. Carry on.
Just simple observation
Using copyrighted material for something you aren’t gonna make any money off of? Cool, go hog wild. If you’re gonna use some music or art that you didn’t make in something that will make you money, the folks that made whatever you used should get a cut. Not the whole cut, but a cut.
Ah, moving the goal posts, I see.
In what way? I rephrased my original comment.
If an artist falls in love with drawing and learns to draw from Jack Kirby’s work and at the beginning even imitates his style, does he owe Jack Kirby royalties for every drawing he does as he ‘learned’ on Jack’s copyrighted art?
I think in that case, no. ‘Style’ is one thing, directly using someone’s art in your own work is something else entirely. However, we’re talking about a person here, not a program developed by a company for the express purpose of making as much money as possible in the shortest amount of time. Until AI can truly demonstrate that it is truly thinking and not simply executing commands given, I don’t think the lines are blurred nearly enough to suggest that someone learning to paint and an AI trained on hundreds of thousands of pieces of art for the purpose of making money for the company that built it are remotely the same.
Didn’t read the article but boo-fucking-hoo. Pay the content creators.
To me, this reads like “Giant-ATV-Based Taxi Service Couldn’t Exist If Operators were Required to Pay Homeowners for Driving over their Houses.”
If a business can’t exist without externalizing its costs, that business should either a. not exist, or b. be forced to internalize those costs through licensing or fees. See also, major polluters.
Google and Amazon both have massive corpuses of this data that they would allow only themselves to use.
Anthropic isn’t saying this to help content creators, they’re saying this to kill OpenAI so they don’t have to actually compete
Big Company: Well if you can’t afford food you should not have food.
Also Big Company:… sobbing pwease we neeed fweee… pwease we need mowe moneys!
Then it shouldn’t exist.
This isn’t an issue of fair use. They’re stealing other people’s work and using it to create something new and then trying to profit from it, without any credit or recompense.
Now that it exists how do you propose we make it not exist?
Even if we outlaw it Russia and China won’t and without the tools to fight back against it the web is basically done as anything but a propaganda platform
Just like I do with literally all content I’ve ever consumed. Everything I’ve seen has been remashed in my brain into the competencies I charge money for.
It’s not until I profit off of someone else’s work — ie when the source of the profit is their work — that I’m breaking any rules.
This is a non-issue. We’ve let our (legitimate) fear of AI twist us into distorting truth and motivated reasoning. Instead of trying to paint AI as morally wrong, we should admit that we are afraid of it.
We’re trying to replace our fear with disgust and anger. It’s not healthy for us. AI is ultra fucking scary. And not because it’s going to take inspiration from a copyrighted song when it writes a different song. AI is ultra fucking scary because it will soon surpass any possibility of our understanding, and we will be at the whim of an alien intelligence.
But that’s too sci fi sounding, to be something people have to look at. Because it sounds so out there, it’s easy to scoff at and dismiss. So instead of acknowledging our terror at the fact this thing will likely end humanity as we know it, we’re sublimating that energy through righteous indignation. See, indignation is unpleasant, but it’s less threatening to the self than terror.
It’s understandable, like doing another line of coke is understandable. But it is not healthy, not productive, and will not play out the way we think. We need to stop letting our fear turn our minds to mush.
Reading someone else’s material before you write new material is not the same as copying someone else’s material and selling it as your own. The information on the internet has always been considered free for legal use. And the limit of legal use is based on the selling of others’ verbatim material.
This is a simple fact, easy to see. Except recognizing it nullifies the righteous indignation, opening the way for the terror and confusion to come in again.
Data Leak at Anthropic Due to Contractor Error
Interesting that Anthropic is making this argument, considering their story in the AI space. They’re certainly no OpenAI.
It doesn’t matter what business we’re talking about. If you can’t afford to pay the costs associated with running it, it’s not a viable business. It’s pretty fucking simple math.
And no, we’re not talking about “to big to fail” business (that SHOULD be allowed to fail, IMHO) we’re talking about AI, that thing they keep trying to shove down our throats and that we keep saying we don’t want or need.
I don’t know if you noticed this but some really big companies with high stock valuations are only existing because investors poured tons of capital into them to subsidize the service.
Uber could not do taxis cheaper than existing if they didn’t have years of free cash to artificially lower prices.
We are in the beginning of late state capitalism, profitable companies go under due to private capital firms and absolute ponzi frauds get their faces on time magazine.
Enjoy the collapse.
Exactly, they PAID MONEY to make it work. No they don’t make the money back and depend on outside capital, but they are still paying their employees (not enough) and suppliers, etc.
Yes, we are in late stage capitalism where the market eats itself.
Why do you think we have seen so much large scale fraud in the last 15 years?
Why are people publishing so much content online if they aren’t cool with people downloading it? Like, the web is an open platform. The content is there for the taking.
Until one of these AIs just starts selling other people’s work as its own, and no I don’t mean derivative work I mean the copyrighted material, nobody is breaking the rules here.
I read content online without paying for a license. I should only have to obtain a license for material I’m publishing, not material I read.
Except of course that’s not how copyright law works in general.
Of course the questions are 1) is training a model fair use and 2) are the resulting outputs derivative works. That’s for the courts to decide.
But in general, just because I publish content on my website, does not give anyone else license or permission to republish that content or create derivative works, whether for free or for profit, unless I explicitly license that content accordingly.
That’s why things like Creative Commons exists.
But surely you already knew that.
Right, but I think it’s going to be a tough legal argument that using a text to adjust database weighting links between word associations is copying or distributing any part of that work. Assuming courts understand the math/algorithms.
deleted by creator
“Ai” as it is being marketed is less about new technical developments being utilized and more about a fait accompli.
They want mass adoption of the automated plagiarism machine learning programs by users and companies, hoping that by the time the people being plagiarized notice, it’s too late to rip it all out.
That and otherwise devalue and anonymize work done by people to reduce the bargaining power of workers.
A.I. exists. It will continue to get better. If letting people use it becomes illegal, they’ll just use it themselves and cut us out. A world where the general population have access to A.I. is the only one where we’re not totally fucked. I’m not simping for Google or Facebook, I’d much prefer an open source self hostable version. The only way we can stay competitive is if these companies continue to develop these in the open for the consumer market.
General purpose artificial intelligence will exist. Full stop. Intelligence is the most valuable resource in the universe. You’re not going to stop it from existing, you’re just going to stop them from sharing it with you.
What they have, is miles from artificial general intelligence, it is not AI in even a limited sense. It is AI in the same way a mob in a video game is AI.
Their claims to be approaching it are marketing fluff at best, and abject lies at worst.
I think if we sit here and debate the nuances of what is or is not intelligence, we will look back on this conversation and laugh at how pedantic it was. Movies have taught us that A.I. is hyper-intelligent, conscious, has it’s own objectives, is self aware, etc… But corporations don’t care about that. In fact, to a corporation, I’m sure the most annoying thing about intelligence right now is that it comes packaged with its own free will.
People laugh at what is being called A.I. because it’s confidently wrong and “just complicated auto-complete”. But ask your coworkers some questions. I bet it won’t be long before they’re confidently wrong about something and when they’re right, it’ll probably be them parroting something they learned. Most people’s jobs are things like: organize these items on those shelves, mix these ingredients and put it in a cup, get all these numbers from this website and put them in a spreadsheet, write a press release summarizing these sources.
Corporations already have the A.I. they need. You gatekeeping intelligence is just your ego protecting you from the truth: you, or someone dear to you, are already replaceable.
I think we both know that A.I. is possible, I’m saying it’s inevitable, and likely already at version 1. I’m sure any version of it would require access to training data. So the ruling here would translate. The only chance the general population has of keeping up with corporations in the ability to generate economic value, is to keep the production of A.I. in the public space.
Silicon valley’s core business model has for years been to break the law so blatantly and openly while throwing money at the problem to scale that by the time law enforcement caches up to you your an “indispensable” part of the modern world. See Uber, whose own publicly published business model was for years to burn money scaling and ignoring employment law until it could drive all competitors out of business and become an illegal monopoly, thus allowing it to raise prices to the point it’s profitable.
Fucking scooters lying all over the sidewalk.
They also don’t care if the open, free internet devolves into an illiterate AI generated mess, because they need an illiterate populace that isn’t educated enough to question it anyway. They’ll still have access to quality sources of information, while ensuring the lowest common denominator will literally have garbage information being fed to them. I mean, that was already true in the sense that the clickbait news outsold serious investigative news, and so the garbage clickbait became the norm and serious journalism is hard come by and costly.
They love increasing barriers between them and the rest of the populace, physically and mentally.
I’m all for stealing content willy-nilly but you can’t then use that theft to craft a privately “owned” mind.
I’d have no problem with “ai” if it could unionize and had to pay for rice like the rest of humanity.
These companies want to combine open theft with privately owned black boxen they can control and license out for money.
It’s enclosure of The Commons all over again.
So youre fine with the free models Facebook and many others provide?
Because many of these LLMs can be run on your own device without paying.
I’m not fine with anything meta does and I’m not ok putting creatives out of work.
But you’re all for stealing content willy-nilly?
And this is being offered to people without it being a privately owned blackbox licensed out for money.
Feels kinda inconsistent.
Perfectly consistent. Seeming otherwise is down to a failure to grasp my position, not any inconsistency of the positions themselves.
If you steal content from creatives, does that not put them out of work?
No. Building a box that replaces them does that.
There is a difference between an individual pirating a movie and a huge private company pirating a movie and then reselling it to people.
You can debate the morality or social impacts of the former, but it is a very different question than the later.
So it’s okay when they steal content and drive it away to people for free?
Ex. Facebook gives away their LLM model for free.
https://ai.meta.com/llama/
Most things that I could talk about were already addressed by other users (specially @OttoVonNoob@lemmy.ca), so I’ll address a specific point - better models would skip this issue altogether.
The current models are extremely inefficient on their usage of training data. LLMs are a good example; Claude v2.1 was allegedly trained on hundreds of billions of words. In the meantime, it’s claimed that a 4yo child hears something between 45 millions and 13 millions words through their still short life. It’s four orders of magnitude of difference, so even if someone claims that those bots are as smart as a 4yo*, they’re still chewing through the training data without using it efficiently.
Once this is solved, the corpus size will get way, way smaller. Then it would be rather feasible to train those models without offending the precious desire for greed of the American media mafia, in a way that still fulfils the entitlement of the GAFAM mafia.
*I seriously doubt that, but I can’t be arsed to argue this here - it’s a drop in a bucket.
The thing is, i’m not sure at all that it’s even physically possible for an LLM be trained like a four year old, they learn in fundamentally different ways. Even very young children quickly learn by associating words with concepts and objects, not by forming a statistical model of how often x mingingless string of characters comes after every other meaningless string of charecters.
Similarly when it comes to image classifiers, a child can often associate a word to concept or object after a single example, and not need to be shown hundreds of thousands of examples until they can create a wide variety of pixel value mappings based on statistical association.
Moreover, a very large amount of the “progress” we’ve seen in the last few years has only come by simplifying the transformers and useing ever larger datasets. For instance, GPT 4 is a big improvement on 3, but about the only major difference between the two models is that they threw near the entire text internet at 4 as compared to three’s smaller dataset.
My point is that the current approach - statistical association - is so crude that it’ll probably get ditched in the near future anyway, with or without licencing matters. And that those better models (that won’t be LLMs or diffusion-based) will probably skip this issue altogether.
The comparison with 4yos is there mostly to highlight how crude it is. I don’t think either that it’s viable to “train” models in the same way as we’d train a human being.