Meta Admits Use of 'Pirated' Book Dataset to Train AI * TorrentFreak
torrentfreak.com
external-link
Meta admits in court that it used portions of the Books3 dataset to train its Llama models. This dataset includes many pirated books.

deleted by creator

@rufus@discuss.tchncs.de
link
fedilink
English
30
edit-2
10M

AI is just too much of a hype. Every company invests millions into AI and all new products need to “have AI”. And then everybody also needs to file lawsuits. I mean rightly so if Meta just pirated the books, but that’s not a problem with AI, but plain old piracy.

I was pretty sure OpenAI or Meta didn’t license gigabytes of books correctly for use in their commercial products. Nice that Meta now admitted to it. I hope their " Fair Use" argument works and in the future we can all “train AI” with our “research dataset” of 40GB of ebooks. Maybe I’m even going to buy another harddisk and see if I can train an AI on 6 TB of tv series, all marvel movies and a broad mp3 collection.

Btw, there was no denying anyways. Meta wrote a scientific paper about their LLaMA model in march of last year. And they clearly listed all of their sources, including Books3. Other companies aren’t that transparent. And even less so as of today.

Fuckingcapitalists

Metal Zealot
link
fedilink
English
410M

In the age of the internet, nothing is truly yours.

Just look at NFT’S

@maynarkh@feddit.nl
link
fedilink
English
1210M

How are NFTs relevant?

@fiah@discuss.tchncs.de
link
fedilink
English
510M

they aren’t, except perhaps as a counterexample of some dubious sort

They were supposedly anchors to claim ownership of things in the real world.

CC BY-NC-SA 4.0

buckykat [none/use name]
link
fedilink
English
310M

Marking all your comments CC BY-NC-SA is a good bit.

The point of NFTs (beyond the pyramid scheme) was to enforce artificial digital scarcity at the individual level

@SomeGuy69@lemmy.world
link
fedilink
English
210M

They sold snake oil nothing else.

The Snark Urge
link
fedilink
English
510M

They’re fancy receipts, and if people thought of them as just that it might be a technology with some limited non-monetary uses. But, the crypto grift was too strong.

ohno my copyright!!! How will the publisher megacorps now make a record quarter??? Think of the shareholders!

That’s not the take away you should be having here, it’s that a mega Corp felt that they should be allowed to create new content from someone else’s work, both without their permission and without paying

ok, fair; but do consider the context that the models are open weight. You can download them and use them for free.

There is a slight catch though which I’m very annoyed at: it’s not actually Apache. It’s this weird license where you can use the model commercially up until you have 700M Monthly users, which then you have to request a custom license from meta. ok, I kinda understand them not wanting companies like bytedance or google using their models just like that, but Mistral has their models on Apache-2.0 open weight so the context should definitely be reconsidered, especially for llama3.

It’s kind of a thing right now- publishers don’t want models trained on their books, „because it breaks copyright“ even though the model doesn’t actually remember copyrighted passages from the book. Many arguments hinge on the publishers being mad that you can prompt the model to repeat a copyrighted passage, which it can do. IMO this is a bullshit reason

anyway, will be an interesting two years as (hopefully) copyright will get turned inside out :)

I really have to thank you for an educated response

@trebuchet@lemmy.ml
link
fedilink
English
-1010M

Lemmy sure loves copyright and intellectual property once you change who the pirate is.

@cecilkorik@lemmy.ca
link
fedilink
English
3010M

Almost like the context matters and the world isn’t entirely made up of black and white binary choices because we’re not robots or computers and discrete logic does not apply to human moral arguments.

@Steve@startrek.website
link
fedilink
English
210M

Preposterous

@trebuchet@lemmy.ml
link
fedilink
English
-510M

Conveniently, these moral arguments that are freed from the confines of discrete logic also allow people on /c/piracy to ignore the rules when justifying their own piracy, and still condemn others they already happen to dislike when they do piracy.

sour
link
fedilink
610M

because company and individual are same

@trebuchet@lemmy.ml
link
fedilink
English
110M

So IP law for individuals = bad, but IP law for corporations = good is the general argument here?

Is there a principled basis for this argument?

It seems like a lot of art like musicians or novelists rely almost entirely on earnings from selling their works to individuals. Wouldn’t a legal regime like you’re advocating basically make producing art for real people a lot less lucrative comparatively and drive those artists into making corporate art and marketing materials?

sour
link
fedilink
210M

does only selling to individual prevent company from pirating

@eskimofry@lemmy.world
link
fedilink
English
610M

That’s like saying everyone should let people enjoy their kinks and you come in and say "aha, then pedophilia is allowed, ya?

FaceDeer
link
fedilink
-3
edit-2
10M

The current top whipping boy is AI, apparently. “AI must be bad” is the highest level assumption, so apparently even in this piracy community that overrides the usual “copyright must be bad” assumption.

Or is it actually “Meta must be bad?” I’ve lost track of who the Five Minutes Hate is supposed to be directed at lately.

AdmiralShat
link
fedilink
English
510M

You have a very small pool of thinking capacity

FaceDeer
link
fedilink
-410M

I’ve lost track because I don’t care who the whipping boy is supposed to be. I form my own opinions.

AdmiralShat
link
fedilink
English
210M

Wow, lol, that one went way over your head.

I called you stupid because of what you said. There is no universal whipping boy, you also struggle with reading comprehension, pretty severely.

I always find it so weird how the people who scream “I FORM MY OWN OPINIONS” are usually the dumbest, with the least formed opinions. You need to use that as a buffer because you don’t have a thought out opinion but you’re afraid of not being apart of the conversation.

sour
link
fedilink
110M

facebook is bad

@eskimofry@lemmy.world
link
fedilink
English
110M

Ralph Waldo Emerson:

A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines." His point was that only small-minded men refused to rethink their prior beliefs.

@trebuchet@lemmy.ml
link
fedilink
English
-3
edit-2
10M

So what you’re saying is this episode has caused you/others here on /c/piracy to rethink your prior beliefs, and now you see some value in the copyright legal regime?

@eskimofry@lemmy.world
link
fedilink
English
010M

Not really. We believe in what we believe. You’re the goblin who sticks on to consistency.

Piracy is a service problem. If you want it to disappear, corporate greed got to disappear.

I do wonder how it shakes out. If the case establishes that a license to use the material should be acquired for copyrighted material, then maybe the license I’m setting on comments might bring commercial AI companies in hot water too - which I’d love. Opensource AI models FTW

CC BY-NC-SA 4.0

@jarfil@beehaw.org
link
fedilink
English
910M

That license would require the AI model to only output content under the same license. Not sure if you realize, but commercial use is part of the OpenSource definition:

https://opensource.org/osd/

Your content would just get filtered out from any training dataset.

As for going against commercial companies… maybe you are a lawyer, otherwise good luck paying the fees.

@Banzai51@midwest.social
link
fedilink
English
110M

This is the least shocking revelation.

@Whom@midwest.social
link
fedilink
English
8810M

That’s fine, just let the rest of us do the same.

@jaden@lemmy.zip
link
fedilink
English
-210M

Yeah too much of this thread is so hypocritical, but either free to copy stuff should be free or it shouldn’t.

Fushuan [he/him]
link
fedilink
English
610M

Actually I prefer if individual users pirating being considere fair use, but corporation pirating not be considered fair use. So them pirating is not fine but us pirating should be.

@MonkderZweite@feddit.ch
link
fedilink
English
26
edit-2
10M

Welp, whole trained dataset got DMCAed, right? And a nonsensical fine, right?

@bartolomeo@suppo.fi
link
fedilink
English
7210M

“We didn’t do it, and if we did it was fair use, and if it wasn’t progress will be hampered if rules and regulations are too strict.”

archomrade [he/him]
link
fedilink
English
4810M

Nationalize AI or tax it to fund UBI, and none of this is an issue.

Armok: God of Blood
link
fedilink
English
13
edit-2
10M

Best idea I’ve heard in a year. Automation should benefit humanity as a whole.

@dumpsterlid@lemmy.world
link
fedilink
English
-8
edit-2
10M

What a bunch of losers, thinking they are making the future…… by stealing from as many artists as they can? How do you convince yourself you are doing the right thing when what you are doing is scaling up the theft of art from small artists to a tech company sized operation?

And how much oxygen has been wasted over the years by music companies pushing the narrative that “stealing” from artists with torrenting is wrong? This is so much worse than stealing (and a million times worse than torrenting) though because the point of the theft is to destroy the livelihood of the artist who was stolen from and turn their art into a cheap commodity that can be sold as a service with the artist seeing none of the monetary or cultural reward for their work.

@Kissaki@feddit.de
link
fedilink
English
810M

Did you just make a contradictory argument for both sides?

Is your distinction that piracy by individuals gives cultural recognition while that of corporations doesn’t?

If you think piracy is warranted, at the cost of artists/creators, how is a generalized AI that makes it available and more accessible as a cultural abstracted good different?

@dumpsterlid@lemmy.world
link
fedilink
English
0
edit-2
10M

Because I don’t see a strong argument for piracy coming at a direct, immutable cost to artists. I also don’t see a strong argument that piracy reduces the chance fans will pay for art when the art is made decently easy to purchase and is being sold at a reasonable price. Of course there are complexities to this discussion but ultimately when you compare it to massive corporations wholesale stealing massive amounts of works of art with the specific intention of undercutting and destroying the value of said art by attempting to commodify it I think the difference is pretty clear. One of these things is a morally arguable choice by one individual, the other is class warfare by the rich.

Joe shmo torrents an album from a band they like, maybe they buy the album in the future or go to a band concert and buy merch. Joe shmo hasn’t mined some economic gain out of a band and then moved on, Joe shmo has become more of a committed fan because they love the album. Meta steals from a band so that they can create an algorithm that produces knockoff versions of the band’s music that Meta can sell to say a company making a commercial who wants music in that style but would prefer not to pay an actual human artist an actual fair price for the music. These are not the same.

(AI doesn’t create convincing fake songs yet necessarily, but you get my point as it applies to other art that AI can create convincing examples of, books and writing being a prime example)

nevernevermore
link
fedilink
310M

I’m going to imagine it’s because that cultural abstracted good is then put behind a pay wall, which OP will theb also pirate, thus fulfilling the prophesy.

FaceDeer
link
fedilink
310M

What a bunch of losers, thinking they are making the future…… by stealing from as many artists as they can?

Are you aware of which community this is posted in?

Meta stealing intellectual property and utilizing it for corporate gain is not the same as normal users pirating content. They are so far apart that it warrants its own discussion and cannot be lumped in together.

@dumpsterlid@lemmy.world
link
fedilink
English
110M

I didn’t realize at first, my bad. I realize that makes a lot of my post redundant but I think my point still stands.

So much hypocrisy that a massive corporation can actually steal like this and it is more socially acceptable than torrenting.

And that’s the issue I in particular have. It’s a double standard and not only that, they’re using it to generate money for their own tools

It’s not the same as some kid pirating photoshop to play around with, or a couple who is curious about GOT and want to watch it without paying HBO.

This is a separate issue and I hate that this place is so reddit like that trying to talk about it gets “hurrr dur I guess you’re mad because AI and meta are just the current hate train circle jerk hurrr i form my own opinions hurr”

Like, no, I’m upset because this is a whole new topic of piracy use.

@j4k3@lemmy.world
link
fedilink
English
-110M

I’m not upset because I think it is totally irrelevant because training AI is not reproducing any works and it is no different than a person who reads or sees said works talking about or creating in the style of said works.

At the core, this amounts to thought policing as the final distilled issue if this is given legal precedent. It would be a massive regression of fundamental human rights with terrible long term implications. This is no different than how allowing companies to own your data and manipulate you has directly lead to a massive regression of human rights over the last 25 years. Reacting like foolish luddites to a massive change that seems novel in the moment will have far reaching consequences most people lack the fundamental logic skills to put together in their minds.

In practice, offline AI is like having most of the knowledge of the internet readily available for your own private use in a way that is custom tailored to each individual. I’m actually running large models on my own computer daily. This is not hypothetical, or hyperbole; this is empirical.

deleted by creator

@j4k3@lemmy.world
link
fedilink
English
-210M

deleted by creator

@Meatballs@mander.xyz
link
fedilink
English
110M

deleted by creator

deleted by creator

@howrar@lemmy.ca
link
fedilink
English
810M

I’m pretty sure “admits” implies an attempt to hide it. They’ve explicitly said in the model’s initial publication that the training set includes Books3.

Nope. Yer can feck off Zuck! Yer ain’t comin’ aboard my ship! 🏴‍☠️

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ
!piracy@lemmy.dbzer0.com
Create a post
⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don’t request invites, trade, sell, or self-promote

3. Don’t request or link to specific pirated titles, including DMs

4. Don’t submit low-quality posts, be entitled, or harass others



Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):


💰 Please help cover server costs.

Ko-Fi Liberapay
Ko-fi Liberapay

  • 1 user online
  • 106 users / day
  • 270 users / week
  • 1K users / month
  • 3.5K users / 6 months
  • 1 subscriber
  • 3.4K Posts
  • 82.2K Comments
  • Modlog