Basically a deer with a human face. Despite probably being some sort of magical nature spirit, his interests are primarily in technology and politics and science fiction.

Spent many years on Reddit and then some time on kbin.social.

  • 0 Posts
  • 70 Comments
Joined 6M ago
cake
Cake day: Mar 03, 2024

help-circle
rss

Yes, that would also be statistical correlations to an AI model. The specific kind of information they’re being trained on doesn’t affect the underlying mechanism of model training.


Looking forward to the “Waymo robotaxis become silent killers stalking the night” headlines once the fix is implemented.


I run tabletop roleplaying adventures and LLMs have proven to be great “brainstorming buddies” when planning them out. I bounce ideas back and forth, flesh them out collaboratively, and have the LLM speak “in character” to give me ideas for what the NPCs would do.

They’re not quite up to running the adventure themselves yet, but it’s an awesome support tool.


It’s impossible to run an AI company “ethically” because “ethics” are such a wibbly-wobbly and subjective thing, and because there are people who simply wish to use it as a weapon on one side of a debate or the other. I’ve seen goalposts shift around quite a lot in arguments over “ethical” AI.


A surprising number of “file formats” these days are really just zip files with a standard for the filenames and folders contained within. There’s likely a ton of wonderful secrets like these to be found in the collective dataspace of humanity.


Shush, this is an opportunity for people to dump on Microsoft, if you take it from them they’ll turn on you.


Especially because seeing the same information in different contexts helps mapping the links between the different contexts and helps dispel incorrect assumptions.

Yes, but this is exactly the point of deduplication - you don’t want identical inputs, you want variety. If you want the AI to understand the concept of cats you don’t keep showing it the same picture of a cat over and over, all that tells it is that you want exactly that picture. You show it a whole bunch of different pictures whose only commonality is that there’s a cat in it, and then the AI can figure out what “cat” means.

They need to fundamentally change big parts of how learning happens and how the algorithm learns to fix this conflict.

Why do you think this?


There actually isn’t a downside to de-duplicating data sets, overfitting is simply a flaw. Generative models aren’t supposed to “memorize” stuff - if you really want a copy of an existing picture there are far easier and more reliable ways to accomplish that than giant GPU server farms. These models don’t derive any benefit from drilling on the same subset of data over and over. It makes them less creative.

I want to normalize the notion that copyright isn’t an all-powerful fundamental law of physics like so many people seem to assume these days, and if I can get big companies like Meta to throw their resources behind me in that argument then all the better.


Remember when piracy communities thought that the media companies were wrong to sue switch manufacturers because of that?

It baffles me that there’s such an anti-AI sentiment going around that it would cause even folks here to go “you know, maybe those litigious copyright cartels had the right idea after all.”

We should be cheering that we’ve got Meta on the side of fair use for once.

look up sample recover attacks.

Look up “overfitting.” It’s a flaw in generative AI training that modern AI trainers have done a great deal to resolve, and even in the cases of overfitting it’s not all of the training data that gets “memorized.” Only the stuff that got hammered into the AI thousands of times in error.


Training an AI does not involve copying anything so why would you think that fair use is even a factor here? It’s outside of copyright altogether. You can’t copyright concepts.

Downloading pirated books to your computer does involve copyright violation, sure, but it’s a violation by the uploader. And look at what community we’re in, are we going to get all high and mighty about that?


Does that mean that a country that imports 100% of the oil it burns should be counted as having no emissions?


What did I say that implied that? I’m pointing out a contradiction in kilgore’s comment, I’m not adding anything of my own here.


Their distribution of books is completely legal.

Corporations just have more money to warp the laws in their favour.

You just contradicted yourself in two sentences.


But I think the law is pretty clear, and a precedent calling their use case fair use would be mind blowing. You need new, much more common sense IP legislation that redefines consumer rights in a digital world.

Indeed. I’m a big supporter of IA’s mission, and I’m a big supporter of piracy (copyright has gone insane over the years), but this outcome was obvious from the moment IA did this and it was a mistake for them to fight this fight. They should focus on preservation. Let the EFF handle the lawsuits, and let Library Genesis handle the illegal distribution of books. Everyone focus on what they’re best at.


They’re appealing the decision so there’s still opportunity for IA to throw good money after bad on this.


But when you die and an AI company contacts all your grieving friends and family to offer them access to an AI based on you (for a low, low fee!)

You can stop right there, you’re just imagining a scenario that suits your prejudices. Of all the applications for AI that I can imagine that would be better served by a model that is entirely under my control this would be the top of the list.

With that out of the way the rest of your rhetorical questions are moot.


Even with that, being absolutist about this sort of thing is wrong. People undergoing surgery have spent time on heart/lung machines that breathe for them. People sometimes fast for good reasons, or get IV fluids or nutrients provided to them. You don’t see protestors outside of hospitals decrying how humans aren’t meant to be kept alive with such things, though, at least not in most cases (as always there are exceptions, the Terri Schiavo case for example).

If I want to create an AI substitute for myself it is not anyone’s right to tell me I can’t because they don’t think I was meant to do that.


I don’t believe humans are “meant” to do anything. We are a result of evolution, not intentional design. So I believe humans should do whatever they personally want to do in a situation like this.

If you have a loved one who does this and you don’t feel comfortable interacting with their AI version, then don’t interact with their AI version. That’s on you. But don’t belittle them for having preferences different from your own. Different people want different things and deal with death in different ways.


If you don’t want to do it then don’t do it. Can we stop trying to tell everyone else they have to have the same values as you?


If their goal is to prevent AI trainers from scraping their art then an open federated platform is the opposite of what they want.


It also has an expensive back end and no plans for any kind of monetization, so it’s dead in the water from that side too. The moment they’re successful they’re broke.


If they feel less need to add proper alt-text because peoples’ browsers are doing a better job anyway, I don’t see why that’s a problem. The end result is better alt text.


I would expect it’d be not too hard to expand the context fed into the AI from just the pixels to including adjacent text as well. Multimodal AIs can accept both kinds of input. Might as well start with the basics though.


The Fediverse doesn’t have any defenses against AI impersonators though, aside from irrelevance. If it gets big the same incentives will come into play.


You don’t think LLMs are being trained off of this content too? Nobody needs to bother “announcing a deal” for it, it’s being freely broadcast.


Sure, but having a smoking section in Tim Hortons isn’t going to change that. I’d think it’d make it more likely for smokers to throw their butts out in a manner that can be properly disposed of, rather than making them smoke outside.

Note that I’m not advocating smoking or smoking sections. Smoking is awful on many levels and I’d rather see it go away entirely. I’m just pointing out that it’s ridiculous to say having a smoking section in Tim Hortons is going to have a significant impact on the environment. Jumping to “this is going to fuck the planet” is crying wolf, it’s going to result in people either getting sick of environmentalism or more subtly problematic it’ll result in people thinking they’re making a difference when they’re not. The plastic straw ban, for example. Plastic straws were never a major contributor to ocean plastic waste. By far the largest contributor to ocean plastic waste is discarded fishing equipment, but I don’t see any campaigning to reduce seafood consumption. People banned straws instead and then thought they’d accomplished something.


Even this post showing it off feels like satire, how exactly does a room with cigarette smoke in it “fuck the planet?” Do people think cigarette smoke emissions are remotely relevant on a global scale? Poe’s law is strong in this one.



But you’re claiming that there’s already no ladder. Your previous paragraph was about how nobody but the big players can actually start from scratch.

Adding cost only makes the threshold higher. The opposite of the way things should be going.

All this aside from the conceptual flaws of such legislation. You’d be effectively outlawing people from analyzing data that’s publicly available

How? This is a copyright suit.

Yes, and I’m saying that it shouldn’t be. Analyzing data isn’t covered by copyright, only copying data is covered by copyright. Training an AI on data isn’t copying it. Copyright should have no hold here.

Like I said in my last comment, the gathering of the data isn’t in contention. That’s still perfectly legal and anyone can do it. The suit is about the use of that data in a paid product.

That’s the opposite of what copyright is for, though. Copyright is all about who can copy the data. One could try to sue some of these training operations for having made unauthorized copies of stuff, such as the situation with BookCorpus (a collection of ebooks that many LLMs have trained on that is basically pirated). But even in that case the thing that is a copyright violation is not the training of the LLM itself, it’s the distribution of BookCorpus. And one detail of piracy that the big copyright holders don’t like to talk about is that generally speaking downloading pirated material isn’t the illegal part, it’s uploading it, so even there an LLM trainer might be able to use BookCorpus. It’s whoever it is that gave them the copy of BookCorpus that’s in trouble.

Once you have a copy of some data, even if it’s copyrighted, there’s no further restriction on what you can do with that data in the privacy of your own home. You can read it. You can mulch it up and make paper mache sculptures out of it. You can search-and-replace the main character’s name with your own, and insert paragraphs with creepy stuff. Copyright is only concerned with you distributing copies of it. LLM training is not doing that.

If you want to expand copyright in such a way that rights-holders can tell you what analysis you can and cannot subject their works to, that’s a completely new thing and it’s going down a really weird and dark path for IP.


They’re the ones training “base” models. There are a lot of smaller base models floating around these days with open weights that individuals can fine-tune, but they can’t start from scratch.

What legislation like this would do is essentially let the biggest players pull the ladders up behind them - they’ve got their big models trained already, but nobody else will be able to afford to follow in their footsteps. The big established players will be locked in at the top by legal fiat.

All this aside from the conceptual flaws of such legislation. You’d be effectively outlawing people from analyzing data that’s publicly available to anyone with eyes. There’s no basic difference between training an LLM off of a website and indexing it for a search engine, for example. Both of them look at public data and build up a model based on an analysis of it. Neither makes a copy of the data itself, so existing copyright laws don’t prohibit it. People arguing for outlawing LLM training are arguing to dramatically expand the concept of copyright in a dangerous new direction it’s never covered before.


I don’t think you’re familiar with the sort of resources necessary to train a useful LLM up from scratch. Individuals won’t have access to that for personal use.


You realize that if cases like this are won then only the “giant fucking corporations” are going to be able to afford the datasets to train AI with?


As I said, the “traditional” CDLs were also in a legal grey area. But once the publishers are suing IA for going full Library Genesis anyway, why not also include those?

I went back to one of the older articles I could find on this subject, from before the lawsuit was filed. Some particularly-relevant excerpts:

Until this week, the Open Library only allowed people to “check out” as many copies as the library owned. If you wanted to read a book but all copies were already checked out by other patrons, you had to join a waiting list for that book—just like you would at a physical library.

Of course, such restrictions are artificial when you’re distributing digital files. Earlier this week, with libraries closing around the world, the Internet Archive announced a major change: it is temporarily getting rid of these waiting lists.

James Grimmelmann, a legal scholar at Cornell University, told Ars that the legal status of this kind of lending is far from clear—even if a library limits its lending to the number of books it has in stock. He wasn’t able to name any legal cases involving people “lending” digital copies of books the way the Internet Archive was doing.

The legal basis for the Open Library’s lending program may be even shakier now that the Internet Archive has removed limits on the number of books people can borrow. The benefits of this expanded lending during a pandemic are obvious. But it’s not clear if that makes a difference under copyright law. “There is no specific pandemic exception” in copyright law, Grimmelmann told Ars.

Ironically the FAQ that Internet Archive put online has been taken down, but I found it in their Wayback Machine. It says:

The library will have suspended waitlists through June 30, 2020, or the end of the US national emergency, whichever is later. After that, waitlists will be dramatically reduced to their normal capacity, which is based on the number of physical copies in Open Libraries.

So it seems pretty clear to me that by “suspending waitlists” it means that they’re going to “lend” more copies simultaneously than they actually have.

The Internet Archive had been poking a bear with a stick for years and the bear had been grumbling but not otherwise responding. So they decided to try giving it a whack across the nose with the stick instead. Normally I’d just sigh and shake my head at their stupidity, but they’re carrying a precious cargo on their back while they’re needlessly provoking that bear, and now they’re screaming “oh no my precious cargo! Help me!” While the bear has a firm grip on their leg. That makes me extra frustrated and angry at them for doing this.

I’m not siding with the bear here, I should be very clear. The publishers are awful, the whole concept of copyright has become corrupt and broken, and so on and so forth. But the Internet Archive isn’t supposed to be fighting this fight. They were supposed to be protecting that precious cargo, and provoking the bear is the opposite of doing that.


The emergency library followed the same legal framework that ebook lending follows at local libraries.

No, it did not. From the Wikipedia article:

On March 24, 2020, as a result of shutdowns caused by the COVID-19 pandemic, the Internet Archive opened the National Emergency Library, removing the waitlists used in Open Library and expanding access to these books for all readers.

Emphasis added. They took the limits off.

What the libraries do is already in a legal grey area, the publishers just don’t go after it because it’s more trouble than it’s worth and would bring bad press. Like how most rightsholders ignore fanfiction. But the IA went way beyond that and smacked them in the face.

Don’t blame IA for fulfilling their mission to make knowledge free.

Their mission is archiving the Internet. a mission that they are putting at risk with this stunt.


I expect them to not provoke the $200 billion lawsuit in the first place. They should never have done the “Emergency Library”, it was an obvious boneheaded decision.

Then, once they had done it and the inevitable lawsuit came down on them, they should have tried to settle the lawsuit. Not fight to the bitter end, not double down. They’re only making it worse for themselves. It’s not simply losing the lawsuit that could destroy them, it’s refusing to negotiate.


Oh, for crying out loud, Internet Archive. This is not the fight you should be fighting.

The Internet Archive is the steward of an incredibly valuable repository of archived information. Much of what it’s got squirrelled away is likely unique, irreplaceable historical records of things that have otherwise been lost. And they’re risking all of that in this quixotic battle to share books that are widely available anyway and not at all at risk.

“Lending” out those books in the way that they did was blatant copyright violation spitting directly into the eye of publishers known to be litigious and vindictive. All to fight for a point that’s not part of their mandate, archiving the Internet. They’re going to lose and it’s going to hurt them badly.

Each copy can only be loaned to one person at a time, to mimic the lending attributes of physical books.

Internet Archive believes that its approach falls under fair use but publishers Hachette, HarperCollins, John Wiley, and Penguin Random House disagree. They filed a lawsuit in 2020 equating IA’s controlled digital lending operation to copyright infringement.

That is not what the lawsuit was about, Internet Archive. If you’re going to fight this fight then be honest about what exactly you’re fighting for. The lawsuit in 2020 was not about one-person-at-a-time lending, it was about your “COVID Emergency Library” where you removed all restrictions and let people download books freely.

I strongly believe that copyright has gone berserk of the decades and grown like an uncontrolled weed, harming the intellectual commons for the sake of megacorporations’ profits. I’m a subscriber on this piracy community, after all. I believe in the position that Internet Archive is fighting for here, despite all the downvotes I’m surely about to be hammered with. But they shouldn’t be the ones fighting it. Let someone else take this one on. Sci-Hub or Library Genesis, maybe.



Over the past month I feel like all I’ve been doing is writing tech design documents for systems I don’t actually know anything about because I haven’t had the opportunity to go in and do anything with them.

Fortunately I’ve finally managed to reach the point where everyone agrees that we should just start implementing the basics and see how that goes rather than try to plan it all out ahead of time since we’re surely going to have to throw out the later plans once we see what we’re actually dealing with.


Ah, we’re doing one of those full circle things. I actually remember the time when AOL was “the internet.”