• 0 Posts
  • 14 Comments
Joined 1Y ago
cake
Cake day: Jun 07, 2023

help-circle
rss

Hello, kids! Pirates are very bad! Never use qBittorent to download copyrighted material, and certainly do NOT connect it to a VPN to avoid getting caught. Additionally, you should also NEVER download illegal material via an https connection because it is fully encrypted and you won’t get caught!



For the love of God please stop posting the same story about AI model collapse. This paper has been out since May, been discussed multiple times, and the scenario it presents is highly unrealistic.

Training on the whole internet is known to produce shit model output, requiring humans to produce their own high quality datasets to feed to these models to yield high quality results. That is why we have techniques like fine-tuning, LoRAs and RLHF as well as countless datasets to feed to models.

Yes, if a model for some reason was trained on the internet for several iterations, it would collapse and produce garbage. But the current frontier approach for datasets is for LLMs (e.g. GPT4) to produce high quality datasets and for new LLMs to train on that. This has been shown to work with Phi-1 (really good at writing Python code, trained on high quality textbook level content and GPT3.5) and Orca/OpenOrca (GPT-3.5 level model trained on millions of examples from GPT4 and GPT-3.5). Additionally, GPT4 has itself likely been trained on synthetic data and future iterations will train on more and more.

Notably, by selecting a narrow range of outputs, instead of the whole range, we are able to avoid model collapse and in fact produce even better outputs.


We have no moat and neither does OpenAI is the leaked document you’re talking about

It’s a pretty interesting read. Time will tell if it’s right, but given the speed of advancements that can be stacked on top of each other that I’m seeing in the open source community, I think it could be right. If open source figured out scalable distributed training I think it’s Joever for AI companies.


I don’t know what type of chatbots these companies are using, but I’ve literally never had a good experience with them and it doesn’t make sense considering how advanced even something like OpenOrca 13B is (GPT-3.5 level) which can run on a single graphics card in some company server room. Most of the ones I’ve talked to are from some random AI startup that have cookie cutter preprogrammed text responses that feel less like LLMs and more like a flow chart and a rudimentary classifier to select an appropriate response. We have LLMs that can do the more complex human tasks of figuring out problems and suggesting solutions and that can query a company database to respond correctly, but we don’t use them.


This makes sense for any other company but OpenAI is still technically a non profit in control of the OpenAI corporation, the part that is actually a business and can raise capital. Considering Altman claims literal trillions in wealth would be generated by future GPT versions, I don’t think OpenAI the non profit would ever sell the company part for a measly few billions.


Lmao Twitter is not that hard to create. Literally look at the Mastodon code base and “transform” it and you’re already most of the way there.


I really hate the state of the Supreme Court atm. Looking back, it wasn’t a legitimate institution from the beginning, but the current 6-3 court shows how flawed it is, being out of line with public opinion in loads of different cases and effectively legislating from the bench via judicial review.

The only reason it has gotten this bad, though, is because Congress has abdicated its responsibilities as a legislative body and left it more and more to executive orders and court decisions. The entire debate around the Dobbs decision could have been avoided if Dems codified abortion into law, and this one could have avoided too if our Congress actually went to work legislating a solution to the ongoing student loan and college affordability crisis.

I think we need supreme court reform. I’m particularly partial to the idea of having a rotating bench pulled randomly from the lower courts each term, with each party in Congress getting a certain amount of strike outs to take people off that they don’t want, similar to the way jurors are selected. I also think the people should be able to overrule the court via referendum, because ultimately we should decide what the constitution says.

I just can’t see this happening though, at least for multiple decades until the younger people today get into political power.


FediSearch I guess is similar to your idea, though I think the goal would be to make a new and open search index specifically containing fediverse websites instead of just using Google. I also feel like the formatting should be more like Lemmy, with the particular post title and short description showing instead of the generic search UI.

The idea of a fediverse search is really cool though. If things like news and academic papers ever got their own fediverse-connected service, I could see a FediSearch being a great alternative to the AI sludge of Google.


I don’t really think compulsory voting would be that beneficial for democrats. Yes, it may boost them a few points across the board, but my general intuition about the general public is they lean towards democrats but are more socially conservative than you see in online spaces. 2020 is probably the best example: super high turnout yet Dems still clipping by with only a +4 advantage instead of the +10 predicted by looking at far more politically engaged voters.


The comment is trying to point out, albeit obtusely, that democrats have also funded crazy people on opposite end of the political spectrum. In 2022 the democrats funded far right candidates in hopes they would win the primary and be an easy victory royale for the dem candidate. The comment is trying to analogize these two things, which is fair because it is a similar political strategy.


This isn’t an actual problem. Can you train on post-ChatGPT internet text? No, but you can train on the pre-ChatGPT common crawls, the millions of conversations people have with the models and on audio, video and images. As we improve training techniques and model architectures, we will need even less of this data to train even more performant models.


The one SIMPLE trick crypto bros HATE: Blockchain -> “Distributed Ledger” NFT -> “Unique Identifier”

Like and share with your friends


There are some in the research community that agree with your take: THE CURSE OF RECURSION: TRAINING ON GENERATED DATA MAKES MODELS FORGET

Basically the long and short of that paper is that LLMs are inherently biased towards likely responses. The more their training set is LLM generated, and thus contains that bias, the less the LLM will be able to produce unlikely responses, over time degrading the model quality throughout successive generations.

However, I tend to think this viewpoint is probably missing something important. Can you train a new LLM on today’s internet? Probably not, at least without some heavy cleaning. Can you train a multimodal model on video, audio, the chat logs of people talking to it, and even other better LLMs? Yes, and you will get a much higher quality model and likely won’t get the same model collapse implied by the paper.

This is more or less what OpenAI has done. All the conversations with 100M+ users are saved and used to further train the AI. Their latest GPT4 is also trained on video and image recognition, and they have also been exploring ways for LLMs to train new ones, especially to aid in alignment of these models.

Another recent example is Orca, a fine tune of the open source llama model, which is trained by GPT-3.5 and GPT-4 as teachers, and retains ~90% of GPT-3.5’s performance though it uses a factor of 10 less parameters.