A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.
Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.
Subcommunities on Beehaw:
This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
LLMs generate output one token at a time. Each token comes with a confidence level by the model, about whether it’s the only possible token to continue the sequence. A model is only 100% confident in its output, if it reproduces a training text verbatim. With any temperature above 0, they veer off the 100% confidence path, which lets them leverage the concept association they came up with during training, makes their output more useful.
For every generated text, you could get a confidence heat map, then ask the model to refine sections that don’t meet a desired level of confidence. Especially the parts where a model makes stuff up, or hallucinates, are likely token sequences with much lower confidence than the rest.
Running a model several times, focusing on the sections with lower confidence, getting additional data from other sources like the internet, or some niche expert system, could eliminate many of the nonsense sections… and I have a reasonably suspicion that Google’s Gemini does exactly that, refining each output with 4 additional iterations, instead of blindly spitting out the first one.
I guess that makes sense, but I wonder if it would be hard to get clean data out of the per-token confidence values. The LLM could be hallucinating, or it could just be generating bad grammar. It seems like it’s hard enough already to get LLMs to distinguish between “killing processes” and murder, but maybe there could be some novel training and inference techniques that come up.
An LLM has… let’s say two core components: a tokenizer, and a neural network. The neural network’s output, is an array of activation levels for a series of neurons, each neuron representing one token. A confidence of 100%, would mean a 100% activation of a single neuron/token, and 0% for all the rest. That is a highly unlikely scenario for a neural network, except when it got overfitted for a single patter during training, and is getting fed the same pattern again. What is more usual, is some value between 0% and 100% for each neuron, with a few neurons showing higher levels of activation, and the LLM… usually picks the highest, but maybe sometimes the second or further one.
The confidence can be calculated by comparing the level of the chosen token’s neuron, to all the other output neurons. A naive one could be level/sum(levels). Somewhat more advanced, could be level²/sum(levels²).
Hallucinations are theoretically possible at a high confidence, but usually happen at lower confidence levels where there are many tokens with a similar confidence.
It doesn’t look like anything to me… I mean, that could be either part of the guardrails, or a lack of context. A “killing process” is murder, outside the programming/sysadmin context. Current LLMs are still not great at handling different semantic contexts for the same token, and particularly bad at mixing different contexts throughout a single text.
My personal “Turing” test for an LLM, is being able to write a sentence, that could be interpreted in 3 or more ways. For a human, 2 meanings is a somewhat easy task, a double-entendre. Starting at 3 and 4, it becomes a feat. Most LLMs are still at 1, and sometimes struggling.
For example, Gemini says:
It can do paragraphs, though:
…which is pretty neat, but paragraphs have “more degrees of flexibility”, making it way harder to do in a single sentence.