Mine attempts to lie whenever it can if it doesn’t know something. I will call it out and say that is a lie and it will say “you are absolutely correct” tf.
I was reading into sleeper agents placed inside local LLMs and this is increasing the chance I’ll delete it forever. Which is a shame because it is the new search engine seeing how they ruined search engines
Thinking of llms this way is a category error. Llms can’t lie because they dont have the capacity for intentionality. Whatever text is output is a statistical aggregate of the billions of conversations its been trained on that have patterns in common with the current conversation. The sleeper agent stuff is pure crackpottery they dont have a fine control over them that way (yet) machine model development is full of black boxes and hope-it-works trial and error training. At worst is censorship and political bias which can be post trained or ablated out.
They get things wrong cofidently. This kind of bullshitting is known as hallucination. When you point out their mistake and they say your right thats 1. Part of their compliance post training to never get in conflict with you 2. Standard course correction once a error has been pointed out (humans do it too). This is an open problem that will likely never go away until llms stop being schastic parrots, which is still very far away.
Yet the people creating the LLMs admit they don’t know how it works. They also show during training the LLM is intentional deceptive at times. By looking at it’s thinking. The damn thing lies. Use whatever word you want. It tells you something wrong on purpose.
“don’t how they work” misunderstands what scientist mean when they say that (also intentional misdirection from marketing in order to build hype). We know exactly how it works, you describe down to physics if needed, BUT at different levels of abstration in the precense of really world inputs the out puts are novel to us.
Its predicting words that come after words. The “training” is inputing the numerical representation of words and adjusting variables in the algorythem until the given mathmatical formula creates the same outputs as inputs within a given margin of error.
When you cat I say dog. When some says what are they together we say “catdog” or “pets”. Randomness is added so that the algorythem can say either even if pets is majority answer. Make the string more complicated and that randomness gives more oppertunity for weird answers. The training data could also just have lots of weird answers.
Little mystery here. The interesting “we dont know how it works” is that these outputs give such novel output that is unlike the inputs sometimes to the degree it seems like it reasons. Even though again it does not
If you wanna put intent in there, maybe think of it as a kid desperately trying to give you an answer they think will please you, when they don’t know, because their need to answer is greater than their need to answer correctly.
Think about the data that the models were trained on… pretty much all of it was based on sites like Reddit and Stack Overflow.
If you look at the conversations that occur on those sites, it is very rare for someone to ask a question and then someone else replies with “I don’t know”, or even an “I don’t know, but I think this is how you could find out”. Instead, the vast majority of replies are someone confidently stating what they believe is the truth.
These models are just mimicking the data they’ve been trained on, and they have not really been trained to be unsure. It’s up to us as the users to not rely on an LLM as a source of truth.
Stochastic parrots always bullshit. It can’t lie as it has no concept or care for truth and falsity, but spitting back noise that’s statistically shaped like a signal.
In practice, I noticed the answer is more likely wrong the more specific the question. General questions that have the answer widely available in the training data will more often be there correctly in the LLMs result.
deleted by creator
I feed my class quizzes in senior cell biology into these sites. They all get a C-.
Two points of interest: they bullshit like students and they never answer " I don’t know" .
Also Open AI and Grok return exactly the same answers, to the letter with the same errors.
All models are wrong but some are useful.
~George E. P. Box (probably)~
This is as true of LLMs as a human’s mental model.
Good comment. But the way it does it feels pretty intentional to me. Especially when it admits that it just lied so that I could give an answer, whether the answer was true or false
Always. That is a known issue with ai that has to do with explainability. Basically, if you’re familiar with the general idea of neural networks, we don’t really understand the hidden layers so we can’t know if they “know” something so we can’t train them to give different answers based on if they do or don’t. They are still statistical models that are functionally always guessing.
Could you post the link to the sleeper agent thing?
Here’s the video I actually watched about the sleeper agents
deleted by creator
I wouldn’t stop using ai completely over that. I generally don’t trust it with anything that important anyway.
all the time, usually to protect entrenched power systems and about the efficacy of working within said system.
Never because To me lying requires intent to deceive. As llm do not have intentions, the engineers behind the llms have intent.



