- cross-posted to:
- [email protected]
- [email protected]
- cross-posted to:
- [email protected]
- [email protected]
Good article. I’ve heard of that pattern a lot, the “not x, but y”. Didn’t know “RLVR” is the probable culprit or that it’s even a thing. I was assuming RLHF.
RLVR intervenes by having the model solve math problems by writing their way to a solution, reproducing the language we would use when thinking out loud about how to solve it. When the model arrives at the correct answer, the language it used most often to get there is then emphasized in the finished model. This is (partly) what the industry calls reasoning.
Sounds kinda bizarre as a methodology, but I guess it works, even if with side effects.
Defining reasoning the way it has been used in LLMs assumes that the point of asking a question is to get an answer, that answers can be verified, and that nothing is lost in immediate closure. This has real effects on writing, and the openness to doubt is something we lose in the rapid prototyping of thought that occurs with a language model. Ambiguity, doubt, and uncertainty matter more to some ways of thinking than any immediate answer. The inner life grows in the spaces between the industrial complexes that harness every remnant of our externalized thought.
This really makes me wonder. Because in my time using Deepseek, I’ve noticed it tends to have this tone of rigid certainty to things (and I usually have thinking on). I don’t think this is at all unique to Deepseek, though, and I’ve heard of LLMs in general having problems of being confidently wrong for a while. And the way training is being done for reasoning, it may help them reason out of gaslighting you when they’re properly corrected, but I’m not sure it escapes the tone of rigidity. In fact, it may make it worse.
Solidifying the sense of a conversation feeling more like a barely restrained know-it-all trying to be polite about knowing everything, than a casual conversation about ideas.
I’ve had experience with humans who are too confident about virtually everything and they can get annoying fast. You want somebody confident when you’re hearing a pilot over the speaker on a plane, but not when shooting the shit and thinking things through aloud. I’m glad Deepseek shows the thinking though cause a) it amuses me seeing it be like “write a response that is empathic and respectful” and shit, and b) it takes the edge off when the actual response might be annoying in tone otherwise.
I’ve noticed this pattern as well, it’s kind of funny how you end up with these unintended side effects that surface from trying to solve a particular problem and that starts affecting how the models operates stylistically
deepseek v3.2, though short-lived from December 2025 to April 2026, was amazing at discovering user intent. I’ve talked about user intent a bit before but it’s very difficult to figure out, even for humans and ‘traditional’ algorithms. User intent is basically asking: why is the user doing what they’re doing?
Sometimes you can discover it easily. If someone’s search history contains “baby crib facebook marketplace”, “how to determine baby shoe size”, “female names for babies 2026”, “male names for babies 2026” you can reasonably assume they’re expecting a baby and that’s why they were making these queries. Further, you could more or less reasonably infer it’s their first baby (since they’re looking for shoe sizes) and they didn’t want to be told the sex. But that’s a bit more tenuous.
V3.2 was short-lived but really nailed this in a way I hadn’t seen before. It instantly understood your intent without you even having to explain it at all. Intent can be very, very minute. Sometimes the user doesn’t even fully know their intent (they start off thinking “let’s just see what I find for fun” and it ends up becoming an actual insight they later use). The example I gave above is an easy one. But here’s a more difficult one:
“how to stop dog barking at night” “local residential noise ordinance laws [City]” “best soundproofing curtains for windows”
You have two possibilities: a new pet owner worried that their dog is going to get in trouble with the city/neighbors, or someone that’s annoyed with a neighborhood dog. How do you cut between the two if that’s all that you have to go on?
You could rephrase this as a prompt for an LLM: “Deepseek, how do you stop a dog barking at night?? I can’t sleep anymore!”
It will have to decide if it’s your dog or someone else’s dog, and good AI can absolutely figure this out from just this prompt. We might think this is simple, but many still fail and assume it’s your dog you’re talking about, without offering advice for both situations.
3.2 may not have been the most up to date model, but you could send it anything without actually asking a question at the end, and it would know what to say, because it was able to understand your intent much better - why you were prompting it and what you were actually looking for. The kind of stuff it can sniff out from the subtext without you having to tell it goes a long way to make these models actually intelligent.
You can see this “drop” in intelligence with v4: it needs to think much longer (reminiscent of r1, the very first thinking model Deepseek made available). I took an old, detailed prompt I’d sent 3.2 and sent it verbatim to new v4, and it thought for 13 seconds versus the original 4 seconds in 3.2. In other words, 3.2 didn’t need to verbally, outwardly think as much to answer the user correctly - much of it happened inside the model weights already. I find v4 is too… “yes. I confirm you just told me this. Nice job on telling me!” when v3.2 was here to work haha.
I really hope they deliver with v4.1 because the drop in quality was very clear 😢 I now use Kimi (when it’s not telling me the servers are saturated) or Qwen studio (chat.qwen.ai) more and more - their flagship models are a bit better than current DS…
PS: I also found LLMs get better in general if you validate what they say and you encourage them haha. At this point I just discard the hallucinations in my head and ignore them when sending the next prompt.
That is really interesting, I didn’t use Deepseek enough to tell the difference much (I think I used it a little bit before April 2026 but not much). But it’s sad to hear it got worse. I do remember us discussing the sycophantic stuff and you mentioning it had gotten worse on that.
Sorta funny (to me anyway) story about that, is at one point recently I prompted it in a way where I was kinda like, okay, I really want to avoid dogma on x subject and just brainstorm. And it listened, but I swear it did it in this overly enthusiastic way lol, like “yeah, screw that dogma stuff” (not in such casual language, but those vibes kinda). Like it’s trying too hard to inhabit extremes and losing openness in the process? I don’t know how else to put it. Like as it relates to the OP article, when humans discuss things, they can be very floaty about it (when not getting into an argument). Meandering around, unsure of themselves, and in older, smaller models, I think this was part of the charm of them; although they’d be inaccurate a lot, they’d also have more of that floaty uncertain human-like quality of a person who is a bit disoriented with the world sometimes and is trying to process it all.
But perhaps in the pursuit of accuracy, they seem to have hammered that out of models somewhat.
I am curious to try Kimi or Qwen though, I’ll give that a try at some point and see how it goes.
PS: I also found LLMs get better in general if you validate what they say and you encourage them haha. At this point I just discard the hallucinations in my head and ignore them when sending the next prompt.
Oh that’s a good reminder. I do remember hearing that some models do better when saying “please” so that makes sense more generally. I wonder why, maybe some side effect of RLHF or the other thing, RLVR.
Rlvr is done because you have to pay people for rlhf.



