I benchmarked Claude Code's caveman plugin against "be brief."

<- Back

I benchmarked Claude Code's caveman plugin against "be brief."

max-t-dev

Comments (57)

Aurornis
I still can’t believe that people take Caveman seriously.It’s a funny joke, but saving a couple hundred tokens in the final output is going to be negligible, especially when coding where it’s common to go through hundreds of thousands of tokens in a session. You also have to consider the additional tokens consumed by the skill itself (acknowledging that output tokens are billed at a different rate).I got a kick out of it when it was released, but now that I’m seeing it repeated as a useful operation it’s apparent how much cargo culting is going on in this space.
encody
"...the value isn't compression. It's structure.""...that consistency is real value.""A few findings...are worth flagging here."I know this smell. I'm not sure if this is AI or merely the natural result of overwhelming immersion in AI output that is "backpropagating" its way into organic communication.On a completely related note, I've been enjoying classic fiction a lot more recently. Moby Dick is actually pretty funny.
max-t-dev
Author here. Caveman is a popular Claude Code plugin that compresses Claude's responses via a custom skill with intensity modes. I wanted to know whether it actually beats the simplest possible alternative, prepending "be brief." to prompts. 24 prompts, 5 arms, judged by a separate Claude against per-prompt rubrics covering required facts, required terms, and dangerous wrong claims to avoid. 120 scored responses, 100% key-point coverage across every arm, zero must_avoid triggers. Headline: "be brief." matched caveman on tokens (419 vs 401-449) and quality (0.985 vs 0.970-0.976). Caveman has real value beyond compression. Consistent output structure, intensity modes, the Auto-Clarity safety escape. But the compression itself isn't the differentiator I expected. Harness is open source and strategy-agnostic if anyone wants to add an arm: https://github.com/max-taylor/cc-compression-bench Happy to answer questions about methodology, the per-category variance findings, or the bits I cut from the writeup.
BewareTheYiga
Caveman made me laugh and that, in theory, should count for something.
0xbadcafebee
I tell chats to "be brief" all the time when they're being too verbose, but I never thought to put it in coding agent instructions. Thanks for the benchmark! I wonder how one would put this in AGENTS.md so that it makes sense as a general instruction?
mattas
It's interesting.On one hand the labs say that they can't keep up with demand for tokens.On the other hand there is an entire ecosystem built around figuring out which magic words will make LLMs output fewer tokens.
avaer
Thanks for the research!Though I feel like industry veterans (especially those working with LLMs) came to this conclusion without having to write a single prompt. Even ignoring the technical merits of these kinds of hacks, if you think you've outwitted billions of dollars of statistics with a prompt, you're probably wrong at this point.What I find most interesting is the popularity of these snake oils, especially the ones that are easy to install and never check. The tech moves so fast and the research is so scarce and poor-quality that the bullshit asymmetry principle wins and people buy into these cargo cults.Maybe we need a plugin to check if a new plugin/prompting technique/LLM lifehack is BS.
refactor_master
Can someone give me a sound argument for why, when these things supposedly hold:- LLMs scale with amount of data on the subject- Even frontier labs themselves have a hard time gauging exactly how well-performing models are, across a quite rigorous set of tests in all aspectsthen, how can this be true:Using a low-data "niche language" (what is the volume of literature written in Caveman?) is supposedly of equal performance, when this anecdotally doesn't hold for e.g. niche code languages, proven by a handful of completely arbitrarily designed tests.We've barely convinced ourselves that LLMs actually increase measurable industry productivity, instead of us just spending time to send slop to each other.
0-_-0
How about caveman+be brief?
brcmthrowaway
Stop using an LLM to write blog posts
ramesh31
Caveman sounds clever if you have no idea how LLM reasoning works. Talking through a problem out loud, in depth, is a critical part of how things like Claude Code even get to a result. Those aren't "wasted tokens", they're an integral part of how the LLM reaches a conclusion and completes its chain of reasoning.
openclawclub
[flagged]
Tiberium
[dead]
lofaszvanitt
Caveman is useless for me. We are in the year 2026, computers are here to serve me, and bring me comfort. Caveman is a caveman, speaks like an idiot. I don't want to interact with an idiot. It's irritating, and as the article states, an overhyped turd.It is the same idiocy that permeates EV cars. You buy an expensive car to go from A to B and at the same time offer you comfort. When I have to think about using the seat heating or not, I'm out of my comfort zone. So no, fuck caveman, and I don't fucking care about the burned tokens.Be brief. It's easy, no setup needed, not another mindless mumbojumbo extension and its 325 dependencies.
deadbabe
I wish they would change the name to caveperson.
numpad0
Is caveman speech brief, or is it just more consistent with the Chinese language? The Chinese language famously lack ALL inflections, conjugations, anything that modify spelling of words.