Stop Burning Your Context Window – How We Cut MCP Output by 98% in Claude Code

<- Back

Stop Burning Your Context Window – How We Cut MCP Output by 98% in Claude Code

mksglu

Comments (24)

nr378
Nice work.It strikes me there's more low hanging fruit to pluck re. context window management. Backtracking strikes me as another promising direction to avoid context bloat and compaction (i.e. when a model takes a few attempts to do the right thing, once it's done the right thing, prune the failed attempts out of the context).
mksglu
Author here. I shared the GitHub repo a few days ago (https://news.ycombinator.com/item?id=47148025) and got great feedback. This is the writeup explaining the architecture.The core idea: every MCP tool call dumps raw data into your 200K context window. Context Mode spawns isolated subprocesses — only stdout enters context. No LLM calls, purely algorithmic: SQLite FTS5 with BM25 ranking and Porter stemming.Since the last post we've seen 228 stars and some real-world usage data. The biggest surprise was how much subagent routing matters — auto-upgrading Bash subagents to general-purpose so they can use batch_execute instead of flooding context with raw output.Source: https://github.com/mksglu/claude-context-mode Happy to answer any architecture questions.
esafak
If this breaks the cache it is penny wise, pound foolish; cached full queries have more information and are cheap. The article does not mention caching; does anyone know?I just enable fat MCP servers as needed, and try to use skills instead.
giancarlostoro
This sounds a little bit like rkt? Which trims output from other CLI applications like git, find and the most common tools used by Claude. This looks like it goes a little further which is interesting.I see some of these AI companies adopting some of these ideas sooner or later. Trim the tokens locally to save on token usage.https://github.com/rtk-ai/rtk
buremba
AFAIK Claude Code doesn't inject all the MCP output into the context. It limits 25k tokens and uses bash pipe operators to read the full output. That's at least what I see in the latest version.
unxmaal
I did this accidentally while porting Go to IRIX: https://github.com/unxmaal/mogrix/blob/main/tools/knowledge-...
specialp
Do you need 80+ tools in context? Even if reduced, why not use sub agents for areas of focus? Context is gold and the more you put into it unrelated to the problem at hand the worse your outcome is. Even if you don't hit the limit of the window. Would be like compressing data to read into a string limit rather than just chunking the data
ZeroGravitas
I've seen a few projects like this. Shouldn't they in theory make the llms "smarter" by not polluting the context? Have any benchmarks shown this effect?
mvkel
Excited to try this. Is this not in effect a kind of "pre-compaction," deciding ahead of time what's relevant? Are there edge cases where it is unaware of, say, a utility function that it coincidentally picks up when it just dumps everything?
agrippanux
I am a happy user of this and have recommended my team also install it. It’s made a sizable reduction in my token use.
formvoltron
this is going to crash the AI economy. nvda down 20 percent monday. lol
SignalStackDev
[dead]
aplomb1026
[dead]
jamiecode
The 98% reduction is the real story here, but the systemic problem you're solving is even bigger than individual tool calls blowing up context. When you're orchestrating multi-step workflows, each tool output becomes part of the conversation state that carries forward to the next step. A Playwright snapshot at step 1 is 56 KB. It still counts at step 3 when you've moved on to something completely different.The subprocess isolation is smart - stdout-only is the right constraint. I've been running multi-agent workflows where the cost of tool output accumulation forces you to make bad decisions: either summarise outputs manually (defeating the purpose of tool calls), truncate logs (information loss), or cap the workflow depth. None of them good.The search ranking piece is worth noting. Most people just grep logs or dump chunks and let the LLM sort it out. BM25 + FTS5 means you're pre-filtering at index time, not letting the model do relevance ranking on the full noise. That's the difference between usable and unusable context at scale.Only question: how does credential passthrough work with MCP's protocol boundaries? If gh/aws/gcloud run in the subprocess, how does the auth state persist between tool calls, or does each call reinit?