Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

<- Back

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

GodelNumbering

Comments (93)

GodelNumbering
Interesting things Dirac does:1. Uses an optimized version of Hash-Anchored edits for file editing (https://dirac.run/posts/hash-anchors-myers-diff-single-token)2. Utilizes language's AST to decide what to fetch into context, entirely avoids large code file reads3. Batches all operations. Does large number of reads/edits simultaneously (you can see a video demo for deepseek-v4-flash here https://www.reddit.com/r/LocalLLaMA/comments/1suhdki/tested_...)4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next
mdasen
It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.Is there a leaderboard out there comparing harness results using the same models?
adyavanapalli
I haven't tried it, but I'm curious why you decided to implement a whole new harness over just writing extensions in pi. From whatever I've done with pi so far, the extension api is quite extensive. Hash anchored edits, for example, can definitely be implemented in pi. Anyhow, thank you for showing us your project and will be checking it out later. Cheers!
deaux
1. Would be good to benchmark at least one other model from a different family to see if it indeed generalizes. Minimax 2.7 seems a good candidate to keep it affordable. Until then we can't really tell if it's just overfit on Gemini 3 Flash.2. Until then your landing page needs to mention all the numbers are just from running on Gemini 3 Flash. Currently there's no mention at all of Gemini.3. Assuming that cheaper also means faster in this case where model is equal? If so, then why not add this to the benchmarks to highlight another advantage - time until completion of the tasks. If it's the opposite and it takes longer (seems unlikely), then it would be transparent to note this.4. Would be good to note if it does or does not support skills, (nested) AGENTS.md, MCP and so on for people considering migrating.
adyavanapalli
I had a chance to look at this and noticed you were sending telemetry to an endpoint you control: https://dirac.run/v1/event. It doesn't seem like you're sending anything obviously sensitive or doing anything in bad faith (though, I do see api errors being sent, which could potentially leak sensitive info), but you gotta admit that that's scary seeing you as the sole dev for this. Plus, it's opt out too. Sorry, it's no go for me.
avereveard
"astounding how much the harness matters" is the right read and it should be the lasting one. the model is rentable, the prompts are rentable, the benchmark numbers are mostly a function of the harness around them. swapping Gemini for Sonnet underneath the same harness has a smaller bench delta than swapping the harness around the model. the cheating-agents post you linked is the same observation through a different lens, the harness is what's being measured, the model is just the substrate.that said context management seem to be solving today model problems, more than being an universal property, and will probably be obsoleted a few model generations down the road, as tool obsoleted RAG context injection from question embeddings.
kha1n3vol3
I am using dirac with Kimi 2.6 for refactoring a rust codebase. I have a Clean Architecture design which is being reinforced. The scope of work is laid out in a Beads epic with sub-issues. The planning was done with gpt5.5, and gpt5.5 is checking the work is complete. I have found that dirac is more productive on large codebase refactoring than OpenCode which actually trashed the .rs file and had to revert the code.
gobdovan
Very interesting, especially the harness point, how much of performance is in the wrapper tools (when I almost run out of credits, I change my model to a smaller one and try to give it more structured prompts; very often gpt-5.4-mini with structure works better than gpt-5.4 with vibes)This inspired me to start a "skill distillery" [0] where I take good agent workflow ideas and turning them into small, inspectable/installable skills.The first one is dirac-workflow, based on Dirac's structural code workflow. It's not a Dirac clone tho, it has no runtime, persistent AST index, hash-anchor editing engine, or benchmark harness. Just a small AST helper and the workflow discipline as a portable skill.I also dogfooded it on the Dirac repo itself and included a short report.Would appreciate feedback from the original author, if the prompts and tools [1] are representative.[0] https://github.com/ouatu-ro/skill-distillery[1] https://github.com/ouatu-ro/skill-distillery/blob/main/skill...
bryanhogan
If I understand correctly, this is a heavily improved Cline fork? Does that mean features such as plan and act mode are also still there?
sally_glance
Great job and congrats! Working on my own harness has been one of my favorite side projects in the past couple of weeks, of course I never finish anything... But I'm very interested in your experience with the following:1. Context management - specifically pruning old tool call responses, truncation of tool output and automatic compaction. Those have worked pretty great for me, benefits of reducing context greatly seem to outweigh gains from "remembering" everything. I always leave short summaries though.2. "Subagents" - my latest attempts revolve around not exposing any tools for the main agent at all, except for a run_agent tool where the subagent has access to the classic search/execute/fetch tools. My theory is that if subagents return concise summaries this would automatically keep the parent agent context clean for much longer. Still experimenting though, writing prompts for subagents may also be too far outside of the current training sets.
anandkrshnn
Really impressive results. The point about the harness mattering more than the model is spot on — we've seen similar patterns in our own work.One thing that stood out to me is your use of hash-anchored edits + AST-based context selection. We're building something in a similar direction with the Sovereign AI Stack, but with a stronger focus on governance and verification.Curious — did you run into issues with context drift when using AST queries on very large codebases? We found that combining it with incremental symbol DB updates helped a lot.Congrats on the results!
Mashimo
Interesting. Would love a comparison to pi.dev (Not Ohmypi)How does this perform in day to day coding tasks, outside of benchmarks?
deviation
Nice work. I adopted this to use with my workplace's LLM proxy with a few small changes to the api/config files. Works flawlessly.
martinald
Very interesting! I've often thought static analysis could really help agents (I wrote this last summer: https://martinalderson.com/posts/claude-code-static-analysis...), but despite being hyped for LSPs in Claude Code it turned out to be very underwhelming (for many of the reasons that they can be annoying in a "real" IDE, ie static analysis starts firing mid edit and complaining and cached analysis getting stuck).Curious to know if this has been an issue with your AST approach on larger projects?The hash line based numbering is very interesting too (though I see on Opus 4.5+ far far fewer editing errors).I've often thought that even if model progress stopped today, we'd still have _years_ of improvements thru harness iteration.
nzoschke
I’ve haven’t had great experiences with Gemini for coding yet. I’m doing reasonably simple full stack Go apps. Tried Gemini-ClI, antigravity, Pi.The problems I’ve experienced are less adept at picking the right bash commands to build and test the Go app, and not following idiomatic Go or code base patterns for changes.A skill hasn’t helped much.Will need to try this and open code next.
davidkunz
I would like if some of that functionality is extracted in CLI tools. Then every coding agent can use it.
2001zhaozhao
Very cool and interesting direction. I'm interested to see how easy it is to extend the harness's language support.
gchamonlive
hey there! thanks for the project!I was intrigued with the claims so I wanted to test it myself.First I (vibe)made an AUR package I could use to install it from git source, from master: https://aur.archlinux.org/packages/dirac-cli-gitThen I went in to see what's what, but I there isn't support for gemini-cli login, and importing from opencode doesn't work, failing with a message "Something went wrong. Could not read API keys from OpenCode config.". `dirac auth --verbose` doesn't seem to do anything.Sorry for reporting it here, but it seems that GitHub is throwing a tantrum again and your issues page's been knocked out.It was able to login with my OpenAI sub though, so let's see how's that.EDIT: headsup, only gpt-5.4 seems to work, gpt-5.4-pro and gpt-5.5-2026-04-23 all throw api error 400, maybe by no fault of your own. OpenAI has been deliberately hindering third-party agents lately, as oh-my-pi ceased to work last week with all gpt models, either throwing an error or having a ludicrously low api rate limit.
nthypes
Can't OpenCode reach the same just developing this as a feature or plug-in? Like anchored edit?
dur-randir
How do I connect it to a local llama.cpp instance?
blueTiger33
Stared it. will try it later. one question though, to make it simpler for me, in what tasks does this model shine, how do you improve the score? I already use some skills to cut down CC costs, like caveman, rtk cli and a few others. just want to understand
Aeroi
harness definitely makes a difference for the benchmarks. I ran my agent Camera Search against a few benchmarks and was able to beat Opus 4.7.I created a real world benchmark, for mining, oil&gas, construction ect. called FieldOps-bench and it basically proves that vertical agents and specialized harness, tool, systems outperforms SOTA models alone still.
redrove
I keep trying to use dirac-cli with codex and it won't work: Error: Codex API error: Codex API request failed: 400.Any ideas?
michelhabib
woow, looks very good. I'm wondering if you do any optimizations for cli in general, since you're not using MCP. I'm building my own CLI for AI Agents, and was always concerned with context rot.
scoopdewoop
The Hash-anchor edit guy! Sincerely great idea, I used it in my own toy harness to good effect. I just checked this out, never tried it before, and its great! Clearly a well-iterated design with good choices made.It is so refreshing to see real FOSS and not a grift. Simple openrouter api key, and I'm going.This is what I'm using from now on. You are doing the best work in this space.
snqb
how well does it do on frontier models like Opus 4.6?
npodbielski
Ha! I had an idea to do something like that myself over the weekend after trying Junie and Mistral to write some test for my personal project, that took literally hours! because Qwen 3.5 I am using locally can run 10k prompt for 10mins. Which should not be the case if agent would ask really simple questions like:- what tool you need?- what would be parameters for the tool- what method you want to read?instead of sending few kilobytes of build output and waiting for response. Oh well.. Good thing someone already did that!
aetherspawn
Sorry I couldn’t really figure out if this was a harness, a fine tuned model, or both. Can we use Qwen with this for example? Is the performance expected to be better in that case?
neonstatic
I am a bit confused. What languages does it help with? You mention AST manipulation, so I am assuming it's not universally applicable, e.g. to Rust?
DeathArrow
Right now, top two harness on Terminal Bench are Codex and Forge Code. I wonder how Dirac compares to them.Forge Code is awesome and I plan to test Dirac, too.
jeff_antseed
[dead]
ronin_niron
[dead]
tommy29tmar
[dead]
dk970
[dead]
phoebe_builds
[dead]
anon
undefined
builderminkyu
[dead]
johnwhitman
[dead]
nthypes
No CLI? Only VSCode extension?