<- Back
Comments (7)
- SwellJoeI added this to a benchmark I've been doing of how well agents find security bugs, specifically security bugs originally found by Mythos. It performs poorly with only read/grep/ls tools, but in a follow-up test with a full shell and Python, it doubled its findings (still a poor showing, but it does at least indicate it is doing what it says on the tin: making tools to help it solve problems). It also did worse than Qwen AgentWorld, another recent post-train of Qwen 3.6 MoE intended for agentic use.https://swelljoe.com/post/will-it-mythos/
- BalinaresI'd have expected this to get more HN attention. Qwen 3.6 35B capability in a 9B model is a bonkers claim.
- nzachInstead of training the model to directly answer questions we trained the model to always write and execute the code that would solve the question ?If that is the case, this isn't just a fancy way to perform prompt optimization?