Research-Driven Agents: When an agent reads before it codes

<- Back

Research-Driven Agents: When an agent reads before it codes

hopechong

Comments (46)

simlevesque
I've been making skills from arxiv papers for a while. I have a one for multi-object tracking for example. It has a SKILL.md describing all important papers (over 30) on the subject and a folder with each paper's full content as reStructuredText.To feed Arxiv papers to LLMs I found that RST gives the best token count/fidelity ratio. Markdown lacks precision. LateX is too verbose. I have a script with the paper's urls, name and date that downloads the LateX zips from Arxiv, extracts it, transforms them to RST and then adds them to the right folder. Then I ask a LLM to make a summary from the full text, then I give other LLMs the full paper again with the summary and ask them to improve on and and proofread them. While this goes on I read the papers myself and at the end I read the summaries and if I approve them I add it to the skill. I also add for each paper info on how well the algorithms described do in common benchmarks.I highly recommend doing something similar if you're working in a cutting-edge domain. Also I'd like to know if anyone has recommendations to improve what I do.
throwdbaaway
Very nice TG improvement from Flash Attention KQ fusion. Is it something that was already done in ik_llama.cpp? If not, then it will be a welcomed addition for hybrid CPU/GPU inference.
dataviz1000
Sorry to spam, I'm working on this also from a different angle. Hopefully sharing adds to the conversation.First, about the loop, Claude's (coding agent) context and attention is big enough to self-reflect. Agent Tuning shows a technique that not only demonstrates this but a way quantify it. [0] The difference is autoresearch's val_bpb measures what the agent built; Agent Tuning's p̂ measures the agent itself.> Claude's attention doesn't distinguish between "instructions I'm writing" and "instructions I'm following" -- they're both just tokens in context.Second, doing research, finding academic research to add to context helps. Here is an example of an implementation that creates trading strategies by reading research and recreating them in creative new ways. [1]The biggest problem is the coding agents don't "Fail fast and loud". They fail deceivingly.[0] https://github.com/adam-s/agent-tuning[1] https://github.com/adam-s/alphadidactic
lmeyerov
I've found value in architectural research before r&d tier projects like big changes to gfql, our oss gpu cypher implementation. It ends up multistage:- deep research for papers, projects etc. I prefer ChatGPT Pro Deep Research here As it can quickly survey hundreds of sources for overall relevance- deep dives into specific papers and projects, where an AI coding agent downloads relevant papers and projects for local analysis loops, performs technical breakdowns into essentially a markdown wiki, and then reduces over all of them into a findings report. Claude code is a bit nicer here because it supports parallel subagents well.- iterative design phase where the agent iterates between the papers repos and our own project to refine suggestions and ideasFundamentally, this is both exciting, but also limiting: It's an example of 'Software Collapse' where we get to ensure best practices and good ideas from relevant communities, but the LLM is not doing the creativity here, just mashing up and helping pick.Tools to automate the stuff seems nice. I'd expect it to be trained into the agents soon as it's not far from their existing capabilities already. Eg, 'iteratively optimize function foobar, prefer GPU literature for how.'
jbergqvist
When I want to solve a new problem with an agent, I always ask it to search broadly for prior work in the given area online, and then analyze if we can build our solution using it as inspiration.I see it as the solution being out there in “idea space”, and by having the agent search beforehand we can more efficiently explore this space before converging on the final solution.
ctoth
I've been very interested in this recently. I'm pretty sure that every project should have a ./papers directory of annotated papers in it like I do in Qlatt[0].Literally every project. If it's something that's been done a million times then that means it has good literature on it? If not, then even more important to find related stuff! And not just crunchy CS stuff like databases or compilers or whatever. Are you creating a UI? There's probably been great UI research you can base off of! Will this game loop be fun in the game you're building? There's probably been research about it![0]: https://github.com/ctoth/Qlatt/blob/master/papers/
love2read
It sounds very silly but it sounds like they need to add a phase before research that finds a profiler and runs it before just guessing what optimizations may be beneficiary.
KingOfCoders
I use #PPPCDC for prompting: plan,plan,plan then verify with: Compare the plan to the existing Code. Reread and compare the plan to the Docs. Fix the areas you're not Confident about.
maCDzP
I have a ML project. I usually set up a team of agents, where I have a leader, archivist, research assistant, researcher, developer and tester. The team generates hypothesis based on papers, test it, and iterate over that. Everything is documented using a lab notebook. It burns tokens but I have found some promising strategies that I am testing.
hungryhobbit
I think anyone who uses Claude knows that it works smarter when you have it make a plan first, and ask it to research the existing code as much as possible first ... so the results in this article doesn't surprise me at all.However, I'd be curious to hear back from others who have tried adding the shell script (at the end of the article) to their flow: does it (really) improve Claude?
prats226
A good experiment would be to also try giving it access to latency traces so it can identify issues? Wrt coding agents, giving access to observability tools often improve coding/debugging ability for me
kaycebasques
Gemini has a Deep Research API: https://ai.google.dev/gemini-api/docs/deep-research
hopechong
Coding agents that read papers before writing code find optimizations that code-only agents miss.We added a literature review phase to Karpathy’s autoresearch loop and pointed it at llama.cpp. The agent autonomously read arxiv papers, studied competing forks and spun up VMs to run parallel experiments.
austinbaggio
Research step makes sense, can also confirm that running multiple agents with diverse strategies also compound results more quickly than single agents
tomi_dev
This is interesting.Do you see a noticeable difference in output quality when the agent reads context first vs going straight into generation?Feels like most tools skip that step.
outside1234
A research step (gather insights from across the codebase and internet for how to accomplish the next step), planning step (how should I sequence implementation given that research), an implementation step, and a verification step (code review of the implementation) is super effective workflow for me.
doctorpangloss
The skypilot devs need to focus on decoupling their offering, so that their very valuable "find the cheapest cloud" functionality isn't married to a glitchy reinvention of Kubernetes JobSet and MLflow
Sim-In-Silico
[dead]
j_gonzalez
[dead]
KaiShips
[dead]
anon
undefined
axeldunkel
[dead]
Bmello11
[dead]
sschlegel
[dead]
matthias_m_dev
[dead]
notef
[dead]
Malachiidaniels
[flagged]
phendrenad2
This is obvious, right? If you want to build a Facebook clone, you wouldn't tell the agent "build Facebook". You would provide it with a description of every page on Facebook, behaviors, interactions, UI, etc.
tomi_dev
This is interesting.Do you see a noticeable difference in output quality when the agent reads context first vs going straight into generation?Feels like most tools skip that step.