Why eval startups fail (2025)

<- Back

Why eval startups fail (2025)

jxmorris12

Comments (52)

arjie
It's also that the price-value frontier is different for different use-cases. For many of the things that I was doing, I could do harness improvements to make DS V4 Flash catch up in performance with GPT-5.5 or Claude Sonnet, but that's just because of the use-case. And if I'm being honest, this kind of eval doesn't need someone else. Claude and I can build a framework on a per-task thing.
michaelbuckbee
I built a simple (free) eval tool for my own uses (Github Gists + Model Outputs) after not being able to find a suitable one in the market.The market's being split into1. Longitudinal LLM observability toolingMost eval startups have gone down the route of something more like being an observability platform for LLM inference. They want to be in your stack and running the inference to collect data on performance of it.They collect things like how often a model returns JSON that's out of spec or returns values that aren't expected as well as general timing and cost info.2. Safety Limiting / PentestingSay you're doing something in the medical field or that's sensitive in some way and you want to figure out what model has the best outputs for your task that won't fly off the guardrails.3. Simple cost + performance + quality swappingThis is what my tool does, basically lets you test if you _really_ need to be running that frontier model in a loop across a million records or if you'd be better with an older model or something else.https://evvl.ai/Example eval: https://giyd8stidy.evvl.io
dbish
"eval startups have a hard time finding customers, because clients have to be technical developers who want to build with APIs, but also not technical enough to run their own evals"To add to this, they have to be developers who aren't already using a fullstack observability solution, since it's fairly straight forward to add the eval startup featureset to an existing observability solution, and easier (plus cost effective) to just keep it all in one place.
paddy_m
I have written a couple of eval harnesses to see how well LLMs drive software I have written. Basically I have data analysis software that I need LLMs to write code for. The code is complex, and I want to shape my APIs such that LLMs do a better job of quickly getting to the right answer. So I test different prompting and api surfaces, it's really easy to make quick gains this way and save your users from bugs. In this paradigm, I'm explicitly not testing different models, and I'm very interested to see how lesser models do with my software. Also for this type of testing, using the open weight models makes it faster, cheaper, and more reliable to test vs frontier models because I can trust that kimi-2.5-a-bunch-of-specs is going to behave more consistently than whatever tweaks Claude is making to Sonnet this week. API and prompting improvements seem to carry across the different models for gross improvements.I haven't looked that hard, but I can't find articles about this type of eval testing, curious to hear if others have approached writing APIs in this way.
theteapot
What's an eval?
alexhans
The way eval startup is defined here is very specific and doesn't cover successful eval farmwork/SaaS vendors like Arize, Promptfoo, deepeval, etcThe author does have a point around generic benchmarks not being super valuable for companies. But evals should be seen as verifying design/behaviour constraints and can greatly aid product building, golden dataset creations and good software practices.It's just that the aim should be "how to generate your own good evals, even if it's hard" as not so much "here's some generic evals about models".
anon
undefined
PashaGo
Unfortunately, model quality is not the only criterion for users, and often not even the most important one. Adoption is also driven by marketing, UX, integrations, pricing, ecosystem, and a lot of other non-benchmark factors.Also, model providers are not interested to have their models compared head-to-head under identical conditions. And “Model A is better than Model B” is almost meaningless by itself. Better for what task? With what prompt? What inputs? What budget? What failure tolerance?It would be nice to have a place where users could run their own benchmarks, define evaluation criteria for their actual use cases, and make those runs verifiable by others.
0xWTF
I can see where Goodhart's Law applies to psychology and economics, pretty much any man-made domain without IDLH (immediate-danger-to-life-and-health) outcomes. But I think it's going to be hard to Goodhart a lot of medical AI safety. Biology doesn't give a shit.However, identifying the right metrics and having the necessary test sets will, at times, be challenging.
nilirl
Maybe it's not that valuable? No snark, but how much confidence do these evals provide?
david_shi
> I believe eval startups can work when they're targeting safety benchmarks specifically.Are there any examples of successful startups doing this?
dippogriff
The current way benchmarks are done and are accepted by the community makes for really uninspired work. Until we're willing to break out of this rigid evaluation format prone to crazy overfitting and gaming, talent will move elsewhere. It is kind of a chicken and egg problem though.
anon
undefined
jampekka
I think there's gonna be (or perhaps already is) a huge demand for evaling individual systems. Many countries are starting to adopt some criteria for LLM usage for public use, and I doubt govs are gonna develop in-house knowhow for this. These will likely form some kinds of "independent auditor" models, as the system provider has too strong conflicts of intetest.It's probably not gonna be exactly glorious work, but designing expert evals settings and collecting and crunching the data for quality assurance and control is going to be needed.
999900000999
I’m convinced the only way to make a startup work, with a few exceptions, is to give away your product or sell below cost.For years upon years until you get brought out. Then it’s someone else’s problem. Or you IPO and bring in new management to figure out how to make money.VCs don’t see 20x exits happening for Eval companies, so they have trouble with the losing money for years step
GL26
The problem with eval is the fact that the information is not updating itself fast enough so that you want the latest model performance benchmarks. Bloomberg succeeded because it sells info that is expires in the next hour.
torginus
Imo it's very simple - AI is a big function inverter. If you have a better cost function than frontier labs, as in, you are better at judging model output quality, then you can use that cost function to RL the next generation of models.Therefore your knowledge is better used in training than letting users be slightly better at the token casino. Which is mentioned in this post as well, eval startup people either go to work at frontier labs or finetune startups.
jdw64
If you look at the history of software engineering, the ones that made the most money were usually not the companies that built the applications themselves, but the ones that built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.Personally, I agree with the Goodhart problem, but isn't the reason Eval startups fail because they try to sell an 'evaluation service' rather than a 'verification toolchain'? The problem, it seems, is that AI verification toolchains require a model in the end, because they internalize AI and sell it under the name of a 'harness.'So an AI verification(eval) toolchain would have to be structurally different. Verifying AI code isn't about whether it compiles. AI code can always be made to compile. The issue involves various semantic criticisms, such as overfitting to existing designs and tests. To catch those issues, you ultimately need to build an AI. But building that AI is expensive. So in the end, AI verification companies depend on external model providers for the core components of their verification engine. I think this is a bad business decision
PaulHoule
Worked or tried to work for a few places that ended eval work in the 2010s for previous-gen systems. Most didn’t pay for it, thanks to all the ones that didn’t I didn’t dare try selling it to the one that would have.
h1fra
evals are glorified integration tests, would you invest in an integration test startup? absolutely not. I don't get why we are making all of this fuzz around evals
bitlad
Everything eventually fails. Nothing is constant, not even evals.
wseqyrku
> Not enough eval customersAha.
coldtea
Because they operate on untrusted input
anon
undefined
redwood
I found this pretty hard to read as the author has a very specific understanding of what an eval startup means but it is only implied rather than explicitly described. I would have thought that they were referring to the companies that provide a technology platform to enable you to do evals in an AI application context for example companies like Comet/Opik and Braintrust.But it sounds like the author does not mean those companies at all since those are actually important in enabling the very Venn diagram he/she describes.Based on what I assume the author's referring to they are referring to something more like a public benchmark report provider... I would say but yes that's a relatively small total addressable Market space no matter how you look at it
shivanshu23e
[flagged]
gunaclksy
[dead]