Computer Use is 45x more expensive than structured APIs

<- Back

Computer Use is 45x more expensive than structured APIs

palashawas

Comments (137)

angry_octet
Great guidance hidden in here for making it expensive for agents to navigate your website. Move elements on screen as the mouse moves, force natural mouse movement to make the UI work, change the button labels in the JS to be randomly named every visit, force scrolling to the bottom of the screen to check for hidden extra tasks...Hang on, that sounds like common corporate SaaS apps.
merlindru
I'm building something that fixes this exact problem[1].The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.[1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet
Worf
Is it possible to ask the vision agent to "map" the UI and expose it to another agent as a set of interfaces that resemble an API better? From what I understand the vision agent now should both know that "next page" shows more results and that they need to get more results in the first place.If one agent just explores the UI, maybe in a test environment, and outputs a somewhat-structured description of the various UI elements and their behavior, then another agent was given that description, would the other agent perform better that an agent that both explores the UI and tries to accomplish the given task at the same time?With an example UI I made up, the description (API-like interface definition) could be something like: Get all reviews: To get all the reviews you need to go to each page and click "show full review" for every review summary in that page. Go to each page: Start at page 1 (the default when in the Reviews tab). Continue by clicking the "next" button until the "next" button is no longer available (as you've reached the last page). So the second agent can skip some thinking about how to navigate because it already has that skill. The first agent can explore the UI on its own, once, without worrying about messing up if there's a test environment.Or am I misunderstanding the article completely? Probably. But it's interesting nonetheless. Sorry if it makes no sense.
jacktu
Totally agree. I’ve been building an AI visual tool recently and experimented with both approaches. The latency and c ost of generic "agentic" browser use are absolute dealbreakers for real-time consumer apps right now. Structured APIs (even just chained LLM calls with strict JSON schemas) are not only 40x cheaper, but more importantly, they are deterministic enough to actually build a stable product on top of. Computer use is an amazing demo, but structured APIs are what pay the server bills.
rgilliotte
Many people are working on that :-)Apps written now will have mcp servers / AI compatibility when relevantThe issue that still needs solving is how to make llms interact with everything we already have and use (efficiently, not with screenshot, read, screenshot, ...)Most of the time that means reverse engineering, either the app itself or the APIs it usesFrom github (not my projects):https://github.com/SimoneAvogadro/android-reverse-engineerin... => reverse engineer android app APIs from APKshttps://github.com/HKUDS/CLI-Anything => convert ooen-source GUI apps to clishttps://github.com/kalil0321/reverse-api-engineer => API reverse engineering from traffic (claude skills)My take at the same issue (very young project):Also api reverse engineering from traffic captures, with a focus on mobile app, safety & community mcp generationhttps://getspectral.sh https://github.com/spectral-mcp/spectral
rahulyc
All the websites currently blocking Claude Code or other AI agents are fighting a losing battle. Computer-use is in the early stages, and the thing preventing mass-adoption seems to be the number of tokens it takes. Agents can fumble around trying 10 CLI commands that don't work before finding the right one and we barely notice. But other visual agents (browser use / computer use etc) end up eventually fumbling on to the right thing, but we don't have the patience to wait 20 mins. to click a button. As tokens get cheaper + faster, we probably get the models that can use a UI interface just as natively as a CLI.
orliesaurus
Computer Use? Or Browser Use? IMHO big diffThe problem is that not everything from the 'past' can be accessed via APIs. It would be a fun time - remember Prism [1] - I would just run that and get all the API calls in a nice format and then replay them over and over to do things in succession.In the new world, we have access to OpenAPI.json and whatnot, but in the world where things were built in the days pre-OpenAPI and pre-specs and best practices...I am not so sure! (and a lot of world lives then)Alas, this works for a good chunk of things but not everything. Which is why the other technnology exists.[1] https://stoplight.io/open-source/prism
anon
undefined
antves
I think one main point is that not all "computer use" is the same, the harness and agentic experience matters a lot. A poorly designed API experience can actually be _less_ efficient than a well designed browser or computer use experienceIn particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be consideredWe typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be
janalsncm
Wall clock time tells me everything I need to know. The vision model took almost 20 minutes to do the thing that Sonnet did in 20 seconds.The only reason you wouldn’t choose an API is if it wasn’t viable.
johnsmith1840
Text based web browsing? Would love the comparison there. Tons of systems have a dom translation layer. I'm building around this with the concept of turn a webpage into text for an agent to use directly. I actually had to move away from haiku not because of accuracy problems but because it operated the browser too fast for a human to follow what it was doing. The real loss here are bespoke webapps like a figma or google docs which are near impossible to see what they are doing via the dom.To me the browser is a translation layer. Working on the browser directly while hard enables big advantages on compatibility. The only thing I miss as of now which is on the todo is ocr of the images in the browser into text out. But an api would need to do that anyways to work.The main loss in my view of pure API based is, where do you get the data? We won't replicate human work without seeing that done. Humans work in the UI that's it. Computer use to me is the promise of being able to replicate end to end actions a human does. API can do that in theory but the data to do that is also near impossible to collect properly.
aurareturn
In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly.I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.
etothet
Vision has a long way to go. I remember trying an early version of AWS's Nova Act and laughed at how slow it was. And a few months later it hadn't really seemed to improve that much.Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done.Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach.A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page.
brikym
It would be great if institutions like banks provided proper APIs.
_boffin_
What i don't understand about "computer use" is why they're not just grabbing the window handles and storing them to determine what should be clicked after the first few iterations of using that a specific application. if a new case / path / whatever is found, drop back to screen grabbing and bounding boxes and then figure the handles that are there and store after.idk.. not really thought out too much, but has to be better
sheepscreek
This tracks - has been my experience exactly. Not to mention there isn’t particularly a significant lift in inaccuracy or speed. As things stand, to me it is the worst of both worlds. Expensive and inaccurate.
sarmike31
Just wondering: RPA companies like UiPath ard dead in the water, right?
ai_fry_ur_brain
Its funny watching the slow mean reversion back to more deterministic tooling.
svnt
> This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything.> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.
Havoc
Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority.I can see the appeal in pixel route given universality but wow that seems ugly on efficiency
sudb
I'm pretty unsurprised that the vision agent did worse. I'd be interested in a comparison between the different tools that now exist to let LLMs drive browsers (e.g. vercel's agent-browser, the relatively new dev-browser[1], etc.)There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.1. https://github.com/SawyerHood/dev-browser
cjbarber
I think of computer use as like last mile delivery. APIs and bash and such are the efficient logistics networks. Both have different benefits. Obviously, use the efficient methods when you can.
2001zhaozhao
I have only found Computer Use useful for GUI app local debugging. Presumably it will also be useful for getting around protections for external apps that don't want AI to interact with them, or for interfacing with legacy apps or those built without AI in mind.I don't think any new app should ever be specifically designed for AI to interact with them through computer use
rootcage
The best use cases I've seen for computer/browser use is for legacy SaaS/Software. For example, hotels use archaic Property Management Systems (PMS) and they're required by corporate to use it and pay for it. These companies can barely keep the product alive, they definitely aren't incentivized to maintain an API. In such a case browser use agent seems to be the best (only) way.
arjunchint
The hard part about the web is that API's aren't just available even if the website owner wants them exposed (big if).I embedded a Google Calendar widget on my Book a demo page, I don't know the API and Google doesn't expose/maintain one either.What we are doing at Retriever AI is to instead reverse engineer the website APIs on the fly and call them directly from within the webpage so that auth/session tokens propoagate for free: https://www.rtrvr.ai/blog/ai-subroutines-zero-token-determin...
overgard
I've been thinking of things I'd want an agent for recently. The problem is, everything I think of is something that requires using quite a few different websites, saving a lot of data securely, and working with a lot of sensitive accounts (my email, etc.)The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:- Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.- Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.- Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.
mrcwinn
We need a superset of HTML that is designed for agents. I'm not sure it's quite as simple as "just make everything an API."
gowld
Confusing title? "Computer Use" is actually "Browser vision"?
dist-epoch
It doesn't matter.Electron uses 10x more RAM than regular apps. But it's so convenient.Python is 100x slower than C. It's in the top 3 of languages now.Worse but more convenient always wins.
RobRivera
UX feedbackMe: hmm, this title confuses and infuriates Rob.[Clicks link]Me: Sees same title, repeat feelings of confusion and infuration[Scrolls article down on my smartphone]Me: Sees jpg with the same title, repeat feelings of co fusion and infuriation.[Closes tab][Continues living rest of my life]I hope this feedback is well received and understood.
moralestapia
This is obvious. The problem is that not everything has an API, while everything has a human-oriented UI.
creatonez
Browser agents / vision agents are a menace and ISPs should outright ban subscribers who run them on the public internet.
zephen
I find this extremely surprising.When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.
ipunchghosts
I have a similar finding for a website I made that collates college town bar specials and live music. Using agents with vision models works but it's not as straightforward as one would initially think. U can check out the results here. https://www.nittanynights.com
sanderjd
Only 45x?
taormina
The interface designed for humans is poor for AI needs? And the interface designed for programmatic use is easier for the AI to use? In other news, the sky is blue and water is wet.
anon
undefined
deafpolygon
This is missing the point that AI training probably costed boatloads more to achieve to get here.
theabhinavdas
For now.
WhoffAgents
[flagged]
lacymorrow
[dead]
volume_tech
[flagged]
faangguyindia
I saw Codex was screenshotting, then clicking around. I just stopped it and never used that again.Using CLI tools is much faster and token-efficient. I developed ten apps in the last two months. One reached 10,000+ monthly active users.I ask Codex to generate SVG line by line and backtrack edit, ask it to use Inkscape to generate icons, etc...I developed all this on $20 codex sub.
bottlepalm
There's no way this is true. I would argue in some cases computer use is less expensive. First for APIs that don't even exist, it's a non starter. Second most APIs are not designed for agents and are verbose as hell - returning the entire DTO and tons of unnecessary properties burns tokens. Second computer use is not as token hungry as you think it is - a single screenshot may be just 1000 tokens, it's actually competitive and beats API workflows in many cases.
0xWTF
So, to make this concrete, Akasa uses computer vision to read medical records to replace medical coders because there aren't enough medical coders to get all the billing right and medical systems leave like $1T a year on the table.The EHRs could give companies like Akasa API access so Akasa could then just run NLP, but the EHR vendors don't grant various third parties API access for various reasons, so instead Akasa gets a seat license for each medical system they service and uses computer vision to read the screen (a cadre of Akasa medical coders review errors to stay up to date with unannounced changes from the EHR vendors) and then runs the NLP to figure out which CPT codes to assign to actually put in a bill and send the payer so the hospitals can stay afloat.So this 45x delta is how much more the medical systems pay Akasa because Epic won't work with Akasa.This is but one example of why US medical bills are outrageously high.