Beyond Clicking and Shell Commands: API-Native Computer Control

CarsonWu1 pts0 comments

Beyond Clicking and Shell Commands: API-Native Computer Control | Carson Wu ← All writing An AI agent can draft an email, summarize a repository, or propose edits. The more difficult question is what happens next: how should it operate an application?

GUI automation and shell access are two practical answers. I use both, but I have also been experimenting with another option: give the agent a small application API and let it write a short JavaScript program for each task.

I call this API-native computer control . The name is more ambitious than the current implementation. I do not know whether it is the best general interface for agents, but it seems promising when an application already has structured data, domain rules, and a meaningful API.

When tool calls become the control loop

Tool calling often exposes actions such as:

create_rectangle(...)<br>move_object(...)<br>set_fill_color(...)<br>delete_object(...)<br>This works well for a few independent actions. It becomes less convenient when a task needs iteration or branching. Imagine a slide editor, canvas, or UI board where the user asks:

Find every text object smaller than 12 px, increase it to 12 px, and move any overflow into a new text box below the original.

If the model calls one tool for every object, every observation and action may require another inference step. A short program keeps the ordinary computation local:

const objects = canvas.listObjects({ type: "text" });

for (const object of objects) {<br>if (object.fontSize >= 12) continue;

const result = canvas.updateText(object.id, { fontSize: 12 });

if (result.overflowText) {<br>canvas.createText({<br>content: result.overflowText,<br>x: object.x,<br>y: object.y + object.height + 8,<br>fontSize: 12,<br>});<br>The model still chooses the operation, but loops, conditions, and intermediate values do not each need another model turn. Generated code is not a replacement for tools; it is a way to compose approved tools.

GUI control recovers semantics from pixels

Graphical interfaces are excellent for people. We can scan a canvas, recognize an icon, and point at an object without naming every part of the scene.

For an agent, the same action often takes a longer translation path:

flowchart LR<br>subgraph GUI["GUI control"]<br>A[Render] --> B[Interpret pixels] --> C[Find control] --> D[Click or type] --> E[Check result]<br>end<br>subgraph API["Semantic API control"]<br>F[Read state] --> G[Call named action] --> H[Verify result]<br>end<br>E --> S[(Application state)]<br>H --> S<br>classDef gui fill:#fff0df,stroke:#c76b16,color:#17202a<br>classDef api fill:#e8f0fb,stroke:#2563a7,color:#17202a<br>class A,B,C,D,E gui<br>class F,G,H api<br>GUI control is indispensable when no structured interface exists. Vision also remains important when appearance is the result, such as editing slides, graphics, or a web page. My concern is using pixels and coordinates as the primary control plane when the application already knows the content, bounds, transform, and identity of each object.

A useful split is to use vision to understand and judge the output, while using semantic operations to change application state when available.

Why the shell works so well

The shell removes much of the visual interpretation work. It is textual, scriptable, composable, and supported by decades of public examples. Models have seen large amounts of Bash, git, ffmpeg, and similar tools during training. Some of that operational experience is encoded in the model’s learned parameters, so common commands can feel almost native.

That is a substantial advantage:

ffmpeg -i input.mov -vf "scale=1920:-2,fps=30" -c:v libx264 -crf 20 output.mp4<br>A model familiar with ffmpeg may produce this without first studying a manual. The shell is therefore hard to beat for developer environments and established utilities.

The tradeoff is authority and precision. A process launcher plus filesystem access is broader than a narrowly scoped application operation:

video.resize({<br>assetId,<br>width: 1920,<br>frameRate: 30,<br>destination: approvedOutput,<br>});<br>CLI syntax also depends on conventions that are not fully captured by a machine-checkable schema.

Containers and operating-system sandboxes still matter. A narrow API does not replace them, but it can reduce the authority placed inside the boundary. For application-level automation, that may also allow a lighter runtime than a general shell environment.

From fixed handlers to generated programs

In a React application, a button usually invokes code that a developer prepared in advance:

function AlignButton({ selectedIds }) {<br>const onClick = () => {<br>const objects = selectedIds.map((id) => editor.getObject(id));<br>const left = Math.min(...objects.map((object) => object.x));

for (const object of objects) {<br>editor.updateTransform(object.id, { x: left });

editor.commitHistory("Align left");<br>};

return button onClick={onClick}>Align leftbutton>;<br>The button is a human-friendly handle for a predefined code path. This is reliable, but fixed: the developer must anticipate the...

object control application shell const result

Related Articles