Screen Control APIs

Programmatic interfaces for executing agent-driven actions on graphical user interfaces.

Computer Use

Auto-Navigation

Computer Use Abstraction

Modern AI agents aren't limited to text terminals. Langvision provides a standard set of APIs to allow agents to interact with virtual desktops, browsers, and mobile emulators.

Action Primitives

The API exposes several fundamental computer-use actions that agents can emit:

•click(x, y): Simulates a left mouse click at the given coordinates.
•type(text): Injects keyboard events for the specified text string.
•scroll(direction, amount): Executes mouse wheel scroll events.
•drag(startX, startY, endX, endY): Clicks, holds, and moves the cursor before releasing.

How Agents Decide

A standard screen control loop looks like this:

1. The Langvision client takes a screenshot and sends it to the VLM (Vision-Language Model).2. The VLM processes the image, identifies the target element (e.g., 'Submit Button'), and outputs the precise (x, y) coordinates.3. The agent invokes the click tool with those coordinates.4. The system executes the click, takes a new screenshot, and repeats the loop until the task is complete.