Empower your agents to 'see' and interact with the visual world using multi-modal capabilities.
Vision-Language Models
Multi-Modal
What is Langvision?
Langvision is the multi-modal subsystem of the Langtrain ecosystem. It abstracts the complexity of integrating Vision-Language Models (VLMs) like LLaVA, Qwen-VL, and Pixtral into your agentic workflows.
Core Capabilities
•UI Understanding: Langvision models are specifically fine-tuned on web and desktop interfaces. They can parse bounding boxes, identify clickable elements, and read text natively from screenshots.
•Visual QA: Pass images along with text prompts to ask complex questions about graphs, diagrams, or real-world photographs.
•Continuous Streaming: For robotics or screen-recording applications, Langvision can process video frame streams in near real-time using optimized context caching.
Integration in Studio
Using Langvision in Langtrain Studio is as simple as dropping a 'Vision Node' onto your canvas. When an agent requires visual context to complete a task, it can trigger the Vision Node to request a screenshot, parse the current state, and make an informed decision.