AI Development10 min read

FDM-1: AI Trained on 11M Hours of Screen Footage

Standard Intelligence FDM-1 learns software operation by training on 11 million hours of screen recordings. Architecture, capabilities, benchmarks, and API access.

Digital Applied Team
February 27, 2026
10 min read
11M hrs

Training Screen Footage

Video

Primary Training Data

CAD+Code

Key Capabilities

2026

Public Release Year

Key Takeaways

Trained on observation, not instruction: FDM-1 learns software operation by watching 11 million hours of screen recordings showing humans using applications, learning UI patterns, mouse trajectories, and keyboard sequences from observation rather than explicit programming.
CAD, code editing, and web navigation: The model demonstrates competitive performance across three distinct domains: operating CAD software, editing code in IDEs, and navigating complex web applications, suggesting the video training approach generalizes across interface types.
Competitive with Claude computer use: Early benchmarks show FDM-1 matching or approaching Claude computer use and GPT-4o on standard desktop automation tasks, despite using a fundamentally different training methodology.
The training data moat is enormous: The 11 million hours of curated screen recordings represent a dataset that would be extremely expensive and time-consuming for competitors to replicate, creating a significant barrier to entry.

The dominant approach to building AI that can operate computers has relied on language models augmented with screenshot analysis and tool-calling APIs. Claude computer use, GPT-4o with tool use, and similar systems start from a text-trained foundation and learn to interpret desktop interfaces as a secondary capability. Standard Intelligence took a fundamentally different path with FDM-1: train a model on 11 million hours of screen recordings from the ground up, so that visual software operation is the native competency rather than an add-on.

This guide covers what FDM-1 is, how its video-first training methodology works, what it can and cannot do today, how it compares to existing computer-use AI systems, and what the release signals for the broader desktop automation landscape. Whether you are evaluating computer-use AI for enterprise workflows, building automation tools, or tracking the trajectory of foundation models, the FDM-1 release represents a meaningful data point in how machines learn to interact with software.

What Is FDM-1

FDM-1, or Foundation Desktop Model 1, is an AI model developed by Standard Intelligence that was trained primarily on video recordings of humans operating desktop software. Rather than learning from text corpora and then being fine-tuned to interpret screenshots, FDM-1 ingests raw screen footage as its primary training signal. The model observes cursor movements, click patterns, keyboard inputs, and the visual changes that result from those actions across millions of hours of real software use.

Traditional Computer-Use AI
  • Text-trained LLM as the foundation layer
  • Screenshots processed as secondary visual input
  • Tool-calling APIs to execute mouse and keyboard actions
  • Strong reasoning, slower visual processing
FDM-1 Video-Native Approach
  • Screen recordings as the primary training data
  • Mouse trajectories and keyboard sequences learned natively
  • Visual interface understanding is the core competency
  • Fast visual processing, limited textual reasoning

The distinction matters because it determines what the model is inherently good at. A text-first model like Claude Opus 4.6 excels at reasoning about what to do and can generalize to unfamiliar interfaces through instruction following. A video-first model like FDM-1 excels at recognizing familiar visual patterns and executing learned action sequences with precision. These are complementary strengths, and the competitive landscape will likely include both approaches.

Video Training Methodology

The core innovation behind FDM-1 is treating desktop operation as a video prediction problem. The model receives a sequence of screen frames and learns to predict what the human will do next: where the cursor will move, what will be clicked, and what will be typed. This is conceptually similar to how next-token prediction works in language models, but applied to visual-action sequences instead of text.

Training Pipeline Overview
1

Data Collection

11 million hours of screen recordings from consented contributors and synthetic sessions, capturing diverse software usage across CAD tools, IDEs, browsers, and productivity applications.

2

Action Annotation

Each frame is paired with the corresponding human input event: cursor coordinates, click type, scroll direction, and keystroke sequences. This creates supervision signals directly from observed behavior.

3

Visual-Action Prediction

The model learns to predict the next action given the current screen state and recent action history. Over millions of hours, it builds an internal representation of how software interfaces work.

4

Task Fine-Tuning

After pre-training on general desktop footage, the model is fine-tuned on specific task categories (CAD operations, code editing, web navigation) with reward signals based on task completion success.

The scale of the training data is notable. Eleven million hours translates to roughly 1,250 years of continuous screen recording. This volume allows the model to encounter rare interface states, unusual application configurations, and edge-case workflows that would be impossible to cover through manual demonstration datasets. The breadth of software covered during training also contributes to generalization: FDM-1 has observed enough variation in button placements, menu structures, and dialog box designs to develop robust visual pattern recognition.

Why Video Training Is Different

Text-based training teaches a model what software does through documentation and descriptions. Video training teaches a model how software is used through direct observation. The difference is analogous to learning to drive by reading the manual versus watching thousands of hours of dashcam footage from experienced drivers. Both produce knowledge, but the knowledge is structured differently. The video-trained model captures implicit behavioral patterns: the micro-pauses before clicking, the scanning patterns across menus, and the recovery strategies when an action produces an unexpected result.

Capabilities: CAD, Code, and Navigation

Standard Intelligence has demonstrated FDM-1 across three primary domains, each representing a different type of desktop interaction complexity. The diversity of these demonstrations is significant because it suggests the video training approach produces generalizable skills rather than narrow automation for a single application.

CAD Software
  • Navigates 3D modeling environments
  • Selects tools from complex toolbars
  • Modifies geometry through parameter dialogs
  • Handles multi-step design workflows
Code Editing
  • Operates within IDE interfaces
  • Uses file explorer and terminal panels
  • Executes keyboard shortcuts for editing
  • Navigates multi-file projects
Web Navigation
  • Fills multi-field web forms
  • Handles dropdown menus and date pickers
  • Navigates multi-page checkout flows
  • Switches between browser tabs

The CAD capability is particularly notable because CAD software represents one of the most visually complex desktop environments, with dense toolbars, context-sensitive menus, and precise spatial interactions. Traditional automation approaches struggle with CAD applications because the interfaces are highly graphical and inconsistently structured across different software packages. The video training approach sidesteps this problem by learning directly from how humans navigate these interfaces.

Benchmarks vs Claude and GPT-4o

Comparing FDM-1 against Claude Opus 4.6 computer use and GPT-4o requires careful framing because the models operate through different mechanisms. Claude and GPT-4o receive screenshots and generate tool calls (click at coordinates, type text, scroll). FDM-1 processes video frames and outputs predicted actions directly. Despite these architectural differences, standard desktop automation benchmarks provide a reasonable comparison surface.

Benchmark Performance Comparison

Web Form Completion

FDM-1 performs comparably to Claude Opus 4.6 on standard web form tasks, with faster action execution but slightly lower accuracy on forms requiring contextual reasoning about field content. GPT-4o trails both on complex multi-step forms.

Desktop Application Navigation

FDM-1 shows particular strength on tasks involving complex desktop applications with dense visual interfaces. On CAD and design tool benchmarks, FDM-1 outperforms both Claude and GPT-4o, likely due to the visual training advantage.

Multi-Step Task Completion

On longer tasks requiring 10+ sequential actions, Claude Opus 4.6 maintains higher completion rates due to stronger planning and error recovery. FDM-1 executes individual steps faster but is more likely to get stuck when the interface state diverges from training patterns.

Novel Interface Handling

When tested on applications not represented in training data, Claude and GPT-4o generalize better through instruction following. FDM-1 relies on visual similarity to known interfaces, which works well for conventional UI patterns but degrades on custom or unusual designs.

The benchmark picture suggests that FDM-1 and language-model-based approaches have complementary strengths. FDM-1 excels on tasks that are visually repetitive and pattern-heavy, where the model has seen thousands of similar interactions in training. Language models excel on tasks requiring reasoning, planning, and adaptation to novel situations. A hybrid approach combining both architectures could potentially outperform either alone.

Current Limitations

FDM-1 represents an early-stage approach to video-trained desktop automation, and its limitations are important to understand before evaluating it for production use cases. These are not minor edge cases; they represent fundamental constraints of the current architecture.

Limited Textual Reasoning

Because FDM-1 is trained on visual patterns rather than text, it struggles with tasks that require understanding the semantic meaning of on-screen text. It can recognize a text field and type into it, but it cannot reliably interpret instructions written on the page or reason about content in the same way a language model can.

Error Recovery Fragility

When an action produces an unexpected result (a dialog box that was not in the training data, a loading spinner that takes longer than expected), FDM-1 can enter a loop or stall. The model lacks the reasoning capability to step back, diagnose the problem, and try an alternative approach. This is a significant gap for production deployment.

Resolution and Display Sensitivity

The model is sensitive to screen resolution, display scaling, and theme settings. A button that appears at certain coordinates in the training data may be in a different position on a different resolution display. While the model handles common configurations, unusual display setups can degrade performance.

No Natural Language Instruction

Unlike Claude or GPT-4o, FDM-1 does not accept natural language task descriptions. Tasks must be specified through structured inputs or demonstrated through initial actions. This makes it less flexible for ad-hoc automation and more suited to predefined workflow automation.

Implications for Computer-Use AI

FDM-1 is significant not because it replaces existing approaches, but because it validates a fundamentally different training paradigm for desktop automation. The fact that a video-trained model can compete with models built on top of some of the most capable language models in existence suggests that observation-based learning has underexplored potential in the computer-use space.

For the broader AI industry, FDM-1 raises several important questions. First, could hybrid architectures that combine a language model for planning and reasoning with a video-trained model for execution outperform either approach alone? Second, does the video training approach scale better as training compute increases, given that screen recordings are more abundant and cheaper to collect than expert demonstrations? Third, will specialized desktop models trained on domain-specific footage (only CAD, only financial software, only healthcare EHR systems) deliver superior performance within those verticals?

What This Means for Developers
  • Video-trained models are now a viable path for desktop automation, not just a research curiosity
  • Expect hybrid systems combining LLM reasoning with video-trained execution within 12-18 months
  • The training data moat (11M hours) is a competitive advantage that favors early movers
What This Means for Businesses
  • Desktop automation is approaching viability for visually-complex enterprise software
  • CAD, ERP, and legacy applications that lack APIs are now automatable through visual interfaces
  • Evaluate both LLM-based and video-based approaches for your specific workflow requirements

The emergence of diffusion-based architectures for inference speed, as explored in models like Mercury 2 from Inception Labs, combined with video-native training approaches like FDM-1, suggests that the next generation of computer-use AI may look very different from the current screenshot-and-tool-call paradigm.

API Access and Integration

Standard Intelligence has structured FDM-1 access around an API that accepts screen state (screenshots or video frame sequences) and returns predicted actions. This is a different interaction pattern than language model APIs, which accept text prompts and return text responses. Developers integrating FDM-1 need to build a screen capture pipeline, send frames to the API, receive action predictions, and execute those actions on the local machine.

Integration Architecture

Screen Capture Layer

A local agent captures the current screen state at configurable intervals (typically 2-5 frames per second) and sends frames to the FDM-1 API. The agent handles resolution normalization and frame encoding to minimize bandwidth.

Action Prediction API

The API returns structured action objects containing the action type (click, type, scroll, key press), coordinates, and any text content. Response latency is currently in the 100-300ms range, which is adequate for most automation tasks but noticeable for real-time interaction.

Action Execution Layer

A local executor translates API responses into actual mouse and keyboard events on the host machine. This layer handles platform-specific input simulation (Windows, macOS, Linux) and includes safety mechanisms to prevent unintended actions.

For developers already working with Anthropic computer use APIs, the FDM-1 integration pattern will feel somewhat different. Claude computer use operates through a text-based tool-calling interface where the model reasons about what to do and issues structured tool calls. FDM-1 operates through a visual prediction interface where the model sees the screen and predicts the next action directly. The execution layer is similar, but the reasoning layer is fundamentally different.

The Future of Desktop Foundation Models

FDM-1 is a first-generation model in what is likely to become a broader category of desktop foundation models. The training methodology, while novel, is straightforward to scale: more screen recordings, more diverse software coverage, and more compute produce a better model. The 11 million hour training dataset is large by current standards, but it is a small fraction of the total hours of screen time generated globally every day.

The trajectory for desktop foundation models will likely follow a pattern similar to language models: rapid capability improvement driven by scale, followed by specialization for specific domains and use cases. Within 12-24 months, we can reasonably expect desktop models fine-tuned for specific industries (healthcare EHR systems, financial trading platforms, engineering CAD suites) that significantly outperform general-purpose models within their domain.

The convergence of video-trained desktop models, language-model reasoning, and diffusion-based inference is creating a rich design space for computer-use AI. The winning architecture will likely combine multiple approaches: a language model for high-level task planning and error recovery, a video-trained model for visual pattern recognition and precise action execution, and efficient inference mechanisms for real-time responsiveness. FDM-1 is an early but important contribution to one piece of this puzzle.

Related Articles

Continue exploring with these related guides