Artificial Intelligence March 5, 2026

Si Inc’s FDM-1: The Computer Action Model Trained on 11 Million Hours of Video

On February 23, 2026, a small team working out of a South Park office in San Francisco pulled back the curtain on something the AI community has been chasing for years: a foundation model that can actually use a computer. Not by parsing screenshots or following scripted workflows, but by watching – and learning from – 11 million hours of screen recordings accumulated over two decades of internet history.

Si Inc’s FDM-1 – short for Forward Dynamics Model – represents a fundamental departure from how computer-use agents have been built. Where previous approaches relied on expensive contractor-labeled screenshots and narrow reinforcement learning environments, FDM-1 trains directly on video at 30 frames per second, predicting the next mouse movement or key press the way a language model predicts the next word. The result is a model that can extrude gears in Blender, fuzz banking applications for bugs, and even drive a car through San Francisco streets using arrow keys – all from a single pretrained foundation.

What makes this possible isn’t just the sheer volume of data. It’s a novel video encoding approach that compresses nearly 2 hours of 30 FPS video into just 1 million tokens – a 50x improvement over the previous state-of-the-art and 100x more efficient than OpenAI’s encoder. That efficiency gap is what separates a model that can process a few seconds of context from one that can reason over hours of continuous computer interaction.

Why 11 Million Hours Changes Everything

Computer action modeling has been stuck in a data desert. The largest open dataset of labeled computer-use video clocked in at under 20 hours of 30 FPS footage. That’s roughly equivalent to trying to build GPT-3 from a single book. The bottleneck was always the same: labeling screen recordings with precise actions – every mouse delta, every key press – required human contractors, and that was prohibitively expensive at scale.

Si Inc’s 11-million-hour corpus obliterates this constraint. The dataset spans film editing sessions, coding livestreams, video game playthroughs, and countless other forms of screen-captured activity that have piled up across the internet over the past twenty years. None of it was originally created for AI training. All of it turned out to be exactly what a general computer action model needed.

Metric FDM-1 Previous Best
Video dataset size 11 million hours Less than 20 hours
Token efficiency (30 FPS) ~2 hours per 1M tokens ~1 minute per 1M tokens
Context capability Multi-hour video Seconds of screenshots
Data source Internet-scale video Contractor-labeled recordings

This 500x-plus leap in dataset scale is what shifts computer action modeling from a data-constrained regime to a compute-constrained one – the same transition that unlocked rapid progress in language models years ago.

The Three-Stage Training Recipe

Raw video is useless without action labels. You can’t train a model to predict key presses if you don’t know which keys were pressed. Si Inc solved this with an elegant bootstrapping pipeline that unfolds in three distinct stages.

  1. Train an Inverse Dynamics Model (IDM) on 40,000 hours of contractor-labeled screen recordings. This model learns to predict what action was taken between any two consecutive frames – if a “K” appears on screen, it infers the K key was pressed. For longer-range dependencies like a Cmd+V following a much earlier Cmd+C, minutes of video history provide sufficient context for accurate labeling.
  2. Label the full corpus. The trained IDM is then unleashed on all 11 million hours of unlabeled internet video, automatically annotating every frame with predicted mouse movements and key presses. This bypasses the need for human annotation entirely on the vast majority of the data.
  3. Train the Forward Dynamics Model (FDM-1) autoregressively on the IDM-labeled videos. The model learns next-action prediction – given a stream of video frames and past actions, what key press or mouse delta comes next? The output token space covers all possible computer actions.

This approach draws direct inspiration from OpenAI’s Video PreTraining (VPT) paper, which used a similar IDM-bootstrapping technique to build a Minecraft agent. But VPT operated with only six seconds of context and was specific to a single game. FDM-1 extends the same principle to hours of context across every type of computer activity imaginable.

A Video Encoder That Actually Scales

The unsung hero of FDM-1 is its video encoder. Existing vision-language models burn through tokens at an extraordinary rate when processing video – roughly 1 million tokens for a single minute of 30 FPS footage. At that rate, processing hours of video is computationally impossible.

Si Inc’s encoder compresses nearly 2 hours of 30 FPS video into 1 million tokens. That’s not an incremental improvement. It’s a categorical change in what’s feasible. Within a 200,000-token context window, the Si Inc tokenizer fits dramatically more frames than any competitor: Gemini manages approximately 775 frames, ChatGPT’s computer-use vision handles around 240, Claude about 162, and NVIDIA’s Cosmos encoder roughly 49.

The key innovation is a masked compression objective that avoids the fixed-size embedding trap. Screen recordings have wildly variable information density – a cursor drifting across a blank desktop carries almost no information, while scrolling through dense code carries enormous amounts. The encoder adapts to this variance rather than forcing a one-size-fits-all compression ratio.

What FDM-1 Can Actually Do

Three demonstrations showcase the model’s breadth.

CAD Modeling in Blender

FDM-1 executes continuous mouse movements to complete multi-step CAD operations – selecting faces on an n-gon, extruding them, and shaping a gear. The team creates OS checkpoints at each successful operation (extrude, select, etc.), which unlocks test-time compute for complex sequences. This is precisely the kind of long-horizon, spatially precise task that screenshot-based agents cannot handle.

Autonomous Driving via Arrow Keys

After fine-tuning on less than 1 hour of collected driving data, FDM-1 uses key presses to navigate turns around a block in San Francisco. The team forked openpilot’s joystick mode to control the vehicle and built a web interface displaying live video feeds alongside steering angle, brake, and acceleration data. The model executes turns and corrects back to straight-line steering. Critically, fine-tuning FDM-1 substantially outperforms training from scratch, demonstrating that computer-use pretraining transfers meaningfully to real-world physical control.

Automated UI Testing

FDM-1 proves unusually capable at GUI fuzzing – finding bugs that require deep exploration of state trees or unusual interaction patterns. In a demonstration with a mock banking application, the model discovered a bug where a “Submit Wire Transfer” button remained clickable immediately after a transfer completed, allowing account balances to go negative. Random walks and random key presses can’t find bugs like this because they don’t emulate realistic human behavior.

Infrastructure at an Unusual Scale

Training and evaluating a model like FDM-1 demands infrastructure that doesn’t exist off the shelf. Si Inc built an evaluation system supporting over 1 million rollouts per hour across 80,000 forking Ubuntu virtual machines, each configured with 1 vCPU and 8 GB of RAM.

A single H100 GPU controls 42 VMs in parallel. Round-trip latency from screen capture to action execution is just 11 milliseconds – achieved through GPU-VM colocation, low-latency VNC connections, and custom Rust bindings. This latency matters enormously: the model was trained on video where actions happen without network delay, so any lag during inference creates a distribution mismatch that degrades performance.

Infrastructure Metric Value
Concurrent evaluation VMs 80,000 Ubuntu instances
Rollouts per hour Over 1 million
VMs per H100 GPU 42
Screen-to-action latency 11 ms round-trip
VM specs 1 vCPU, 8 GB RAM each

The forking VM architecture also enables a form of test-time compute for computer use. By checkpointing successful states and branching from them, the system can explore multiple action paths simultaneously – similar to how tree search works in game-playing AI, but applied to arbitrary software environments.

Early Benchmarks and Scaling Behavior

FDM-1 achieves 50% accuracy on a key press prediction task – choosing between no action, left, or right – after pretraining. This outperforms baseline models that use only a video encoder without internet video data. More importantly, the model shows steeper scaling trends than baselines, suggesting that throwing more compute at the problem will yield continued improvements rather than hitting diminishing returns.

The team hasn’t yet published head-to-head benchmark comparisons against models like Anthropic’s Claude computer use or other VLM-based agents. When asked about this on Hacker News, the response focused on the model’s unique capabilities – multi-hour context, 30 FPS operation, and diverse task generalization – rather than narrow benchmark scores. The implication is clear: FDM-1 operates in a different regime than existing computer-use agents, making direct comparison on current benchmarks somewhat beside the point.

From Data-Constrained to Compute-Constrained

The deeper significance of FDM-1 isn’t any single demo. It’s the proof that computer action modeling can follow the same scaling trajectory that transformed language models from curiosities into the backbone of modern AI.

Building GPT-3 required an internet-scale text corpus. Building a general computer agent, it turns out, requires an internet-scale video corpus. The 11-million-hour dataset – spanning coding, design, gaming, and everything in between – is that corpus. The IDM-bootstrapping pipeline is the mechanism that makes it usable. And the 50x-efficient video encoder is what makes training on it computationally tractable.

Si Inc’s founder noted on Hacker News that the team plans to scale further over the coming months, with updates expected within one to two months. The technical community’s response has been notably enthusiastic, with one commenter describing it as “the first time I’ve seen overwhelming praise on HN” and another expressing shock at how good the video encoder performs in practice.

What remains to be seen is how FDM-1 performs as it scales into more complex, longer-horizon tasks – the kind of multi-hour engineering sessions, financial analyses, and research workflows that the team envisions as the model’s ultimate domain. But the foundation is laid. Computer action modeling has its internet moment, and the constraint has officially shifted from data to compute.

Sources