Artificial Intelligence March 28, 2026

Meta’s TRIBE v2 Can Predict Your Brain Activity From Any Sight or Sound

A new AI model can predict what your brain is doing – without scanning it. Meta’s Fundamental AI Research team has released TRIBE v2, a tri-modal foundation model that takes video, audio, and text as input and outputs a detailed map of predicted brain activity across more than 70,000 locations. It represents a 70-fold increase in spatial resolution over previous neural decoding models and can generalize to entirely new subjects, languages, and experimental tasks without retraining.

The implications are significant. Functional MRI experiments are expensive, slow, and require new recordings for every new condition. TRIBE v2 creates what researchers call a “digital twin” of neural activity – a computational stand-in that lets scientists run thousands of virtual experiments at a fraction of the cost of physical scans. In controlled tests, the model’s predictions of group-averaged brain responses were often more accurate than any single individual’s actual fMRI recording.

Announced on March 26, 2026, TRIBE v2 arrives alongside its full codebase, model weights, and an interactive demo, all released under a CC BY-NC license for the global research community.

From Boutique Models to a Brain Foundation Model

Traditional brain encoding models in cognitive neuroscience have been narrow by design. A typical model might be trained to decode how one person’s visual cortex responds to static images – and that’s it. Changing the stimulus type, the sensory modality, or even the individual subject meant starting over. This fragmentation has prevented researchers from building a unified picture of how the brain integrates multisensory information in real time.

TRIBE v2 breaks this pattern entirely. Rather than targeting a single modality for a single person, it functions as a foundation model – trained on naturalistic stimuli like movies and podcasts that reflect the messy, multi-modal nature of everyday perception. The model handles vision via video, audition via audio, and language via text simultaneously, capturing how these sensory streams converge within the human cortex. Its predecessor, TRIBE v1, was trained on just four subjects and predicted roughly 1,000 voxels. It still won the Algonauts 2025 competition against 263 teams. TRIBE v2 scales that to 20,484 cortical vertices on the fsaverage5 surface plus 8,802 subcortical voxels.

How the Architecture Works

TRIBE v2 doesn’t learn to see or hear from scratch. Instead, it leverages three frozen, pre-trained foundation models as feature extractors, each handling one sensory channel:

These embeddings are compressed to a shared dimension of D=384 per modality and concatenated into a combined 1,152-dimensional time series. This multi-modal sequence feeds into a Transformer encoder with 8 layers and 8 attention heads, operating over a 100-second window. The Transformer’s output is then decimated to match fMRI’s 1 Hz temporal resolution and passed through a subject-specific prediction block that maps the latent representations onto the full brain geometry.

Component Function Output
Feature Extractors (frozen) Extract latent representations from video, audio, and text Modality-specific embeddings at D=384
Temporal Transformer Fuses multi-modal features across 100-second window 1 Hz fMRI-aligned sequences (D=1,152)
Subject Block Maps to individual brain geometry 20,484 cortical vertices + 8,802 subcortical voxels

Training Data, Scale, and Performance

Data scarcity has long been a bottleneck in neuroimaging research. TRIBE v2 addresses this with a strategy borrowed from large language models: scale aggressively and let performance follow.

The model was trained on 451.6 hours of fMRI data from 25 subjects across four naturalistic studies involving movies, podcasts, and silent videos. Evaluation spanned a much broader collection – 1,117.7 hours from 720 subjects, including high-resolution 7 Tesla data from the Human Connectome Project (HCP). On that HCP 7T dataset, TRIBE v2’s zero-shot predictions achieved a group correlation (R_group) near 0.4, which is twice the median individual subject’s predictivity of the group average. Put simply, the model’s prediction of what a typical brain does was more representative than most actual individual brain scans.

Accuracy scales log-linearly with training data volume, and no plateau has been observed. This mirrors the scaling laws seen in large language models and suggests that TRIBE v2 will continue improving as fMRI repositories expand. Compared to optimized linear models – the previous gold standard for voxel-wise encoding – TRIBE v2 showed significant improvements across every evaluation dataset.

Zero-Shot Generalization and Fine-Tuning

One of TRIBE v2’s most striking capabilities is zero-shot generalization. Using a group-averaged “unseen subject” layer, the model can predict brain responses for entirely new individuals, languages, and experimental conditions without any retraining. This alone could transform how neuroimaging studies are designed – researchers can pre-screen hypotheses computationally before committing to expensive scanner time.

When a small amount of subject-specific data is available – as little as one hour of fMRI recording – fine-tuning just the subject block for a single epoch yields two- to four-fold improvements over linear models trained from scratch. The feature extractors remain frozen throughout; there’s no need to retrain the massive upstream models.

Replicating Decades of Neuroscience on a Computer

The real test of any brain model isn’t just correlation numbers – it’s whether it can recover what scientists already know to be true. TRIBE v2 was evaluated against well-established experimental paradigms from the Individual Brain Charting (IBC) dataset, and the results are remarkably consistent with decades of empirical research.

Visual Experiments

When presented with images of faces, places, bodies, and characters, TRIBE v2 correctly pinpointed the fusiform face area (FFA) for face processing and the parahippocampal place area (PPA) for scene recognition – two of the most well-characterized functional regions in the human visual system.

Language Experiments

The model successfully localized the temporo-parietal junction (TPJ) for emotional processing and Broca’s area for syntactic processing. It also reproduced the expected stronger left-hemisphere activation for complete sentences compared to scrambled word lists – a classic neurolinguistic finding.

Network Discovery

Applying Independent Component Analysis to the model’s final layer revealed five functional networks that map onto well-known brain systems: primary auditory cortex, the language network, motion recognition areas, the default mode network, and the visual system. These emerged naturally from the model’s learned representations without explicit supervision.

Multi-Modal Fusion and Sensory Mapping

By selectively disabling individual input channels, researchers can use TRIBE v2 to map which sensory modality drives activity in specific brain regions. The results align with established neuroscience: audio input best predicts activity near the auditory cortex, video maps to the visual cortex, and text activates language areas and portions of the frontal lobe.

The real gains emerge in regions where the brain integrates multiple senses. At the temporal-parietal-occipital junction – a key multi-sensory integration zone – feeding all three channels simultaneously boosted prediction accuracy by up to 50% compared to any single channel alone. This capacity to model sensory convergence is one of TRIBE v2’s most distinctive contributions, offering a fine-grained topographic view of how the brain blends sight, sound, and meaning.

Limitations Worth Understanding

TRIBE v2 is powerful, but its boundaries are important to acknowledge.

The research team has indicated that future versions will prioritize expanding to additional sensory modalities and modeling active brain processes beyond passive perception.

What This Means for the Future of Brain Research

TRIBE v2 represents a fundamental shift in how neuroscience experiments can be conducted. The traditional pipeline – design a study, recruit subjects, book scanner time, collect data, analyze results – is slow and expensive. A single fMRI session can cost hundreds of dollars per hour, and every new experimental condition requires new recordings. TRIBE v2 offers a computational shortcut: test hypotheses virtually, identify the most promising experimental designs, and only then move to physical scans for validation.

The model’s log-linear scaling behavior is particularly encouraging. As open fMRI repositories like the Human Connectome Project and Individual Brain Charting continue to grow, TRIBE v2 and its successors should become progressively more accurate without any architectural changes – just more data. The research team recommends training on “deep” datasets (few subjects recorded for many hours) for robustness and evaluating on “wide” datasets (many subjects) for generalizability.

This isn’t mind-reading. TRIBE v2 predicts how the brain encodes sensory input, not what someone is privately thinking. But it is a significant step toward building a unified computational framework for understanding human cognition – one that could accelerate treatments for neurological disorders, improve brain-computer interfaces, and feed insights back into the design of better AI systems. The code, weights, paper, and demo are all publicly available, putting this tool in the hands of any researcher ready to use it.

Sources