Building AI that sees, hears, and touches — with joint perception, reusable memory, and coherent control across every modality.
Four core capabilities that together form a complete multimodal intelligence.
Large-scale audio-video corpus with time-aligned tracks, plus a unique tactile dataset with full-hand force maps synchronized to egocentric video — a moat no public dataset provides.
NeurIPS-accepted work that synthesizes high-quality short-form outputs from raw footage. Clips become reusable experiences the system can reference, recombine, and adapt at scale.
A proprietary architecture for simultaneous audio-video editing — treating modalities as one coupled stream so edits stay temporally and semantically aligned across channels.
Models trained on binaural and spatial audio, enabling physically grounded 3D soundscapes that strengthen the system's sense of where events occur in the world.
A unified system that sees, hears, and feels the world