AI & ML Advanced By Samson Tanimawo, PhD Published Dec 24, 2026 5 min read

Robotics Foundation Models

Vision-language-action models combine perception, language, and motor control. The 2025-2026 wave (RT-2, OpenVLA, Octo, π0) is the foundation-model moment for robotics.

The VLA idea

Vision-Language-Action (VLA) models are foundation models for robotics. They take vision (camera feed) and language (instruction) as input; they output robot actions (joint commands, gripper control). The framing parallels vision-language models, but with action as the third modality. The bet: foundation models that work across robot tasks, learning from massive embodied-AI datasets.

The motivation. Traditional robotics required per-task programming or per-task learning. Each new task was a new project. VLA models aim to be reusable: one model, many tasks. The economics shift from "build a robot for task X" to "deploy a VLA model and instruct it for task X".

The current state. As of 2026, VLA models work for narrow, well-controlled tasks: pick-and-place in known environments, simple manipulation, structured demonstrations. They struggle with novel objects, unstructured environments, long-horizon tasks. The technology is real but young.

The honest framing. VLA is foundation-model-aspirations applied to robotics; foundation models for text needed massive data and compute that took years to accumulate. Robotics data is much harder to collect. The "GPT moment" for robotics is 3-7 years away by best estimates.

The 2026 wave

Notable VLA models:

The OpenVLA case. Released 2024. Open weights. Strong baseline for many manipulation tasks. The default model academic and small-industrial teams use. Demonstrates the VLA pattern at moderate scale.

The Pi case. Series of foundation models from Physical Intelligence. Trained on large-scale demonstration data. Industrial pilots in 2025-2026. Whether the company succeeds or another approach wins is open; the technical direction is broadly representative of the field.

The compute scale gap. VLA models trained on billions of frames vs language models trained on trillions of tokens. The data scale is much smaller. Predictions about VLA capability scaling assume the data gap closes; if it doesn't, VLA progress will be slower than language model progress.

The hardware constraint. Robotics compute is constrained: on-robot inference must be fast and power-efficient. VLA models run on robot-attached compute; not the cloud. The hardware-software co-design matters more for robotics than for cloud-served language models.

The data problem

Language models trained on trillions of tokens (the public web). Robotics has nothing comparable. Demonstration data, humans teleoperating robots through tasks, is expensive to collect. Each minute of data requires a human in the loop. The data scarcity is the dominant constraint on VLA progress.

The teleoperation cost. Per-hour cost of teleoperation: $50-200 depending on task complexity and operator skill. To collect millions of hours of data (the level language models have): tens of millions of dollars; multi-year collection efforts.

The simulation alternative. Simulated environments produce unlimited data cheaply. Sim-to-real transfer is a real challenge: policies trained in sim often fail in real. Recent advances (domain randomisation, large-scale sim) close the gap; pure-sim VLA still trails real-data VLA.

The video-only alternative. Internet videos contain billions of human-action examples. Models that learn from video without action labels are a research direction. Promising but not yet matching action-labelled data quality.

The crowd-sourcing direction. Companies pay users to teleoperate robots in their homes (tidy your kitchen, fold laundry). Distributed data collection at lower per-hour cost. Privacy and quality concerns; commercial scale is emerging.

The data-flywheel theory. Once VLA models are deployed, they collect their own data: each task execution produces labelled data (action, outcome). The flywheel accelerates over time. Whether the flywheel produces useful data quickly enough to outpace research lag is open.

Realistic capabilities

What VLA models can do reliably in 2026: pick up labeled objects from known positions; place them in known target zones; follow simple language instructions for these tasks; generalise modestly to similar (not identical) objects. What they can't: handle novel environments without retraining; perform long-horizon tasks; recover from significant disruptions; safely operate around humans without engineered safeguards.

The pick-and-place reality. The most-tested VLA capability. Works on factory lines and warehouses where environments are controlled. Generalisation to new objects is partial, adding a previously-unseen object often requires fine-tuning or substantial demonstration.

The long-horizon limitation. Tasks longer than 30-60 seconds compound errors. The model loses track or makes a mistake that compounds. Solutions include task decomposition (break into shorter primitives), human supervision, hierarchical planning. None fully solves long-horizon yet.

The novel-environment limitation. VLA models trained on environment X work poorly in environment Y. New lighting, new layouts, new clutter all cause failures. Industrial deployment requires controlled environments; "drop the robot in any room and have it work" is years out.

The human-safety limitation. VLA models acting near humans need explicit safety layers. Speed limits, collision avoidance, emergency stops. The model's actions can't be trusted to be human-safe by themselves; engineered safety is mandatory.

The "gradual progress" framing. VLA capabilities improve year over year. Each year unlocks new feasible deployments. The pace is slower than language model capability increase; the trajectory is real.

Common antipatterns

Treating VLA as a drop-in replacement for traditional robotics. It's a complement, not a replacement. Traditional methods still win for many constrained tasks.

Promising "general-purpose home robot" timelines. The hype-vs-reality gap is large here. Be honest about what's near-term.

Skipping safety engineering on the assumption "the model is safe". The model isn't safe; engineered safety layers are mandatory.

Pure-sim training without sim-to-real validation. Sim-trained policies often fail in real. Always validate on physical hardware before deployment.

What to do this week

Three moves. (1) If you're researching robotics, evaluate OpenVLA on a baseline task. Hands-on experience builds intuition that papers don't. (2) For commercial robotics applications, scope tasks to the constrained-environment, short-horizon, well-defined-objects regime. That's where VLA earns its keep today. (3) Map the safety architecture explicitly, what stops the robot from harming humans if the model errs? Without explicit safety, robotics deployments aren't responsible.