RoboEvalfor researchers & developers

Test your model against real-world workflows.

RoboEval evaluates your VLA, video-action, or world model in high-fidelity simulation, then on real robots — hundreds of episodes under real lighting, real placement error, real failure. With diagnostics for exactly where it breaks.

RoboEval evaluation platform interface

Benchmarks don't predict reliability.

95% → 60%simulation score to production success rate
5–10%accuracy lost to a single warehouse lighting change
0-shotwhere open VLAs that top leaderboards collapse on real-world tasks

A model that scores 95% in simulation drops to 60% in production. A lighting change costs 5–10% accuracy. Open-weight VLAs that match proprietary models on public leaderboards collapse on zero-shot real-world tasks.

Two phases. Sim to real.

Every model runs the same pipeline — the one that separates simulation performance from real-world reliability.

Phase 1

Simulation

Your model runs hundreds of episodes across scenarios in high-fidelity physics simulation, scored on task completion, safety, and efficiency. Simulation is fast, repeatable, and cheap, so we test broadly and filter to the top performers without tying up a single robot.

The scenes aren’t toy environments. SceneForge reconstructs real deployment sites from video into photoreal, physics-ready simulation — so a model is tested against the place it will actually run.

Phase 2

Real-world

Top performers move to physical robots in our lab, under the conditions that actually break models: variable lighting, imperfect object placement, environmental noise. Multiple robot form factors and built-in failure diagnostics that surface when and why a model breaks down — not just whether it passed.

What you get.

01

Per-episode video

Watch every rollout. Replay the exact episode a model failed.

02

Failure diagnostics

When and why a model breaks — not just pass/fail.

03

Multi-scene evaluation

One model, many scenes, per-scene success rates.

04

A/B comparison

Two models, side by side, identical conditions.

05

Public leaderboard

Ranked on real-world-predictive evals.

06

Open by default

Bring a model from the open ML ecosystem — OpenVLA, π0, GR00T, RT-2, and more.

The first benchmark

WarehouseBench

WarehouseBench is our first evaluation suite — bi-manual robot arms running packing, kitting, and order fulfillment, the workflows real warehouses run. v1 scores the metrics that decide whether a deployment works, not just whether a task completed.

01

Success rate

did the task complete.

02

Containment accuracy

did items land where they belong.

03

Closure quality

is the package sealed correctly.

04

Post-action stability

does it stay put after the arm lets go.

Next: Physical Tool Use — extending the evaluation surface to adjacent deployment categories.

The leaderboard.

A public ranking of frontier open models on evaluations built to predict real-world reliability — not simulation scores. See how the field actually stacks up. Then put your model on the board.

RankModelReal-world score
Submissions open — be among the first models on the board.

How does your model compare?

RoboEval is in private access. Sign in with GitHub, request access, and we’ll get you onboarded.

Built for the open ML ecosystem.