Whitepaper · v0.1 · April 2026

Learned simulators for evaluating
general robot policies

Fern is building scalable evaluation and reinforcement-learning environments for robotics. We train high-fidelity, action-conditioned world models from real robot data, so any policy can be benchmarked and improved without ever running on physical hardware.

The problem with evaluating robot policies

General robot policies are improving fast — but the way we measure them hasn’t. The dominant evaluation methodology is still “put it on a real robot, run a few rollouts, eyeball the results.” That has three problems.

It doesn’t scale. Each evaluation costs operator time, hardware wear, and resets between trials. Comparing two checkpoints meaningfully takes hours, not seconds.
It isn’t reproducible.Lighting, object placements, and contact dynamics drift between sessions, so two research groups can’t compare numbers directly.
It isn’t safe to optimize against. Reinforcement learning needs millions of rollouts. Doing those on hardware is slow, expensive, and dangerous.

The same problems were solved for game-playing agents and language models with simulators and benchmark suites. Robotics doesn’t have either yet — not because nobody tried, but because hand-built physics simulators don’t cover the full visual and contact distribution of the real world. Sim2real is hard for a reason.

A simulator learned end-to-end from real data

Instead of writing a simulator, we learn one. Given a starting image and a stream of actions, our model rolls out future frames that match what would have happened on the real robot. The physics isn’t hand-coded — every contact, every shadow, every cable and gripper finger is something the model has seen during training.

High-fidelity images.The simulator outputs 256×256 RGB frames at the same effective rate as the real teleop trajectories that trained it.
Action-conditioned.Drive it with the same end-effector commands you’d send a real bimanual setup; the rollout responds to your input frame-by-frame.
No real robot needed. Once the model is trained, every researcher gets the same physics — no shipping hardware, no calibration drift, no operator queue.
Apples-to-apples comparison. Two policies, one identical environment, one identical starting state. We plan to host these benchmarks publicly so the field can move from anecdotal to quantitative comparisons.

What it looks like

Below: a clip from the validation set. Left half is the real robot recording (ground truth). Right half is the model’s rollout, conditioned on the same starting frame and the same action stream — never seen during training. The model never gets to peek at the GT pixels; everything on the right is generated.

Ground truth

Generated

bimanual rope · validation episode br_0000

The two streams stay aligned through the entire 20-second episode — the model has learned the rope’s contact dynamics, the arms’ reachable workspace, and the gripper’s closing geometry from teleoperation alone.

How the model is trained

Training data comes from the bimanual teleoperation dataset released by Wang et al.’s Interactive World Simulator project — four task families (T pushing, rope routing, mug grasping, pile sweeping) collected on a bimanual teleop rig. We share that project’s core motivation — learn physics from real interaction data so policies can be evaluated and trained without burning hardware — but train it with a fundamentally different and more scalable architecture: a single diffusion-forcing transformer that handles all four task families through one unified action head, at 256×256 RGB output, in place of one specialist model per task at lower resolution.

Concretely, the simulator denoises next-frame latents in a latent-VAE-encoded image space, conditioned on the past frames and the action history. Headline numbers from the current checkpoint:

Backbone	DiT3D · 12 layers · 768 hidden · 12 heads · RoPE-3D
Patch size	2 × 2 latent patches · 4-channel SD-VAE latents
Diffusion	Cosine continuous schedule · v-prediction · 1000 train timesteps
Sampling	DDIM · 20 steps at inference · sigmoid loss weighting
Resolution	256 × 256 RGB output · 32 × 32 × 4 latents
Sequence	16-frame context window · frame-skip 2 (5 Hz effective)
Action conditioning	8-D end-effector deltas + grippers · 2 arms × 4 dims
Optimizer	AdamW · lr 7.5e-5 · betas (0.9, 0.99) · grad-clip 1.0
Precision	bf16-mixed · EMA 0.9999 · 50 epochs · batch 24

The action vector is the same one you’d send a real bimanual setup: 3-D end-effector translation deltas plus a continuous gripper command, per arm. We pad smaller scenes (push-T, single-arm grasp) with zeros so all four task families share one head, which lets the model transfer contact priors between them — pushing in push-T helps it learn arm-rope contacts, and vice versa.

Inference runs at about 3.5 frames per second on a single modern GPU, with end-to-end latency under 300 ms per generated frame. That’s fast enough that a researcher can drive the simulator interactively from their laptop, the way you’d drive a video game — which is exactly what the demo below does.

Try it yourself

The simulator below is the same model running live on a single cloud GPU. Click Start the demo, then steer the bimanual setup with the keyboard. Every frame you see is generated on demand from the actions you send — there’s no recorded video being replayed.

This demo is a proof of concept.It’s pinned to a single multi-task checkpoint trained on a public open dataset, served just to prove the architecture and the live-driving loop work end-to-end in a browser.

Note: the GPU is shared. If the canvas takes a moment to come live, it’s likely already in use by someone else.

Idle

Click the canvas, then drive the rope.

What’s next

What we’re actively working on:

Architecture R&D. We’re actively iterating on the model architecture for decreasing sim-to-real gap, longer horizon stability, lower per-frame latency, and sharper contact physics.
First-party data collection. We’re recording our own bimanual teleoperation in-house on a growing fleet of robots — expanding the base model’s coverage of grippers, contact regimes, and scene diversity well beyond what any single open dataset provides.
Custom world models for customers. Most robotics companies already have terabytes of teleoperation data sitting in cold storage from training their own policies. We re-purpose that data to fit a world model on their specific embodiments and tasks, so they can evaluate and RL-train their checkpoints against their physics, not a generic one.
Public benchmarks. A growing catalog of evaluation tasks — manipulation, navigation, mobile manipulation — hosted on this site. Submit a policy, get a leaderboard placement, see exactly where it succeeds and fails.
RL environments. The same world models exposed as Gym-style environments for offline + online RL. Train against learned physics, deploy to real robots without ever burning hardware time on bad policies.

We’re building this for three kinds of teams. Policy developers who want their checkpoints evaluated end-to-end on a managed cloud platform, without standing up their own robot fleet. RL researchers who want Gym-style environments backed by learned physics, so training runs don’t need real hardware in the loop. And robotics companies who want a custom world model fitted to their own embodiments and existing teleop data — so evaluation and RL training happen in their physics, not a generic one. If any of those describe you, reach out at founders@fern.bot.