Simulation for Evaluating Robot Policies Should Be
Asynchronous and Real-Time


Anonymous Author(s)
In submission to
Conference on Robot Learning, 2026
Overview video (3min)

Simulation is an indispensable tool for evaluating robot algorithms in a safe and scalable manner. However, current simulation benchmarks introduce an artificial sim-to-real gap by running robot policies synchronously — pausing the simulation to let the policy “think” for as long as it needs before executing actions. But real world physics does not wait for policy inference. Synchronous evaluation provides an unfair advantage to heavyweight policies with large backbones or iterative denoising that may be performant, but are too slow in the real world. We introduce AsyncRoboSim: a protocol to make any robotic simulation benchmark asynchronous by running the simulator and policy in parallel threads. The simulator runs in real time and communicates with the policy via shared inter-process queues for observations and actions. Importantly, the simulation rate is decoupled from the speed at which the policy consumes observations and produces actions. We instantiate our approach in robosuite, the underlying simulation framework for many robotic benchmarks that can automatically inherit our asynchronous implementation. We show that the performance of standard manipulation tasks in AsyncRoboSim is significantly more correlated with real-world performance than conventional synchronous simulation. Realtime, asynchronous simulation naturally reflects inference latency in task performance metrics and also aids debugging.

Method: Asynchronous and real-time simulation
(a) Conventional synchronous simulation alternates between a policy inference step and a simulation step in a single process where the simulator is frozen during policy inference. (b) Asynchronous simulation runs the policy and simulator in separate parallel processes that communicate via shared observation and action queues. The simulator pulls actions and produces observations at fixed rates. Importantly, it synchronizes physics to match the real-time wall clock. The policy may pull observations and push actions at any rate it chooses. By decoupling the policy and simulator in parallel processes and communicating asynchronously via shared queues, the policy latency does not affect simulation. This allows us to fairly evaluate the effect of inference latency on policy performance, similar to how latency affects the policy’s behavior in the real world.
Synchronous vs. asynchronous simulation

A diffusion policy predicting action chunks of length 16 simulated in conventional synchronous simulation (left) and real-time, asynchronous simulation (right). The policy visibly “jitters” due to inference latency at chunk boundaries where denoising occurs. This is visible in asynchronous simulation but the synchronous simulation is deceptively smooth.

Left — Synchronous: the simulator alternates between policy inference and physics steps in a single process, freezing physics while the policy is “thinking.” Latency is hidden, so the motion looks deceptively smooth in simulation no matter how slow the policy is.

Right — Asynchronous: the policy and simulator run as separate parallel processes that continuously communicate and interact with each other via shared observation and action queues. The simulator advances physics in real time and never waits for the policy — if the next action is not ready, the simulator keeps executing the most recent one. This exposes the true inference latency of the diffusion policy that visibly “jitters” at chunk boundaries where denoising occurs, mirroring how latency manifests on a real robot.

Varying the "real-time rate"

Can Task
RTR = 0.33
RTR = 1
RTR = 3
Square Task
RTR = 0.33
RTR = 1
RTR = 3

The real-time rate (RTR) is a tunable knob that sets the speed of the simulation relative to the wall clock, letting us evaluate a policy under different hardware and compute budgets without changing the policy. RTR = 1 (center) runs physics at true real time, our default. RTR > 1 (right) advances physics faster than real time, emulating scarcer compute — e.g. an edge device or a heavily loaded workstation — so the policy’s latency is effectively amplified. RTR < 1 (left) advances physics slower than real time, emulating more abundant compute that the policy can keep up with more easily. All clips show asynchronous simulation with action chunk = 16. Increasing the RTR (faster simulation) lowers task success and widens the gap between synchronous and asynchronous simulation, since the policy has relatively less time to react.

Real-world twin as ground truth for evaluation

Can-sim
Can-real
Square-sim
Square-real
Can — Diffusion policy (chunk=16)
Square — Diffusion policy (chunk=16)

We reproduce the Can and Square tasks from Robomimic on a real Franka Research 3 robot, using the same objects and 3D-printed tools as their simulated counterparts. Can is a pick-and-place task requiring robust object transport across spatial and height variation, while Square is a contact-rich precision task that inserts a square nut onto a peg with tight geometric tolerance. Both demos run a diffusion policy predicting action chunks of 16 (chunk = 16). These real-world rollouts are the ground-truth that we compare against. We find that performance in asynchronous simulation is significantly more correlated with this real-world performance than conventional synchronous simulation.

Result 1: Correlation between real-world and simulation
Linear correlation between simulation and real-world success rates across all evaluated policies under synchronous and asynchronous conditions. Each point above is a single robot policy, with its real-world success rate on the \(x\)-axis and its simulated success rate on the \(y\)-axis (orange = Can, blue = Square).
Sim-to-real correlation metrics by task and sync/async simulation.
Task Condition Pearson r Spearman ρ Pairwise Acc.
Can Synchronous sim 0.477 0.485 0.724
Asynchronous sim 0.735 0.664 0.812
Improvement (sync → async) +0.258 +0.179 +0.088
Square Synchronous sim 0.831 0.782 0.824
Asynchronous sim 0.894 0.975 0.971
Improvement (sync → async) +0.063 +0.193 +0.147

We quantify how well simulation predicts the real world using three metrics: Pearson r (linear correlation, sensitive to the magnitude of differences), Spearman ρ (rank correlation), and Pairwise Accuracy (the fraction of policy pairs whose real-world ordering simulation gets right). Across both tasks and all three metrics, asynchronous simulation correlates more strongly with real-world performance than conventional synchronous simulation — for example, Pearson r rises from 0.477 to 0.735 on Can and from 0.831 to 0.894 on Square. In other words, asynchronous simulation is a better predictor of which policy will actually perform better on the real robot.

Result 2: Correlation between latency and performance drop
We measure how much a policy’s inference latency hurts its performance when we move from synchronous to asynchronous evaluation, across all five Robomimic tasks (Lift, Can, Square, Transport, ToolHang) on the left and the LIBERO benchmark on the right. Each point is a single task–policy pair (colored by task); the \(x\)-axis is the policy’s inference latency and the \(y\)-axis is the drop in success rate going from synchronous to asynchronous simulation, with per-task log-linear regression fits (slopes annotated in the legend). The higher a policy’s latency, the larger the drop i.e. the worse is the overestimation of the policy's performance by conventional synchronous simulation. Low-latency MLP/GMM policies (≈11–22 ms) are barely affected, whereas iterative-denoising diffusion policies degrade sharply — e.g. on Can, diffusion policy with chunk = 16 falls from 98% success under synchronous evaluation to 42% under asynchronous evaluation. Synchronous evaluation hides this latency cost.
Result 3: Correlation between RTR and performance drop
We study how the real-time rate (RTR) — the speed of the simulator relative to policy inference — affects performance on the Robomimic Square task across a range of robot policies. Synchronous simulation is the limiting case as RTR \(\to 0\), where the simulator waits arbitrarily long for inference to complete. Left: success rate falls as RTR increases (faster simulation), and the drop is steeper for higher-latency policies — low-latency BC variants stay roughly stable while the high-latency diffusion policy degrades sharply. Right: the synchronous-to-asynchronous performance drop increases with policy latency, as shown in Result 2. However, this is amplified at higher RTRs. Increasing RTR thus emulates scarcer compute and exposes a latency–reactivity tradeoff that synchronous evaluation systematically hides.