feat: sim2real domain randomization + reward fixes for rotary cartpole

Close the sim2real gap for the Furuta pendulum (swings up but can't
balance on hardware). Root causes were (a) no domain randomization, so
the policy overfit one deterministic sim instance, and (b) reward design
flaws that produced degenerate policies.

Domain randomization (runner-level, backend-agnostic):
- BaseRunner: domain_rand config; per-env action-delay buffer (latency),
  Gaussian qpos/qvel sensor noise, per-env dynamics-scale sampling
  (friction/damping/torque), resampled per episode. Sensor noise per step.
- privileged_obs/privileged_dim expose normalized DR factors (mu) for RMA.
- step() now uses clean state for reward/termination, noisy state for the
  observation the policy sees.
- MuJoCoRunner: applies per-env friction/damping/torque scales.
- robot.py: compute_motor_force gains friction/damping scale args.
- Configs: DR blocks for mujoco (full) and mjx (delay+noise); clean
  defaults for mujoco_single/serial; noise/delay anchored to recordings.

Reward fixes (rotary_cartpole):
- Shift upright reward to [0,1] (was [-1,1]) + alive_bonus, so surviving
  always beats ending early (kills the "suicide into the limit" policy).
- Add balance_bonus * upright * stillness so reward requires upright AND
  near-zero pendulum velocity (kills the "spin in full loops" policy).

Deploy:
- eval.py load_policy reconstructs the history/adaptation encoder
  (auto-detects its dim from the checkpoint) so DR+embedding policies load.

Fixes:
- MuJoCoRunner._sim_reset referenced self._env (typo) -> self.env, which
  was breaking every rotary-cartpole reset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-09 20:48:25 +02:00
parent 8cc84d6a21
commit b37cd26690
22 changed files with 1219 additions and 781 deletions

View File

@@ -1,12 +1,18 @@
max_steps: 1000
robot_path: assets/rotary_cartpole
reward_upright_scale: 1.0
alive_bonus: 0.25 # per-step survival bonus (living must beat dying)
balance_bonus: 2.0 # extra reward for upright AND still (beats spinning)
balance_vel_scale: 0.5 # how fast the balance bonus decays with pendulum speed
# ── Regularisation penalties (prevent fast spinning) ─────────────────
motor_vel_penalty: 0.01 # penalise high motor angular velocity
motor_angle_penalty: 0.05 # penalise deviation from centre
action_penalty: 0.05 # penalise large actions (energy cost)
# ── Initial state randomisation ──────────────────────────────────────
pendulum_init_range_deg: 180.0 # pendulum starts in [-180°, +180°]
# ── Software safety limit (env-level, always applied) ────────────────
motor_angle_limit_deg: 90.0 # terminate episode if motor exceeds ±90°
@@ -16,4 +22,5 @@ hpo:
motor_vel_penalty: {min: 0.001, max: 0.1}
motor_angle_penalty: {min: 0.01, max: 0.2}
action_penalty: {min: 0.01, max: 0.2}
pendulum_init_range_deg: {min: 30.0, max: 180.0}
max_steps: {values: [500, 1000, 2000]}

View File

@@ -3,3 +3,15 @@ device: auto # auto = cuda if available, else cpu
dt: 0.002
substeps: 10
history_length: 10 # RMA-style: 10-step window of (obs, action) pairs
rma_mode: "none" # "none" | "teacher" | "deploy"
# ── Domain randomization (sim-to-real) ──────────────────────────────
# NOTE: action-delay and sensor-noise are applied for MJX, but the
# per-env dynamics *scales* (friction/damping/torque) are NOT yet wired
# into the JIT step — use runner=mujoco for scale randomization, or keep
# this block to delay+noise only on MJX.
domain_rand:
qpos_noise_std: 0.01 # rad — encoder angle noise
qvel_noise_std: 0.5 # rad/s — velocity-estimate noise (measured)
action_delay_steps: [0, 2] # control-step latency (040 ms)

View File

@@ -2,13 +2,17 @@ num_envs: 64
device: auto # auto = cuda if available, else cpu
dt: 0.002
substeps: 10
history_length: 10 # RMA-style: 10-step window of (obs, action) pairs
history_length: 10 # must match training.history_length (DR + embedding)
# ── Sim2real: domain randomization ───────────────────────────────
rma_mode: "none" # "none" | "teacher" | "deploy"
# ── Domain randomization (sim-to-real) ──────────────────────────────
# Noise/delay levels anchored to the real recordings (~50 Hz, ~0.5 rad/s
# velocity noise, ≤1-step latency). Set domain_rand: {} to disable.
domain_rand:
mass_frac: 0.15 # ±15% body mass randomization
friction_frac: 0.3 # ±30% joint friction
damping_frac: 0.3 # ±30% joint damping
armature_frac: 0.2 # ±20% reflected rotor inertia
gear_frac: 0.15 # ±15% actuator gear ratio
com_offset: 0.005 # ±5mm center-of-mass shift
qpos_noise_std: 0.01 # rad — encoder angle noise
qvel_noise_std: 0.5 # rad/s — velocity-estimate noise (measured)
action_delay_steps: [0, 2] # control-step latency (040 ms)
friction_scale: [0.6, 1.6] # Coulomb-friction multiplier
damping_scale: [0.6, 1.6] # viscous-damping multiplier
torque_scale: [0.85, 1.15] # motor-constant / battery-voltage variation

View File

@@ -6,3 +6,12 @@ device: cpu
dt: 0.002
substeps: 10
history_length: 10
rma_mode: "none" # "none" | "teacher" | "deploy"
# Clean by default (deterministic eval). Confirming-experiment example —
# re-eval an existing checkpoint in sim with a fixed 1-step action delay:
# mjpython scripts/eval.py env=rotary_cartpole runner=mujoco_single \
# checkpoint=runs/.../agent_XXXX.pt \
# '++runner.domain_rand.action_delay_steps=[1,1]'
domain_rand: {}

View File

@@ -9,3 +9,5 @@ baud: 115200
dt: 0.02 # control loop period (50 Hz, matches training)
no_data_timeout: 2.0 # seconds of silence before declaring disconnect
history_length: 10 # must match training runner
rma_mode: "none" # "none" | "teacher" | "deploy"

View File

@@ -1,19 +1,19 @@
hidden_sizes: [128, 128]
hidden_sizes: [256, 256]
total_timesteps: 5000000
rollout_steps: 1024
learning_epochs: 4
mini_batches: 4
rollout_steps: 2048
learning_epochs: 10
mini_batches: 8
discount_factor: 0.99
gae_lambda: 0.95
learning_rate: 0.0003
clip_ratio: 0.2
value_loss_scale: 0.5
entropy_loss_scale: 0.05
entropy_loss_scale: 0.01
log_interval: 1000
checkpoint_interval: 50000
initial_log_std: 0.5
min_log_std: -2.0
initial_log_std: -0.5
min_log_std: -4.0
max_log_std: 2.0
record_video_every: 10000
@@ -22,6 +22,10 @@ record_video_every: 10000
history_length: 10 # temporal window (must match runner)
embedding_dim: 32 # history encoder output dimension
# RMA (Rapid Motor Adaptation)
rma_mode: "none" # "none" | "teacher" | "deploy"
latent_dim: 8 # env encoder / adaptation latent dimension
# ClearML remote execution (GPU worker)
remote: false