Files
RL-Sim-Framework/configs/training/ppo.yaml
Victor Mylle b37cd26690 feat: sim2real domain randomization + reward fixes for rotary cartpole
Close the sim2real gap for the Furuta pendulum (swings up but can't
balance on hardware). Root causes were (a) no domain randomization, so
the policy overfit one deterministic sim instance, and (b) reward design
flaws that produced degenerate policies.

Domain randomization (runner-level, backend-agnostic):
- BaseRunner: domain_rand config; per-env action-delay buffer (latency),
  Gaussian qpos/qvel sensor noise, per-env dynamics-scale sampling
  (friction/damping/torque), resampled per episode. Sensor noise per step.
- privileged_obs/privileged_dim expose normalized DR factors (mu) for RMA.
- step() now uses clean state for reward/termination, noisy state for the
  observation the policy sees.
- MuJoCoRunner: applies per-env friction/damping/torque scales.
- robot.py: compute_motor_force gains friction/damping scale args.
- Configs: DR blocks for mujoco (full) and mjx (delay+noise); clean
  defaults for mujoco_single/serial; noise/delay anchored to recordings.

Reward fixes (rotary_cartpole):
- Shift upright reward to [0,1] (was [-1,1]) + alive_bonus, so surviving
  always beats ending early (kills the "suicide into the limit" policy).
- Add balance_bonus * upright * stillness so reward requires upright AND
  near-zero pendulum velocity (kills the "spin in full loops" policy).

Deploy:
- eval.py load_policy reconstructs the history/adaptation encoder
  (auto-detects its dim from the checkpoint) so DR+embedding policies load.

Fixes:
- MuJoCoRunner._sim_reset referenced self._env (typo) -> self.env, which
  was breaking every rotary-cartpole reset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 20:48:25 +02:00

43 lines
1.3 KiB
YAML

hidden_sizes: [256, 256]
total_timesteps: 5000000
rollout_steps: 2048
learning_epochs: 10
mini_batches: 8
discount_factor: 0.99
gae_lambda: 0.95
learning_rate: 0.0003
clip_ratio: 0.2
value_loss_scale: 0.5
entropy_loss_scale: 0.01
log_interval: 1000
checkpoint_interval: 50000
initial_log_std: -0.5
min_log_std: -4.0
max_log_std: 2.0
record_video_every: 10000
# RMA-style history encoder
history_length: 10 # temporal window (must match runner)
embedding_dim: 32 # history encoder output dimension
# RMA (Rapid Motor Adaptation)
rma_mode: "none" # "none" | "teacher" | "deploy"
latent_dim: 8 # env encoder / adaptation latent dimension
# ClearML remote execution (GPU worker)
remote: false
# ── HPO search ranges ────────────────────────────────────────────────
# Read by scripts/hpo.py — ignored by TrainerConfig during training.
hpo:
learning_rate: {min: 0.00005, max: 0.001}
clip_ratio: {min: 0.1, max: 0.3}
discount_factor: {min: 0.98, max: 0.999}
gae_lambda: {min: 0.9, max: 0.99}
entropy_loss_scale: {min: 0.0001, max: 0.1}
value_loss_scale: {min: 0.1, max: 1.0}
learning_epochs: {min: 2, max: 8, type: int}
mini_batches: {values: [2, 4, 8, 16]}