RL-Framework

A small, fast RL framework for training sim2real policies on a 3D-printed rotary (Furuta) cartpole — built to scale from a laptop CPU to a GPU worker (ClearML) without code changes, and to grow into more robots and simulators.

Architecture

Three orthogonal pieces, composed by Hydra config groups:

Piece	Role	Implementations
Env (`src/envs/`)	Task logic: obs / reward / termination / init distribution. Pure torch, batched, backend-agnostic.	`rotary_cartpole`
Runner (`src/runners/`)	Physics + sim2real plumbing (DR, sensor noise, action delay, history buffer).	`mujoco` (CPU), `mjx` (GPU/JAX), `serial` (real ESP32 robot)
Trainer (`src/training/`)	skrl PPO + shared MLP with optional history encoder.	`ppo`, `ppo_mjx`, `ppo_single`, `ppo_real`

The robot itself is described once in assets/<robot>/robot.yaml (URDF + identified motor model) and shared by training, sysid and deployment — the motor model (bias → deadzone → gear compensation, Coulomb + Stribeck friction, viscous damping, first-order lag) is implemented in src/core/robot.py and mirrored exactly in the MJX JIT step (src/runners/mjx.py).

Train

# CPU (64 parallel MuJoCo envs)
python scripts/train.py env=rotary_cartpole runner=mujoco training=ppo

# GPU (1024 MJX envs) — local
python scripts/train.py env=rotary_cartpole runner=mjx training=ppo_mjx

# GPU — remote on ClearML gpu-queue
python scripts/train.py env=rotary_cartpole runner=mjx training=ppo_mjx training.remote=true

Videos and scalars stream to ClearML. Checkpoints land in runs/.

Sim2real recipe

Capture real trajectories: python -m src.sysid.capture (writes .npz to assets/<robot>/recordings/).
Identify physics: python -m src.sysid.optimize --robot-path assets/rotary_cartpole --recording <capture>.npz — CMA-ES fits inertials/joint dynamics against the recording (motor model is locked from the unified sysid). Writes sysid_result.json + robot_tuned.yaml + *_tuned.urdf.
Validate the fit: python -m src.sysid.visualize, then copy robot_tuned.yaml → robot.yaml.
Train with DR + history: the runner randomizes friction/damping/torque scales, sensor noise and action latency per episode (configs/runner/mjx.yaml: domain_rand), and appends a 10-step (obs, action) history to the observation so the policy can implicitly identify the current dynamics (history_length).
Deploy: mjpython scripts/eval.py env=rotary_cartpole runner=serial checkpoint=runs/<run>/checkpoints/agent_X.pt

Other tools

mjpython scripts/viz.py env=rotary_cartpole              # keyboard-drive the sim
mjpython scripts/viz.py runner=serial                    # digital twin of the real robot
python scripts/hpo.py env=rotary_cartpole training=ppo_single   # ClearML + SMAC3 HPO
pytest tests/                                            # unit tests

Adding a robot / simulator

Robot: drop assets/<name>/ (URDF + robot.yaml), subclass BaseEnv (obs/reward/termination/initial_state_ranges), register in src/core/registry.py, add configs/env/<name>.yaml.
Simulator: subclass BaseRunner and implement _sim_initialize, _sim_step, _sim_reset (full-batch return) — DR, history and the env-side logic come for free. Register in scripts/train.py: RUNNER_REGISTRY.

3.3 KiB Raw Blame History