RL-Sim-Framework/README.md

# RL-Framework

A small, fast RL framework for training sim2real policies on a 3D-printed
rotary (Furuta) cartpole — built to scale from a laptop CPU to a GPU worker
(ClearML) without code changes, and to grow into more robots and simulators.

## Architecture

Three orthogonal pieces, composed by Hydra config groups:

| Piece | Role | Implementations |
|---|---|---|
| **Env** (`src/envs/`) | Task logic: obs / reward / termination / init distribution. Pure torch, batched, backend-agnostic. | `rotary_cartpole` |
| **Runner** (`src/runners/`) | Physics + sim2real plumbing (DR, sensor noise, action delay, history buffer). | `mujoco` (CPU), `mjx` (GPU/JAX), `serial` (real ESP32 robot) |
| **Trainer** (`src/training/`) | skrl PPO + shared MLP with optional history encoder. | `ppo`, `ppo_mjx`, `ppo_single`, `ppo_real` |

The robot itself is described once in `assets/<robot>/robot.yaml`
(URDF + identified motor model) and shared by **training, sysid and
deployment** — the motor model (bias → deadzone → gear compensation,
Coulomb + Stribeck friction, viscous damping, first-order lag) is
implemented in `src/core/robot.py` and mirrored exactly in the MJX JIT
step (`src/runners/mjx.py`).

## Train

```bash
# CPU (64 parallel MuJoCo envs)
python scripts/train.py env=rotary_cartpole runner=mujoco training=ppo

# GPU (1024 MJX envs) — local
python scripts/train.py env=rotary_cartpole runner=mjx training=ppo_mjx

# GPU — remote on ClearML gpu-queue
python scripts/train.py env=rotary_cartpole runner=mjx training=ppo_mjx training.remote=true
```

Videos and scalars stream to ClearML. Checkpoints land in `runs/`.

## Sim2real recipe

1. **Capture** real trajectories: `python -m src.sysid.capture` (writes `.npz` to `assets/<robot>/recordings/`).
2. **Identify** physics: `python -m src.sysid.optimize --robot-path assets/rotary_cartpole --recording <capture>.npz`
   — CMA-ES fits inertials/joint dynamics against the recording (motor model is locked from the unified sysid). Writes `sysid_result.json` + `robot_tuned.yaml` + `*_tuned.urdf`.
3. **Validate** the fit: `python -m src.sysid.visualize`, then copy `robot_tuned.yaml` → `robot.yaml`.
4. **Train with DR + history**: the runner randomizes friction/damping/torque scales, sensor noise and action latency per episode (`configs/runner/mjx.yaml: domain_rand`), and appends a 10-step (obs, action) history to the observation so the policy can implicitly identify the current dynamics (`history_length`).
5. **Deploy**: `mjpython scripts/eval.py env=rotary_cartpole runner=serial checkpoint=runs/<run>/checkpoints/agent_X.pt`

## Other tools

```bash
mjpython scripts/viz.py env=rotary_cartpole              # keyboard-drive the sim
mjpython scripts/viz.py runner=serial                    # digital twin of the real robot
python scripts/hpo.py env=rotary_cartpole training=ppo_single   # ClearML + SMAC3 HPO
pytest tests/                                            # unit tests
```

## Adding a robot / simulator

- **Robot**: drop `assets/<name>/` (URDF + `robot.yaml`), subclass `BaseEnv`
  (obs/reward/termination/`initial_state_ranges`), register in `src/core/registry.py`,
  add `configs/env/<name>.yaml`.
- **Simulator**: subclass `BaseRunner` and implement `_sim_initialize`,
  `_sim_step`, `_sim_reset` (full-batch return) — DR, history and the
  env-side logic come for free. Register in `scripts/train.py: RUNNER_REGISTRY`.