- Add --cdc_debug flag to enable verbose bucket-by-bucket output
- When debug=False (default): Show tqdm progress bar, concise logging
- When debug=True: Show detailed bucket information, no progress bar
- Improves user experience during CDC cache generation
Fixes shape mismatch bug in multi-subset training where CDC preprocessing
and training used different index calculations, causing wrong CDC data to
be loaded for samples.
Changes:
- CDC cache now stores/loads data using image_key strings instead of integer indices
- Training passes image_key list instead of computed integer indices
- All CDC lookups use stable image_key identifiers
- Improved device compatibility check (handles "cuda" vs "cuda:0")
- Updated all 30 CDC tests to use image_key-based access
Root cause: Preprocessing used cumulative dataset indices while training
used sorted keys, resulting in mismatched lookups during shuffled multi-subset
training.
- Treat cuda and cuda:0 as compatible devices
- Only warn on actual device mismatches (cuda vs cpu)
- Eliminates warning spam during multi-subset training
- Check that noise and CDC matrices are on same device
- Automatically transfer noise if device mismatch detected
- Warn user when device transfer occurs
- Add tests to verify device handling
- Track warned samples in global set to prevent log spam
- Each sample only warned once per training session
- Prevents thousands of duplicate warnings during training
- Add tests to verify throttling behavior
- Remove @torch.no_grad() decorator from compute_sigma_t_x()
- Gradients now properly flow through CDC transformation during training
- Add comprehensive gradient flow tests for fast/slow paths and fallback
- All 25 CDC tests passing
- Create apply_cdc_noise_transformation() for better modularity
- Implement fast path for batch processing when all shapes match
- Implement slow path for per-sample processing on shape mismatch
- Clone noise tensors in fallback path for gradient consistency
- Cache all shapes during GammaBDataset initialization
- Eliminates file I/O on every training step (9.5M accesses/sec)
- Reduces get_shape() from file operation to dict lookup
- Memory overhead: ~126 bytes/sample (~12.6 MB per 100k images)