Commits


ashbhandare authored and GitHub committed 7cebf76a46b
Improve checkpointing for Zero stage 1 (#5478) * Initial running changes * Checkpointing aggregation changes * compare with older version * initial cleanup * Add zero test, minor fix * Fix zero test, transform, formatting * Review comments * add more unit tests * review comments * Try fix CI * Add additional check on just aggregation code * Try fix ckpt gen * Add pregenerated ckpt for CI, enable zero test in e2e * Moving test to nightly, removing ckpt files * Add tests to dist GPU CI * Fix dist test * Review comments * Fix test