Comparing changes

* Add TFV1TrainLoopMonitorCallback. * Remove the parameter num_workers in TFV1TrainLoopMonitorCallback.is_critical_pod * Update comments. * Merge two if conditions into one. * Composite TFV1PSStrategyTrainLoopMonitorCallback if _is_tfv1_ps_strategy_training is True. * Set the exit_code according to the success parameter in Master.request_stop. * Add message content in request_stop. * Organize the master exit logic. * Resolve test failure. * Resolve typo.

* Update the version releasing doc. (#2474) * Update the release step: cherry-pick the fix commit from develop to release branch * Fix typo * Update the versioning doc. * Add cluster_spec_json in EXCLUDE_PRINT_ARGS (#2479) * Implement the fail fast mechanism of master. (#2480) * Add TFV1TrainLoopMonitorCallback. * Remove the parameter num_workers in TFV1TrainLoopMonitorCallback.is_critical_pod * Update comments. * Merge two if conditions into one. * Composite TFV1PSStrategyTrainLoopMonitorCallback if _is_tfv1_ps_strategy_training is True. * Set the exit_code according to the success parameter in Master.request_stop. * Add message content in request_stop. * Organize the master exit logic. * Resolve test failure. * Resolve typo. * add pod status change log (#2483) * Fix model cuda (#2484) * Relaunch worker on failure (#2485) * relaunch worker on failure * only relaunch in PS strategy * Create an ElasticImageFolder for PyTorch. (#2486) * Develop an image folder dataset for PyTorch * Add docstring * Check whether to register hooks according to HOROVOD_ELASTIC (#2487) * Check whether to register hooks according to HOROVOD_ELASTIC * Register hooks * Remove "elasticdl-" prefix to ps/worker pod name (#2489) * remove prefix to ps/worker pod name * fix tests * fix black Co-authored-by: brightcoder01 <55301748+brightcoder01@users.noreply.github.com> Co-authored-by: Qinlong Wang <WangQL1201@outlook.com>

* Update the version releasing doc. (#2474) * Update the release step: cherry-pick the fix commit from develop to release branch * Fix typo * Update the versioning doc. * Add cluster_spec_json in EXCLUDE_PRINT_ARGS (#2479) * Implement the fail fast mechanism of master. (#2480) * Add TFV1TrainLoopMonitorCallback. * Remove the parameter num_workers in TFV1TrainLoopMonitorCallback.is_critical_pod * Update comments. * Merge two if conditions into one. * Composite TFV1PSStrategyTrainLoopMonitorCallback if _is_tfv1_ps_strategy_training is True. * Set the exit_code according to the success parameter in Master.request_stop. * Add message content in request_stop. * Organize the master exit logic. * Resolve test failure. * Resolve typo. * add pod status change log (#2483) * Fix model cuda (#2484) * Relaunch worker on failure (#2485) * relaunch worker on failure * only relaunch in PS strategy * Create an ElasticImageFolder for PyTorch. (#2486) * Develop an image folder dataset for PyTorch * Add docstring * Check whether to register hooks according to HOROVOD_ELASTIC (#2487) * Check whether to register hooks according to HOROVOD_ELASTIC * Register hooks * Remove "elasticdl-" prefix to ps/worker pod name (#2489) * remove prefix to ps/worker pod name * fix tests * fix black * Develop an API to get training epoch (#2488) * Check whether to register hooks according to HOROVOD_ELASTIC * Develop an API to get training epoch * Register hooks * Add unittest * Fic by comments * Fix unittest Co-authored-by: brightcoder01 <55301748+brightcoder01@users.noreply.github.com> Co-authored-by: HT <tenn_2001c@yahoo.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on Jan 8, 2021

Commits on Jan 13, 2021

Commits on Jan 14, 2021

Commits on Jan 22, 2021

This comparison is taking too long to generate.

Uh oh!