Monitoring Training#

Monitoring Horovod training using TensorBoard is similar to how it is usually done for TensorFlow or PyTorch workloads. Your training script generates the TensorBoard logs and saves the logs to the directory reference by OCI__SYNC_DIR env variable. With SYNC_ARTIFACTS=1, these TensorBoard logs will be periodically synchronized with the configured object storage bucket.

Please refer Saving Artifacts to Object Storage Buckets.

Aggregating metrics:

In a distributed setup, the metrics(loss, accuracy etc.) need to be aggregated from all the workers. Horovod provides MetricAverageCallback callback(for TensorFlow) which should be added to the model training step. For PyTorch, refer this Pytorch Example.

Using TensorBoard Logs:

TensorBoard can be setup on a local machine and pointed to object storage. This will enable a live monitoring setup of TensorBoard logs.

OCIFS_IAM_TYPE=api_key tensorboard --logdir oci://<bucket_name>/path/to/logs

Note: The logs take some initial time (few minutes) to reflect on the tensorboard dashboard.

Horovod Timelines:

Horovod also provides Timelines, which provides a snapshot of the training activities. Timeline files can be optionally generated with the following environment variable(part of workload yaml).

config:
    env:
      - name: ENABLE_TIMELINE #Disabled by Default(0).
        value: 1

Note: Creating Timelines degrades the training execution time.