Monitoring Horovod training using TensorBoard is similar to how it is usually done for TensorFlow
or PyTorch workloads. Your training script generates the TensorBoard logs and saves the logs to
the directory reference by
OCI__SYNC_DIR env variable. With
SYNC_ARTIFACTS=1, these TensorBoard logs will
be periodically synchronized with the configured object storage bucket.
Please refer Saving Artifacts to Object Storage Buckets.
In a distributed setup, the metrics(loss, accuracy etc.) need to be aggregated from all the workers. Horovod provides MetricAverageCallback callback(for TensorFlow) which should be added to the model training step. For PyTorch, refer this Pytorch Example.
Using TensorBoard Logs:
TensorBoard can be setup on a local machine and pointed to object storage. This will enable a live monitoring setup of TensorBoard logs.
OCIFS_IAM_TYPE=api_key tensorboard --logdir oci://<bucket_name>/path/to/logs
Note: The logs take some initial time (few minutes) to reflect on the tensorboard dashboard.
Horovod also provides Timelines, which provides a snapshot of the training activities. Timeline files can be optionally generated with the following environment variable(part of workload yaml).
config: env: - name: ENABLE_TIMELINE #Disabled by Default(0). value: 1
Note: Creating Timelines degrades the training execution time.