ads.opctl.distributed.common package¶
Submodules¶
ads.opctl.distributed.common.abstract_cluster_provider module¶
- class ads.opctl.distributed.common.abstract_cluster_provider.ClusterProvider(mode, ephemeral=True, life_span='0h', work_dir='')[source]¶
Bases:
object
Provides contract for implementing Framework specific Cluster Life Cycle Manager.
The main node is identified by the environment variable - MAIN The worker node is identified by the environment variable - WORKER
The worker and main coordinate cluster configuration using the directory provided via WORK_DIR. The main node creates config.json in the WORK_DIR. The worker nodes polls the config.json and exports the configuration as environment variables
- DEFAULT_CODE_DIR = '/code'¶
- SYNC_SCRIPT_PATH = '/etc/datascience/sync.sh'¶
- basic_configuration() dict [source]¶
Prepares basic set of configuration which is framework independent. This configuration is decorated later by configuration method implemented by framework specific implementations
- configuration(conf={}) dict [source]¶
Provides the configuration information of the cluster.
- conf:
Contains generic information about the cluster, generated using basic_configuration. Eg. IP Address of the main process
- execution_failed()[source]¶
Invoked when code submitted to epheramal cluster fails. Calling this method sets the cluster tearable state
- export_config_files()[source]¶
By default only exports configuration generated by main. This behavior can be overridden.
- export_configuration(files)[source]¶
Read the configuration in the files array and export to environment variable
- fetch_all_worker_info()[source]¶
Fetchs all the worker configs In some cluster the main requires information about all worker IPs apriori. This method maybe useful in such situation.
- classmethod find_self_ip(authinfo)[source]¶
Identify IP address by finding which of the host IP intersects with the CIDR block of the subnet associated with the JOB_OCID
- setup_configuration(config: dict | None = None)[source]¶
Writes the configuration information into location provided by work_dir
- config:
dictionary containing configuration information that needs to be shared with the workers if config is None, then this method calls self.configuration and saves the configuration
- work_dir:
Could be any valid URI supported by fsspec
- start_ps()[source]¶
Implement this for starting the ps process. Eg. tf-parameter-server for tensorflow
- property stop_filesystem¶
ads.opctl.distributed.common.abstract_framework_spec_builder module¶
- class ads.opctl.distributed.common.abstract_framework_spec_builder.AbstractFrameworkSpecBuilder(config)[source]¶
Bases:
object
Provides contract for implementing Framework specific Cluster Spec Builder
In the example of jobs, this class handles adding framework specific environment variables to the job definition.
NOTE: This class is not invoked while the cluster is running. Only after a call to ads opctl.