ads.opctl.distributed.common package#
Submodules#
ads.opctl.distributed.common.abstract_cluster_provider module#
- class ads.opctl.distributed.common.abstract_cluster_provider.ClusterProvider(mode, ephemeral=True, life_span='0h', work_dir='')[source]#
Bases:
object
Provides contract for implementing Framework specific Cluster Life Cycle Manager.
The main node is identified by the environment variable - MAIN The worker node is identified by the environment variable - WORKER
The worker and main coordinate cluster configuration using the directory provided via WORK_DIR. The main node creates config.json in the WORK_DIR. The worker nodes polls the config.json and exports the configuration as environment variables
- DEFAULT_CODE_DIR = '/code'#
- SYNC_SCRIPT_PATH = '/etc/datascience/sync.sh'#
- basic_configuration() dict [source]#
Prepares basic set of configuration which is framework independent. This configuration is decorated later by configuration method implemented by framework specific implementations
- configuration(conf={}) dict [source]#
Provides the configuration information of the cluster.
- conf:
Contains generic information about the cluster, generated using basic_configuration. Eg. IP Address of the main process
- execution_failed()[source]#
Invoked when code submitted to epheramal cluster fails. Calling this method sets the cluster tearable state
- export_config_files()[source]#
By default only exports configuration generated by main. This behavior can be overridden.
- export_configuration(files)[source]#
Read the configuration in the files array and export to environment variable
- fetch_all_worker_info()[source]#
Fetchs all the worker configs In some cluster the main requires information about all worker IPs apriori. This method maybe useful in such situation.
- classmethod find_self_ip(authinfo)[source]#
Identify IP address by finding which of the host IP intersects with the CIDR block of the subnet associated with the JOB_OCID
- setup_configuration(config: dict | None = None)[source]#
Writes the configuration information into location provided by work_dir
- config:
dictionary containing configuration information that needs to be shared with the workers if config is None, then this method calls self.configuration and saves the configuration
- work_dir:
Could be any valid URI supported by fsspec
- start_ps()[source]#
Implement this for starting the ps process. Eg. tf-parameter-server for tensorflow
- property stop_filesystem#
ads.opctl.distributed.common.abstract_framework_spec_builder module#
- class ads.opctl.distributed.common.abstract_framework_spec_builder.AbstractFrameworkSpecBuilder(config)[source]#
Bases:
object
Provides contract for implementing Framework specific Cluster Spec Builder
In the example of jobs, this class handles adding framework specific environment variables to the job definition.
NOTE: This class is not invoked while the cluster is running. Only after a call to ads opctl.