ads.data_labeling.reader package¶
Submodules¶
ads.data_labeling.reader.dataset_reader module¶
The module containing classes to read labeled datasets. Allows to read labeled datasets from exports or from the cloud.
Classes¶
- LabeledDatasetReader
The LabeledDatasetReader class to read labeled dataset.
- ExportReader
The ExportReader class to read labeled dataset from the export.
- DLSDatasetReader
The DLSDatasetReader class to read labeled dataset from the cloud.
Examples
>>> from ads.common import auth as authutil
>>> from ads.data_labeling import LabeledDatasetReader
>>> ds_reader = LabeledDatasetReader.from_export(
... path="oci://bucket_name@namespace/dataset_metadata.jsonl",
... auth=authutil.api_keys(),
... materialize=True
... )
>>> ds_reader.info()
------------------------------------------------------------------------
annotation_type SINGLE_LABEL
compartment_id TEST_COMPARTMENT
dataset_id TEST_DATASET
dataset_name test_dataset_name
dataset_type TEXT
labels ['yes', 'no']
records_path path/to/records
source_path path/to/dataset
>>> ds_reader.read()
Path Content Annotations
-----------------------------------------------------------------------
0 path/to/the/content/file1 file content yes
1 path/to/the/content/file2 file content no
2 path/to/the/content/file3 file content no
>>> next(ds_reader.read(iterator=True))
("path/to/the/content/file1", "file content", "yes")
>>> next(ds_reader.read(iterator=True, chunksize=2))
[("path/to/the/content/file1", "file content", "yes"),
("path/to/the/content/file2", "file content", "no")]
>>> next(ds_reader.read(chunksize=2))
Path Content Annotations
----------------------------------------------------------------------
0 path/to/the/content/file1 file content yes
1 path/to/the/content/file2 file content no
>>> ds_reader = LabeledDatasetReader.from_DLS(
... dataset_id="dataset_OCID",
... compartment_id="compartment_OCID",
... auth=authutil.api_keys(),
... materialize=True
... )
- class ads.data_labeling.reader.dataset_reader.DLSDatasetReader(dataset_id: str, compartment_id: str, auth: Dict, encoding='utf-8', materialize: bool = False, include_unlabeled: bool = False)[source]¶
Bases:
Reader
The DLSDatasetReader class to read labeled dataset from the cloud.
Initializes the DLS dataset reader instance.
- Parameters:
dataset_id (str) – The dataset OCID.
compartment_id (str) – The compartment OCID of the dataset.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files. The encoding is used to extract the metadata information of the labeled dataset and also to extract the content of the text dataset records.
materialize ((bool, optional). Defaults to False.) – Whether the content of dataset files should be loaded/materialized or not. By default the content will not be materialized.
include_unlabeled ((bool, optional). Defaults to False.) – Whether to load the unlabeled records or not.
- Raises:
ValueError – When dataset_id is empty or not a string.:
TypeError – When dataset_id not a string.:
- info() Metadata [source]¶
Gets the labeled dataset metadata.
- Returns:
The labeled dataset metadata.
- Return type:
- read(format: str | None = None) Generator[Tuple, Any, Any] [source]¶
Reads the labeled dataset records.
- Parameters:
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection type. When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it outputs List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
- Returns:
The labeled dataset records.
- Return type:
Generator[Tuple, Any, Any]
- class ads.data_labeling.reader.dataset_reader.ExportReader(path: str, auth: Dict | None = None, encoding='utf-8', materialize: bool = False, include_unlabeled: bool = False)[source]¶
Bases:
Reader
The ExportReader class to read labeled dataset from the export.
Initializes the labeled dataset export reader instance.
- Parameters:
path (str) – The metadata file path, can be either local or object storage path.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files. The encoding is used to extract the metadata information of the labeled dataset and also to extract the content of the text dataset records.
materialize ((bool, optional). Defaults to False.) – Whether the content of dataset files should be loaded/materialized or not. By default the content will not be materialized.
include_unlabeled ((bool, optional). Defaults to False.) – Whether to load the unlabeled records or not.
- Raises:
ValueError – When path is empty or not a string.:
TypeError – When path not a string.:
- info() Metadata [source]¶
Gets the labeled dataset metadata.
- Returns:
The labeled dataset metadata.
- Return type:
- read(format: str | None = None) Generator[Tuple, Any, Any] [source]¶
Reads the labeled dataset records.
- Parameters:
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection type. When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it outputs List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
- Returns:
The labeled dataset records.
- Return type:
Generator[Tuple, Any, Any]
- class ads.data_labeling.reader.dataset_reader.LabeledDatasetReader(reader: Reader)[source]¶
Bases:
object
The labeled dataset reader class.
- read(self, iterator: bool = False) Generator[Any, Any, Any] | pd.DataFrame [source]¶
Reads labeled dataset.
- from_export(cls, path: str, auth: Dict = None, encoding='utf-8', materialize: bool = False) 'LabeledDatasetReader' [source]¶
Constructs a Labeled Dataset Reader instance.
Examples
>>> from ads.common import auth as authutil >>> from ads.data_labeling import LabeledDatasetReader
>>> ds_reader = LabeledDatasetReader.from_export( ... path="oci://bucket_name@namespace/dataset_metadata.jsonl", ... auth=authutil.api_keys(), ... materialize=True ... )
>>> ds_reader = LabeledDatasetReader.from_DLS( ... dataset_id="dataset_OCID", ... compartment_id="compartment_OCID", ... auth=authutil.api_keys(), ... materialize=True ... )
>>> ds_reader.info() ------------------------------------------------------------------------ annotation_type SINGLE_LABEL compartment_id TEST_COMPARTMENT dataset_id TEST_DATASET dataset_name test_dataset_name dataset_type TEXT labels ['yes', 'no'] records_path path/to/records source_path path/to/dataset
>>> ds_reader.read() Path Content Annotations ----------------------------------------------------------------------- 0 path/to/the/content/file1 file content yes 1 path/to/the/content/file2 file content no 2 path/to/the/content/file3 file content no
>>> next(ds_reader.read(iterator=True)) ("path/to/the/content/file1", "file content", "yes")
>>> next(ds_reader.read(iterator=True, chunksize=2)) [("path/to/the/content/file1", "file content", "yes"), ("path/to/the/content/file2", "file content", "no")]
>>> next(ds_reader.read(chunksize=2)) Path Content Annotations ---------------------------------------------------------------------- 0 path/to/the/content/file1 file content yes 1 path/to/the/content/file2 file content no
Initializes the labeled dataset reader instance.
- Parameters:
reader (Reader) – The Reader instance which reads and extracts the labeled dataset.
- classmethod from_DLS(dataset_id: str, compartment_id: str | None = None, auth: dict | None = None, encoding: str = 'utf-8', materialize: bool = False, include_unlabeled: bool = False) LabeledDatasetReader [source]¶
Constructs Labeled Dataset Reader instance.
- Parameters:
dataset_id (str) – The dataset OCID.
compartment_id (str. Defaults to the compartment_id from the env variable.) – The compartment OCID of the dataset.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files.
materialize ((bool, optional). Defaults to False.) – Whether the content of the dataset file should be loaded or it should return the file path to the content. By default the content will not be loaded.
- Returns:
The LabeledDatasetReader instance.
- Return type:
- classmethod from_export(path: str, auth: dict | None = None, encoding: str = 'utf-8', materialize: bool = False, include_unlabeled: bool = False) LabeledDatasetReader [source]¶
Constructs Labeled Dataset Reader instance.
- Parameters:
path (str) – The metadata file path, can be either local or object storage path.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files.
materialize ((bool, optional). Defaults to False.) – Whether the content of the dataset file should be loaded or it should return the file path to the content. By default the content will not be loaded.
- Returns:
The LabeledDatasetReader instance.
- Return type:
- info() Serializable [source]¶
Gets the labeled dataset metadata.
- Returns:
The labeled dataset metadata.
- Return type:
- read(iterator: bool = False, format: str | None = None, chunksize: int | None = None) Generator[Any, Any, Any] | DataFrame [source]¶
Reads the labeled dataset records.
- Parameters:
iterator ((bool, optional). Defaults to False.) – True if the result should be represented as a Generator. Fasle if the result should be represented as a Pandas DataFrame.
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” or “yolo”.
chunksize ((int, optional). Defaults to None.) – The number of records that should be read in one iteration. The result will be returned in a generator format.
- Returns:
Union[ – Generator[Tuple[str, str, Any], Any, Any], Generator[List[Tuple[str, str, Any]], Any, Any], Generator[pd.DataFrame, Any, Any], pd.DataFrame
] – pd.Dataframe if iterator and chunksize are not specified. Generator[pd.Dataframe] ` if `iterator equal to False and chunksize is specified. Generator[List[Tuple[str, str, Any]]] if iterator equal to True and chunksize is specified. Generator[Tuple[str, str, Any]] if iterator equal to True and chunksize is not specified.
ads.data_labeling.reader.dls_record_reader module¶
- class ads.data_labeling.reader.dls_record_reader.DLSRecordReader(dataset_id: str, compartment_id: str, auth: dict | None = None)[source]¶
Bases:
Reader
DLS Record Reader Class that reads records from the cloud into ADS format.
Initiates a DLSRecordReader instance.
- Parameters:
dataset_id (str) – The dataset OCID.
compartment_id (str) – The compartment OCID of the dataset.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
- class ads.data_labeling.reader.dls_record_reader.OCIRecordSummary(record: RecordSummary | None = None, annotation: List[AnnotationSummary] | None = None)[source]¶
Bases:
object
The class that representing the labeled record in ADS format.
- record¶
OCI RecordSummary.
- Type:
RecordSummary
- annotations¶
List of OCI AnnotationSummary.
- Type:
List[AnnotationSummary]
- record: RecordSummary = None¶
ads.data_labeling.reader.export_record_reader module¶
- class ads.data_labeling.reader.export_record_reader.ExportRecordReader(path: str, auth: Dict | None = None, encoding='utf-8', includes_metadata: bool = False)[source]¶
Bases:
JsonlReader
The ExportRecordReader class to read labeled dataset records from the export.
Initiates an ExportRecordReader instance.
- Parameters:
path (str) – object storage path or local path for a file.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding of files. Only used for “TEXT” dataset.
includes_metadata ((bool, optional). Defaults to False.) – Determines whether the export file includes metadata or not.
Examples
>>> from ads.data_labeling.reader.export_record_reader import ExportRecordReader >>> path = "your/path/to/jsonl/file.jsonl" >>> from ads.common import auth as authutil >>> reader = ExportRecordReader(path=path, auth=authutil.api_keys(), encoding="utf-8") >>> next(reader.read())
ads.data_labeling.reader.jsonl_reader module¶
- class ads.data_labeling.reader.jsonl_reader.JsonlReader(path: str, auth: Dict | None = None, encoding='utf-8')[source]¶
Bases:
Reader
JsonlReader class which reads the file.
Initiates a JsonlReader object.
- Parameters:
path (str) – object storage path or local path for a file.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding of files. Only used for “TEXT” dataset.
Examples
>>> from ads.data_labeling.reader.jsonl_reader import JsonlReader >>> path = "your/path/to/jsonl/file.jsonl" >>> from ads.common import auth as authutil >>> reader = JsonlReader(path=path, auth=authutil.api_keys(), encoding="utf-8") >>> next(reader.read())
- read(skip: int | None = None) Generator[Dict, Any, Any] [source]¶
Reads and yields the content of the file.
- Parameters:
skip ((int, optional). Defaults to None.) – The number of records that should be skipped.
- Returns:
The content of the file.
- Return type:
Generator[Dict, Any, Any]
- Raises:
ValueError – If skip not empty and not a positive integer.
FileNotFoundError – When file not found.
ads.data_labeling.reader.metadata_reader module¶
- class ads.data_labeling.reader.metadata_reader.DLSMetadataReader(dataset_id: str, compartment_id: str, auth: dict)[source]¶
Bases:
Reader
DLSMetadataReader class which reads the metadata jsonl file from the cloud.
Initializes the DLS metadata reader instance.
- Parameters:
dataset_id (str) – The dataset OCID.
compartment_id (str) – The compartment OCID of the dataset.
auth (dict) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
- Raises:
ValueError – When dataset_id is empty or not a string.:
TypeError – When dataset_id not a string.:
- read() Metadata [source]¶
Reads the content from the metadata file.
- Returns:
The metadata of the labeled dataset.
- Return type:
- Raises:
DatasetNotFoundError – If dataset not found.
ReadDatasetError – If any error occured in attempt to read dataset.
- exception ads.data_labeling.reader.metadata_reader.DatasetNotFoundError(id: str)[source]¶
Bases:
Exception
- exception ads.data_labeling.reader.metadata_reader.EmptyMetadata[source]¶
Bases:
Exception
Empty Metadata.
- class ads.data_labeling.reader.metadata_reader.ExportMetadataReader(path: str, auth: Dict | None = None, encoding='utf-8')[source]¶
Bases:
JsonlReader
ExportMetadataReader class which reads the metadata jsonl file from local/object storage path.
Initiates a JsonlReader object.
- Parameters:
path (str) – object storage path or local path for a file.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding of files. Only used for “TEXT” dataset.
Examples
>>> from ads.data_labeling.reader.jsonl_reader import JsonlReader >>> path = "your/path/to/jsonl/file.jsonl" >>> from ads.common import auth as authutil >>> reader = JsonlReader(path=path, auth=authutil.api_keys(), encoding="utf-8") >>> next(reader.read())
- class ads.data_labeling.reader.metadata_reader.MetadataReader(reader: Reader)[source]¶
Bases:
object
MetadataReader class which reads and extracts the labeled dataset metadata.
Examples
>>> from ads.data_labeling import MetadataReader >>> import oci >>> import os >>> from ads.common import auth as authutil >>> reader = MetadataReader.from_export_file("metadata_export_file_path", ... auth=authutil.api_keys()) >>> reader.read()
Initiate a MetadataReader instance.
- Parameters:
reader (Reader) – Reader instance which reads and extracts the labeled dataset metadata.
- classmethod from_DLS(dataset_id: str, compartment_id: str | None = None, auth: dict | None = None) MetadataReader [source]¶
Contructs a MetadataReader instance.
- Parameters:
dataset_id (str) – The dataset OCID.
compartment_id ((str, optional). Default None) – The compartment OCID of the dataset.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
- Returns:
The MetadataReader instance whose reader is a DLSMetadataReader instance.
- Return type:
- classmethod from_export_file(path: str, auth: Dict | None = None) MetadataReader [source]¶
Contructs a MetadataReader instance.
- Parameters:
path (str) – metadata file path, can be either local or object storage path.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
- Returns:
The MetadataReader instance whose reader is a ExportMetadataReader instance.
- Return type:
ads.data_labeling.reader.record_reader module¶
- class ads.data_labeling.reader.record_reader.RecordReader(reader: Reader, parser: Parser, loader: Loader | None = None, include_unlabeled: bool = False, encoding: str = 'utf-8', materialize: bool = False)[source]¶
Bases:
object
Record Reader Class consists of parser, reader and loader. Reader reads the the content from the record file. Parser parses the label for each record. And Loader loads the content of the file path in that record.
Examples
>>> import os >>> import oci >>> from ads.data_labeling import RecordReader >>> from ads.common import auth as authutil >>> file_path = "/path/to/your_record.jsonl" >>> dataset_type = "IMAGE" >>> annotation_type = "BOUNDING_BOX" >>> record_reader = RecordReader.from_export_file(file_path, dataset_type, annotation_type, "image_file_path", authutil.api_keys()) >>> next(record_reader.read())
Initiates a RecordReader instance.
- Parameters:
reader (Reader) – Reader instance to read content from the record file.
parser (Parser) – Parser instance to parse the labels from record file.
loader (Loader. Defaults to None.) – Loader instance to load the content from the file path in the record.
materialize (bool, optional. Defaults to False.) – Whether to materialize the content using loader.
include_unlabeled ((bool, optional). Default to False.) – Whether to load the unlabeled records or not.
encoding (str, optional) – Encoding for text files. Used only to extract the content of the text dataset contents.
- Raises:
ValueError – If the record reader and record parser must be specified. If the loader is not specified when materialize if True.
- classmethod from_DLS(dataset_id: str, dataset_type: str, annotation_type: str, dataset_source_path: str, compartment_id: str | None = None, auth: Dict | None = None, include_unlabeled: bool = False, encoding: str = 'utf-8', materialize: bool = False, format: str | None = None, categories: List[str] | None = None) RecordReader [source]¶
Constructs Record Reader instance.
- Parameters:
dataset_id (str) – The dataset OCID.
dataset_type (str) – Dataset type. Currently supports TEXT, IMAGE and DOCUMENT.
annotation_type (str) – Annotation Type. Currently TEXT supports SINGLE_LABEL, MULTI_LABEL, ENTITY_EXTRACTION. IMAGE supports SINGLE_LABEL, MULTI_LABEL and BOUNDING_BOX. DOCUMENT supports SINGLE_LABEL and MULTI_LABEL.
dataset_source_path (str) – Dataset source path.
compartment_id ((str, optional). Defaults to None.) – The compartment OCID of the dataset.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files.
materialize ((bool, optional). Defaults to False.) – Whether the content of the dataset file should be loaded or it should return the file path to the content. By default the content will not be loaded.
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection type. When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it outputs List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
categories ((List[str], optional). Defaults to None.) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns:
The RecordReader instance.
- Return type:
- classmethod from_export_file(path: str, dataset_type: str, annotation_type: str, dataset_source_path: str, auth: Dict | None = None, include_unlabeled: bool = False, encoding: str = 'utf-8', materialize: bool = False, format: str | None = None, categories: List[str] | None = None, includes_metadata=False) RecordReader [source]¶
Initiates a RecordReader instance.
- Parameters:
path (str) – Record file path.
dataset_type (str) – Dataset type. Currently supports TEXT, IMAGE and DOCUMENT.
annotation_type (str) – Annotation Type. Currently TEXT supports SINGLE_LABEL, MULTI_LABEL, ENTITY_EXTRACTION. IMAGE supports SINGLE_LABEL, MULTI_LABEL and BOUNDING_BOX. DOCUMENT supports SINGLE_LABEL and MULTI_LABEL.
dataset_source_path (str) – Dataset source path.
auth ((dict, optional). Default None) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
include_unlabeled ((bool, optional). Default to False.) – Whether to load the unlabeled records or not.
encoding ((str, optional). Defaults to "utf-8".) – Encoding for text files. Used only to extract the content of the text dataset contents.
materialize ((bool, optional). Defaults to False.) – Whether to materialize the content by loader.
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection type. When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it outputs List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
categories ((List[str], optional). Defaults to None.) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
includes_metadata ((bool, optional). Defaults to False.) – Determines whether the export file includes metadata or not.
- Returns:
A RecordReader instance.
- Return type: