ads.data_labeling package
Submodules
ads.data_labeling.interface.loader module
ads.data_labeling.interface.parser module
ads.data_labeling.interface.reader module
ads.data_labeling.boundingbox module
- class ads.data_labeling.boundingbox.BoundingBoxItem(top_left: ~typing.Tuple[float, float], bottom_left: ~typing.Tuple[float, float], bottom_right: ~typing.Tuple[float, float], top_right: ~typing.Tuple[float, float], labels: ~typing.List[str] = <factory>)
Bases:
object
BoundingBoxItem class representing bounding box label.
- labels
List of labels for this bounding box.
- Type
List[str]
- top_left
Top left corner of this bounding box.
- Type
Tuple[float, float]
- bottom_left
Bottom left corner of this bounding box.
- Type
Tuple[float, float]
- bottom_right
Bottom right corner of this bounding box.
- Type
Tuple[float, float]
- top_right
Top right corner of this bounding box.
- Type
Tuple[float, float]
Examples
>>> item = BoundingBoxItem( ... labels = ['cat','dog'] ... bottom_left=(0.2, 0.4), ... top_left=(0.2, 0.2), ... top_right=(0.8, 0.2), ... bottom_right=(0.8, 0.4)) >>> item.to_yolo(categories = ['cat','dog', 'horse'])
- bottom_left: Tuple[float, float]
- bottom_right: Tuple[float, float]
- classmethod from_yolo(bbox: List[Tuple], categories: Optional[List[str]] = None) BoundingBoxItem
Converts the YOLO formated annotations to BoundingBoxItem.
- Parameters
bboxes (List[Tuple]) – The list of bounding box annotations in YOLO format. Example: [(0, 0.511560675, 0.50234826, 0.47013485, 0.57803468)]
categories (List[str]) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns
The BoundingBoxItem.
- Return type
- Raises
TypeError – When categories list has a wrong format.
- labels: List[str]
- to_yolo(categories: List[str]) List[Tuple[int, float, float, float, float]]
Converts BoundingBoxItem to the YOLO format.
- Parameters
categories (List[str]) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns
The list of YOLO formatted bounding boxes.
- Return type
List[Tuple[int, float, float, float, float]]
- Raises
ValueError – When categories list not provided. When categories list not matched with the labels.
TypeError – When categories list has a wrong format.
- top_left: Tuple[float, float]
- top_right: Tuple[float, float]
- class ads.data_labeling.boundingbox.BoundingBoxItems(items: ~typing.List[~ads.data_labeling.boundingbox.BoundingBoxItem] = <factory>)
Bases:
object
BoundingBoxItems class which consists of a list of BoundingBoxItem.
- items
List of BoundingBoxItem.
- Type
List[BoundingBoxItem]
Examples
>>> item = BoundingBoxItem( ... labels = ['cat','dog'] ... bottom_left=(0.2, 0.4), ... top_left=(0.2, 0.2), ... top_right=(0.8, 0.2), ... bottom_right=(0.8, 0.4)) >>> items = BoundingBoxItems(items = [item]) >>> items.to_yolo(categories = ['cat','dog', 'horse'])
- items: List[BoundingBoxItem]
- to_yolo(categories: List[str]) List[Tuple[int, float, float, float, float]]
Converts BoundingBoxItems to the YOLO format.
- Parameters
categories (List[str]) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns
The list of YOLO formatted bounding boxes.
- Return type
List[Tuple[int, float, float, float, float]]
- Raises
ValueError – When categories list not provided. When categories list not matched with the labels.
TypeError – When categories list has a wrong format.
ads.data_labeling.constants module
- class ads.data_labeling.constants.AnnotationType
Bases:
object
AnnotationType class which contains all the annotation types that data labeling service supports.
- BOUNDING_BOX = 'BOUNDING_BOX'
- ENTITY_EXTRACTION = 'ENTITY_EXTRACTION'
- MULTI_LABEL = 'MULTI_LABEL'
- SINGLE_LABEL = 'SINGLE_LABEL'
ads.data_labeling.data_labeling_service module
- class ads.data_labeling.data_labeling_service.DataLabeling(compartment_id: Optional[str] = None, dls_cp_client_auth: Optional[dict] = None, dls_dp_client_auth: Optional[dict] = None)
Bases:
OCIWorkRequestMixin
Class for data labeling service. Integrate the data labeling service APIs.
Examples
>>> import ads >>> import pandas >>> from ads.data_labeling.data_labeling_service import DataLabeling >>> ads.set_auth("api_key") >>> dls = DataLabeling() >>> dls.list_dataset() >>> metadata_path = dls.export(dataset_id="your dataset id", ... path="oci://<bucket_name>@<namespace>/folder") >>> df = pd.DataFrame.ads.read_labeled_data(metadata_path)
Initialize a DataLabeling class.
- Parameters
compartment_id (str, optional) – OCID of data labeling datasets’ compartment
dls_cp_client_auth (dict, optional) – Data Labeling control plane client auth. Default is None. The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
dls_dp_client_auth (dict, optional) – Data Labeling data plane client auth. Default is None. The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
- Returns
Nothing.
- Return type
None
- export(dataset_id: str, path: str, include_unlabeled=False) str
Export dataset based on the dataset_id and save the jsonl files under the path (metadata jsonl file and the records jsonl file) to the object storage path provided by the user and return the metadata jsonl path.
- Parameters
dataset_id (str) – The dataset id of which the snapshot will be generated.
path (str) – The object storage path to store the generated snapshot. “oci://<bucket_name>@<namespace>/prefix”
include_unlabeled (bool, Optional. Defaults to False.) – Whether to include unlabeled records or not.
- Returns
oci path of the metadata jsonl file.
- Return type
str
- list_dataset(**kwargs) DataFrame
List all the datasets created from the data labeling service under a given compartment.
- Parameters
kwargs (dict, optional) – Additional keyword arguments will be passed to oci.data_labeling_serviceDataLabelingManagementClient.list_datasets method.
- Returns
pandas dataframe which contains the dataset information.
- Return type
pandas.DataFrame
- Raises
Exception – If pagination.list_call_get_all_results() fails
ads.data_labeling.metadata module
- class ads.data_labeling.metadata.Metadata(source_path: str = '', records_path: str = '', labels: ~typing.List[str] = <factory>, dataset_name: str = '', compartment_id: str = '', dataset_id: str = '', annotation_type: str = '', dataset_type: str = '')
Bases:
DataClassSerializable
The class that representing the labeled dataset metadata.
- source_path
Contains information on where all the source data(image/text/document) stores.
- Type
str
- records_path
Contains information on where records jsonl file stores.
- Type
str
- labels
List of classes/labels for the dataset.
- Type
List
- dataset_name
Dataset display name on the Data Labeling Service console.
- Type
str
- compartment_id
Compartment id of the labeled dataset.
- Type
str
- dataset_id
Dataset id.
- Type
str
- annotation_type
Type of the labeling/annotation task. Currently supports SINGLE_LABEL, MULTI_LABEL, ENTITY_EXTRACTION, BOUNDING_BOX.
- Type
str
- dataset_type
Type of the dataset. Currently supports Text, Image, DOCUMENT.
- Type
str
- annotation_type: str = ''
- compartment_id: str = ''
- dataset_id: str = ''
- dataset_name: str = ''
- dataset_type: str = ''
- classmethod from_dls_dataset(dataset: Dataset) Metadata
Contructs a Metadata instance from OCI DLS dataset.
- Parameters
dataset (OCIDLSDataset) – OCIDLSDataset object.
- Returns
The ads labeled dataset metadata instance.
- Return type
- labels: List[str]
- records_path: str = ''
- source_path: str = ''
- to_dataframe() DataFrame
Converts the metadata to dataframe format.
- Returns
The metadata in Pandas dataframe format.
- Return type
pandas.DataFrame
- to_dict() Dict
Converts to dictionary representation.
- Returns
The metadata in dictionary type.
- Return type
Dict
ads.data_labeling.ner module
- class ads.data_labeling.ner.NERItem(label: str = '', offset: int = 0, length: int = 0)
Bases:
object
NERItem class which is a representation of a token span.
- label
Entity name.
- Type
str
- offset
The token span’s entity start index position in the text.
- Type
int
- length
Length of the token span.
- Type
int
- label: str = ''
- length: int = 0
- offset: int = 0
- to_spacy() tuple
Converts one NERItem to the spacy format.
- Returns
NERItem in the spacy format
- Return type
Tuple
- class ads.data_labeling.ner.NERItems(items: ~typing.List[~ads.data_labeling.ner.NERItem] = <factory>)
Bases:
object
NERItems class consists of a list of NERItem.
- to_spacy() List[tuple]
Converts NERItems to the spacy format.
- Returns
List of NERItems in the Spacy format.
- Return type
List[tuple]
- exception ads.data_labeling.ner.WrongEntityFormatLabelIsEmpty
Bases:
ValueError
- exception ads.data_labeling.ner.WrongEntityFormatLabelNotString
Bases:
ValueError
- exception ads.data_labeling.ner.WrongEntityFormatLengthIsNegative
Bases:
ValueError
- exception ads.data_labeling.ner.WrongEntityFormatLengthNotInteger
Bases:
ValueError
- exception ads.data_labeling.ner.WrongEntityFormatOffsetIsNegative
Bases:
ValueError
- exception ads.data_labeling.ner.WrongEntityFormatOffsetNotInteger
Bases:
ValueError
ads.data_labeling.record module
- class ads.data_labeling.record.Record(path: str = '', content: Optional[Any] = None, annotation: Optional[Union[Tuple, str, List[BoundingBoxItem], List[NERItem]]] = None)
Bases:
object
Class representing Record.
- path
File path.
- Type
str
- content
Content of the record.
- Type
Any
- annotation
Annotation/label of the record.
- Type
Union[Tuple, str, List[BoundingBoxItem], List[NERItem]]
- annotation: Union[Tuple, str, List[BoundingBoxItem], List[NERItem]] = None
- content: Any = None
- path: str = ''
- to_dict() Dict
Convert the Record instance to a dictionary.
- Returns
Dictionary representation of the Record instance.
- Return type
Dict
- to_tuple() Tuple[str, Any, Union[Tuple, str, List[BoundingBoxItem], List[NERItem]]]
Convert the Record instance to a tuple.
- Returns
Tuple representation of the Record instance.
- Return type
Tuple
ads.data_labeling.mixin.data_labeling module
- class ads.data_labeling.mixin.data_labeling.DataLabelingAccessMixin
Bases:
object
Mixin class for labeled text data.
- static read_labeled_data(path: Optional[str] = None, dataset_id: Optional[str] = None, compartment_id: Optional[str] = None, auth: Optional[Dict] = None, materialize: bool = False, encoding: str = 'utf-8', include_unlabeled: bool = False, format: Optional[str] = None, chunksize: Optional[int] = None)
Loads the dataset generated by data labeling service from either the export file or the Data Labeling Service.
- Parameters
path ((str, optional). Defaults to None) – The export file path, can be either local or object storage path.
dataset_id ((str, optional). Defaults to None) – The dataset OCID.
compartment_id (str. Defaults to the compartment_id from the env variable.) – The compartment OCID of the dataset.
auth ((dict, optional). Defaults to None) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
materialize ((bool, optional). Defaults to False) – Whether the content of the dataset file should be loaded or it should return the file path to the content. By default the content will not be loaded.
encoding ((str, optional). Defaults to 'utf-8') – Encoding of files. Only used for “TEXT” dataset.
include_unlabeled ((bool, optional). Default to False) – Whether to load the unlabeled records or not.
format ((str, optional). Defaults to None) –
Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo for Object Detection type.
When None, it outputs List[NERItem] or List[BoundingBoxItem],
When “spacy”, it outputs List[Tuple],
When “yolo”, it outputs List[List[Tuple]].
chunksize ((int, optional). Defaults to None) – The amount of records that should be read in one iteration. The result will be returned in a generator format.
- Returns
pd.Dataframe if chunksize is not specified. Generator[pd.Dataframe] if chunksize is specified.
- Return type
Union[Generator[pd.DataFrame, Any, Any], pd.DataFrame]
Examples
>>> import pandas as pd >>> import ads >>> from ads.common import auth as authutil >>> df = pd.DataFrame.ads.read_labeled_data(path="path_to_your_metadata.jsonl", ... auth=authutil.api_keys(), ... materialize=False) Path Content Annotations -------------------------------------------------------------------- 0 path/to/the/content/file yes 1 path/to/the/content/file no
>>> df = pd.DataFrame.ads.read_labeled_data_from_dls(dataset_id="your_dataset_ocid", ... compartment_id="your_compartment_id", ... auth=authutil.api_keys(), ... materialize=False) Path Content Annotations -------------------------------------------------------------------- 0 path/to/the/content/file yes 1 path/to/the/content/file no
- render_bounding_box(options: Optional[Dict] = None, content_column: str = 'Content', annotations_column: str = 'Annotations', categories: Optional[List[str]] = None, limit: int = 50, path: Optional[str] = None) None
Renders bounding box dataset. Displays only first 50 rows.
- Parameters
options (dict) – The colors options specified for rendering.
content_column (Optional[str]) – The column name with the content data.
annotations_column (Optional[str]) – The column name for the annotations list.
categories (Optional List[str]) – The list of object categories in proper order for model training. Only used when bounding box annotations are in YOLO format. Example: [‘cat’,’dog’,’horse’]
limit (Optional[int]. Defaults to 50) – The maximum amount of records to display.
path (Optional[str]) – Path to save the image with annotations to local directory.
- Returns
Nothing
- Return type
None
Examples
>>> import pandas as pd >>> import ads >>> from ads.common import auth as authutil >>> df = pd.DataFrame.ads.read_labeled_data(path="path_to_your_metadata.jsonl", ... auth=authutil.api_keys(), ... materialize=True) >>> df.ads.render_bounding_box(content_column="Content", annotations_column="Annotations")
- render_ner(options: Dict = None, content_column: str = 'Content', annotations_column: str = 'Annotations', limit: int = 50) None
Renders NER dataset. Displays only first 50 rows.
- Parameters
options (dict) – The colors options specified for rendering.
content_column (Optional[str]) – The column name with the content data.
annotations_column (Optional[str]) – The column name for the annotations list.
limit (Optional[int]. Defaults to 50) – The maximum amount of records to display.
- Returns
Nothing
- Return type
None
Examples
>>> import pandas as pd >>> import ads >>> from ads.common import auth as authutil >>> df = pd.DataFrame.ads.read_labeled_data(path="path_to_your_metadata.jsonl", ... auth=authutil.api_keys(), ... materialize=True) >>> df.ads.render_ner(content_column="Content", annotations_column="Annotations")
ads.data_labeling.parser.export_metadata_parser module
ads.data_labeling.parser.export_record_parser module
- class ads.data_labeling.parser.export_record_parser.BoundingBoxRecordParser(dataset_source_path: str, format: Optional[str] = None, categories: Optional[List[str]] = None)
Bases:
RecordParser
BoundingBoxRecordParser class which parses the label of BoundingBox label data.
Initiates a RecordParser instance.
- Parameters
dataset_source_path (str) – Dataset source path.
format ((str, optional). Defaults to None.) – Output format of annotations.
categories ((List[str], optional). Defaults to None.) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns
RecordParser instance.
- Return type
- class ads.data_labeling.parser.export_record_parser.EntityType
Bases:
object
Entity type class for supporting multiple types of entities.
- GENERIC = 'GENERIC'
- IMAGEOBJECTSELECTION = 'IMAGEOBJECTSELECTION'
- TEXTSELECTION = 'TEXTSELECTION'
- class ads.data_labeling.parser.export_record_parser.MultiLabelRecordParser(dataset_source_path: str, format: Optional[str] = None, categories: Optional[List[str]] = None)
Bases:
RecordParser
MultiLabelRecordParser class which parses the label of Multiple label data.
Initiates a RecordParser instance.
- Parameters
dataset_source_path (str) – Dataset source path.
format ((str, optional). Defaults to None.) – Output format of annotations.
categories ((List[str], optional). Defaults to None.) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns
RecordParser instance.
- Return type
- class ads.data_labeling.parser.export_record_parser.NERRecordParser(dataset_source_path: str, format: Optional[str] = None, categories: Optional[List[str]] = None)
Bases:
RecordParser
NERRecordParser class which parses the label of NER label data.
Initiates a RecordParser instance.
- Parameters
dataset_source_path (str) – Dataset source path.
format ((str, optional). Defaults to None.) – Output format of annotations.
categories ((List[str], optional). Defaults to None.) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns
RecordParser instance.
- Return type
- class ads.data_labeling.parser.export_record_parser.RecordParser(dataset_source_path: str, format: Optional[str] = None, categories: Optional[List[str]] = None)
Bases:
Parser
RecordParser class which parses the labels from the record.
Examples
>>> from ads.data_labeling.parser.export_record_parser import SingleLabelRecordParser >>> from ads.data_labeling.parser.export_record_parser import MultiLabelRecordParser >>> from ads.data_labeling.parser.export_record_parser import NERRecordParser >>> from ads.data_labeling.parser.export_record_parser import BoundingBoxRecordParser >>> import fsspec >>> import json >>> from ads.common import auth as authutil >>> labels = [] >>> with fsspec.open("/path/to/records_file.jsonl", **authutil.api_keys()) as f: >>> for line in f: >>> bounding_box_labels = BoundingBoxRecordParser("source_data_path").parse(json.loads(line)) >>> labels.append(bounding_box_labels)
Initiates a RecordParser instance.
- Parameters
dataset_source_path (str) – Dataset source path.
format ((str, optional). Defaults to None.) – Output format of annotations.
categories ((List[str], optional). Defaults to None.) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns
RecordParser instance.
- Return type
- parse(record: Dict) Record
Extracts the annotations from the record content. Constructs and returns a Record instance containing the file path and the labels.
- Parameters
record (Dict) – Content of the record from the record file.
- Returns
Record instance which contains the file path as well as the annotations.
- Return type
- class ads.data_labeling.parser.export_record_parser.RecordParserFactory
Bases:
object
RecordParserFactory class which contains a list of registered parsers and allows to register new RecordParsers.
- Current parsers include:
SingleLabelRecordParser
MultiLabelRecordParser
NERRecordParser
BoundingBoxRecordParser
- static parser(annotation_type: str, dataset_source_path: str, format: Optional[str] = None, categories: Optional[List[str]] = None) RecordParser
Gets the parser based on the annotation_type.
- Parameters
annotation_type (str) – Annotation type which can be SINGLE_LABEL, MULTI_LABEL, ENTITY_EXTRACTION and BOUNDING_BOX.
dataset_source_path (str) – Dataset source path.
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection type. When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it outputs List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
categories ((List[str], optional). Defaults to None.) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns
RecordParser corresponding to the annotation type.
- Return type
- Raises
ValueError – If annotation_type is not supported.
- classmethod register(annotation_type: str, parser) None
Registers a new parser.
- Parameters
annotation_type (str) – Annotation type which can be SINGLE_LABEL, MULTI_LABEL, ENTITY_EXTRACTION and BOUNDING_BOX.
parser (RecordParser) – A new Parser class to be registered.
- Returns
Nothing.
- Return type
None
- class ads.data_labeling.parser.export_record_parser.SingleLabelRecordParser(dataset_source_path: str, format: Optional[str] = None, categories: Optional[List[str]] = None)
Bases:
RecordParser
SingleLabelRecordParser class which parses the label of Single label data.
Initiates a RecordParser instance.
- Parameters
dataset_source_path (str) – Dataset source path.
format ((str, optional). Defaults to None.) – Output format of annotations.
categories ((List[str], optional). Defaults to None.) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns
RecordParser instance.
- Return type
ads.data_labeling.reader.dataset_reader module
The module containing classes to read labeled datasets. Allows to read labeled datasets from exports or from the cloud.
Classes
- LabeledDatasetReader
The LabeledDatasetReader class to read labeled dataset.
- ExportReader
The ExportReader class to read labeled dataset from the export.
- DLSDatasetReader
The DLSDatasetReader class to read labeled dataset from the cloud.
Examples
>>> from ads.common import auth as authutil
>>> from ads.data_labeling import LabeledDatasetReader
>>> ds_reader = LabeledDatasetReader.from_export(
... path="oci://bucket_name@namespace/dataset_metadata.jsonl",
... auth=authutil.api_keys(),
... materialize=True
... )
>>> ds_reader.info()
------------------------------------------------------------------------
annotation_type SINGLE_LABEL
compartment_id TEST_COMPARTMENT
dataset_id TEST_DATASET
dataset_name test_dataset_name
dataset_type TEXT
labels ['yes', 'no']
records_path path/to/records
source_path path/to/dataset
>>> ds_reader.read()
Path Content Annotations
-----------------------------------------------------------------------
0 path/to/the/content/file1 file content yes
1 path/to/the/content/file2 file content no
2 path/to/the/content/file3 file content no
>>> next(ds_reader.read(iterator=True))
("path/to/the/content/file1", "file content", "yes")
>>> next(ds_reader.read(iterator=True, chunksize=2))
[("path/to/the/content/file1", "file content", "yes"),
("path/to/the/content/file2", "file content", "no")]
>>> next(ds_reader.read(chunksize=2))
Path Content Annotations
----------------------------------------------------------------------
0 path/to/the/content/file1 file content yes
1 path/to/the/content/file2 file content no
>>> ds_reader = LabeledDatasetReader.from_DLS(
... dataset_id="dataset_OCID",
... compartment_id="compartment_OCID",
... auth=authutil.api_keys(),
... materialize=True
... )
- class ads.data_labeling.reader.dataset_reader.DLSDatasetReader(dataset_id: str, compartment_id: str, auth: Dict, encoding='utf-8', materialize: bool = False, include_unlabeled: bool = False)
Bases:
Reader
The DLSDatasetReader class to read labeled dataset from the cloud.
- read(self) Generator[Tuple, Any, Any]
Reads the labeled dataset.
Initializes the DLS dataset reader instance.
- Parameters
dataset_id (str) – The dataset OCID.
compartment_id (str) – The compartment OCID of the dataset.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files. The encoding is used to extract the metadata information of the labeled dataset and also to extract the content of the text dataset records.
materialize ((bool, optional). Defaults to False.) – Whether the content of dataset files should be loaded/materialized or not. By default the content will not be materialized.
include_unlabeled ((bool, optional). Defaults to False.) – Whether to load the unlabeled records or not.
- Raises
ValueError – When dataset_id is empty or not a string.:
TypeError – When dataset_id not a string.:
- info() Metadata
Gets the labeled dataset metadata.
- Returns
The labeled dataset metadata.
- Return type
- read(format: Optional[str] = None) Generator[Tuple, Any, Any]
Reads the labeled dataset records.
- Parameters
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection type. When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it outputs List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
- Returns
The labeled dataset records.
- Return type
Generator[Tuple, Any, Any]
- class ads.data_labeling.reader.dataset_reader.ExportReader(path: str, auth: Optional[Dict] = None, encoding='utf-8', materialize: bool = False, include_unlabeled: bool = False)
Bases:
Reader
The ExportReader class to read labeled dataset from the export.
- read(self) Generator[Tuple, Any, Any]
Reads the labeled dataset.
Initializes the labeled dataset export reader instance.
- Parameters
path (str) – The metadata file path, can be either local or object storage path.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files. The encoding is used to extract the metadata information of the labeled dataset and also to extract the content of the text dataset records.
materialize ((bool, optional). Defaults to False.) – Whether the content of dataset files should be loaded/materialized or not. By default the content will not be materialized.
include_unlabeled ((bool, optional). Defaults to False.) – Whether to load the unlabeled records or not.
- Raises
ValueError – When path is empty or not a string.:
TypeError – When path not a string.:
- info() Metadata
Gets the labeled dataset metadata.
- Returns
The labeled dataset metadata.
- Return type
- read(format: Optional[str] = None) Generator[Tuple, Any, Any]
Reads the labeled dataset records.
- Parameters
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection type. When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it outputs List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
- Returns
The labeled dataset records.
- Return type
Generator[Tuple, Any, Any]
- class ads.data_labeling.reader.dataset_reader.LabeledDatasetReader(reader: Reader)
Bases:
object
The labeled dataset reader class.
- read(self, iterator: bool = False) Union[Generator[Any, Any, Any], pd.DataFrame]
Reads labeled dataset.
- from_export(cls, path: str, auth: Dict = None, encoding='utf-8', materialize: bool = False) 'LabeledDatasetReader'
Constructs a Labeled Dataset Reader instance.
Examples
>>> from ads.common import auth as authutil >>> from ads.data_labeling import LabeledDatasetReader
>>> ds_reader = LabeledDatasetReader.from_export( ... path="oci://bucket_name@namespace/dataset_metadata.jsonl", ... auth=authutil.api_keys(), ... materialize=True ... )
>>> ds_reader = LabeledDatasetReader.from_DLS( ... dataset_id="dataset_OCID", ... compartment_id="compartment_OCID", ... auth=authutil.api_keys(), ... materialize=True ... )
>>> ds_reader.info() ------------------------------------------------------------------------ annotation_type SINGLE_LABEL compartment_id TEST_COMPARTMENT dataset_id TEST_DATASET dataset_name test_dataset_name dataset_type TEXT labels ['yes', 'no'] records_path path/to/records source_path path/to/dataset
>>> ds_reader.read() Path Content Annotations ----------------------------------------------------------------------- 0 path/to/the/content/file1 file content yes 1 path/to/the/content/file2 file content no 2 path/to/the/content/file3 file content no
>>> next(ds_reader.read(iterator=True)) ("path/to/the/content/file1", "file content", "yes")
>>> next(ds_reader.read(iterator=True, chunksize=2)) [("path/to/the/content/file1", "file content", "yes"), ("path/to/the/content/file2", "file content", "no")]
>>> next(ds_reader.read(chunksize=2)) Path Content Annotations ---------------------------------------------------------------------- 0 path/to/the/content/file1 file content yes 1 path/to/the/content/file2 file content no
Initializes the labeled dataset reader instance.
- Parameters
reader (Reader) – The Reader instance which reads and extracts the labeled dataset.
- classmethod from_DLS(dataset_id: str, compartment_id: Optional[str] = None, auth: Optional[dict] = None, encoding: str = 'utf-8', materialize: bool = False, include_unlabeled: bool = False) LabeledDatasetReader
Constructs Labeled Dataset Reader instance.
- Parameters
dataset_id (str) – The dataset OCID.
compartment_id (str. Defaults to the compartment_id from the env variable.) – The compartment OCID of the dataset.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files.
materialize ((bool, optional). Defaults to False.) – Whether the content of the dataset file should be loaded or it should return the file path to the content. By default the content will not be loaded.
- Returns
The LabeledDatasetReader instance.
- Return type
- classmethod from_export(path: str, auth: Optional[dict] = None, encoding: str = 'utf-8', materialize: bool = False, include_unlabeled: bool = False) LabeledDatasetReader
Constructs Labeled Dataset Reader instance.
- Parameters
path (str) – The metadata file path, can be either local or object storage path.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files.
materialize ((bool, optional). Defaults to False.) – Whether the content of the dataset file should be loaded or it should return the file path to the content. By default the content will not be loaded.
- Returns
The LabeledDatasetReader instance.
- Return type
- info() Serializable
Gets the labeled dataset metadata.
- Returns
The labeled dataset metadata.
- Return type
- read(iterator: bool = False, format: Optional[str] = None, chunksize: Optional[int] = None) Union[Generator[Any, Any, Any], DataFrame]
Reads the labeled dataset records.
- Parameters
iterator ((bool, optional). Defaults to False.) – True if the result should be represented as a Generator. Fasle if the result should be represented as a Pandas DataFrame.
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” or “yolo”.
chunksize ((int, optional). Defaults to None.) – The number of records that should be read in one iteration. The result will be returned in a generator format.
- Returns
Union[ – Generator[Tuple[str, str, Any], Any, Any], Generator[List[Tuple[str, str, Any]], Any, Any], Generator[pd.DataFrame, Any, Any], pd.DataFrame
] – pd.Dataframe if iterator and chunksize are not specified. Generator[pd.Dataframe] ` if `iterator equal to False and chunksize is specified. Generator[List[Tuple[str, str, Any]]] if iterator equal to True and chunksize is specified. Generator[Tuple[str, str, Any]] if iterator equal to True and chunksize is not specified.
ads.data_labeling.reader.jsonl_reader module
- class ads.data_labeling.reader.jsonl_reader.JsonlReader(path: str, auth: Optional[Dict] = None, encoding='utf-8')
Bases:
Reader
JsonlReader class which reads the file.
Initiates a JsonlReader object.
- Parameters
path (str) – object storage path or local path for a file.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding of files. Only used for “TEXT” dataset.
Examples
>>> from ads.data_labeling.reader.jsonl_reader import JsonlReader >>> path = "your/path/to/jsonl/file.jsonl" >>> from ads.common import auth as authutil >>> reader = JsonlReader(path=path, auth=authutil.api_keys(), encoding="utf-8") >>> next(reader.read())
- read(skip: Optional[int] = None) Generator[Dict, Any, Any]
Reads and yields the content of the file.
- Parameters
skip ((int, optional). Defaults to None.) – The number of records that should be skipped.
- Returns
The content of the file.
- Return type
Generator[Dict, Any, Any]
- Raises
ValueError – If skip not empty and not a positive integer.
FileNotFoundError – When file not found.
ads.data_labeling.reader.metadata_reader module
- class ads.data_labeling.reader.metadata_reader.DLSMetadataReader(dataset_id: str, compartment_id: str, auth: dict)
Bases:
Reader
DLSMetadataReader class which reads the metadata jsonl file from the cloud.
Initializes the DLS metadata reader instance.
- Parameters
dataset_id (str) – The dataset OCID.
compartment_id (str) – The compartment OCID of the dataset.
auth (dict) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
- Raises
ValueError – When dataset_id is empty or not a string.:
TypeError – When dataset_id not a string.:
- read() Metadata
Reads the content from the metadata file.
- Returns
The metadata of the labeled dataset.
- Return type
- Raises
DatasetNotFoundError – If dataset not found.
ReadDatasetError – If any error occured in attempt to read dataset.
- exception ads.data_labeling.reader.metadata_reader.DatasetNotFoundError(id: str)
Bases:
Exception
- exception ads.data_labeling.reader.metadata_reader.EmptyMetadata
Bases:
Exception
Empty Metadata.
- class ads.data_labeling.reader.metadata_reader.ExportMetadataReader(path: str, auth: Optional[Dict] = None, encoding='utf-8')
Bases:
JsonlReader
ExportMetadataReader class which reads the metadata jsonl file from local/object storage path.
Initiates a JsonlReader object.
- Parameters
path (str) – object storage path or local path for a file.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding of files. Only used for “TEXT” dataset.
Examples
>>> from ads.data_labeling.reader.jsonl_reader import JsonlReader >>> path = "your/path/to/jsonl/file.jsonl" >>> from ads.common import auth as authutil >>> reader = JsonlReader(path=path, auth=authutil.api_keys(), encoding="utf-8") >>> next(reader.read())
- class ads.data_labeling.reader.metadata_reader.MetadataReader(reader: Reader)
Bases:
object
MetadataReader class which reads and extracts the labeled dataset metadata.
Examples
>>> from ads.data_labeling import MetadataReader >>> import oci >>> import os >>> from ads.common import auth as authutil >>> reader = MetadataReader.from_export_file("metadata_export_file_path", ... auth=authutil.api_keys()) >>> reader.read()
Initiate a MetadataReader instance.
- Parameters
reader (Reader) – Reader instance which reads and extracts the labeled dataset metadata.
- classmethod from_DLS(dataset_id: str, compartment_id: Optional[str] = None, auth: Optional[dict] = None) MetadataReader
Contructs a MetadataReader instance.
- Parameters
dataset_id (str) – The dataset OCID.
compartment_id ((str, optional). Default None) – The compartment OCID of the dataset.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
- Returns
The MetadataReader instance whose reader is a DLSMetadataReader instance.
- Return type
- classmethod from_export_file(path: str, auth: Optional[Dict] = None) MetadataReader
Contructs a MetadataReader instance.
- Parameters
path (str) – metadata file path, can be either local or object storage path.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
- Returns
The MetadataReader instance whose reader is a ExportMetadataReader instance.
- Return type
- exception ads.data_labeling.reader.metadata_reader.ReadDatasetError(id: str)
Bases:
Exception
ads.data_labeling.reader.record_reader module
- class ads.data_labeling.reader.record_reader.RecordReader(reader: Reader, parser: Parser, loader: Optional[Loader] = None, include_unlabeled: bool = False, encoding: str = 'utf-8', materialize: bool = False)
Bases:
object
Record Reader Class consists of parser, reader and loader. Reader reads the the content from the record file. Parser parses the label for each record. And Loader loads the content of the file path in that record.
Examples
>>> import os >>> import oci >>> from ads.data_labeling import RecordReader >>> from ads.common import auth as authutil >>> file_path = "/path/to/your_record.jsonl" >>> dataset_type = "IMAGE" >>> annotation_type = "BOUNDING_BOX" >>> record_reader = RecordReader.from_export_file(file_path, dataset_type, annotation_type, "image_file_path", authutil.api_keys()) >>> next(record_reader.read())
Initiates a RecordReader instance.
- Parameters
reader (Reader) – Reader instance to read content from the record file.
parser (Parser) – Parser instance to parse the labels from record file.
loader (Loader. Defaults to None.) – Loader instance to load the content from the file path in the record.
materialize (bool, optional. Defaults to False.) – Whether to materialize the content using loader.
include_unlabeled ((bool, optional). Default to False.) – Whether to load the unlabeled records or not.
encoding (str, optional) – Encoding for text files. Used only to extract the content of the text dataset contents.
- Raises
ValueError – If the record reader and record parser must be specified. If the loader is not specified when materialize if True.
- classmethod from_DLS(dataset_id: str, dataset_type: str, annotation_type: str, dataset_source_path: str, compartment_id: Optional[str] = None, auth: Optional[Dict] = None, include_unlabeled: bool = False, encoding: str = 'utf-8', materialize: bool = False, format: Optional[str] = None, categories: Optional[List[str]] = None) RecordReader
Constructs Record Reader instance.
- Parameters
dataset_id (str) – The dataset OCID.
dataset_type (str) – Dataset type. Currently supports TEXT, IMAGE and DOCUMENT.
annotation_type (str) – Annotation Type. Currently TEXT supports SINGLE_LABEL, MULTI_LABEL, ENTITY_EXTRACTION. IMAGE supports SINGLE_LABEL, MULTI_LABEL and BOUNDING_BOX. DOCUMENT supports SINGLE_LABEL and MULTI_LABEL.
dataset_source_path (str) – Dataset source path.
compartment_id ((str, optional). Defaults to None.) – The compartment OCID of the dataset.
auth ((dict, optional). Defaults to None.) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
encoding ((str, optional). Defaults to 'utf-8'.) – Encoding for files.
materialize ((bool, optional). Defaults to False.) – Whether the content of the dataset file should be loaded or it should return the file path to the content. By default the content will not be loaded.
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection type. When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it outputs List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
categories ((List[str], optional). Defaults to None.) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
- Returns
The RecordReader instance.
- Return type
- classmethod from_export_file(path: str, dataset_type: str, annotation_type: str, dataset_source_path: str, auth: Optional[Dict] = None, include_unlabeled: bool = False, encoding: str = 'utf-8', materialize: bool = False, format: Optional[str] = None, categories: Optional[List[str]] = None, includes_metadata=False) RecordReader
Initiates a RecordReader instance.
- Parameters
path (str) – Record file path.
dataset_type (str) – Dataset type. Currently supports TEXT, IMAGE and DOCUMENT.
annotation_type (str) – Annotation Type. Currently TEXT supports SINGLE_LABEL, MULTI_LABEL, ENTITY_EXTRACTION. IMAGE supports SINGLE_LABEL, MULTI_LABEL and BOUNDING_BOX. DOCUMENT supports SINGLE_LABEL and MULTI_LABEL.
dataset_source_path (str) – Dataset source path.
auth ((dict, optional). Default None) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.
include_unlabeled ((bool, optional). Default to False.) – Whether to load the unlabeled records or not.
encoding ((str, optional). Defaults to "utf-8".) – Encoding for text files. Used only to extract the content of the text dataset contents.
materialize ((bool, optional). Defaults to False.) – Whether to materialize the content by loader.
format ((str, optional). Defaults to None.) – Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo” for Object Detection type. When None, it outputs List[NERItem] or List[BoundingBoxItem]. When “spacy”, it outputs List[Tuple]. When “yolo”, it outputs List[List[Tuple]].
categories ((List[str], optional). Defaults to None.) – The list of object categories in proper order for model training. Example: [‘cat’,’dog’,’horse’]
includes_metadata ((bool, optional). Defaults to False.) – Determines whether the export file includes metadata or not.
- Returns
A RecordReader instance.
- Return type
- read() Generator[Tuple[str, Union[List, str]], Any, Any]
Reads the record.
- Yields
Generator[Tuple[str, Union[List, str]], Any, Any] – File path, content and labels in a tuple.
ads.data_labeling.visualizer.image_visualizer module
The module that helps to visualize Image Dataset.
- ads.data_labeling.visualizer.image_visualizer.render(items: List[LabeledImageItem], options: Dict = None)
Renders Labeled Image dataset.
Examples
>>> bbox1 = BoundingBoxItem(bottom_left=(0.3, 0.4),
>>> top_left=(0.3, 0.09),
>>> top_right=(0.86, 0.09),
>>> bottom_right=(0.86, 0.4),
>>> labels=['dolphin', 'fish'])
>>> record1 = LabeledImageItem(img_obj1, [bbox1])
>>> bbox2 = BoundingBoxItem(bottom_left=(0.2, 0.4),
>>> top_left=(0.2, 0.2),
>>> top_right=(0.8, 0.2),
>>> bottom_right=(0.8, 0.4),
>>> labels=['dolphin'])
>>> bbox3 = BoundingBoxItem(bottom_left=(0.5, 1.0),
>>> top_left=(0.5, 0.8),
>>> top_right=(0.8, 0.8),
>>> bottom_right=(0.8, 1.0),
>>> labels=['shark'])
>>> record2 = LabeledImageItem(img_obj2, [bbox2, bbox3])
>>> render(items = [record1, record2], options={"default_color":"blue", "colors": {"dolphin":"blue", "whale":"red"}})
- class ads.data_labeling.visualizer.image_visualizer.ImageLabeledDataFormatter
Bases:
object
The ImageRender class to render Image items in a notebook session.
- static render_item(item: LabeledImageItem, options: Optional[Dict] = None, path: Optional[str] = None) None
Renders image dataset.
- Parameters
item (LabeledImageItem) – Item to render.
options (Optional[dict]) – Render options.
path (str) – Path to save the image with annotations to local directory.
- Returns
Nothing.
- Return type
None
- Raises
ValueError – If items not provided. If path is not valid.
TypeError – If items provided in a wrong format.
- class ads.data_labeling.visualizer.image_visualizer.LabeledImageItem(img: ImageFile, boxes: List[BoundingBoxItem])
Bases:
object
Data class representing Image Item.
- img
the labeled image object.
- Type
ImageFile
- boxes
a list of BoundingBoxItem
- Type
List[BoundingBoxItem]
- boxes: List[BoundingBoxItem]
- img: ImageFile
- class ads.data_labeling.visualizer.image_visualizer.RenderOptions(default_color: str, colors: Optional[dict])
Bases:
object
Data class representing render options.
- default_color
The specified default color.
- Type
str
- colors
The multiple specified colors.
- Type
Optional[dict]
- colors: Optional[dict]
- default_color: str
- classmethod from_dict(options: dict) RenderOptions
Constructs an instance of RenderOptions from a dictionary.
- Parameters
options (dict) – Render options in dictionary format.
- Returns
The instance of RenderOptions.
- Return type
- to_dict()
Converts RenderOptions instance to dictionary format.
- Returns
The render options in dictionary format.
- Return type
dict
- exception ads.data_labeling.visualizer.image_visualizer.WrongEntityFormat
Bases:
ValueError
- ads.data_labeling.visualizer.image_visualizer.render(items: List[LabeledImageItem], options: Optional[Dict] = None, path: Optional[str] = None) None
Render image dataset.
- Parameters
items (List[LabeledImageItem]) – The list of LabeledImageItem to render.
options (dict, optional) – The options for rendering.
path (str) – Path to save the images with annotations to local directory.
- Returns
Nothing.
- Return type
None
- Raises
ValueError – If items not provided. If path is not valid.
TypeError – If items provided in a wrong format.
Examples
>>> bbox1 = BoundingBoxItem(bottom_left=(0.3, 0.4), >>> top_left=(0.3, 0.09), >>> top_right=(0.86, 0.09), >>> bottom_right=(0.86, 0.4), >>> labels=['dolphin', 'fish'])
>>> record1 = LabeledImageItem(img_obj1, [bbox1]) >>> render(items = [record1])
ads.data_labeling.visualizer.text_visualizer module
The module that helps to visualize NER Text Dataset.
- ads.data_labeling.visualizer.text_visualizer.render(items: List[LabeledTextItem], options: Dict = None) str
Renders NER dataset to Html format.
Examples
>>> record1 = LabeledTextItem("London is the capital of the United Kingdom", [NERItem('city', 0, 6), NERItem("country", 29, 14)])
>>> record2 = LabeledTextItem("Houston area contractor seeking a Sheet Metal Superintendent.", [NERItem("city", 0, 6)])
>>> result = render(items = [record1, record2], options={"default_color":"#DDEECC", "colors": {"city":"#DDEECC", "country":"#FFAAAA"}})
>>> display(HTML(result))
- class ads.data_labeling.visualizer.text_visualizer.LabeledTextItem(txt: str, ents: List[NERItem])
Bases:
object
Data class representing NER Item.
- txt
The labeled sentence.
- Type
str
- txt: str
- class ads.data_labeling.visualizer.text_visualizer.RenderOptions(default_color: str, colors: Optional[dict])
Bases:
object
Data class representing render options.
- default_color
The specified default color.
- Type
str
- colors
The multiple specified colors.
- Type
Optional[dict]
- colors: Optional[dict]
- default_color: str
- classmethod from_dict(options: dict) RenderOptions
Constructs an instance of RenderOptions from a dictionary.
- Parameters
options (dict) – Render options in dictionary format.
- Returns
The instance of RenderOptions.
- Return type
- to_dict()
Converts RenderOptions instance to dictionary format.
- Returns
The render options in dictionary format.
- Return type
dict
- class ads.data_labeling.visualizer.text_visualizer.TextLabeledDataFormatter
Bases:
object
The TextLabeledDataFormatter class to render NER items into Html format.
- static render(items: List[LabeledTextItem], options: Optional[Dict] = None) str
Renders NER dataset to Html format.
- Parameters
items (List[LabeledTextItem]) – Items to render.
options (Optional[dict]) – Render options.
- Returns
Html representation of rendered NER dataset.
- Return type
str
- Raises
ValueError – If items not provided.
TypeError – If items provided in a wrong format.
- ads.data_labeling.visualizer.text_visualizer.render(items: List[LabeledTextItem], options: Optional[Dict] = None) str
Renders NER dataset to Html format.
- Parameters
items (List[LabeledTextItem]) – The list of NER items to render.
options (dict, optional) – The options for rendering.
- Returns
Html string.
- Return type
str
Examples
>>> record = LabeledTextItem("London is the capital of the United Kingdom", [NERItem('city', 0, 6), NERItem("country", 29, 14)]) >>> result = render(items = [record], options={"default_color":"#DDEECC", "colors": {"city":"#DDEECC", "country":"#FFAAAA"}}) >>> display(HTML(result))