ads.data_labeling.mixin package

Submodules

ads.data_labeling.mixin.data_labeling module

class ads.data_labeling.mixin.data_labeling.DataLabelingAccessMixin[source]

Bases: object

Mixin class for labeled text data.

static read_labeled_data(path: str | None = None, dataset_id: str | None = None, compartment_id: str | None = None, auth: Dict | None = None, materialize: bool = False, encoding: str = 'utf-8', include_unlabeled: bool = False, format: str | None = None, chunksize: int | None = None)[source]

Loads the dataset generated by data labeling service from either the export file or the Data Labeling Service.

Parameters:
  • path ((str, optional). Defaults to None) – The export file path, can be either local or object storage path.

  • dataset_id ((str, optional). Defaults to None) – The dataset OCID.

  • compartment_id (str. Defaults to the compartment_id from the env variable.) – The compartment OCID of the dataset.

  • auth ((dict, optional). Defaults to None) – The default authetication is set using ads.set_auth API. If you need to override the default, use the ads.common.auth.api_keys or ads.common.auth.resource_principal to create appropriate authentication signer and kwargs required to instantiate IdentityClient object.

  • materialize ((bool, optional). Defaults to False) – Whether the content of the dataset file should be loaded or it should return the file path to the content. By default the content will not be loaded.

  • encoding ((str, optional). Defaults to 'utf-8') – Encoding of files. Only used for “TEXT” dataset.

  • include_unlabeled ((bool, optional). Default to False) – Whether to load the unlabeled records or not.

  • format ((str, optional). Defaults to None) –

    Output format of annotations. Can be None, “spacy” for dataset Entity Extraction type or “yolo for Object Detection type.

    • When None, it outputs List[NERItem] or List[BoundingBoxItem],

    • When “spacy”, it outputs List[Tuple],

    • When “yolo”, it outputs List[List[Tuple]].

  • chunksize ((int, optional). Defaults to None) – The amount of records that should be read in one iteration. The result will be returned in a generator format.

Returns:

pd.Dataframe if chunksize is not specified. Generator[pd.Dataframe] if chunksize is specified.

Return type:

Union[Generator[pd.DataFrame, Any, Any], pd.DataFrame]

Examples

>>> import pandas as pd
>>> import ads
>>> from ads.common import auth as authutil
>>> df = pd.DataFrame.ads.read_labeled_data(path="path_to_your_metadata.jsonl",
...                                         auth=authutil.api_keys(),
...                                         materialize=False)
                            Path       Content               Annotations
    --------------------------------------------------------------------
    0   path/to/the/content/file                                     yes
    1   path/to/the/content/file                                      no
>>> df = pd.DataFrame.ads.read_labeled_data_from_dls(dataset_id="your_dataset_ocid",
...                                                  compartment_id="your_compartment_id",
...                                                  auth=authutil.api_keys(),
...                                                  materialize=False)
                            Path       Content               Annotations
    --------------------------------------------------------------------
    0   path/to/the/content/file                                     yes
    1   path/to/the/content/file                                      no
render_bounding_box(options: Dict | None = None, content_column: str = 'Content', annotations_column: str = 'Annotations', categories: List[str] | None = None, limit: int = 50, path: str | None = None) None[source]

Renders bounding box dataset. Displays only first 50 rows.

Parameters:
  • options (dict) – The colors options specified for rendering.

  • content_column (Optional[str]) – The column name with the content data.

  • annotations_column (Optional[str]) – The column name for the annotations list.

  • categories (Optional List[str]) – The list of object categories in proper order for model training. Only used when bounding box annotations are in YOLO format. Example: [‘cat’,’dog’,’horse’]

  • limit (Optional[int]. Defaults to 50) – The maximum amount of records to display.

  • path (Optional[str]) – Path to save the image with annotations to local directory.

Returns:

Nothing

Return type:

None

Examples

>>> import pandas as pd
>>> import ads
>>> from ads.common import auth as authutil
>>> df = pd.DataFrame.ads.read_labeled_data(path="path_to_your_metadata.jsonl",
...                                         auth=authutil.api_keys(),
...                                         materialize=True)
>>> df.ads.render_bounding_box(content_column="Content", annotations_column="Annotations")
render_ner(options: Dict = None, content_column: str = 'Content', annotations_column: str = 'Annotations', limit: int = 50, return_html: bool = False) None[source]

Renders NER dataset. Displays only first 50 rows.

Parameters:
  • options (dict) – The colors options specified for rendering.

  • content_column (Optional[str]) – The column name with the content data.

  • annotations_column (Optional[str]) – The column name for the annotations list.

  • limit (Optional[int]. Defaults to 50) – The maximum amount of records to display.

Returns:

Nothing

Return type:

None

Examples

>>> import pandas as pd
>>> import ads
>>> from ads.common import auth as authutil
>>> df = pd.DataFrame.ads.read_labeled_data(path="path_to_your_metadata.jsonl",
...                                         auth=authutil.api_keys(),
...                                         materialize=True)
>>> df.ads.render_ner(content_column="Content", annotations_column="Annotations")

Module contents