ads.text_dataset package

Submodules

ads.text_dataset.backends module

class ads.text_dataset.backends.Base

Bases: object

Base class for backends.

convert_to_text(fhandler: OpenFile, dst_path: str, fname: Optional[str] = None, storage_options: Optional[Dict] = None) str

Convert input file to a text file

Parameters
  • fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

  • dst_path (str) – local folder or cloud storage prefix to save converted text files

  • fname (str, optional) – filename for converted output, relative to dirname or prefix, by default None

  • storage_options (dict, optional) – storage options for cloud storage

Returns

path to saved output

Return type

str

get_metadata(fhandler: OpenFile) Dict

Get metadata of a file.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Returns

dictionary of metadata

Return type

dict

read_line(fhandler: OpenFile) Generator[Union[str, List[str]], None, None]

Read lines from a file.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Yields

Generator – a generator that yields lines

read_text(fhandler: OpenFile) Generator[Union[str, List[str]], None, None]

Read entire file into a string.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Yields

Generator – a generator that yields text in the file

class ads.text_dataset.backends.PDFPlumber

Bases: Base

convert_to_text(fhandler, dst_path, fname=None, storage_options=None)

Convert input file to a text file

Parameters
  • fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

  • dst_path (str) – local folder or cloud storage prefix to save converted text files

  • fname (str, optional) – filename for converted output, relative to dirname or prefix, by default None

  • storage_options (dict, optional) – storage options for cloud storage

Returns

path to saved output

Return type

str

get_metadata(fhandler)

Get metadata of a file.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Returns

dictionary of metadata

Return type

dict

read_line(fhandler)

Read lines from a file.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Yields

Generator – a generator that yields lines

read_text(fhandler)

Read entire file into a string.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Yields

Generator – a generator that yields text in the file

class ads.text_dataset.backends.Tika

Bases: Base

convert_to_text(fhandler, dst_path, fname=None, storage_options=None)

Convert input file to a text file

Parameters
  • fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

  • dst_path (str) – local folder or cloud storage prefix to save converted text files

  • fname (str, optional) – filename for converted output, relative to dirname or prefix, by default None

  • storage_options (dict, optional) – storage options for cloud storage

Returns

path to saved output

Return type

str

detect_encoding(fhandler: OpenFile)
get_metadata(fhandler)

Get metadata of a file.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Returns

dictionary of metadata

Return type

dict

read_line(fhandler)

Read lines from a file.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Yields

Generator – a generator that yields lines

read_text(fhandler)

Read entire file into a string.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Yields

Generator – a generator that yields text in the file

ads.text_dataset.dataset module

class ads.text_dataset.dataset.DataLoader(engine: Optional[str] = None)

Bases: object

DataLoader binds engine, FileProcessor and File handler(in this case it is fsspec) together to produce a dataframe of parsed text from files.

This class is expected to be used mainly from TextDatasetFactory class.

processor

processor that is used for loading data.

Type

ads.text_dataset.extractor.FileProcessor

Examples

>>> import oci
>>> from ads.text_dataset.dataset import TextDatasetFactory as textfactory
>>> from ads.text_dataset.options import Options
>>> df = textfactory.format('pdf').engine('pandas').read_line(
...     'oci://<bucket-name>@<namespace>/<path>/*.pdf',
...     storage_options={"config": oci.config.from_file(os.path.join("~/.oci", "config"))},
... )
>>> data_gen = textfactory.format('pdf').option(Options.FILE_NAME).backend('pdfplumber').read_text(
...     'oci://<bucket-name>@<namespace>/<path>/*.pdf',
...     storage_options={"config": oci.config.from_file(os.path.join("~/.oci", "config"))},
... )
>>> textfactory.format('docx').convert_to_text(
...     'oci://<bucket-name>@<namespace>/<path>/*.docx',
...     './extracted',
...     storage_options={"config": oci.config.from_file(os.path.join("~/.oci", "config"))},
... )
>>> textfactory.format('docx').convert_to_text(
...     'oci://<bucket-name>@<namespace>/<path>/*.docx',
...     'oci://<bucket-name>@<namespace>/<out_path>',
...     storage_options={"config": oci.config.from_file(os.path.join("~/.oci", "config"))},
... )
>>> meta_gen = textfactory.format('docx').metadata_schema(
...     'oci://<bucket-name>@<namespace>/papers/*.pdf',
...     storage_options={"config": oci.config.from_file(os.path.join("~/.oci", "config"))},
... )
>>> df = textfactory.format('pdf').engine('pandas').option(Options.FILE_METADATA, {'extract': ['Author']}).read_text(
...     'oci://<bucket-name>@<namespace>/<path>/*.pdf',
...     storage_options={"config": oci.config.from_file(os.path.join("~/.oci", "config"))},
...     total_files=10,
... )
>>> df = textfactory.format('txt').engine('cudf').read_line(
...     'oci://<bucket-name>@<namespace>/<path>/*.log',
...      udf=r'^\[(\S+)\s(\S+)\s(\d+)\s(\d+\:\d+\:\d+)\s(\d+)]\s(\S+)\s(\S+)\s(\S+)\s(\S+)',
...      df_args={"columns":["day", "month", "date", "time", "year", "type", "method", "status", "file"]},
...      n_lines_per_file=10,
... )

Initialize a DataLoader object.

Parameters

engine (str, optional) – dataframe engine, by default None.

Return type

None

backend(backend: Union[str, Base]) None

Set backend used for extracting text from files.

Parameters

backend ((str | ads.text_dataset.backends.Base)) – backend for extracting text from raw files.

Return type

None

convert_to_text(src_path: str, dst_path: str, encoding: str = 'utf-8', storage_options: Optional[Dict] = None) None

Convert files to plain text files.

Parameters
  • src_path (str) – path to source data file(s). can use glob pattern

  • dst_path (str) – local folder or cloud storage (e.g., OCI object storage) prefix to save converted text files

  • encoding (str, optional) – encoding for files, by default utf-8

  • storage_options (Dict, optional) – storage options for cloud storage, by default None

Return type

None

engine(eng: str) None

Set engine for dataloader. Can be pandas or cudf.

Parameters

eng (str) – name of engine

Return type

None

Raises

NotSupportedError – raises error if engine passed in is not supported.

metadata_all(path: str, storage_options: Optional[Dict] = None, encoding: str = 'utf-8') Generator[Dict[str, Any], None, None]

Get metadata of all files that matches the given path. Return a generator.

Parameters
  • path (str) – path to data files. can use glob pattern.

  • storage_options (Dict, optional) – storage options for cloud storage, by default None

  • encoding (str, optional) – encoding of files, by default ‘utf-8’

Returns

generator of extracted metedata from files.

Return type

Generator

metadata_schema(path: str, n_files: int = 1, storage_options: Optional[Dict] = None, encoding: str = 'utf-8') List[str]

Get available fields in metadata by looking at the first n_files that matches the given path.

Parameters
  • path (str) – path to data files. can have glob pattern

  • n_files (int, optional) – number of files to look up, default to be 1

  • storage_options (dict, optional) – storage options for cloud storage, by default None

  • encoding (str, optional) – encoding of files, by default utf-8

Returns

list of available fields in metadata

Return type

List[str]

option(opt: Options, spec: Optional[Any] = None) None

Set extraction options.

Parameters
  • opt (ads.text_dataset.options.Options) – an option defined in ads.text_dataset.options.Options

  • spec (Any, optional) – specifications that will be passed to option handler, by default None

Return type

None

read_line(path: str, udf: Union[str, Callable] = None, n_lines_per_file: int = None, total_lines: int = None, df_args: Dict = None, storage_options: Dict = None, encoding: str = 'utf-8') Union[Generator[Union[str, List[str]], None, None], DataFrame]

Read each file into lines. If path matches multiple files, will combine lines from all files.

Parameters
  • path (str) – path to data files. can have glob pattern.

  • udf ((callable | str), optional) – user defined function for processing each line, can be a callable or regex, by default None

  • n_lines_per_file (int, optional) – max number of lines read from each file, by default None

  • total_lines (int, optional) – max number of lines read from all files, by default None

  • df_args (dict, optional) – arguments passed to dataframe engine (e.g. pandas), by default None

  • storage_options (dict, optional) – storage options for cloud storage, by default None

  • encoding (str, optional) – encoding of files, by default ‘utf-8’

Returns

returns either a data generator or a dataframe.

Return type

(Generator | DataFrame)

read_text(path: str, udf: Union[str, Callable] = None, total_files: int = None, storage_options: Dict = None, df_args: Dict = None, encoding: str = 'utf-8') Union[Generator[Union[str, List[str]], None, None], DataFrame]

Read each file into a text string. If path matches multiple files, each file corresponds to one record.

Parameters
  • path (str) – path to data files. can have glob pattern.

  • udf ((callable | str), optional) – user defined function for processing each line, can be a callable or regex, by default None

  • total_files (int, optional) – max number of files to read, by default None

  • df_args (dict, optional) – arguments passed to dataframe engine (e.g. pandas), by default None

  • storage_options (dict, optional) – storage options for cloud storage, by default None

  • encoding (str, optional) – encoding of files, by default ‘utf-8’

Returns

returns either a data generator or a dataframe.

Return type

(Generator | DataFrame)

with_processor(processor_type: str) None

Set file processor.

Parameters

processor_type (str) – type of processor, which corresponds to format of the file.

Return type

None

class ads.text_dataset.dataset.TextDatasetFactory

Bases: object

A class that generates a dataloader given a file format.

static format(format_name: str) DataLoader

Instantiates DataLoader class and seeds it with the right kind of FileProcessor. Eg. PDFProcessor for pdf. The FileProcessorFactory returns the processor based on the format Type.

Parameters

format_name (str) – name of format

Returns

a DataLoader object.

Return type

ads.text_dataset.dataset.DataLoader

ads.text_dataset.extractor module

class ads.text_dataset.extractor.FileProcessor(backend: Union[str, Base] = 'default')

Bases: object

Base class for all the file processor. Files are opened using fsspec library. The default implementation in the base class assumes text files.

This class is expected to be used inside ads.text_dataset.dataset.DataLoader.

backend(backend: Union[str, Base]) None

Set backend for file processor.

Parameters

backend (ads.text_dataset.backends.Base) – a backend for file processor

Return type

None

Raises

NotSupportedError – when specified backend is not supported.

backend_map = {'default': <class 'ads.text_dataset.backends.Base'>, 'tika': <class 'ads.text_dataset.backends.Tika'>}
convert_to_text(fhandler: OpenFile, dst_path: str, fname: Optional[str] = None, storage_options: Optional[Dict] = None) str

Convert input file to a text file.

Parameters
  • fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

  • dst_path (str) – local folder or cloud storage (e.g. OCI object storage) prefix to save converted text files

  • fname (str, optional) – filename for converted output, relative to dirname or prefix, by default None

  • storage_options (dict, optional) – storage options for cloud storage, by default None

Returns

path to saved output

Return type

str

get_metadata(fhandler: OpenFile) Dict

Get metadata of a file.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Returns

dictionary of metadata

Return type

dict

read_line(fhandler: OpenFile, **format_reader_kwargs: Dict) Generator[Union[str, List[str]], None, None]

Yields lines from a file.

Parameters

fhandler (fsspec.core.OpenFile) – file handler returned by fsspec

Returns

a generator that yields lines from a file

Return type

Generator

read_text(fhandler: OpenFile, **format_reader_kwargs: Dict) Generator[Union[str, List[str]], None, None]

Yield contents from the entire file.

Parameters

fhandler (fsspec.core.OpenFile) – a file handler returned by fsspec

Returns

a generator that yield text from a file

Return type

Generator

class ads.text_dataset.extractor.FileProcessorFactory

Bases: object

Factory that manages all file processors. Provides functionality to get a processor corresponding to a given file type, or register custom processor for a specific file format.

Examples

>>> from ads.text_dataset.extractor import FileProcessor, FileProcessorFactory
>>> FileProcessorFactory.get_processor('pdf')
>>> class CustomProcessor(FileProcessor):
... # custom logic here
... pass
>>> FileProcessorFactory.register('new_format', CustomProcessor)
static get_processor(format)
processor_map = {'doc': <class 'ads.text_dataset.extractor.WordProcessor'>, 'docx': <class 'ads.text_dataset.extractor.WordProcessor'>, 'pdf': <class 'ads.text_dataset.extractor.PDFProcessor'>, 'txt': <class 'ads.text_dataset.extractor.FileProcessor'>}
classmethod register(fmt: str, processor: FileProcessor) None

Register custom file processor for a file format.

Parameters
  • fmt (str) – file format

  • processor (FileProcessor) – custom processor

Raises

TypeError – raised when processor is not a subclass of FileProcessor.

class ads.text_dataset.extractor.PDFProcessor(backend: Union[str, Base] = 'default')

Bases: FileProcessor

Extracts text content from PDF

backend_map = {'default': <class 'ads.text_dataset.backends.Tika'>, 'pdfplumber': <class 'ads.text_dataset.backends.PDFPlumber'>, 'tika': <class 'ads.text_dataset.backends.Tika'>}
class ads.text_dataset.extractor.WordProcessor(backend: Union[str, Base] = 'default')

Bases: FileProcessor

Extracts text content from doc or docx format.

backend_map = {'default': <class 'ads.text_dataset.backends.Tika'>, 'tika': <class 'ads.text_dataset.backends.Tika'>}

ads.text_dataset.options module

class ads.text_dataset.options.FileOption(dataloader: ads.text_dataset.dataset.DataLoader)

Bases: OptionHandler

handle(fhandler: OpenFile, spec: Any) Any
class ads.text_dataset.options.MetadataOption(dataloader: ads.text_dataset.dataset.DataLoader)

Bases: OptionHandler

handle(fhandler: OpenFile, spec: Dict) List
class ads.text_dataset.options.OptionFactory

Bases: object

static option_handler(option: Options) OptionHandler
option_handlers = {<Options.FILE_NAME: 1>: <class 'ads.text_dataset.options.FileOption'>, <Options.FILE_METADATA: 2>: <class 'ads.text_dataset.options.MetadataOption'>}
classmethod register_option(option: Options, handler) None
class ads.text_dataset.options.OptionHandler(dataloader: ads.text_dataset.dataset.DataLoader)

Bases: object

handle(fhandler: OpenFile, spec: Any) Any
class ads.text_dataset.options.Options(value)

Bases: Enum

An enumeration.

FILE_METADATA = 2
FILE_NAME = 1

Module contents