Configure PII

Let’s explore each line of the pii.yaml so we can better understand options for extending and customizing the operator to our use case.

Here is an example pii.yaml with every parameter specified:

kind: operator
type: pii
version: v1
spec:
    output_directory:
        url: oci://my-bucket@my-tenancy/results
        name: mydata-out.csv
    report:
        report_filename: report.html
        show_rows: 10
        show_sensitive_content: true
    input_data:
        url: oci://my-bucket@my-tenancy/mydata.csv
    target_column: target
    detectors:
        - name: default.phone
          action: anonymize
  • Kind: The yaml file always starts with kind: operator. There are many other kinds of yaml files that can be run by ads opctl, so we need to specify this is an operator.

  • Type: The type of operator is pii.

  • Version: The only available version is v1.

  • Spec: Spec contains the bulk of the information for the specific problem.
    • input_data: This dictionary contains the details for how to read the input data.
      • url: Insert the uri for the dataset if it’s on object storage using the URI pattern oci://<bucket>@<namespace>/path/to/data.csv.

    • target_column: This string specifies the name of the column where the user data is within the input data.

    • detectors: This list contains the details for each detector and action that will be taken.
      • name: The string specifies the name of the detector. The format should be <type>.<entity>. Check Configure Detector for more details.

      • action: The string specifies the way to process the detected entity. Default to mask.

    • output_directory: This dictionary contains the details for where to put the output artifacts. The directory need not exist, but must be accessible by the Operator during runtime.
      • url: Insert the uri for the dataset if it’s on object storage using the URI pattern oci://<bucket>@<namespace>/subfolder/.

      • name: The string specifies the name of the processed data file.

    • report: (optional) This dictionary specific details for the generated report.
      • report_filename: Placed into output_directory location. Defaults to report.html.

      • show_sensitive_content: Whether to show sensitive content in the report. Defaults to false.

      • show_rows: The number of rows that shows in the report.

Configure Detector

A detector consists of name and action. The name parameter defines the detector that will be used, and the action parameter defines the way to process the entity.

Configure Name

We currently support the following type of detectors:

  • default

  • spacy

Default

Here scrubadub’s pre-defined detector is used. You can designate the name in the format of default.<entity> (e.g., default.phone). Check the supported detectors from scrubadub.

Note

If you want to de-identify address by this tool, scrubadub_address is required. You will need to follow the instructions to install the required dependencies.

spaCy

To use spaCy’s NER to identify entity, you can designate the name in the format of spacy.<model>.<entity> (e.g., spacy.en_core_web_sm.person). The “entity” value can correspond to any entity that spaCy recognizes. For a list of available models and entities, please refer to the spaCy documentation.

Configure Action

We currently support the following types of actions:

  • mask

  • remove

  • anonymize

Mask

The mask action is used to mask the detected entity with the name of the entity type. It replaces the entity with a placeholder. For example, with the following configured detector:

name: spacy.en_core_web_sm.person
action: mask

After processing, the input text “Hi, my name is John Doe.” will become “Hi, my name is {{NAME}}.”

Remove

The remove action is used to delete the detected entity from the text. It completely removes the entity without replacement. For example, with the following configured detector:

name: spacy.en_core_web_sm.person
action: remove

After processing, the input text “Hi, my name is John Doe.” will become “Hi, my name is .”

Anonymize

The anonymize action can be used to obfuscate the detected sensitive information. Currently, we provide context-aware anonymization for name, email, and number-like entities. For example, with the following configured detector:

name: spacy.en_core_web_sm.person
action: anonymize

After processing, the input text “Hi, my name is John Doe.” will become “Hi, my name is Joe Blow.”