Plugin#

One of the most powerful features of ADSString is that you can expand and customize it. The .plugin_register() method allows you to add properties to the ADSString class. These plugins can be provided by third-party providers or developed by you. This section demonstrates how to connect the to the OCI Language service, and how to create a custom plugin.

Custom Plugin#

You can bind additional properties to ADSString using custom plugins. This allows you to create custom text processing extensions. A plugin has access to the self.string property in ADSString class. You can define functions that perform a transformation on the text in the object. All functions defined in a plugin are bound to ADSString and accessible across all objects of that class.

Assume that your text is "I purchased the gift on this card 4532640527811543 and the dinner on 340984902710890" and you want to know what credit cards were used. The .credit_card property returns the entire credit card number. However, for privacy reasons you don’t what the entire credit card number, but the last four digits.

To solve this problem, you can create the class CreditCardLast4 and use the self.string property in ADSString to access the text associated with the object. It then calls the .credit_card method to get the credit card numbers. Then it parses this to return the last four characters in each credit card.

The first step is to define the class that you want to bind to ADSString. Use the @property decorator and define a property function. This function only takes self. The self.string is accessible with the text that is defined for a given object. The property returns a list.

class CreditCardLast4:
    @property
    def credit_card_last_4(self):
        return [x[len(x)-4:len(x)] for x in ADSString(self.string).credit_card]

After the class is defined, it must be registered with ADSString using the .register_plugin() method.

ADSString.plugin_register(CreditCardLast4)

Take the text and make it an ADSString object, and call the .credit_card_last_4 property to obtain the last four digits of the credit cards that were used.

creditcard_numbers = "I purchased the gift on this card 4532640527811543 and the dinner on 340984902710890"
s = ADSString(creditcard_numbers)
s.credit_card_last_4

['1543', '0890']

OCI Language Services#

The OCI Language service provides pretrained models that provide sophisticated text analysis at scale.

The Language service contains these pretrained language processing capabilities:

Aspect-Based Sentiment Analysis: Identifies aspects from the given text and classifies each into positive, negative, or neutral polarity.
Key Phrase Extraction: Extracts an important set of phrases from a block of text.
Language Detection: Detects languages based on the given text, and includes a confidence score.
Named Entity Recognition: Identifies common entities, people, places, locations, email, and so on.
Text Classification: Identifies the document category and subcategory that the text belongs to.

Those are accessible in ADS using the OCILanguage plugin.

ADSString.plugin_register(OCILanguage)

Aspect-Based Sentiment Analysis#

Aspect-based sentiment analysis can be used to gauge the mood or the tone of the text.

The aspect-based sentiment analysis (ABSA) supports fine-grained sentiment analysis by extracting the individual aspects in the input document. For example, a restaurant review “The driver was really friendly, but the taxi was falling apart.” contains positive sentiment toward the taxi driver aspect. Also, it has a strong negative sentiment toward the service mechanical aspect of the taxi. Classifying the overall sentiment as negative would neglect the fact that the taxi driver was nice.

ABSA classifies each of the aspects into one of the three polarity classes, positive, negative, mixed, and neutral. With the predicted sentiment for each aspect. It also provides a confidence score for each of the classes and their corresponding offsets in the input. The range of the confidence score for each class is between 0 and 1, and the cumulative scores of all the three classes sum to 1.

In the next example, the sample sentence is analyzed. The two aspects, taxi cab and driver, have their sentiments determined. It defines the location of the aspect by giving its offset position in the text, and the length of the aspect in characters. It also gives the text that defines the aspect along with the sentiment scores and which sentiment is dominant.

t = ADSString("The driver was really friendly, but the taxi was falling apart.")
t.absa

Results of Aspect-Based Sentiment analysis

Key Phrase Extraction#

Key phrase (KP) extraction is the process of extracting the words with the most relevance, and expressions from the input text. It helps summarize the content and recognizes the main topics. The KP extraction finds insights related to the main points of the text. It understands the unstructured input text, and returns keywords and KPs. The KPs consist of subjects and objects that are being talked about in the document. Any modifiers, like adjectives associated with these subjects and objects, are also included in the output. Confidence scores for each key phrase that signify how confident the algorithm is that the identified phrase is a KP. Confidence scores are a value from 0 to 1.

The following example determines the key phrases and the importance of these phrases in the text (which is the value of test_text):

Lawrence Joseph Ellison (born August 17, 1944) is an American business magnate,
investor, and philanthropist who is a co-founder, the executive chairman and
chief technology officer (CTO) of Oracle Corporation. As of October 2019, he was
listed by Forbes magazine as the fourth-wealthiest person in the United States
and as the sixth-wealthiest in the world, with a fortune of $69.1 billion,
increased from $54.5 billion in 2018.[4] He is also the owner of the 41st
largest island in the United States, Lanai in the Hawaiian Islands with a
population of just over 3000.

s = ADSString(test_text)
s.key_phrase

Language Detection#

The language detection tool identifies which natural language the input text is in. If the document contains more than one language, the results may not be what you expect. Language detection can help make customer support interactions more personable and quicker. Customer service chatbots can interact with customers based on the language of their input text and respond accordingly. If a customer needs help with a product, the chatbot server can field the corresponding language product manual, or transfer it to a call center for the specific language.

The following is a list of some of the supported languages:

Supported Languages#
Afrikaans	Albanian	Arabic	Armenian	Azerbaijani	Basque
Belarusian	Bengali	Bosnian	Bulgarian	Burmese	Cantonese
Catalan	Cebuano	Chinese	Croatian	Czech	Danish
Dutch	Eastern Punjabi	Egyptian Arabic	English	Esperanto	Estonian
Finnish	French	Georgian	German	Greek	Hebrew
Hindi	Hungarian	Icelandic	Indonesian	Irish	Italian
Japanese	Javanese	Kannada	Kazakh	Korean	Kurdish (Sorani)
Latin	Latvian	Lithuanian	Macedonian	Malay	Malayalam
Marathi	Minangkabau	Nepali	Norwegian (Bokmal)	Norwegian (Nynorsk)	Persian
Polish	Portuguese	Romanian	Russian	Serbian	Serbo-Croatian
Slovak	Slovene	Spanish	Swahili	Swedish	Tagalog
Tamil	Telugu	Thai	Turkish	Ukrainian	Urdu
Uzbek	Vietnamese	Welsh

The next example determines the language of the text, the ISO 639-1 language code, and a probability score.

s.language_dominant

Named Entity Recognition#

Named entity recognition (NER) detects named entities in text. The NER model uses NLP, which uses machine learning to find predefined named entities. This model also provides a confidence score for each entity and is a value from 0 to 1. The returned data is the text of the entity, its position in the document, and its length. It also identifies the type of entity, a probability score that it is an entity of the stated type.

The following are the supported entity types:

DATE: Absolute or relative dates, periods, and date range.
EMAIL: Email address.
EVENT: Named hurricanes, sports events, and so on.
FAC: Facilities; Buildings, airports, highways, bridges, and so on.
GPE: Geopolitical entity; Countries, cities, and states.
IPADDRESS: IP address according to IPv4 and IPv6 standards.
LANGUAGE: Any named language.
LOCATION: Non-GPE locations, mountain ranges, and bodies of water.
MONEY: Monetary values, including the unit.
NORP: Nationalities, religious, and political groups.
ORG: Organization; Companies, agencies, institutions, and so on.
PERCENT: Percentage.
PERSON: People, including fictional characters.
PHONE_NUMBER: Supported phone numbers.
- (“GB”) - United Kingdom
- (“AU”) - Australia
- (“NZ”) - New Zealand
- (“SG”) - Singapore
- (“IN”) - India
- (“US”) - United States
PRODUCT: Vehicles, tools, foods, and so on (not services).
QUANTITY: Measurements, as weight or distance.
TIME: Anything less than 24 hours (time, duration, and so on).
URL: URL

The following example lists the named entities:

s.ner

The output gives the named entity, its location, and offset position in the text. It also gives a probability and score that this text is actually a named entity along with the type.

Text Classification#

Text classification analyses the text and identifies categories for the content with a confidence score. Text classification uses NLP techniques to find insights from textual data. It returns a category from a set of predefined categories. This text classification uses NLP and relies on the main objective lies on zero-shot learning. It classifies text with no or minimal data to train. The content of a collection of documents is analyzed to determine common themes.

The next example classifies the text and gives a probability score that the text is in that category.

s.text_classification