ADSString also supports NLP parsing and is backed by Natural Language Toolkit (NLTK) or spaCy. Unless otherwise specified, NLTK is used by default. You can extract properties, such as nouns, adjectives, word counts, parts of speech tags, and so on from text with NLP.
ADSString class can have one backend enabled at a time. What properties are available depends on the backend, as do the results of calling the property. The following examples provide an overview of the available parsers, and how to use them. Generally, the parser supports the
word_count base properties. Parsers can support additional parsers.
The Natural Language Toolkit (NLTK) is a powerful platform for processing human language data. It supports all the base properties and in addition
stem property returns a list of all the stemmed tokens. It reduces a token to its word stem that affixes to suffixes and prefixes, or to the roots of words that is the lemma. The
token property is similar to the
word property, except it returns non-alphanumeric tokens and doesn’t force tokens to be lowercase.
The following example use a sample of text about Larry Ellison to demonstrate the use of the NLTK properties.
test_text = """ Lawrence Joseph Ellison (born August 17, 1944) is an American business magnate, investor, and philanthropist who is a co-founder, the executive chairman and chief technology officer (CTO) of Oracle Corporation. As of October 2019, he was listed by Forbes magazine as the fourth-wealthiest person in the United States and as the sixth-wealthiest in the world, with a fortune of $69.1 billion, increased from $54.5 billion in 2018. He is also the owner of the 41st largest island in the United States, Lanai in the Hawaiian Islands with a population of just over 3000. """.strip() ADSString.nlp_backend("nltk") s = ADSString(test_text)
['Joseph', 'Ellison', 'August', 'business']
['American', 'chief', 'fourth-wealthiest', 'largest', 'Hawaiian']
['joseph', 'ellison', 'born', 'august']
By taking the difference between
word, the token set contains non-alphanumeric tokes, and also the uppercase version of words.
list(set(s.token) - set(s.word))[1:5]
['Oracle', '1944', '41st', 'fourth-wealthiest']
stem property takes the list of words and stems them. It produces morphological variations of a word’s root form. The following example stems some words, and shows some of the stemmed words that were changed.
list(set(s.stem) - set(s.word))[1:5]
['fortun', 'technolog', 'increas', 'popul']
spaCy is in an advanced NLP toolkit. It helps you understand what the words mean in context, and who is doing what to whom. It helps you determine what companies and products are mentioned in a document. The spaCy backend is used to parses the
word_count base properties. It also supports the following additional properties:
entity: All entities in the text.
entity_artwork: The titles of books, songs, and so on.
entity_location: Locations, facilities, and geopolitical entities, such as countries, cities, and states.
entity_organization: Companies, agencies, and institutions.
entity_person: Fictional and real people.
entity_product: Product names and so on.
lemmas: A rule-based estimation of the roots of a word.
tokens: The base tokens of the tokenization process. This is similar to
word, but it includes non-alphanumeric values and the word case is preserved.
spacy module is installed ,you can change the NLP backend using the
ADSString.nlp_backend("spacy") s = ADSString(test_text)
['magnate', 'investor', 'philanthropist', 'co']
['American', 'executive', 'chief', 'fourth', 'wealthiest', 'largest']
['Joseph', 'Ellison', 'born', 'August']
You can identify all the locations that are mentioned in the text.
['the United States', 'the Hawaiian Islands']
Also, the organizations that were mentioned.
['CTO', 'Oracle Corporation', 'Forbes', 'Lanai']
Part of Speech Tags
The POS tagger in spaCy uses a smaller number of categories. For example, spaCy has the
ADJ POS for all adjectives, while NLTK has
JJ to mean an adjective.
JJR refers to a comparative adjective, and
JJS refers to a superlative adjective. For fine grain analysis of different parts of speech, NLTK is the preferred backend. However, spaCy’s reduced category set tends to produce fewer errors,at the cost of not being as specific.
The spaCy parsers produce the following POS tags:
ADJ: adjective; big, old, green, incomprehensible, first
ADP: adposition; in, to, during
ADV: adverb; very, tomorrow, down, where, there
AUX: auxiliary; is, has (done), will (do), should (do)
CONJ: conjunction; and, or, but
CCONJ: coordinating conjunction; and, or, but
DET: determiner; a, an, the
INTJ: interjection; psst, ouch, bravo, hello
NOUN: noun; girl, cat, tree, air, beauty
NUM: numeral; 1, 2017, one, seventy-seven, IV, MMXIV
PART: particle; ’s, not,
PRON: pronoun; I, you, he, she, myself, themselves, somebody
PROPN: proper noun; Mary, John, London, NATO, HBO
PUNCT: punctuation; ., (, ), ?
SCONJ: subordinating conjunction; if, while, that
SYM: symbol; $, %, §, ©, +, −, ×, ÷, =, :), 😝
VERB: verb; run, runs, running, eat, ate, eating
X: other; sfpksdpsxmsa