Processing Textual Data with Python
@raphaelmcobe
## Introduction/Motivation -- ### A (few) word (s) on unstructure data - Structured or Semistructured data includes fields or markup that enable it to be
easily parsed by a computer
; - Language is
unstructured data
that has been
produced by people
to be
understood by other people
; - Retrieving information from unstructured data
is essential
! - It is estimated that [80% of the world’s data is unstructured](https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/). - Unstructured data
*is not random!*
- Linguistic properties that make it very understandable to other people -- ### Language as Data - Text mining: - Process of deriving insights from (unstructured) text data; - Natural Language Processing (NLP): - Part of computer science and artificial intelligence which deals with human languages. - Sentiment Analysis: - Just a small variant of Text mining; Note: This is a simple note - Test - Test2 -- ### Understanding text is hard - Real-world data is often
messy
and
noisy
; - Several factors that makes this process hard: - Encoding scheme used (ASCII, UTF-8, UTF-16, Latin-1, etc.); - hundreds of natural languages, each of which has different syntax rules; - Words can be ambiguous (dependent on their context); - case-sensitive, punctuation, numbers, hyperlinks, the use of emoticons and emojis
😜
- synonyms, abbreviation, acronyms, and spelling errors; - often need to understand which words in a sentence are nouns and which are verbs; --- ## Processing Textual data -- ### How to process textual data? - “the numbers don’t lie” and not “the text doesn’t lie”; - Machine learning: - train statistical models on language as it changes; - building models of language on
context-specific corpora
; - Model that describes language and can make inferences based on that description - To take advantage of the predictability of text, we need to define a constrained, numeric decision space on which the model can compute; -- ### How to process textual data? - Break the text processing into steps: -
Finding Parts of Text
- Finding Sentences - Finding People and Things - Detecting Parts of Speech -
Classifying Text and Documents
- Extracting Relationships - Combined Approaches --- ## Finding Parts of Text Examples taken from [Real Python Regular Expressions Tutorial](https://realpython.com/regex-python/) -- ### Regular Expressions A (very) brief history - In 1951, mathematician Stephen Cole Kleene described the concept of a regular language, a language that is recognizable by a finite automaton; - In the mid-1960s, computer science pioneer Ken Thompson, one of the original designers of Unix, implemented pattern matching in the QED text editor; - Appeared in many programming languages, editors, and other tools as a means of determining whether a string matches a specified pattern; -- ### The Python `re` module - Regex functionality in Python resides in a module named `re`. The `re` module contains many useful functions and methods. Let's play a little bit with the `search()` function: - `re.search(
,
)` ```python[1|2|3-4] >>> import re >>> text = "foo123foo" >>> re.search("123", text)
``` -- ### The Python `re` module - `span=(3, 6)` indicates the portion of `
` in which the match was found. This means the same thing as it would in slice notation: ```python[1|2] >>> text[3:6] '123' ``` -- ### Regular Expressions in Python - Any character or sequence of carachters is a valid regular expression; - The `search()` function returns the first sequence of characters that matches the RE; ```python[2-3|4-5] >>> text="abc123abc" >>> re.search("a", text)
>>> re.search("abc", text)
``` -- ### Regular Expressions in Python Metacharacters - Real power of regex matching in Python emerges when `
` contains special characters called
metacharacters
; - Unique meaning to the regex matching engine; - **Character class**: - Matches any single character that is in the class; - Defined using square brackets ([]); - Can be used to specify intervals ```python[1-2|3-4|5-6|7-8] >>> re.search('[0-9]', "abc123abc")
>>> re.search('[2-9]', "abc123abc")
>>> re.search('[0-9][0-9][0-9]', "abc123abc")
>>> re.search('[2-9][0-9]', "abc123abc")
``` -- ### Regular Expressions in Python Metacharacters - character classes - Can have all the characters enumerated: ```python[1-2|3-4] >>> re.search('ab[cde]', "abc123abc")
>>> re.search('ab[cde]', "abd123abc")
``` - Complement a character class by specifying `^` as the first character: ```python >>> re.search('ab[^cde]', "abc abd abe abf")
``` -- ### Regular Expressions in Python Metacharacters - The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard: ```python[1-2|3-4] >>> re.search('1.3', "abc123abc")
>>> print(re.search('foo.bar', 'foobar')) None ``` -- ### Regular Expressions in Python Metacharacters - Quantifiers: - Indicates how many times that portion must occur for the match to succeed - `*` matches zero or more repetitions of the preceding regex: ```python[1-2|3-4|5-6] >>> re.search('foo-*bar', 'foobar') # Zero dashes
>>> re.search('foo-*bar', 'foo-bar') # One dash
>>> re.search('foo-*bar', 'foo--bar') # Two dashes
``` - `.*` This matches zero or more occurrences of any character: ```python >>> re.search('foo.*bar', '# foo $qux@grault % bar #')
``` -- ### Regular Expressions in Python Metacharacters - Quantifiers: - `+` matches one or more repetitions of the preceding regex; - `?` matches zero or one repetitions of the preceding regex; - `{m}` matches exactly `m` repetitions of the preceding regex; - `{m,n}` matches any number of repetitions of the preceding regex from `m` to `n`, inclusive. -- ### Regular Expressions in Python Metacharacters - Anchors: zero-width matches. - Don’t match any actual characters in the search string; - Dictates a particular location in the search string where a match must occur; - `^` or `\A` anchor a match to the start of the text: ```python[1-2|3-4] >>> re.search('^foo', 'foobar')
>>> re.search('^foo', 'barfoo') None ``` - `$` or `\Z` anchor a match to the end of the text: ```python[1-2|3-4] >>> re.search('bar$', 'foobar')
>>> re.search('bar$', 'barfoo') None ``` -- ### Regular Expressions Why are they so important? - Removing uninportant text: - Use the `re.sub()` function; - Strip all non alphanumerical characters: ```python >>> re.sub('[^a-zA-Z0-9_]+', '', "foo1!bar#~2") 'foo1bar2' ``` - Remove all punctuation signs (`\s` represents any whitespace character): ```python >>> re.sub("[^a-zA-Z0-9_]\s",' ',"Foo. Bar? Foo.Bar") 'Foo Bar Foo.Bar' ``` - Remove all numbers from the text: ```python >>> re.sub('[0-9]+', '', "Foo 123 Bar 456") 'Foo Bar ' ``` --- ## Further Cleaning the Text Standard text mining procedures to extract useful features from the news contents, including:
tokenization
,
removing stopwords
, and
lemmatization/stemming
-- ### Tokenization - The process of breaking strings into tokens which in turn are small structures or units; - Taking individual words rather than sentences breaks down the connections between words - It is efficient and convenient for computers to analyze - Most of the times, discovering what words appear in a text and counting the times that these words appear is sufficient to give insightful results; - Transforms a text into a list of words; - Remove useless information (using Regular Expressions!) -- ### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/) - Leading platform for building Python programs to work with human language data - A wonderful tool for teaching, and working in, computational linguistics using Python; - Free reference [book](http://www.nltk.org/book/) - Open Source; ```python >>> import nltk >>> nltk.download() ``` -- ### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/) Tokenization ```python[1-2|4-8|10-11] # importing word_tokenize from nltk from nltk.tokenize import word_tokenize text = """ Ground control to Major Tom (10, 9, 8, 7). Commencing countdown, engines on (6, 5, 4, 3). Check ignition, and may God's love be with you (2, 1, lift off). """ token = word_tokenize(text) print(token) ``` ``` ['Ground', 'control', 'to', 'Major', 'Tom', '(', '10', ',', '9', ',', '8', ',', '7', ')', '.', 'Commencing', 'countdown', ',', 'engines', 'on', '(', '6', ',', '5', ',', '4', ',', '3', ')', '.', 'Check', 'ignition', ',', 'and', 'may', 'God', "'s", 'love', 'be', 'with', 'you', '(', '2', ',', '1', ',', 'lift', 'off', ')', '.'] ``` -- ### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/) Cleaning Before Tokenization ```python[1-2|4-5|7-8|10-12|14-16] # Remove non alphanumeric character text = re.sub("[^a-zA-Z0-9]",' ',text) # Remove all numbers text = re.sub('[0-9]+', '', text) # Remove extra whitespaces text = re.sub('\s+', ' ', text) # Remove beginning and trailing whitespaces text = re.sub('^\s', '', text) text = re.sub('\s$', '', text) ``` -- ### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/) Tokenization ```python # Now tokenize token = word_tokenize(text) print(token) ``` ``` ['Ground', 'control', 'to', 'Major', 'Tom', 'Commencing', 'countdown', 'engines', 'on', 'Check', 'ignition', 'and', 'may', 'God', 's', 'love', 'be', 'with', 'you', 'lift', 'off'] ``` - Finding the frequency distinct in the tokens ```python[1|3-4] from nltk.probability import FreqDist fdist = FreqDist(token) fdist ``` ``` FreqDist({'Ground': 1, 'control': 1, 'to': 1, 'Major': 1, 'Tom': 1, 'Commencing': 1, 'countdown': 1, 'engines': 1, 'on': 1, 'Check': 1, ...}) ``` -- ### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/) Tokenization - Let's Inspect the lines from [Ghostbusters](https://www.imdb.com/title/tt0087332/) from the [Cornell Movie Lines Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html): ```python[1-3|5-6|8-10] # load data and concatenate all lyrics movie_db = pd.read_csv("./movie_lines_db.csv") ghostbusters_lines = movie_db[movie_db['movie_name']=='ghostbusters'].iloc[0,1] # tokenize token = word_tokenize(ghostbusters_lines) # build the frequency distribution and print the top 10 occurences: fdist = FreqDist(token) fdist.most_common(10) ``` ``` [('.', 535), ('I', 223), (',', 201), ('the', 162), ('?', 160), ('you', 158), ('to', 123), ('a', 110), ("'s", 85), ('it', 82)] ``` -- ### Removing Stop Words - Remove the useless words; - Words that frequently appear in many text fragments, but without significant meanings; - Do not provide any meaning and are usually removed from texts - e.g.: ‘I’, ‘the’, ‘a’, ‘of’ - Import the stopwords from the NLTK library; -- ### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/) Removing stopwords - Continuing with the [Ghostbusters](https://www.imdb.com/title/tt0087332/) lines from the [Cornell Movie Lines Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html): ```python # importing stopwors from nltk library from nltk import word_tokenize from nltk.corpus import stopwords english_stopwords = set(stopwords.words('english')) # Create list of tokens coverting all text to lower case ghostbusters_token = word_tokenize(ghostsbusters_lines.lower()) # List words from lines that are not in the set of stopwords clean_lines = [word for word in ghostbusters_token if word not in english_stopwords] print(clean_lines) ``` ``` ["'s", 'matter', ',', 'dear', '?', 'uuuuuuugh', '!', '!', 'hey', ',', 'sweetheart', ',', 'cut',...] ``` -- ### Stemming - refers to normalizing words into its base form or root form; - two methods in Stemming namely, Porter Stemming (removes common morphological and inflectional endings from words) and Lancaster Stemming (a more aggressive stemming algorithm) -- ### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/) Removing stopwords - Continuing with the [Ghostbusters](https://www.imdb.com/title/tt0087332/) lines from the [Cornell Movie Lines Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html): ```python # Importing Porterstemmer from nltk library from nltk.stem import PorterStemmer pst = PorterStemmer() stemmed_tokens = [pst.stem(word) for word in clean_lines] print(stemmed_tokes) ``` ``` [...,'probabl', 'would', "'ve", '.', "n't", 'glad', 'wait', '?','go', 'say', '``', 'eight', '.', "''", "'re", 'fantast', '!','eight', "o'clock", '?', 'okay', '.', 'give', 'second', '.','leav',...] ``` --- ## Sentiment Analysis -- ### Sentiment Analysis - Also known as opinion mining; - Used to determine whether data is positive, negative or neutral; - focus on polarity (positive, negative, neutral); - Help businesses monitor brand and product sentiment in customer feedback, and understand customer needs; - Can use heavy Machine Learning techniques, such as Artificial Neural Networks, or can be based on Rules; -- ### Sentiment Analysis Rule Based - Include various NLP techniques developed in computational linguistics, such as: Stemming, Tokenization, and Text cleaning; - Use of Sentiment Lexicons (i.e. lists of words and expressions); - Very naive since they don't take into account how words are combined in a sequence; - Leave out the context; - Often require fine-tuning and maintenance; -- ### Python Sentiment Analysis After cleaning, tokenizing and stemming the text: - Use well-known lexicons, e.g.: [The subjectivity lexicon](http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/) or the [Valence Aware Dictionary and sEntiment Reasoner - VADER](https://github.com/cjhutto/vaderSentiment) ```python[1-4|6-7|9-11|13-14] # first load VADER import nltk from nltk.sentiment.vader import SentimentIntensityAnalyzer nltk.download('vader_lexicon') # instantiate the sentiment analyzer: vader = SentimentIntensityAnalyzer() # check scores for some phrases: print(vader.polarity_scores("This was the best idea I've had in a long time")) {'neg': 0.0, 'neu': 0.682, 'pos': 0.318, 'compound': 0.6369} print(vader.polarity_scores("This was the worst idea I've had in a long time")) {'neg': 0.313, 'neu': 0.687, 'pos': 0.0, 'compound': -0.6249} ``` --- ## Questions? --- ## References -- ### Examples mainly taken from: - https://towardsdatascience.com/fine-grained-sentiment-analysis-in-python-part-1-2697bb111ed4 - https://monkeylearn.com/sentiment-analysis/ - https://medium.com/towards-artificial-intelligence/text-mining-in-python-steps-and-examples-78b3f8fd913b - https://towardsdatascience.com/a-step-by-step-tutorial-for-conducting-sentiment-analysis-a7190a444366 - https://towardsdatascience.com/getting-started-with-text-analysis-in-python-ca13590eb4f7 -- ### Some books: - Ingersoll, Grant, Thomas Morton, and Andrew Farris. "Taming text." How to Find, Organize, and Manipulate It, Shelter Island, NY/London (2013). - BIRD, Steven; KLEIN, Ewan; LOPER, Edward. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009. - VASILIEV, Yuli. Natural Language Processing with Python and SpaCy: A Practical Introduction. No Starch Press, 2020. - BENGFORT, Benjamin; BILBRO, Rebecca; OJEDA, Tony. Applied text analysis with python: Enabling language-aware data products with machine learning. " O'Reilly Media, Inc.", 2018. --- ## Thanks!
#### raphaelmcobe@gmail.com #### [@raphaelmcobe](twitter.com/raphaelmcobe) #### [CODATA-RDA Data Science Schools](https://www.datascienceschools.org/)