Processing Textual Data

Processing Textual Data with Python

## Introduction/Motivation

### A (few) word (s) on unstructure data
 - Structured or Semistructured data includes fields or markup that
 enable it to be easily parsed by a
 computer;
 - Language is unstructured data that
 has been produced by people to
 be understood by other people;
 - Retrieving information from unstructured data is essential! - It is estimated that [80%
 of the world’s data is
 unstructured](https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/).
 - Unstructured data *is not
 random!*
 - Linguistic properties that make it very understandable to other
 people

### Language as Data
          - Text mining:
            - Process of deriving insights from (unstructured) text data;
          - Natural Language Processing (NLP): 
            - Part of computer science and artificial intelligence which deals
            with human languages.  - Sentiment Analysis:
            - Just a small variant of Text mining;

Note:
          This is a simple note
          - Test 
          - Test2

### Understanding text is hard
 - Real-world data is often messy and noisy;
 - Several factors that makes this process hard:
 - Encoding scheme used (ASCII, UTF-8, UTF-16, Latin-1, etc.); 
 - hundreds of natural languages, each of which has different syntax
 rules; 
 - Words can be ambiguous (dependent on their context);
 - case-sensitive, punctuation, numbers, hyperlinks, the use of
 emoticons and emojis  😜
 - synonyms, abbreviation, acronyms, and spelling errors;
 - often need to understand which words in a sentence are nouns and
 which are verbs;

---

## Processing Textual data

### How to process textual data?
 - “the numbers don’t lie” and not “the text doesn’t lie”;
 - Machine learning: 
 - train statistical models on language as it changes;
 - building models of language on 
 context-specific corpora;
 - Model that describes language and can make inferences based on that
 description
 - To take advantage of the predictability of text, we need to define
 a constrained, numeric decision space on which the model can compute;

### How to process textual data?
 - Break the text processing into steps:
 - Finding Parts of Text
 - Finding Sentences
 - Finding People and Things
 - Detecting Parts of Speech
 - Classifying Text and
 Documents
 - Extracting Relationships
 - Combined Approaches 
 
 ---
 
 ## Finding Parts of Text

Examples taken from [Real Python Regular Expressions
          Tutorial](https://realpython.com/regex-python/)
          --

### Regular Expressions
          A (very) brief history
          - In 1951, mathematician Stephen Cole Kleene described the concept of
          a regular language, a language that is recognizable by a finite
          automaton;
          -  In the mid-1960s, computer science pioneer Ken Thompson, one of
          the original designers of Unix, implemented pattern matching in the
          QED text editor;
          - Appeared in many programming languages, editors, and other tools as
          a means of determining whether a string matches a specified pattern;

--
 
 ### The Python `re` module
 - Regex functionality in Python resides in a module named `re`. The
 `re` module contains many useful functions and methods. Let's play a
 little bit with the `search()` function:
 - `re.search(<regex>, <string>)`

```python[1|2|3-4]
 >>> import re
 >>> text = "foo123foo"
 >>> re.search("123", text)
 <re.Match object; span=(3, 6), match='123'>
 ```

--
 
 ### The Python `re` module
 - `span=(3, 6)` indicates the portion of `<string>` in which the
 match was found. This means the same thing as it would in slice
 notation:

```python[1|2]
          >>> text[3:6]
          '123'
          ```

### Regular Expressions in Python
          - Any character or sequence of carachters is a valid regular
          expression;
          - The `search()` function returns the first sequence of characters
          that matches the RE;

```python[2-3|4-5]
 >>> text="abc123abc"
 >>> re.search("a", text)
 <re.Match object; span=(0, 1), match='a'>
 >>> re.search("abc", text)
 <re.Match object; span=(0, 3), match='abc'>
 ```

### Regular Expressions in Python
 Metacharacters
 - Real power of regex matching in Python emerges when `<regex>`
 contains special characters called metacharacters;
 - Unique meaning to the regex matching engine;
 - **Character class**:
 - Matches any single character that is in the class;
 - Defined using square brackets ([]);
 - Can be used to specify intervals
 ```python[1-2|3-4|5-6|7-8]
 >>> re.search('[0-9]', "abc123abc")
 <re.Match object; span=(3, 4), match='1'>
 >>> re.search('[2-9]', "abc123abc")
 <re.Match object; span=(4, 5), match='2'>
 >>> re.search('[0-9][0-9][0-9]', "abc123abc")
 <re.Match object; span=(3, 6), match='123'>
 >>> re.search('[2-9][0-9]', "abc123abc")
 <re.Match object; span=(4, 6), match='23'>
 ```

### Regular Expressions in Python
 Metacharacters - character classes
 - Can have all the characters enumerated:
 ```python[1-2|3-4]
 >>> re.search('ab[cde]', "abc123abc")
 <re.Match object; span=(0, 3), match='abc'>
 >>> re.search('ab[cde]', "abd123abc")
 <re.Match object; span=(0, 3), match='abd'>
 ```
 - Complement a character class by specifying `^` as the first
 character:
 ```python
 >>> re.search('ab[^cde]', "abc abd abe abf")
 <re.Match object; span=(12, 15), match='abf'>
 ```

### Regular Expressions in Python
          Metacharacters
          -  The dot (.) metacharacter matches any character except a newline,
          so it functions like a wildcard:

```python[1-2|3-4]
 >>> re.search('1.3', "abc123abc")
 <re.Match object; span=(3, 6), match='123'>
 >>> print(re.search('foo.bar', 'foobar'))
 None
 ```

### Regular Expressions in Python
 Metacharacters
 - Quantifiers:
 - Indicates how many times that portion must occur for the match to succeed
 - `*` matches zero or more repetitions of the preceding regex:
 ```python[1-2|3-4|5-6]
 >>> re.search('foo-*bar', 'foobar') # Zero dashes
 <re.Match object; span=(0, 6), match='foobar'>
 >>> re.search('foo-*bar', 'foo-bar') # One dash
 <re.Match object; span=(0, 7), match='foo-bar'>
 >>> re.search('foo-*bar', 'foo--bar') # Two dashes
 <re.Match object; span=(0, 8), match='foo--bar'>
 ```
 - `.*` This matches zero or more occurrences of any character:
 ```python
 >>> re.search('foo.*bar', '# foo $qux@grault % bar #')
 <re.Match object; span=(2, 23), match='foo $qux@grault % bar'>
 ```

### Regular Expressions in Python
          Metacharacters
          - Quantifiers:
            - `+` matches one or more repetitions of the preceding regex;
            - `?` matches zero or one repetitions of the preceding regex;
            - `{m}` matches exactly `m` repetitions of the preceding regex;
            - `{m,n}` matches any number of repetitions of the preceding regex
            from `m` to `n`, inclusive.

### Regular Expressions in Python
 Metacharacters
 - Anchors: zero-width matches. 
 - Don’t match any actual characters in the search string;
 - Dictates a particular location in the search string where a match
 must occur;
 - `^` or `\A` anchor a match to the start of the text:
 ```python[1-2|3-4]
 >>> re.search('^foo', 'foobar')
 <re.Match object; span=(0, 3), match='foo'>
 >>> re.search('^foo', 'barfoo')
 None
 ```
 - `$` or `\Z` anchor a match to the end of the text:
 ```python[1-2|3-4]
 >>> re.search('bar$', 'foobar')
 <re.Match object; span=(3, 6), match='bar'>
 >>> re.search('bar$', 'barfoo')
 None
 ```
 
 --

### Regular Expressions
          Why are they so important?
          
          - Removing uninportant text:
            - Use the `re.sub()` function;
            - Strip all non alphanumerical characters:
          ```python
          >>> re.sub('[^a-zA-Z0-9_]+', '', "foo1!bar#~2")
          'foo1bar2'
          ```
            - Remove all punctuation signs (`\s` represents any whitespace
            character):
          ```python
          >>> re.sub("[^a-zA-Z0-9_]\s",' ',"Foo. Bar? Foo.Bar")
          'Foo Bar Foo.Bar'
          ```
            - Remove all numbers from the text:
          ```python
          >>> re.sub('[0-9]+', '', "Foo 123 Bar 456")
          'Foo  Bar '
          ```

---
 ## Further Cleaning the Text
 
 Standard text mining procedures to extract useful features from the
 news contents, including:
 <ul>
 <li>tokenization,</li>
 <li>removing stopwords, and </li>
 <li>lemmatization/stemming</li>
 </ul>

### Tokenization
          - The process of breaking strings into tokens which in turn are small
          structures or units;
          - Taking individual words rather than sentences breaks down the
          connections between words
          - It is efficient and convenient for computers to analyze
          - Most of the times, discovering what words appear in a text and
          counting the times that these words appear is sufficient to give
          insightful results;
          - Transforms a text into a list of words;
          - Remove useless information (using Regular Expressions!)

### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
          - Leading platform for building Python programs to work with human
          language data
          - A wonderful tool for teaching, and working in, computational
          linguistics using Python;
          - Free reference [book](http://www.nltk.org/book/)
          - Open Source;
          ```python
          >>> import nltk
          >>> nltk.download()
          ```

### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
          Tokenization
          ```python[1-2|4-8|10-11]
          # importing word_tokenize from nltk
          from nltk.tokenize import word_tokenize
          
          text = """
          Ground control to Major Tom (10, 9, 8, 7).
          Commencing countdown, engines on (6, 5, 4, 3).
          Check ignition, and may God's love be with you (2, 1, lift off).
          """

token = word_tokenize(text)
          print(token)

```
          ```
          ['Ground', 'control', 'to', 'Major', 'Tom', '(', '10', ',', '9', ',', '8', ',', '7', ')', '.', 'Commencing', 'countdown', ',', 'engines', 'on', '(', '6', ',', '5', ',', '4', ',', '3', ')', '.', 'Check', 'ignition', ',', 'and', 'may', 'God', "'s", 'love', 'be', 'with', 'you', '(', '2', ',', '1', ',', 'lift', 'off', ')', '.']
          ```

### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
          Cleaning Before Tokenization
          ```python[1-2|4-5|7-8|10-12|14-16]
          # Remove non alphanumeric character 
          text = re.sub("[^a-zA-Z0-9]",' ',text)

# Remove all numbers
          text = re.sub('[0-9]+', '', text)

# Remove extra whitespaces
          text = re.sub('\s+', ' ', text)

# Remove beginning and trailing whitespaces
          text = re.sub('^\s', '', text)
          text = re.sub('\s$', '', text)
          ```

### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
          Tokenization
          ```python
          # Now tokenize
          token = word_tokenize(text)
          print(token)
          ```
          ```
          ['Ground', 'control', 'to', 'Major', 'Tom', 'Commencing', 'countdown', 'engines', 'on', 'Check', 'ignition', 'and', 'may', 'God', 's', 'love', 'be', 'with', 'you', 'lift', 'off']
          ```
          - Finding the frequency distinct in the tokens
          ```python[1|3-4]
          from nltk.probability import FreqDist
          fdist = FreqDist(token)
          fdist
          ```

```
          FreqDist({'Ground': 1, 'control': 1, 'to': 1, 'Major': 1, 'Tom': 1, 'Commencing': 1, 'countdown': 1, 'engines': 1, 'on': 1, 'Check': 1, ...})
          ```

### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
          Tokenization
          - Let's Inspect the lines from
          [Ghostbusters](https://www.imdb.com/title/tt0087332/) from the
          [Cornell Movie Lines
          Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html):

```python[1-3|5-6|8-10]
          # load data and concatenate all lyrics
          movie_db = pd.read_csv("./movie_lines_db.csv")
          ghostbusters_lines = movie_db[movie_db['movie_name']=='ghostbusters'].iloc[0,1]

# tokenize
          token = word_tokenize(ghostbusters_lines)

# build the frequency distribution and print the top 10 occurences:
          fdist = FreqDist(token)
          fdist.most_common(10)
          ```
          ```
[('.', 535), ('I', 223), (',', 201), ('the', 162), ('?', 160), ('you', 158), ('to', 123), ('a', 110), ("'s", 85), ('it', 82)]
          ```

--
          
          ### Removing Stop Words
          - Remove the useless words;
          - Words that frequently appear in many text fragments, but without
          significant meanings;
          - Do not provide any meaning and are usually removed from texts
          - e.g.: ‘I’, ‘the’, ‘a’, ‘of’
          - Import the stopwords from the NLTK library;

### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
          Removing stopwords 
          - Continuing with the
          [Ghostbusters](https://www.imdb.com/title/tt0087332/) lines from the
          [Cornell Movie Lines
          Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html):

```python
          # importing stopwors from nltk library
          from nltk import word_tokenize
          from nltk.corpus import stopwords
          english_stopwords = set(stopwords.words('english'))

# Create list of tokens coverting all text to lower case
          ghostbusters_token = word_tokenize(ghostsbusters_lines.lower())

# List words from lines that are not in the set of stopwords
          clean_lines = [word for word in ghostbusters_token if word not in
          english_stopwords]
          print(clean_lines)
          ```
          ```
          ["'s", 'matter', ',', 'dear', '?', 'uuuuuuugh', '!', '!', 'hey', ',',
          'sweetheart', ',', 'cut',...]
          ```

--
          
          ### Stemming
          - refers to normalizing words into its base form or root form;
          - two methods in Stemming namely, Porter Stemming (removes common
          morphological and inflectional endings from words) and Lancaster
          Stemming (a more aggressive stemming algorithm)
          --

```python
          # Importing Porterstemmer from nltk library
          from nltk.stem import PorterStemmer
          pst = PorterStemmer()
          
          stemmed_tokens = [pst.stem(word) for word in clean_lines]
          print(stemmed_tokes)
          ```
          ```
          [...,'probabl', 'would', "'ve", '.', "n't", 'glad', 'wait', '?','go', 'say', '``', 'eight', '.', "''", "'re", 'fantast', '!','eight', "o'clock", '?', 'okay', '.', 'give', 'second', '.','leav',...]
          ```

---
 ## Sentiment Analysis

--
          ### Sentiment Analysis
          - Also known as opinion mining;
          - Used to determine whether data is positive, negative or neutral;
            - focus on polarity (positive, negative, neutral);
          - Help businesses monitor brand and product sentiment in customer
          feedback, and understand customer needs;
          - Can use heavy Machine Learning techniques, such as Artificial
          Neural Networks, or can be based on Rules;

--
          ### Sentiment Analysis
          Rule Based
          - Include various NLP techniques developed in computational
          linguistics, such as: Stemming, Tokenization, and Text cleaning;
          - Use of Sentiment Lexicons (i.e. lists of words and expressions);
          - Very naive since they don't take into account how words are
          combined in a sequence;
            - Leave out the context;
          - Often require fine-tuning and maintenance;

--
          ### Python Sentiment Analysis
          After cleaning, tokenizing and stemming the text:
          - Use well-known lexicons, e.g.: [The subjectivity
          lexicon](http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/) or the 
          [Valence Aware Dictionary and sEntiment Reasoner -
          VADER](https://github.com/cjhutto/vaderSentiment)

```python[1-4|6-7|9-11|13-14]
          # first load VADER
          import nltk
          from nltk.sentiment.vader import SentimentIntensityAnalyzer
          nltk.download('vader_lexicon')

# instantiate the sentiment analyzer:
          vader = SentimentIntensityAnalyzer()

# check scores for some phrases:
          print(vader.polarity_scores("This was the best idea I've had in a long time"))
          {'neg': 0.0, 'neu': 0.682, 'pos': 0.318, 'compound': 0.6369}

print(vader.polarity_scores("This was the worst idea I've had in a long time"))
 {'neg': 0.313, 'neu': 0.687, 'pos': 0.0, 'compound': -0.6249}
 ```
 
 ---
 ## Questions?

---
          ## References

--
          ### Examples mainly taken from: 
          - https://towardsdatascience.com/fine-grained-sentiment-analysis-in-python-part-1-2697bb111ed4
          - https://monkeylearn.com/sentiment-analysis/
          - https://medium.com/towards-artificial-intelligence/text-mining-in-python-steps-and-examples-78b3f8fd913b
          - https://towardsdatascience.com/a-step-by-step-tutorial-for-conducting-sentiment-analysis-a7190a444366
          - https://towardsdatascience.com/getting-started-with-text-analysis-in-python-ca13590eb4f7

--
          ### Some books:
          - Ingersoll, Grant, Thomas Morton, and Andrew Farris. "Taming text." How to Find, Organize, and Manipulate It, Shelter Island, NY/London (2013).
          - BIRD, Steven; KLEIN, Ewan; LOPER, Edward. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.
          - VASILIEV, Yuli. Natural Language Processing with Python and SpaCy: A Practical Introduction. No Starch Press, 2020.
          - BENGFORT, Benjamin; BILBRO, Rebecca; OJEDA, Tony. Applied text analysis with python: Enabling language-aware data products with machine learning. " O'Reilly Media, Inc.", 2018.

---
 ## Thanks!
 
 
 #### raphaelmcobe@gmail.com
 #### [@raphaelmcobe](twitter.com/raphaelmcobe)
 #### [CODATA-RDA Data Science Schools](https://www.datascienceschools.org/)