Lecture 9 - Preprocessing and Cleaning: Text Normalization and Natural Language Processing¶

In [ ]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

Announcements:¶

  • Data Ethics 2 - next wednesday, apparently! I'll get this out Soon.

Goals:¶

  • Know the meaning and purpose of some basic text normalization operations (from natual language processing):
    • Sentence tokenization
    • Lowercasing, contractions, punctuation, canonicalization
    • Stemming
    • Lemmatization
    • Stopword removal
  • Get some hands-on practice using the above

Start of Quarter Survey¶

Responses to the survey prompt (from Fall, when I had a bigger dataset):

Name one hobby or activity you enjoy outside of school.

In [94]:
hob = pd.read_csv("https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_25f/lectures/L09/hobbies.csv", header=None)
hob
Out[94]:
0
0 I love riding my bike, recently I have been en...
1 I like hanging out with friends by going on wa...
2 I love reading eastern fantasy/cultivation nov...
3 Hiking!
4 I love singing! It's technically in-school, bu...
5 I like video games
6 I manage a home media server in my downtime, i...
7 Gaming
8 I enjoy long distance running! I've been doing...
9 Rock climbing
10 Volleyball
11 playing bass guitar
12 I like to play the guitar and to cook
13 Video Games
14 I enjoy biking and hiking outdoors, and readin...
15 Basketball
16 Mountain biking
17 Reading
18 Archery. Painting. Reading.
19 I really love baseball. My parents and I watch...
20 Reading
21 Hiking
22 rock climbing
23 Golf and Swiming
24 Lifting!
25 I enjoy working out, hanging out with friends,...

What I'd like to do:¶

In [95]:
hob[0].plot.hist()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[95], line 1
----> 1 hob[0].plot.hist()

File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_core.py:1694, in PlotAccessor.hist(self, by, bins, **kwargs)
   1639 def hist(
   1640     self, by: IndexLabel | None = None, bins: int = 10, **kwargs
   1641 ) -> PlotAccessor:
   1642     """
   1643     Draw one histogram of the DataFrame's columns.
   1644 
   (...)   1692         >>> ax = df.plot.hist(column=["age"], by="gender", figsize=(10, 8))
   1693     """
-> 1694     return self(kind="hist", by=by, bins=bins, **kwargs)

File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_core.py:1185, in PlotAccessor.__call__(self, *args, **kwargs)
   1182             label_name = label_kw or data.columns
   1183             data.columns = label_name
-> 1185 return plot_backend.plot(data, kind=kind, **kwargs)

File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_matplotlib/__init__.py:71, in plot(data, kind, **kwargs)
     69         kwargs["ax"] = getattr(ax, "left_ax", ax)
     70 plot_obj = PLOT_CLASSES[kind](data, **kwargs)
---> 71 plot_obj.generate()
     72 plt.draw_if_interactive()
     73 return plot_obj.result

File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_matplotlib/core.py:516, in MPLPlot.generate(self)
    514 @final
    515 def generate(self) -> None:
--> 516     self._compute_plot_data()
    517     fig = self.fig
    518     self._make_plot(fig)

File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_matplotlib/core.py:716, in MPLPlot._compute_plot_data(self)
    714 # no non-numeric frames or series allowed
    715 if is_empty:
--> 716     raise TypeError("no numeric data to plot")
    718 self.data = numeric_data.apply(type(self)._convert_to_ndarray)

TypeError: no numeric data to plot

Text Normalization¶

Text normalization: transforming text into standard or canonical forms.

Often needed to convert text data into tabular data.

Tools for text normalization?

  • Built-in Python string processin functions (e.g., strip, lower)
  • Python regular expressions (e.g., find and replace)
  • Linux commandline tool sed (stream editor) or tr (translate)
  • NLP toolkits; e.g., spacy, nltk (support tokenizing, stemming, lemmatizing, etc.)

As a rule, natural language presents many challenges for seeming simple tasks.

String functions - quick demo:

In [96]:
hours_responses = [
    "8",
    "6.5",
    "12 hours",
    "5h",
    " 4 "
]

import re

def normalize(s):
    s = re.search("\d+", s)
    return s[0]

[normalize(s) for s in hours_responses]
Out[96]:
['8', '6', '12', '5', '4']
In [97]:
words = hob.iloc[-1].item().split()
print("\n".join(words))
I
enjoy
working
out,
hanging
out
with
friends,
and
walking.
In [98]:
words = [w.strip("., ").lower() for w in words]
print("\n".join(words))
i
enjoy
working
out
hanging
out
with
friends
and
walking

Tokenization¶

Roughly defined: splitting a string into linguistically meaningful pieces.

For each of the following, think of an example piece of text where the naive approach does not give the desired result.

Word tokenization: break into word-ish pieces

  • Naive: split on spaces. str.split(' ')
In [ ]:
 
  • Failure cases:
    • " The paragraph starts here."
    • "The paragraphstarts here."
    • The pargraph starts here---yet it's not over yet!"
    • "I like rock-climbing."
  • Sentence tokenization: breaking up paragraphs (or larger) into sentences.
    • Naive: split on periods. str.split('. ')
In [ ]:
 
  • Failure cases:
    • "Who thought this was a good idea? I didn't."
    • "Oh no..."
    • "Mr. Rogers is the coolest."
    • "J.R.R. Tolkien wrote some stuff."
    • The sentence ends with "a quote."
  • Lowercasing
    • Naive: str.lower()
In [ ]:
 
  • Failure cases:
    • Joy vs joy, River vs river
    • "I have a 20 MBps internet connection."
    • lol vs LOL
    • PIN vs pin

Other Text Normalization Operations¶

  • Expand contractions; instead of tokenizing, you could preprocess these away
    • the tokenizer expanded doesn't to does, n't
    • Could instead first expand it to "does not"
  • Canonicalize language variants; e.g., color vs colour
  • Converting numerical representation of words to numbers
    • "two and a half" -> 2.5
    • four million" -> 4,000,000 or 4000000
  • Stripping accents or unicode characters
    • E.g., résumé to resume.

Tokenization - Example:¶

In [100]:
import spacy

nlp = spacy.load('en_core_web_sm')
In [103]:
text = "Dr. Wehrwein doesn't work for the F.B.I. His résumé wouldn't qualify him for such a job."

ans = nlp(text)
list(ans)
Out[103]:
[Dr.,
 Wehrwein,
 does,
 n't,
 work,
 for,
 the,
 F.B.I.,
 His,
 résumé,
 would,
 n't,
 qualify,
 him,
 for,
 such,
 a,
 job,
 .]
In [104]:
list(ans.sents) # sentence tokenization
Out[104]:
[Dr. Wehrwein doesn't work for the F.B.I.,
 His résumé wouldn't qualify him for such a job.]
  • Stripping punctuation
    • If it isn't important for your task, could strip all punctuation out.
    • Beware of side effects; e.g., 192.168.1.1 -> 19216811.
In [105]:
localhost = "Localhost is 127.0.0.1, whereas your home router is traditionally configured to be 192.168.0.1."
ans = nlp(localhost)
list(ans)
Out[105]:
[Localhost,
 is,
 127.0.0.1,
 ,,
 whereas,
 your,
 home,
 router,
 is,
 traditionally,
 configured,
 to,
 be,
 192.168.0.1,
 .]
In [108]:
tok = list(ans)
tok[-1].is_punct
Out[108]:
True
  • Removing stopwords
    • Stopwords: common function words like "to" "in" "the"
    • For some tasks they aren't important or relevant
      • E.g., topic detection
In [109]:
tok = [t for t in ans if (not t.is_stop and not t.is_punct)]
tok
Out[109]:
[Localhost, 127.0.0.1, home, router, traditionally, configured, 192.168.0.1]
In [ ]:
 

Stemming¶

Convert words to word stem (even if the stem itself isn't a whole word).

  • E.g., argue, argued, argues, arguing all replaced by argu.
  • Works without knowing the part of speech.

Lemmatization¶

Like stemming, but attempts to infer part of speech and use custom rules based on part of speech.

Part-of-speech tagging¶

In [127]:
hobby = hob.iloc[0,0]
hobby
Out[127]:
'I love riding my bike, recently I have been enjoying riding my dirt jumper (type of bike made for dirt jumps) to the bike park at the Civic sport complex area.\xa0'
In [128]:
ans = nlp(hobby)
tok = [t for t in ans if (not t.is_stop and not t.is_punct)]
pd.DataFrame({"Token" : [t for t in tok],
              "Lemma" : [t.lemma_ for t in tok], 
              "POS"   : [t.pos_ for t in tok]})
Out[128]:
Token Lemma POS
0 love love VERB
1 riding rid VERB
2 bike bike NOUN
3 recently recently ADV
4 enjoying enjoy VERB
5 riding rid VERB
6 dirt dirt NOUN
7 jumper jumper NOUN
8 type type NOUN
9 bike bike NOUN
10 dirt dirt NOUN
11 jumps jump NOUN
12 bike bike NOUN
13 park park NOUN
14 Civic Civic PROPN
15 sport sport NOUN
16 complex complex ADJ
17 area area NOUN
18 SPACE
In [123]:
text = "I got rid of my shoes and went riding."
riding = nlp(text)
riding[2].lemma_
Out[123]:
'VERB'
In [124]:
riding[-2].lemma_
Out[124]:
'rid'
In [125]:
riding[-2] == riding[2]
Out[125]:
False

Noun phrase parsing¶

In [129]:
list(ans.noun_chunks)
Out[129]:
[I,
 my bike,
 I,
 my dirt jumper,
 type,
 bike,
 dirt jumps,
 the bike park,
 the Civic sport complex area]

Named entity recognition¶

In [130]:
ans.ents
Out[130]:
(Civic,)
In [131]:
nlp("Jude Law visited New York City. Air Force One happened to be parked at JFK.").ents
Out[131]:
(Jude Law, New York City, Air Force One, JFK)
In [132]:
nlp("Jude Law visited New York City. Air Force One happened to be parked at JFK.".lower()).ents
Out[132]:
(new york city, air force one, jfk)
In [133]:
nlp("Big Bird visited New York City. Air Force One happened to be parked at JFK.".lower()).ents
Out[133]:
(big bird, new york city, air force one, jfk)

Sentiment analysis¶

In [134]:
from spacytextblob.spacytextblob import SpacyTextBlob
nlp.add_pipe("spacytextblob")
Out[134]:
<spacytextblob.spacytextblob.SpacyTextBlob at 0x14a034eebe30>
In [135]:
yay = nlp("Today is a good day.")
boo = nlp("I'm feeling sad.")

print(yay._.blob.polarity)
print(boo._.blob.polarity)
0.7
-0.5
In [136]:
hob["Polarity"] = hob[0].apply(lambda x: nlp(x)._.blob.polarity)

sns.displot(data=hob, x="Polarity");
No description has been provided for this image
In [137]:
hob
Out[137]:
0 Polarity
0 I love riding my bike, recently I have been en... 0.175000
1 I like hanging out with friends by going on wa... 0.000000
2 I love reading eastern fantasy/cultivation nov... 0.500000
3 Hiking! 0.000000
4 I love singing! It's technically in-school, bu... 0.312500
5 I like video games 0.000000
6 I manage a home media server in my downtime, i... 0.500000
7 Gaming 0.000000
8 I enjoy long distance running! I've been doing... 0.171875
9 Rock climbing 0.000000
10 Volleyball 0.000000
11 playing bass guitar -0.150000
12 I like to play the guitar and to cook 0.000000
13 Video Games 0.000000
14 I enjoy biking and hiking outdoors, and readin... 0.400000
15 Basketball 0.000000
16 Mountain biking 0.000000
17 Reading 0.000000
18 Archery. Painting. Reading. 0.000000
19 I really love baseball. My parents and I watch... 0.262500
20 Reading 0.000000
21 Hiking 0.000000
22 rock climbing 0.000000
23 Golf and Swiming 0.000000
24 Lifting! 0.000000
25 I enjoy working out, hanging out with friends,... 0.400000
In [ ]: