Lecture 9 - Preprocessing and Cleaning: Text Normalization and Natural Language Processing¶

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

Announcements:¶

  • Data Ethics 2 - two articles to read and a short reflection to write before this coming Wednesday

Goals:¶

  • Know the meaning and purpose of some basic text normalization operations (from natual language processing):
    • Sentence tokenization
    • Lowercasing, contractions, punctuation, canonicalization
    • Stemming
    • Lemmatization
    • Stopword removal
  • Get some hands-on practice using the above

Start of Quarter Survey¶

Responses to the survey prompt:

Name one hobby or activity you enjoy outside of school.

In [2]:
hob = pd.read_csv("https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_25f/lectures/L09/hobbies.csv", header=None)
hob
Out[2]:
0
0 I love riding my bike, recently I have been en...
1 I like hanging out with friends by going on wa...
2 I love reading eastern fantasy/cultivation nov...
3 Hiking!
4 I love singing! It's technically in-school, bu...
5 I like video games
6 I manage a home media server in my downtime, i...
7 Gaming
8 I enjoy long distance running! I've been doing...
9 Rock climbing
10 Volleyball
11 playing bass guitar
12 I like to play the guitar and to cook
13 Video Games
14 I enjoy biking and hiking outdoors, and readin...
15 Basketball
16 Mountain biking
17 Reading
18 Archery. Painting. Reading.
19 I really love baseball. My parents and I watch...
20 Reading
21 Hiking
22 rock climbing
23 Golf and Swiming
24 Lifting!
25 I enjoy working out, hanging out with friends,...

What I'd like to do:¶

In [10]:
hob[0].plot.hist()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 hob[0].plot.hist()

File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_core.py:1409, in PlotAccessor.hist(self, by, bins, **kwargs)
   1349 def hist(
   1350     self, by: IndexLabel | None = None, bins: int = 10, **kwargs
   1351 ) -> PlotAccessor:
   1352     """
   1353     Draw one histogram of the DataFrame's columns.
   1354 
   (...)   1407         >>> ax = df.plot.hist(column=["age"], by="gender", figsize=(10, 8))
   1408     """
-> 1409     return self(kind="hist", by=by, bins=bins, **kwargs)

File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_core.py:1030, in PlotAccessor.__call__(self, *args, **kwargs)
   1027             label_name = label_kw or data.columns
   1028             data.columns = label_name
-> 1030 return plot_backend.plot(data, kind=kind, **kwargs)

File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_matplotlib/__init__.py:71, in plot(data, kind, **kwargs)
     69         kwargs["ax"] = getattr(ax, "left_ax", ax)
     70 plot_obj = PLOT_CLASSES[kind](data, **kwargs)
---> 71 plot_obj.generate()
     72 plot_obj.draw()
     73 return plot_obj.result

File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_matplotlib/core.py:499, in MPLPlot.generate(self)
    497 @final
    498 def generate(self) -> None:
--> 499     self._compute_plot_data()
    500     fig = self.fig
    501     self._make_plot(fig)

File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_matplotlib/core.py:698, in MPLPlot._compute_plot_data(self)
    696 # no non-numeric frames or series allowed
    697 if is_empty:
--> 698     raise TypeError("no numeric data to plot")
    700 self.data = numeric_data.apply(type(self)._convert_to_ndarray)

TypeError: no numeric data to plot

Text Normalization¶

Text normalization: transforming text into standard or canonical forms.

Often needed to convert text data into tabular data.

As a rule, natural language presents many challenges for seeming simple tasks.

Tokenization¶

Roughly defined: splitting a string into linguistically meaningful pieces.

For each of the following, think of an example piece of text where the naive approach does not give the desired result.

Word tokenization: break into word-ish pieces

  • Naive: split on spaces. str.split(' ')
  • Failure cases:
    • " The paragraph starts here."
    • "The paragraphstarts here."
    • The pargraph starts here---yet it's not over yet!"
    • "I like rock-climbing."
  • Sentence tokenization: breaking up paragraphs (or larger) into sentences.
    • Naive: split on periods. str.split('. ')
    • Failure cases:
      • "Who thought this was a good idea? I didn't."
      • "Oh no..."
      • "Mr. Rogers is the coolest."
      • "J.R.R. Tolkien wrote some stuff."
      • The sentence ends with "a quote."
  • Lowercasing
    • Naive: str.lower()
    • Failure cases:
      • Scott vs scott
      • Joy vs joy, River vs river
      • NumPy?
      • "I have a 20 MBps internet connection."
In [11]:
!python -m spacy download en_core_web_sm
Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 44.8 MB/s  0:00:00m0:00:01
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
In [12]:
!pip install spacytextblob
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: spacytextblob in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (5.0.0)
Requirement already satisfied: spacy>=3.0.0 in /opt/miniforge/lib/python3.12/site-packages (from spacytextblob) (3.8.7)
Requirement already satisfied: textblob>=0.18.0.post0 in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (from spacytextblob) (0.19.0)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (1.0.13)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.0.11)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (3.0.9)
Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (8.3.6)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.5.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.0.10)
Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (0.4.1)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (0.19.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (4.67.1)
Requirement already satisfied: numpy>=1.19.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.3.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.32.5)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.11.9)
Requirement already satisfied: jinja2 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (3.1.6)
Requirement already satisfied: setuptools in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (80.9.0)
Requirement already satisfied: packaging>=20.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (25.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (3.4.1)
Requirement already satisfied: language-data>=1.2 in /opt/miniforge/lib/python3.12/site-packages (from langcodes<4.0.0,>=3.2.0->spacy>=3.0.0->spacytextblob) (1.3.0)
Requirement already satisfied: annotated-types>=0.6.0 in /opt/miniforge/lib/python3.12/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->spacytextblob) (0.7.0)
Requirement already satisfied: pydantic-core==2.33.2 in /opt/miniforge/lib/python3.12/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->spacytextblob) (2.33.2)
Requirement already satisfied: typing-extensions>=4.12.2 in /opt/miniforge/lib/python3.12/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->spacytextblob) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.0 in /opt/miniforge/lib/python3.12/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->spacytextblob) (0.4.1)
Requirement already satisfied: charset_normalizer<4,>=2 in /opt/miniforge/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->spacytextblob) (3.4.3)
Requirement already satisfied: idna<4,>=2.5 in /opt/miniforge/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->spacytextblob) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/miniforge/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->spacytextblob) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/miniforge/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->spacytextblob) (2025.8.3)
Requirement already satisfied: cloudpickle>=2.2.0 in /opt/miniforge/lib/python3.12/site-packages (from srsly<3.0.0,>=2.4.3->spacy>=3.0.0->spacytextblob) (3.1.1)
Requirement already satisfied: ujson>=1.35 in /opt/miniforge/lib/python3.12/site-packages (from srsly<3.0.0,>=2.4.3->spacy>=3.0.0->spacytextblob) (5.11.0)
Requirement already satisfied: blis<1.4.0,>=1.3.0 in /opt/miniforge/lib/python3.12/site-packages (from thinc<8.4.0,>=8.3.4->spacy>=3.0.0->spacytextblob) (1.3.0)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /opt/miniforge/lib/python3.12/site-packages (from thinc<8.4.0,>=8.3.4->spacy>=3.0.0->spacytextblob) (0.1.5)
Requirement already satisfied: click>=8.0.0 in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (from typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (8.3.0)
Requirement already satisfied: shellingham>=1.3.0 in /opt/miniforge/lib/python3.12/site-packages (from typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /opt/miniforge/lib/python3.12/site-packages (from typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (14.1.0)
Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /opt/miniforge/lib/python3.12/site-packages (from weasel<0.5.0,>=0.1.0->spacy>=3.0.0->spacytextblob) (0.22.0)
Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /opt/miniforge/lib/python3.12/site-packages (from weasel<0.5.0,>=0.1.0->spacy>=3.0.0->spacytextblob) (7.3.1)
Requirement already satisfied: wrapt in /opt/miniforge/lib/python3.12/site-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy>=3.0.0->spacytextblob) (1.17.3)
Requirement already satisfied: marisa-trie>=1.1.0 in /opt/miniforge/lib/python3.12/site-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy>=3.0.0->spacytextblob) (1.2.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/miniforge/lib/python3.12/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (4.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/miniforge/lib/python3.12/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (2.19.2)
Requirement already satisfied: mdurl~=0.1 in /opt/miniforge/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (0.1.2)
Requirement already satisfied: nltk>=3.9 in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (from textblob>=0.18.0.post0->spacytextblob) (3.9.1)
Requirement already satisfied: joblib in /opt/miniforge/lib/python3.12/site-packages (from nltk>=3.9->textblob>=0.18.0.post0->spacytextblob) (1.5.2)
Requirement already satisfied: regex>=2021.8.3 in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (from nltk>=3.9->textblob>=0.18.0.post0->spacytextblob) (2025.9.1)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/miniforge/lib/python3.12/site-packages (from jinja2->spacy>=3.0.0->spacytextblob) (3.0.2)

Tokenization - Example:¶

In [3]:
import spacy

nlp = spacy.load('en_core_web_sm')
In [15]:
text = "Dr. Wehrwein doesn't work for the F.B.I. His résumé wouldn't qualify him for such a job."

ans = nlp("Dr. Wehrwein doesn't work for the F.B.I. His résumé wouldn't qualify him for such a job.")
list(ans)
Out[15]:
[Dr.,
 Wehrwein,
 does,
 n't,
 work,
 for,
 the,
 F.B.I.,
 His,
 résumé,
 would,
 n't,
 qualify,
 him,
 for,
 such,
 a,
 job,
 .]
In [16]:
list(ans.sents) # sentence tokenization
Out[16]:
[Dr. Wehrwein doesn't work for the F.B.I.,
 His résumé wouldn't qualify him for such a job.]

Other Text Normalization Operations¶

  • Expand contractions; instead of tokenizing, you could preprocess these away
    • the tokenizer expanded doesn't to does, n't
    • Could instead first expand it to "does not"
  • Canonicalize language variants; e.g., color vs colour
  • Converting numerical representation of words to numbers
    • "two and a half" -> 2.5
    • four million" -> 4,000,000 or 4000000
  • Stripping accents or unicode characters
    • E.g., résumé to resume.
  • Stripping punctuation
    • If it isn't important for your task, could strip all punctuation out.
    • Beware of side effects; e.g., 192.168.1.1 -> 19216811.
In [17]:
localhost = "Localhost is 127.0.0.1, whereas your home router is traditionally configured to be 192.168.0.1."
ans = nlp(localhost)
list(ans)
Out[17]:
[Localhost,
 is,
 127.0.0.1,
 ,,
 whereas,
 your,
 home,
 router,
 is,
 traditionally,
 configured,
 to,
 be,
 192.168.0.1,
 .]
In [18]:
tok = list(ans)
tok[-1].is_punct
Out[18]:
True
  • Removing stopwords
    • Stopwords: common function words like "to" "in" "the"
    • For some tasks they aren't important or relevant
      • E.g., topic detection
In [19]:
tok = [t for t in ans if (not t.is_stop and not t.is_punct)]
tok
Out[19]:
[Localhost, 127.0.0.1, home, router, traditionally, configured, 192.168.0.1]
In [ ]:
 

Stemming¶

Convert words to word stem (even if the stem itself isn't a whole word).

  • E.g., argue, argued, argues, arguing all replaced by argu.
  • Works without knowing the part of speech.

Lemmatization¶

Like stemming, but attempts to infer part of speech and use custom rules based on part of speech.

Part-of-speech tagging¶

In [20]:
hobby = hob.iloc[0,0]
hobby
Out[20]:
'I love riding my bike, recently I have been enjoying riding my dirt jumper (type of bike made for dirt jumps) to the bike park at the Civic sport complex area.\xa0'
In [21]:
ans = nlp(hobby)
tok = [t for t in ans if (not t.is_stop and not t.is_punct)]
pd.DataFrame({"Token" : [t for t in tok],
              "Lemma" : [t.lemma_ for t in tok], 
              "POS"   : [t.pos_ for t in tok]})
Out[21]:
Token Lemma POS
0 love love VERB
1 riding ride VERB
2 bike bike NOUN
3 recently recently ADV
4 enjoying enjoy VERB
5 riding ride VERB
6 dirt dirt NOUN
7 jumper jumper NOUN
8 type type NOUN
9 bike bike NOUN
10 dirt dirt NOUN
11 jumps jump NOUN
12 bike bike NOUN
13 park park NOUN
14 Civic Civic PROPN
15 sport sport NOUN
16 complex complex ADJ
17 area area NOUN
18 SPACE

Noun phrase parsing¶

In [22]:
list(ans.noun_chunks)
Out[22]:
[I,
 my bike,
 I,
 my dirt jumper,
 type,
 bike,
 dirt jumps,
 the bike park,
 the Civic sport complex area]

Named entity recognition¶

In [23]:
ans.ents
Out[23]:
(Civic,)
In [27]:
nlp("Jude Law visited NYC. Air Force One happened to be parked at JFK.".lower()).ents
Out[27]:
(air force one, jfk)

Sentiment analysis¶

In [24]:
from spacytextblob.spacytextblob import SpacyTextBlob
nlp.add_pipe("spacytextblob")
Out[24]:
<spacytextblob.spacytextblob.SpacyTextBlob at 0x14e2640f11c0>
In [25]:
yay = nlp("Today is a good day.")
boo = nlp("I'm feeling sad.")

print(yay._.blob.polarity)
print(boo._.blob.polarity)
0.7
-0.5
In [26]:
sns.displot(hob[0].apply(lambda x: nlp(x)._.blob.polarity))
plt.gca().set_xlabel("Polarity")
Out[26]:
Text(0.5, 9.444444444444438, 'Polarity')
No description has been provided for this image

Tools for text normalization?

  • Python regular expressions (e.g., find and replace)
  • Linux commandline tool sed (stream editor) or tr (translate)
  • NLP toolkits; e.g., spacy, nltk (support tokenizing, stemming, lemmatizing, etc.)