Lecture 9 - Preprocessing and Cleaning: Text Normalization and Natural Language Processing¶
In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
Announcements:¶
- Data Ethics 2 - two articles to read and a short reflection to write before this coming Wednesday
Goals:¶
- Know the meaning and purpose of some basic text normalization operations (from natual language processing):
- Sentence tokenization
- Lowercasing, contractions, punctuation, canonicalization
- Stemming
- Lemmatization
- Stopword removal
- Get some hands-on practice using the above
Start of Quarter Survey¶
Responses to the survey prompt:
Name one hobby or activity you enjoy outside of school.
In [2]:
hob = pd.read_csv("https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_25f/lectures/L09/hobbies.csv", header=None)
hob
Out[2]:
| 0 | |
|---|---|
| 0 | I love riding my bike, recently I have been en... |
| 1 | I like hanging out with friends by going on wa... |
| 2 | I love reading eastern fantasy/cultivation nov... |
| 3 | Hiking! |
| 4 | I love singing! It's technically in-school, bu... |
| 5 | I like video games |
| 6 | I manage a home media server in my downtime, i... |
| 7 | Gaming |
| 8 | I enjoy long distance running! I've been doing... |
| 9 | Rock climbing |
| 10 | Volleyball |
| 11 | playing bass guitar |
| 12 | I like to play the guitar and to cook |
| 13 | Video Games |
| 14 | I enjoy biking and hiking outdoors, and readin... |
| 15 | Basketball |
| 16 | Mountain biking |
| 17 | Reading |
| 18 | Archery. Painting. Reading. |
| 19 | I really love baseball. My parents and I watch... |
| 20 | Reading |
| 21 | Hiking |
| 22 | rock climbing |
| 23 | Golf and Swiming |
| 24 | Lifting! |
| 25 | I enjoy working out, hanging out with friends,... |
What I'd like to do:¶
In [10]:
hob[0].plot.hist()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[10], line 1 ----> 1 hob[0].plot.hist() File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_core.py:1409, in PlotAccessor.hist(self, by, bins, **kwargs) 1349 def hist( 1350 self, by: IndexLabel | None = None, bins: int = 10, **kwargs 1351 ) -> PlotAccessor: 1352 """ 1353 Draw one histogram of the DataFrame's columns. 1354 (...) 1407 >>> ax = df.plot.hist(column=["age"], by="gender", figsize=(10, 8)) 1408 """ -> 1409 return self(kind="hist", by=by, bins=bins, **kwargs) File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_core.py:1030, in PlotAccessor.__call__(self, *args, **kwargs) 1027 label_name = label_kw or data.columns 1028 data.columns = label_name -> 1030 return plot_backend.plot(data, kind=kind, **kwargs) File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_matplotlib/__init__.py:71, in plot(data, kind, **kwargs) 69 kwargs["ax"] = getattr(ax, "left_ax", ax) 70 plot_obj = PLOT_CLASSES[kind](data, **kwargs) ---> 71 plot_obj.generate() 72 plot_obj.draw() 73 return plot_obj.result File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_matplotlib/core.py:499, in MPLPlot.generate(self) 497 @final 498 def generate(self) -> None: --> 499 self._compute_plot_data() 500 fig = self.fig 501 self._make_plot(fig) File /opt/miniforge/lib/python3.12/site-packages/pandas/plotting/_matplotlib/core.py:698, in MPLPlot._compute_plot_data(self) 696 # no non-numeric frames or series allowed 697 if is_empty: --> 698 raise TypeError("no numeric data to plot") 700 self.data = numeric_data.apply(type(self)._convert_to_ndarray) TypeError: no numeric data to plot
Text Normalization¶
Text normalization: transforming text into standard or canonical forms.
Often needed to convert text data into tabular data.
As a rule, natural language presents many challenges for seeming simple tasks.
Tokenization¶
Roughly defined: splitting a string into linguistically meaningful pieces.
For each of the following, think of an example piece of text where the naive approach does not give the desired result.
Word tokenization: break into word-ish pieces
- Naive: split on spaces.
str.split(' ') - Failure cases:
- " The paragraph starts here."
- "The paragraphstarts here."
- The pargraph starts here---yet it's not over yet!"
- "I like rock-climbing."
- Sentence tokenization: breaking up paragraphs (or larger) into sentences.
- Naive: split on periods.
str.split('. ') - Failure cases:
- "Who thought this was a good idea? I didn't."
- "Oh no..."
- "Mr. Rogers is the coolest."
- "J.R.R. Tolkien wrote some stuff."
- The sentence ends with "a quote."
- Naive: split on periods.
- Lowercasing
- Naive:
str.lower() - Failure cases:
- Scott vs scott
- Joy vs joy, River vs river
- NumPy?
- "I have a 20 MBps internet connection."
- Naive:
In [11]:
!python -m spacy download en_core_web_sm
Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.8.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 44.8 MB/s 0:00:00m0:00:01
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
In [12]:
!pip install spacytextblob
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: spacytextblob in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (5.0.0) Requirement already satisfied: spacy>=3.0.0 in /opt/miniforge/lib/python3.12/site-packages (from spacytextblob) (3.8.7) Requirement already satisfied: textblob>=0.18.0.post0 in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (from spacytextblob) (0.19.0) Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (3.0.12) Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (1.0.5) Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (1.0.13) Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.0.11) Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (3.0.9) Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (8.3.6) Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (1.1.3) Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.5.1) Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.0.10) Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (0.4.1) Requirement already satisfied: typer<1.0.0,>=0.3.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (0.19.2) Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (4.67.1) Requirement already satisfied: numpy>=1.19.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.3.3) Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.32.5) Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (2.11.9) Requirement already satisfied: jinja2 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (3.1.6) Requirement already satisfied: setuptools in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (80.9.0) Requirement already satisfied: packaging>=20.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (25.0) Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /opt/miniforge/lib/python3.12/site-packages (from spacy>=3.0.0->spacytextblob) (3.4.1) Requirement already satisfied: language-data>=1.2 in /opt/miniforge/lib/python3.12/site-packages (from langcodes<4.0.0,>=3.2.0->spacy>=3.0.0->spacytextblob) (1.3.0) Requirement already satisfied: annotated-types>=0.6.0 in /opt/miniforge/lib/python3.12/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->spacytextblob) (0.7.0) Requirement already satisfied: pydantic-core==2.33.2 in /opt/miniforge/lib/python3.12/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->spacytextblob) (2.33.2) Requirement already satisfied: typing-extensions>=4.12.2 in /opt/miniforge/lib/python3.12/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->spacytextblob) (4.15.0) Requirement already satisfied: typing-inspection>=0.4.0 in /opt/miniforge/lib/python3.12/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->spacytextblob) (0.4.1) Requirement already satisfied: charset_normalizer<4,>=2 in /opt/miniforge/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->spacytextblob) (3.4.3) Requirement already satisfied: idna<4,>=2.5 in /opt/miniforge/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->spacytextblob) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/miniforge/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->spacytextblob) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /opt/miniforge/lib/python3.12/site-packages (from requests<3.0.0,>=2.13.0->spacy>=3.0.0->spacytextblob) (2025.8.3) Requirement already satisfied: cloudpickle>=2.2.0 in /opt/miniforge/lib/python3.12/site-packages (from srsly<3.0.0,>=2.4.3->spacy>=3.0.0->spacytextblob) (3.1.1) Requirement already satisfied: ujson>=1.35 in /opt/miniforge/lib/python3.12/site-packages (from srsly<3.0.0,>=2.4.3->spacy>=3.0.0->spacytextblob) (5.11.0) Requirement already satisfied: blis<1.4.0,>=1.3.0 in /opt/miniforge/lib/python3.12/site-packages (from thinc<8.4.0,>=8.3.4->spacy>=3.0.0->spacytextblob) (1.3.0) Requirement already satisfied: confection<1.0.0,>=0.0.1 in /opt/miniforge/lib/python3.12/site-packages (from thinc<8.4.0,>=8.3.4->spacy>=3.0.0->spacytextblob) (0.1.5) Requirement already satisfied: click>=8.0.0 in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (from typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (8.3.0) Requirement already satisfied: shellingham>=1.3.0 in /opt/miniforge/lib/python3.12/site-packages (from typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (1.5.4) Requirement already satisfied: rich>=10.11.0 in /opt/miniforge/lib/python3.12/site-packages (from typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (14.1.0) Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /opt/miniforge/lib/python3.12/site-packages (from weasel<0.5.0,>=0.1.0->spacy>=3.0.0->spacytextblob) (0.22.0) Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /opt/miniforge/lib/python3.12/site-packages (from weasel<0.5.0,>=0.1.0->spacy>=3.0.0->spacytextblob) (7.3.1) Requirement already satisfied: wrapt in /opt/miniforge/lib/python3.12/site-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy>=3.0.0->spacytextblob) (1.17.3) Requirement already satisfied: marisa-trie>=1.1.0 in /opt/miniforge/lib/python3.12/site-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy>=3.0.0->spacytextblob) (1.2.1) Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/miniforge/lib/python3.12/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (4.0.0) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/miniforge/lib/python3.12/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (2.19.2) Requirement already satisfied: mdurl~=0.1 in /opt/miniforge/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy>=3.0.0->spacytextblob) (0.1.2) Requirement already satisfied: nltk>=3.9 in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (from textblob>=0.18.0.post0->spacytextblob) (3.9.1) Requirement already satisfied: joblib in /opt/miniforge/lib/python3.12/site-packages (from nltk>=3.9->textblob>=0.18.0.post0->spacytextblob) (1.5.2) Requirement already satisfied: regex>=2021.8.3 in /cluster/home/wehrwes/.local/lib/python3.12/site-packages (from nltk>=3.9->textblob>=0.18.0.post0->spacytextblob) (2025.9.1) Requirement already satisfied: MarkupSafe>=2.0 in /opt/miniforge/lib/python3.12/site-packages (from jinja2->spacy>=3.0.0->spacytextblob) (3.0.2)
Tokenization - Example:¶
In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')
In [15]:
text = "Dr. Wehrwein doesn't work for the F.B.I. His résumé wouldn't qualify him for such a job."
ans = nlp("Dr. Wehrwein doesn't work for the F.B.I. His résumé wouldn't qualify him for such a job.")
list(ans)
Out[15]:
[Dr., Wehrwein, does, n't, work, for, the, F.B.I., His, résumé, would, n't, qualify, him, for, such, a, job, .]
In [16]:
list(ans.sents) # sentence tokenization
Out[16]:
[Dr. Wehrwein doesn't work for the F.B.I., His résumé wouldn't qualify him for such a job.]
Other Text Normalization Operations¶
- Expand contractions; instead of tokenizing, you could preprocess these away
- the tokenizer expanded
doesn'ttodoes, n't - Could instead first expand it to "does not"
- the tokenizer expanded
- Canonicalize language variants; e.g., color vs colour
- Converting numerical representation of words to numbers
- "two and a half" -> 2.5
- four million" -> 4,000,000 or 4000000
- Stripping accents or unicode characters
- E.g., résumé to resume.
- Stripping punctuation
- If it isn't important for your task, could strip all punctuation out.
- Beware of side effects; e.g., 192.168.1.1 -> 19216811.
In [17]:
localhost = "Localhost is 127.0.0.1, whereas your home router is traditionally configured to be 192.168.0.1."
ans = nlp(localhost)
list(ans)
Out[17]:
[Localhost, is, 127.0.0.1, ,, whereas, your, home, router, is, traditionally, configured, to, be, 192.168.0.1, .]
In [18]:
tok = list(ans)
tok[-1].is_punct
Out[18]:
True
- Removing stopwords
- Stopwords: common function words like "to" "in" "the"
- For some tasks they aren't important or relevant
- E.g., topic detection
In [19]:
tok = [t for t in ans if (not t.is_stop and not t.is_punct)]
tok
Out[19]:
[Localhost, 127.0.0.1, home, router, traditionally, configured, 192.168.0.1]
In [ ]:
Stemming¶
Convert words to word stem (even if the stem itself isn't a whole word).
- E.g., argue, argued, argues, arguing all replaced by argu.
- Works without knowing the part of speech.
Lemmatization¶
Like stemming, but attempts to infer part of speech and use custom rules based on part of speech.
Part-of-speech tagging¶
In [20]:
hobby = hob.iloc[0,0]
hobby
Out[20]:
'I love riding my bike, recently I have been enjoying riding my dirt jumper (type of bike made for dirt jumps) to the bike park at the Civic sport complex area.\xa0'
In [21]:
ans = nlp(hobby)
tok = [t for t in ans if (not t.is_stop and not t.is_punct)]
pd.DataFrame({"Token" : [t for t in tok],
"Lemma" : [t.lemma_ for t in tok],
"POS" : [t.pos_ for t in tok]})
Out[21]:
| Token | Lemma | POS | |
|---|---|---|---|
| 0 | love | love | VERB |
| 1 | riding | ride | VERB |
| 2 | bike | bike | NOUN |
| 3 | recently | recently | ADV |
| 4 | enjoying | enjoy | VERB |
| 5 | riding | ride | VERB |
| 6 | dirt | dirt | NOUN |
| 7 | jumper | jumper | NOUN |
| 8 | type | type | NOUN |
| 9 | bike | bike | NOUN |
| 10 | dirt | dirt | NOUN |
| 11 | jumps | jump | NOUN |
| 12 | bike | bike | NOUN |
| 13 | park | park | NOUN |
| 14 | Civic | Civic | PROPN |
| 15 | sport | sport | NOUN |
| 16 | complex | complex | ADJ |
| 17 | area | area | NOUN |
| 18 | SPACE |
Noun phrase parsing¶
In [22]:
list(ans.noun_chunks)
Out[22]:
[I, my bike, I, my dirt jumper, type, bike, dirt jumps, the bike park, the Civic sport complex area]
Named entity recognition¶
In [23]:
ans.ents
Out[23]:
(Civic,)
In [27]:
nlp("Jude Law visited NYC. Air Force One happened to be parked at JFK.".lower()).ents
Out[27]:
(air force one, jfk)
Sentiment analysis¶
In [24]:
from spacytextblob.spacytextblob import SpacyTextBlob
nlp.add_pipe("spacytextblob")
Out[24]:
<spacytextblob.spacytextblob.SpacyTextBlob at 0x14e2640f11c0>
In [25]:
yay = nlp("Today is a good day.")
boo = nlp("I'm feeling sad.")
print(yay._.blob.polarity)
print(boo._.blob.polarity)
0.7 -0.5
In [26]:
sns.displot(hob[0].apply(lambda x: nlp(x)._.blob.polarity))
plt.gca().set_xlabel("Polarity")
Out[26]:
Text(0.5, 9.444444444444438, 'Polarity')
Tools for text normalization?
- Python regular expressions (e.g., find and replace)
- Linux commandline tool
sed(stream editor) ortr(translate) - NLP toolkits; e.g.,
spacy,nltk(support tokenizing, stemming, lemmatizing, etc.)