Lecture 11 - Text Normalization; Data Ethics 1 Discussion¶

In [1]:
!pip install spacytextblob # this is for sentiment analysis
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: spacytextblob in /usr/local/lib/python3.8/dist-packages (4.0.0)
Requirement already satisfied: textblob<0.16.0,>=0.15.3 in /usr/local/lib/python3.8/dist-packages (from spacytextblob) (0.15.3)
Requirement already satisfied: spacy<4.0,>=3.0 in /usr/local/lib/python3.8/dist-packages (from spacytextblob) (3.4.4)
Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (0.10.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.25.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (57.4.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (3.3.0)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (6.3.0)
Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (8.1.6)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.11.3)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.0.8)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.10 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (3.0.11)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (4.64.1)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (21.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.4.5)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (3.0.8)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.10.4)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.0.4)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.0.9)
Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (0.10.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.0.7)
Requirement already satisfied: typer<0.8.0,>=0.3.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (0.7.0)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.21.6)
Requirement already satisfied: nltk>=3.1 in /usr/local/lib/python3.8/dist-packages (from textblob<0.16.0,>=0.15.3->spacytextblob) (3.7)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk>=3.1->textblob<0.16.0,>=0.15.3->spacytextblob) (2022.6.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk>=3.1->textblob<0.16.0,>=0.15.3->spacytextblob) (1.2.0)
Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk>=3.1->textblob<0.16.0,>=0.15.3->spacytextblob) (7.1.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>=20.0->spacy<4.0,>=3.0->spacytextblob) (3.0.9)
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.8/dist-packages (from pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4->spacy<4.0,>=3.0->spacytextblob) (4.4.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (2022.12.7)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (4.0.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (1.24.3)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<4.0,>=3.0->spacytextblob) (0.7.9)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<4.0,>=3.0->spacytextblob) (0.0.3)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-packages (from jinja2->spacy<4.0,>=3.0->spacytextblob) (2.0.1)
In [2]:
import numpy as np
import seaborn as sns
import pandas as pd

Announcements:¶

Last talks (for a little while anyway) this week:

  • Thu 2/2 Czilard Vajda, 4pm CF 105 - Research (machine learning/data science/etc.; wine quality esimation?)
  • Fri 2/3 Czilard Vajda, 4pm CF 316 - Teaching Demo

Quiz 3 FMQ¶

Reasons to visualize:

  • To precisely measure the strengths of correlations or relationships in the data
  • To summarize a larger dataset using only a few numbers

Viz principles:

  • Avoid repetition
  • Minimize the data-ink ratio

Which visualization best illustrates:

  • the relationship between two numerical columns:
    • dot or line plot
  • a comparison of summary statistics of between 5 and 10 numerical columns?
    • table
    • bar chart
  • the observed distribution of a numerical column?
    • box and whisker plots

Chart 1 vs Chart 2

  • What is chartjunk mean?

Goals:¶

  • Know the meaning and purpose of some basic text normalization operations (from natual language processing):

    • Sentence tokenization
    • Lowercasing, contractions, punctuation, canonicalization
    • Stemming
    • Lemmatization
    • Stopword removal
  • Discuss Data Ethics 1

Text Normalization¶

Text normalization: transforming the various ways text can apear into standard or canonical forms. Often needed to convert text data into tabular data.

  • Sentence tokenization: breaking up paragraphs (or larger) into sentences.
    • Challenges with naive approach; e.g., see "Dr. Wehrwein doesn't work for the F.B.I."
  • Word tokenization: break into word-ish pieces
  • Lowercasing
    • For many applications, the distinction between "Scott" and "scott" doesn't matter.
    • *Brainstorm: what is one example or situation where the case of the word matters? What is one where it doesn't?*
      • hope vs Hope, joy vs Joy
      • mb vs MB vs Mb
      • STEM vs stem, PIN vs pin
In [3]:
import spacy
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

Responses to the survey prompt:

Name one hobby or activity you enjoy outside of school.

In [4]:
hob = pd.read_csv("https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/lectures/L10/hobbies.csv", header=None)
hob
Out[4]:
0
0 I enjoy playing chess with people who are bett...
1 soccer
2 Running
3 I like to play Dungeons and Dragons
4 Building computers, cars, the WWU Racing team,...
... ...
119 I enjoy regularly going to the gym, as well as...
120 Skiing
121 Playing video games
122 I like skiing
123 Playing basketball and getting my nails painted

124 rows × 1 columns

In [5]:
hob.iloc[119,0]
Out[5]:
'I enjoy regularly going to the gym, as well as playing sports such as soccer and skiing over the winter. Ive also been playing video games since I was a child and managed to build my first PC about a year or two back, which was a hassle in itself. 3 hobbys but whatever.\xa0'
  • Expand contractions; instead of tokenizing, you could preprocess these away
    • tokenizer might expand "isn't" to isn | ' | t
    • Could instead expand it to "is not"
  • Canonicalize language variants; e.g., color vs colour
  • Converting numerical representation of words to numbers
    • "two and a half" -> 2.5
    • four million" -> 4,000,000 or 4000000
  • Stripping accents or unicode characters.
    • E.g., résumé to resume.
In [6]:
import spacy
nlp = spacy.load('en_core_web_sm')
In [7]:
ans = nlp(hob.iloc[119,0])
tok = [t for t in ans] # word tokenization
print(tok)
[I, enjoy, regularly, going, to, the, gym, ,, as, well, as, playing, sports, such, as, soccer, and, skiing, over, the, winter, ., I, ve, also, been, playing, video, games, since, I, was, a, child, and, managed, to, build, my, first, PC, about, a, year, or, two, back, ,, which, was, a, hassle, in, itself, ., 3, hobbys, but, whatever, .,  ]
In [12]:
[x**2 for x in range(15) if x > 5 and x < 10]
Out[12]:
[36, 49, 64, 81]
In [13]:
tok = list(ans.sents) # sentence tokenization
print('\n'.join([str(t) for t in tok]))
I enjoy regularly going to the gym, as well as playing sports such as soccer and skiing over the winter.
Ive also been playing video games since I was a child and managed to build my first PC about a year or two back, which was a hassle in itself.
3 hobbys but whatever. 
In [14]:
s = """N.L.P. concerns itself with language understanding.

There's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!"""
In [15]:
list(nlp(s).sents)
Out[15]:
[N.L.P. concerns itself with language understanding.
 ,
 There's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!]
  • Stripping punctuation
    • If it isn't important for your task, could strip all punctuation out.
    • Beware of side effects; e.g., 192.168.1.1 -> 19216811.
  • Removing stopwords
    • Stopwords: common function words like "to" "in" "the"
    • For some tasks they aren't important or relevant
      • E.g., topic detection
In [16]:
tok = [t for t in ans if (not t.is_stop and not t.is_punct)] # stopword and punctuation removal
print(tok)
[enjoy, regularly, going, gym, playing, sports, soccer, skiing, winter, ve, playing, video, games, child, managed, build, PC, year, hassle, 3, hobbys,  ]
  • Stemming: convert words to word stem (even if the stem itself isn't a valid root).
    • E.g., argue, argued, argues, arguing all replaced by argu.
    • Works without knowing the part of speech.
  • Lemmatization: like stemming, but attempts to infer part of speech and use custom rules based on part of speech.
In [17]:
lem = [t.lemma_ for t in tok] # lemmatization
print([str(t) for t in tok])
print([str(t) for t in lem])
['enjoy', 'regularly', 'going', 'gym', 'playing', 'sports', 'soccer', 'skiing', 'winter', 've', 'playing', 'video', 'games', 'child', 'managed', 'build', 'PC', 'year', 'hassle', '3', 'hobbys', '\xa0']
['enjoy', 'regularly', 'go', 'gym', 'play', 'sport', 'soccer', 'ski', 'winter', 've', 'play', 'video', 'game', 'child', 'manage', 'build', 'pc', 'year', 'hassle', '3', 'hobby', '\xa0']
  • Part-of-speech tagging
In [18]:
pd.DataFrame({"Token" : [t for t in ans],
              "Lemma" : [t.lemma_ for t in ans], 
              "POS"   : [t.pos_ for t in ans]})
Out[18]:
Token Lemma POS
0 I I PRON
1 enjoy enjoy AUX
2 regularly regularly ADV
3 going go VERB
4 to to ADP
... ... ... ...
56 hobbys hobby NOUN
57 but but CCONJ
58 whatever whatever PRON
59 . . PUNCT
60 SPACE

61 rows × 3 columns

  • Noun phrase parsing
In [19]:
print(list(ans.noun_chunks))
[I, the gym, sports, soccer, the winter, I, video games, I, a child, my first PC, which, a hassle, itself, 3 hobbys, whatever]
  • Named entity recognition
In [22]:
nlp("Jude Law visited NYC").ents
Out[22]:
(Jude Law, NYC)
  • Sentiment analysis
In [23]:
from spacytextblob.spacytextblob import SpacyTextBlob
nlp.add_pipe("spacytextblob")
Out[23]:
<spacytextblob.spacytextblob.SpacyTextBlob at 0x7f415a6f0580>
In [27]:
yay = nlp("Today is a good day.")
boo = nlp("I'm feeling eiofjae.")

print(yay._.blob.polarity)
print(boo._.blob.polarity)
0.7
0.0
In [35]:
hob
Out[35]:
0
0 I enjoy playing chess with people who are bett...
1 soccer
2 Running
3 I like to play Dungeons and Dragons
4 Building computers, cars, the WWU Racing team,...
... ...
119 I enjoy regularly going to the gym, as well as...
120 Skiing
121 Playing video games
122 I like skiing
123 Playing basketball and getting my nails painted

124 rows × 1 columns

Sentiment analysis on the hobbies responses - here's a histogram of polarity of all the responses.

In [ ]:
sns.displot(hob[0].apply(lambda x: nlp(x)._.blob.polarity), xlabel="polarity")

Tools for text normalization¶

  • Python string methods (s.replace, s.lower)
  • Python regular expressions (e.g., re.sub)
  • Linux commandline tool sed (stream editor) or tr (translate)
  • NLP toolkits; e.g., spacy, nltk (support tokenizing, stemming, lemmatizing, etc.)
In [28]:
s
Out[28]:
"N.L.P. concerns itself with language understanding.\n\nThere's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!"
In [33]:
s.replace(".", "") # replace periods with nothing
Out[33]:
"NLP concerns itself with language understanding\n\nThere's lots of nuance in natural languages that we take for granted getting every edge case right is hard!"
In [30]:
import re
re.sub("[.!,;]", "", s) # replace any of these 4 punctuation characters with nothing
Out[30]:
"NLP concerns itself with language understanding\n\nThere's lots of nuance in natural languages that we take for granted getting every edge case right is hard"
In [31]:
re.sub("\s+", " ", s) # replace one or more whitespace characters (denoted \s) with a single space
Out[31]:
"N.L.P. concerns itself with language understanding. There's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!"

Data Ethics - Get Yer Data¶

In [ ]:
import pandas as pd

surprised = []

df = pd.DataFrame(index=range(1,6))

Poll time!¶

  1. On a scale from 1 (completely unsurprised) to 5 (shocked), how surprised were you by the data?
In [ ]:
df["Surprised?"] = []
df["Surprised?"].plot.bar()
  1. On a scale from 1 (unperturbed) to 5 (fully spooked), how creeped out were you by the amount or types of data?
In [ ]:
df["Creeped out?"] = []
df["Creeped out?"].plot.bar()
  1. On a scale from 1 (completely bogus) to 5 (perfectly accurate), how accurate did you find the data to be?
In [ ]:
df["Accurate?"] = []
df["Accurate?"].plot.bar()

Discussion in groups of three:¶

  1. Discuss any particularly notable findings that pertain to the questions you were asked to write about:
  1. How did your data compare to what you expected? Was there anything surprising, or creepy, or just plain strange? Describe the types of data that you see.
  2. How comprehensive was your download? Are you able to determine whether the company gave you everything they had, or were they more selective?
  3. What kinds of data science questions could someone answer about you based solely on this data? What kinds of data science questions could someone with access to millions of records like yours answer?
  4. Are you comfortable with the extent and/or accuracy of data collected? Does the company have controls for opting out of collection of the sorts of data you’d rather they not have? If not - or if the company suddenly decided tomorrow to remove those controls - what should our society do about this?
  1. Were your reactions similar or different? Was this due to your attitudes, or due to differences in the data?

Decide on one most interesting thing (from any of the above discussion) to share with the class. The person who woke up latest this morning will be your group's spokesperson.

Discussion as a class:¶

  1. Is there a problem here?
  2. If so, how should society solve it?