!pip install spacytextblob # this is for sentiment analysis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: spacytextblob in /usr/local/lib/python3.8/dist-packages (4.0.0)
Requirement already satisfied: textblob<0.16.0,>=0.15.3 in /usr/local/lib/python3.8/dist-packages (from spacytextblob) (0.15.3)
Requirement already satisfied: spacy<4.0,>=3.0 in /usr/local/lib/python3.8/dist-packages (from spacytextblob) (3.4.4)
Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (0.10.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.25.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (57.4.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (3.3.0)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (6.3.0)
Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (8.1.6)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.11.3)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.0.8)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.10 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (3.0.11)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (4.64.1)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (21.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.4.5)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (3.0.8)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.10.4)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.0.4)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.0.9)
Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (0.10.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.0.7)
Requirement already satisfied: typer<0.8.0,>=0.3.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (0.7.0)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.21.6)
Requirement already satisfied: nltk>=3.1 in /usr/local/lib/python3.8/dist-packages (from textblob<0.16.0,>=0.15.3->spacytextblob) (3.7)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk>=3.1->textblob<0.16.0,>=0.15.3->spacytextblob) (2022.6.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk>=3.1->textblob<0.16.0,>=0.15.3->spacytextblob) (1.2.0)
Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk>=3.1->textblob<0.16.0,>=0.15.3->spacytextblob) (7.1.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>=20.0->spacy<4.0,>=3.0->spacytextblob) (3.0.9)
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.8/dist-packages (from pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4->spacy<4.0,>=3.0->spacytextblob) (4.4.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (2022.12.7)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (4.0.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (1.24.3)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<4.0,>=3.0->spacytextblob) (0.7.9)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<4.0,>=3.0->spacytextblob) (0.0.3)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-packages (from jinja2->spacy<4.0,>=3.0->spacytextblob) (2.0.1)


import numpy as np
import seaborn as sns
import pandas as pd


import spacy

/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")


hob = pd.read_csv("https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/lectures/L10/hobbies.csv", header=None)
hob


hob.iloc[119,0]

'I enjoy regularly going to the gym, as well as playing sports such as soccer and skiing over the winter. Ive also been playing video games since I was a child and managed to build my first PC about a year or two back, which was a hassle in itself. 3 hobbys but whatever.\xa0'


import spacy
nlp = spacy.load('en_core_web_sm')


ans = nlp(hob.iloc[119,0])
tok = [t for t in ans] # word tokenization
print(tok)

[I, enjoy, regularly, going, to, the, gym, ,, as, well, as, playing, sports, such, as, soccer, and, skiing, over, the, winter, ., I, ve, also, been, playing, video, games, since, I, was, a, child, and, managed, to, build, my, first, PC, about, a, year, or, two, back, ,, which, was, a, hassle, in, itself, ., 3, hobbys, but, whatever, .,  ]


[x**2 for x in range(15) if x > 5 and x < 10]

[36, 49, 64, 81]


tok = list(ans.sents) # sentence tokenization
print('\n'.join([str(t) for t in tok]))

I enjoy regularly going to the gym, as well as playing sports such as soccer and skiing over the winter.
Ive also been playing video games since I was a child and managed to build my first PC about a year or two back, which was a hassle in itself.
3 hobbys but whatever.


s = """N.L.P. concerns itself with language understanding.

There's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!"""


list(nlp(s).sents)

[N.L.P. concerns itself with language understanding.
 ,
 There's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!]


tok = [t for t in ans if (not t.is_stop and not t.is_punct)] # stopword and punctuation removal
print(tok)

[enjoy, regularly, going, gym, playing, sports, soccer, skiing, winter, ve, playing, video, games, child, managed, build, PC, year, hassle, 3, hobbys,  ]


lem = [t.lemma_ for t in tok] # lemmatization
print([str(t) for t in tok])
print([str(t) for t in lem])

['enjoy', 'regularly', 'going', 'gym', 'playing', 'sports', 'soccer', 'skiing', 'winter', 've', 'playing', 'video', 'games', 'child', 'managed', 'build', 'PC', 'year', 'hassle', '3', 'hobbys', '\xa0']
['enjoy', 'regularly', 'go', 'gym', 'play', 'sport', 'soccer', 'ski', 'winter', 've', 'play', 'video', 'game', 'child', 'manage', 'build', 'pc', 'year', 'hassle', '3', 'hobby', '\xa0']


pd.DataFrame({"Token" : [t for t in ans],
              "Lemma" : [t.lemma_ for t in ans], 
              "POS"   : [t.pos_ for t in ans]})


print(list(ans.noun_chunks))

[I, the gym, sports, soccer, the winter, I, video games, I, a child, my first PC, which, a hassle, itself, 3 hobbys, whatever]


nlp("Jude Law visited NYC").ents

(Jude Law, NYC)


from spacytextblob.spacytextblob import SpacyTextBlob
nlp.add_pipe("spacytextblob")

<spacytextblob.spacytextblob.SpacyTextBlob at 0x7f415a6f0580>


yay = nlp("Today is a good day.")
boo = nlp("I'm feeling eiofjae.")

print(yay._.blob.polarity)
print(boo._.blob.polarity)

0.7
0.0

hob


sns.displot(hob[0].apply(lambda x: nlp(x)._.blob.polarity), xlabel="polarity")

s

"N.L.P. concerns itself with language understanding.\n\nThere's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!"


s.replace(".", "") # replace periods with nothing

"NLP concerns itself with language understanding\n\nThere's lots of nuance in natural languages that we take for granted getting every edge case right is hard!"


import re
re.sub("[.!,;]", "", s) # replace any of these 4 punctuation characters with nothing

"NLP concerns itself with language understanding\n\nThere's lots of nuance in natural languages that we take for granted getting every edge case right is hard"


re.sub("\s+", " ", s) # replace one or more whitespace characters (denoted \s) with a single space

"N.L.P. concerns itself with language understanding. There's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!"


import pandas as pd

surprised = []

df = pd.DataFrame(index=range(1,6))


df["Surprised?"] = []
df["Surprised?"].plot.bar()


df["Creeped out?"] = []
df["Creeped out?"].plot.bar()


df["Accurate?"] = []
df["Accurate?"].plot.bar()

	0
0	I enjoy playing chess with people who are bett...
1	soccer
2	Running
3	I like to play Dungeons and Dragons
4	Building computers, cars, the WWU Racing team,...
...	...
119	I enjoy regularly going to the gym, as well as...
120	Skiing
121	Playing video games
122	I like skiing
123	Playing basketball and getting my nails painted

	0
0	I enjoy playing chess with people who are bett...
1	soccer
2	Running
3	I like to play Dungeons and Dragons
4	Building computers, cars, the WWU Racing team,...
...	...
119	I enjoy regularly going to the gym, as well as...
120	Skiing
121	Playing video games
122	I like skiing
123	Playing basketball and getting my nails painted

Lecture 11 - Text Normalization; Data Ethics 1 Discussion¶

Announcements:¶

Quiz 3 FMQ¶

Goals:¶

Text Normalization¶

Tools for text normalization¶

Data Ethics - Get Yer Data¶

Poll time!¶

Discussion in groups of three:¶

Discussion as a class:¶

	Token	Lemma	POS
0	I	I	PRON
1	enjoy	enjoy	AUX
2	regularly	regularly	ADV
3	going	go	VERB
4	to	to	ADP
...	...	...	...
56	hobbys	hobby	NOUN
57	but	but	CCONJ
58	whatever	whatever	PRON
59	.	.	PUNCT
60			SPACE