!pip install spacytextblob # this is for sentiment analysis
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: spacytextblob in /usr/local/lib/python3.8/dist-packages (4.0.0) Requirement already satisfied: textblob<0.16.0,>=0.15.3 in /usr/local/lib/python3.8/dist-packages (from spacytextblob) (0.15.3) Requirement already satisfied: spacy<4.0,>=3.0 in /usr/local/lib/python3.8/dist-packages (from spacytextblob) (3.4.4) Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (0.10.1) Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.25.1) Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (57.4.0) Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (3.3.0) Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (6.3.0) Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (8.1.6) Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.11.3) Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.0.8) Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.10 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (3.0.11) Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (4.64.1) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (21.3) Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.4.5) Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (3.0.8) Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.10.4) Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.0.4) Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.0.9) Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (0.10.1) Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (2.0.7) Requirement already satisfied: typer<0.8.0,>=0.3.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (0.7.0) Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.8/dist-packages (from spacy<4.0,>=3.0->spacytextblob) (1.21.6) Requirement already satisfied: nltk>=3.1 in /usr/local/lib/python3.8/dist-packages (from textblob<0.16.0,>=0.15.3->spacytextblob) (3.7) Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk>=3.1->textblob<0.16.0,>=0.15.3->spacytextblob) (2022.6.2) Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk>=3.1->textblob<0.16.0,>=0.15.3->spacytextblob) (1.2.0) Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk>=3.1->textblob<0.16.0,>=0.15.3->spacytextblob) (7.1.2) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>=20.0->spacy<4.0,>=3.0->spacytextblob) (3.0.9) Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.8/dist-packages (from pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4->spacy<4.0,>=3.0->spacytextblob) (4.4.0) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (2.10) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (2022.12.7) Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (4.0.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<4.0,>=3.0->spacytextblob) (1.24.3) Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<4.0,>=3.0->spacytextblob) (0.7.9) Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<4.0,>=3.0->spacytextblob) (0.0.3) Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-packages (from jinja2->spacy<4.0,>=3.0->spacytextblob) (2.0.1)
import numpy as np
import seaborn as sns
import pandas as pd
Last talks (for a little while anyway) this week:
Reasons to visualize:
Viz principles:
Which visualization best illustrates:
Chart 1 vs Chart 2
Know the meaning and purpose of some basic text normalization operations (from natual language processing):
Discuss Data Ethics 1
Text normalization: transforming the various ways text can apear into standard or canonical forms. Often needed to convert text data into tabular data.
import spacy
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML")
Responses to the survey prompt:
Name one hobby or activity you enjoy outside of school.
hob = pd.read_csv("https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/lectures/L10/hobbies.csv", header=None)
hob
0 | |
---|---|
0 | I enjoy playing chess with people who are bett... |
1 | soccer |
2 | Running |
3 | I like to play Dungeons and Dragons |
4 | Building computers, cars, the WWU Racing team,... |
... | ... |
119 | I enjoy regularly going to the gym, as well as... |
120 | Skiing |
121 | Playing video games |
122 | I like skiing |
123 | Playing basketball and getting my nails painted |
124 rows × 1 columns
hob.iloc[119,0]
'I enjoy regularly going to the gym, as well as playing sports such as soccer and skiing over the winter. Ive also been playing video games since I was a child and managed to build my first PC about a year or two back, which was a hassle in itself. 3 hobbys but whatever.\xa0'
import spacy
nlp = spacy.load('en_core_web_sm')
ans = nlp(hob.iloc[119,0])
tok = [t for t in ans] # word tokenization
print(tok)
[I, enjoy, regularly, going, to, the, gym, ,, as, well, as, playing, sports, such, as, soccer, and, skiing, over, the, winter, ., I, ve, also, been, playing, video, games, since, I, was, a, child, and, managed, to, build, my, first, PC, about, a, year, or, two, back, ,, which, was, a, hassle, in, itself, ., 3, hobbys, but, whatever, ., ]
[x**2 for x in range(15) if x > 5 and x < 10]
[36, 49, 64, 81]
tok = list(ans.sents) # sentence tokenization
print('\n'.join([str(t) for t in tok]))
I enjoy regularly going to the gym, as well as playing sports such as soccer and skiing over the winter. Ive also been playing video games since I was a child and managed to build my first PC about a year or two back, which was a hassle in itself. 3 hobbys but whatever.
s = """N.L.P. concerns itself with language understanding.
There's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!"""
list(nlp(s).sents)
[N.L.P. concerns itself with language understanding. , There's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!]
tok = [t for t in ans if (not t.is_stop and not t.is_punct)] # stopword and punctuation removal
print(tok)
[enjoy, regularly, going, gym, playing, sports, soccer, skiing, winter, ve, playing, video, games, child, managed, build, PC, year, hassle, 3, hobbys, ]
lem = [t.lemma_ for t in tok] # lemmatization
print([str(t) for t in tok])
print([str(t) for t in lem])
['enjoy', 'regularly', 'going', 'gym', 'playing', 'sports', 'soccer', 'skiing', 'winter', 've', 'playing', 'video', 'games', 'child', 'managed', 'build', 'PC', 'year', 'hassle', '3', 'hobbys', '\xa0'] ['enjoy', 'regularly', 'go', 'gym', 'play', 'sport', 'soccer', 'ski', 'winter', 've', 'play', 'video', 'game', 'child', 'manage', 'build', 'pc', 'year', 'hassle', '3', 'hobby', '\xa0']
pd.DataFrame({"Token" : [t for t in ans],
"Lemma" : [t.lemma_ for t in ans],
"POS" : [t.pos_ for t in ans]})
Token | Lemma | POS | |
---|---|---|---|
0 | I | I | PRON |
1 | enjoy | enjoy | AUX |
2 | regularly | regularly | ADV |
3 | going | go | VERB |
4 | to | to | ADP |
... | ... | ... | ... |
56 | hobbys | hobby | NOUN |
57 | but | but | CCONJ |
58 | whatever | whatever | PRON |
59 | . | . | PUNCT |
60 | SPACE |
61 rows × 3 columns
print(list(ans.noun_chunks))
[I, the gym, sports, soccer, the winter, I, video games, I, a child, my first PC, which, a hassle, itself, 3 hobbys, whatever]
nlp("Jude Law visited NYC").ents
(Jude Law, NYC)
from spacytextblob.spacytextblob import SpacyTextBlob
nlp.add_pipe("spacytextblob")
<spacytextblob.spacytextblob.SpacyTextBlob at 0x7f415a6f0580>
yay = nlp("Today is a good day.")
boo = nlp("I'm feeling eiofjae.")
print(yay._.blob.polarity)
print(boo._.blob.polarity)
0.7 0.0
hob
0 | |
---|---|
0 | I enjoy playing chess with people who are bett... |
1 | soccer |
2 | Running |
3 | I like to play Dungeons and Dragons |
4 | Building computers, cars, the WWU Racing team,... |
... | ... |
119 | I enjoy regularly going to the gym, as well as... |
120 | Skiing |
121 | Playing video games |
122 | I like skiing |
123 | Playing basketball and getting my nails painted |
124 rows × 1 columns
Sentiment analysis on the hobbies responses - here's a histogram of polarity of all the responses.
sns.displot(hob[0].apply(lambda x: nlp(x)._.blob.polarity), xlabel="polarity")
s.replace
, s.lower
)re.sub
)sed
(stream editor) or tr
(translate)spacy
, nltk
(support tokenizing, stemming, lemmatizing, etc.)s
"N.L.P. concerns itself with language understanding.\n\nThere's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!"
s.replace(".", "") # replace periods with nothing
"NLP concerns itself with language understanding\n\nThere's lots of nuance in natural languages that we take for granted getting every edge case right is hard!"
import re
re.sub("[.!,;]", "", s) # replace any of these 4 punctuation characters with nothing
"NLP concerns itself with language understanding\n\nThere's lots of nuance in natural languages that we take for granted getting every edge case right is hard"
re.sub("\s+", " ", s) # replace one or more whitespace characters (denoted \s) with a single space
"N.L.P. concerns itself with language understanding. There's lots of nuance in natural languages that we take for granted... getting every edge case right is hard!"
import pandas as pd
surprised = []
df = pd.DataFrame(index=range(1,6))
df["Surprised?"] = []
df["Surprised?"].plot.bar()
df["Creeped out?"] = []
df["Creeped out?"].plot.bar()
df["Accurate?"] = []
df["Accurate?"].plot.bar()
- How did your data compare to what you expected? Was there anything surprising, or creepy, or just plain strange? Describe the types of data that you see.
- How comprehensive was your download? Are you able to determine whether the company gave you everything they had, or were they more selective?
- What kinds of data science questions could someone answer about you based solely on this data? What kinds of data science questions could someone with access to millions of records like yours answer?
- Are you comfortable with the extent and/or accuracy of data collected? Does the company have controls for opting out of collection of the sorts of data you’d rather they not have? If not - or if the company suddenly decided tomorrow to remove those controls - what should our society do about this?
Decide on one most interesting thing (from any of the above discussion) to share with the class. The person who woke up latest this morning will be your group's spokesperson.