Spring 2026
This pre-lab and lab serve to give hands-on experience with text
normalization and natural language processing. In Part I, you will
practice normalizing text using the ntlk package. In Part
II, you will use the Spacy NLP package to analyze a dataset of fake and
real news articles.
Details for submission and development environment are the same as for Lab 1; please see the Lab 1 handout if you need a refresher. As a reminder, if you choose to use Google Colab or any other alternative notebook hosting service, you must disable any built-in generative AI features.
You must complete the pre-lab individually, but complete the lab in pairs. Your TA will help facilitating pairing; you should pair up with someone you haven’t worked with on a prior lab. If an odd number of students are in attendance at lab, your TA will arrange one group of three or one person working alone. The lab must be done synchronously and collaboratively with your partner, with no free-loaders and no divide-and-conquer; every answer should reflect the understanding and work of both members. If your group does not finish during lab time, please arrange to meet as a pair outside of class time to finish up any work.
There are four files you will need for this lab. You can find them on our class’s shared storage on the cluster at the following paths:
/cluster/academic/DATA311/202620/lab4/lab4.txt
This will be used in Part I when practicing text normalization.
Credit to the New York Times Magazine.
/cluster/academic/DATA311/202620/lab4/contractions.csv
This will also be used in Part I when practicing text normalization.
Credit to Andrew Tucker, San Jose State University Writing Center.
/cluster/academic/DATA311/202620/lab4/Fake.csv
This will be used in the Pre-Lab and Part II for Spacy NLP analysis.
Credit to the clmentbisaillon’s Kaggle Dataset (Kaggle dataset page)
/cluster/academic/DATA311/202620/lab4/True.csv
This will be used in the Pre-Lab and Part II for Spacy NLP analysis.
Credit to the clmentbisaillon’s Kaggle Dataset (Kaggle dataset page)
Please carefully read “Natural Language Processing With Python’s NLTK Package”
In a text editor of your choice that is capable of exporting to pdf, answer these questions:
How many tokens are in the following sentence if you call
word_tokenize on it? “Muad’Dib learned rapidly because his
first training was in how to learn.”
How many sentence tokens are in the same sentence if you instead
call sent_tokenize?
What are stemming and lemmatizing and how do they differ?
Please carefully read “spaCy 101: Everything you need to know”
In the same editor, answer these questions:
How many tokens are in the following sentence after it is
tokenized by Spacy’s small model (en_core_web_sm)?
Apple is looking at buying U.K. startup for $1 billion
What is a Named Entity? List two example entity labels.
What is a Doc object?
Please download nlp_spacy_and_counters.ipynb, as
well as True.csv and Fake.csv linked above,
and upload them to JupyterHub. Go through all the cells in the notebook
and answer these questions in the same editor as your previous
questions:
How long does it take to run the Spacy model on the
short_fake data?
How long does it take to run the Spacy model on the
short_real data?
What does token.tag_ == "NNPS" mean in the last
cell?
You can return to this notebook for examples of how to use Spacy to analyze news data in Part II of the Lab.
JupyterLab Resources For this lab, you’ll be using a somewhat beefy Spacy language model. When starting your server on JupyterHub, it is recommended that you request 2 CPUs and 6GB of memory; in our experience, this is comfortably sufficient. If the memory usage (displayed in the status bar at the bottom) goes into the red, please restart your server with greater resources.
Create a fresh notebook, lab4.ipynb; title it and
include the names of both partners in a Markdown cell at the top. In
this notebook, load the contents of lab4.txt (linked above)
document into a string. Then implement (and explain your processing in
comments and/or text boxes as needed) the following text normalization
steps:
Replace any instances of successive spaces or tabs with a single space character.
Use the contraction-expansion pairs in
contractions.csv to find and replace any instances of the
contractions anywhere in the document. Hint: don’t overthink this
one - use Python’s string processing features.
Use nltk’s sentence tokenization to break the
document into a list of sentences. Count and print the number of
sentences; compute and print the longest sentence (most
characters).
Use nltk’s word tokenization to break each sentence
into a list of tokens. (You will produce a list of lists, where inner
lists are lists of tokens.)
Lowercase every token.
Create a dictionary that maps each unique token to the number of times it appears. Which token appears most often? List the tokens that only appear once.
If you encounter any unexpected challenges, make sure to describe them and how you decided to deal with them.
In the same lab4.ipynb notebook as Part I, perform the
following analyses separately on (1) the True.csv data and
(2) the Fake.csv data. For analyses 1-5, use the
en_core_web_trf Spacy model - this one is more expensive to
run, but also powerful and better at recognizing named entities, among
other things.
Using news article titles, provide a list of all the countries, cities, and states that are mentioned in the month of March 2017. Briefly comment on your findings.
Using the news article titles, calculate the top five people mentioned on election day November 8th, 2016? Briefly comment on your findings.
Create histograms of real and fake article sentiment mentioning the top person (from the previous analysis). Briefly comment on your findings.
Plot histograms for real and fake news article sentiment (by title) for a 2 month period of your choosing. Briefly comment on your findings.
Perform 1 additional analysis of your choosing on some date range/date of your choosing. Justify your analysis choice, and comment on your findings.
Time running each Spacy model (en_core_web_sm,
en_core_web_md, en_core_web_lg,
and en_core_web_trf), on the first 5 full text news
articles from True.csv. Display your findings in a bar
chart with the bar heights equal to the time elapsed in each of your
timing experiments and a bar for each Spacy model. Briefly discuss your
findings.
Thanks to Brian Hutchinson and Aaron Tuor for developing and iterating on prior versions of this lab.