DATA 311 - Lab 4: Text Normalization and Natural Language Processing

Scott Wehrwein

Fall 2025

Overview

This pre-lab and lab serve to give hands-on experience with text normalization and natural language processing. In Part I, you will practice normalizing text using the ntlk package. In Part II, you will use the Spacy NLP package to analyze a dataset of fake and real news articles.

Submission and Development Environment

Details for submission and development environment are the same as for Lab 1; please see the Lab 1 handout if you need a refresher. As a reminder, if you choose to use Google Colab or any other alternative notebook hosting service, you must disable any built-in generative AI features.

Collaboration Policy

You must complete the pre-lab individual, but complete the lab in pairs. Your TA will help facilitating pairing. If an odd number of students are in attendance at lab, your TA will arrange one group of three. The lab must be done synchronously and collaboratively with your partner, with no free-loaders and no divide-and-conquer; every answer should reflect the understanding and work of both members. If your group does not finish during lab time, please arrange to meet as a pair outside of class time to finish up any work.

You must work with a different partner for Lab 4 than you did for Labs 1 and 2.

The Data

There are four files you will need for this lab. You may choose to either download each file and upload a copy to the local storage of your JupyterHub server, or load them directly from these URLs in your notebook.

  1. https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/lab4/lab4.txt
  2. https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/lab4/contractions.csv
  3. https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/lab4/Fake.csv
  4. https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/lab4/True.csv

Pre-Lab

Lab

Part I

Create a fresh notebook, lab4.ipynb; title it and include the names of both partners in a Markdown cell at the top. In this notebook, load the contents of lab4.txt (linked above) document into a string. Then implement (and explain your processing in comments and/or text boxes as needed) the following text normalization steps:

  1. Replace any instances of successive spaces or tabs with a single space character.

  2. Use the contraction-expansion pairs in contractions.csv to find and replace any instances of the contractions anywhere in the document.

  3. Use nltk’s sentence tokenization to break the document into a list of sentences. Count and print the number of sentences; compute and print the longest sentence (most characters).

  4. Use nltk’s word tokenization to break each sentence into a list of tokens. (You will produce a list of lists, where inner lists are lists of tokens.)

  5. Lowercase every token.

  6. Create a dictionary that maps each unique token to the number of times it appears. Which token appears most often? List the tokens that only appear once.

If you encounter any unexpected challenges, make sure to describe them and how you decided to deal with them.

Part II

In the same lab4.ipynb notebook as Part I, perform the following analyses separately on (1) the True.csv data and (2) the Fake.csv data. For analyses 1-5, use the en_core_web_sm Spacy model that we used in the pre-lab.

  1. Using news article titles, provide a list of all the countries, cities, and states that are mentioned in the month of March 2017. Briefly comment on your findings.

  2. Using the news article titles, calculate the top five people mentioned on election day November 8th, 2016? Briefly comment on your findings.

  3. Create histograms of real and fake article sentiment mentioning the top person (from the previous analysis). Briefly comment on your findings.

  4. Plot histograms for real and fake news article sentiment (by title) for a 2 month period of your choosing. Briefly comment on your findings.

  5. Perform 1 additional analysis of your choosing on some date range/date of your choosing. Justify your analysis choice, and comment on your findings.

  6. Time running each Spacy model (en_core_web_sm, en_core_web_md, en_core_web_lg,
    and en_core_web_trf), on the first 5 full text news articles from True.csv. Display your findings in a bar chart with the bar heights equal to the time elapsed in each of your timing experiments and a bar for each Spacy model. Briefly discuss your findings.

Rubric

Pre-Lab (9 points)

Lab (48 points)

Part 1 (24 points)
Part 2 (24 points)

Acknowledgements

Thanks to Brian Hutchinson and Aaron Tuor for developing and iterating on prior versions of this lab.