# Lecture 10 - NLP Exercise

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

Below I've pasted the text of a news article. Your goal is to use spacy to get a list of most frequently-appearing terms in the article as a coarse way to summarize what it's about.

In [84]:
article = """
If you live in any major city or suburb in the U.S., you may have noticed more and more parents hauling their kids around on bulky cargo bicycles. Some families are ditching their second car, forgoing a minivan, or going car-free altogether.

Cargo bikes have been around for more than a century — and they're popular elsewhere on the globe. But until a few years ago, they were all but forgotten in North America. Now they're making a comeback.

There are a few reasons behind the surge in cargo bike ridership, including better bicycle infrastructure, and bikes that are easy to ride, even if you're not an athlete.

Lelac Almagor was not a biker before she bought her cargo bike.

"I'm a very lazy person," Almagor says.

She thought she'd give the bike a try, and probably return it.

"But by the third day I was like, 'Oh, this is actually going to change my life,'" says Almagor, who lives in Washington, D.C.

Now she rides her three kids to school every day, even when it's pouring rain or hot and muggy.

"It's such a better start to my day, that now there's truly not weather that I would rather drive in," Almagor says.

She bought the bike six years ago — back then, she'd see maybe one other cargo bike parent at school. Now there are dozens crowding the bike racks, not to mention the riders she sees heading to other schools. Almagor got so into cargo bikes, she left her teaching career and now works for a bike company.

Bikes designed to carry kids

Today's cargo bikes are designed specifically for transporting kids, with comfy seats and rain canopies, and, importantly, electric motors.

Philip Koopman, a longtime D.C. bike shop owner, says there were no electric cargo bikes when his kids were young. He toted them around around the city in a trailer behind his bike, muscling up hills.

"Most people don't want to be sweaty when they get to work," Koopman says. "So having these different options, it just makes cycling so much more attainable for so many more people."

On a recent Saturday, Koopman was helping people test ride cargo bikes at the DC Family Bike Fest. The event drew hundreds of parents, including Patricia Stamper, who was trying out different bikes with her two kids.

"I'm 39, I'm losing weight, I need some help," Stamper says. "And this is cheaper than bariatric surgery, it's cheaper than Wegovy."

The bike she's eyeing is pricey, at around $2,500. But she thinks it'll be worth it, to get some exercise, get her kids to school, ride to work and go shopping.

While cargo bikes are expensive, riders point out they're much cheaper than cars, especially when you account for gas, insurance, parking and maintenance.

Better bike infrastructure

The first protected bike lane in the U.S. — separated from car traffic — was installed in 1967 in Davis, Calif. But it wasn't until 40 or 50 years later, in the 2000s, that other cities followed suit. Now such lanes can be found all over the country.

Minneapolis is often ranked as the best cycling city in the nation, with more than 200 miles of bike lanes and trails.

"If it's not safe to ride a bike, it's going to be hard to get people on bikes," says Laura Mitchell, who lives in the city.

Mitchell says getting a cargo bike was a "game changer" for her family. So, earlier this year, she started the Minneapolis Cargo Bike Library, where residents can check out a bike for free, to test it out, or for the occasional trip to a big box store. It's been so popular, she quickly had to cap the number of users.

People aren't only biking with their kids in cities with great infrastructure. Unlike Minneapolis, Houston is often near the bottom of bike rankings.

Brian Jackson, who lives in Houston, frequently drops off his two kids by cargo bike. He's the only one at his school and daycare. He gets a lot of quizzical looks.

"A lot of people are like, 'I've never seen anything like it,'" Jackson says.

Jackson does see a few other cargo-bike parents on the bike trails, and he wishes more people would give it a try. He says Houston has a "secret strong bike culture," and drivers usually give him plenty of space, especially when they see the kids on board.

A second bicycle boom

Since the very first bicycles, people jerry rigged them to carry cargo, including passengers. The heyday of the cargo bike — at least in North America and Europe — was about a century ago.

Back then, cities were buzzing with workmen on bicycles.

"There'd be knife sharpeners with a little studio set up on the back of their bikes. There were glaziers — people who fixed broken windows," says Jody Rosen, author of the book Two Wheels Good: The History and Mystery of the Bicycle.

It all started with the Great Bicycle Boom of the 1890s. Rosen says now we're in a new, 21st century bicycle boom, and cargo bikes are part of it.

Lelac Almagor, the D.C. cargo bike mom, says she thinks the biggest reason there are more people riding cargo bikes isn't the infrastructure or the e-bike technology. It's seeing other parents riding them. That makes it seem normal — just a practical way to get around with your kids.

The very best advertisement for cargo biking, she says? The carpool line at school.

"When the car line is wrapping around the block, just gliding past in our little flotilla of cargo bikes. There's no way, as a parent, not to be like, 'maybe we should do that.'"""


In [87]:
import spacy
nlp = spacy.load("en_core_web_sm")

I've started by simply running the article through spacy:

In [88]:
art = nlp(article)

We can use a Counter from python's collections module to count the frequency of items occurring in a collection.

In [64]:
from collections import Counter

Here, we'll just count the frequency of the raw tokens (we need to `str` them to get the actual text, rather than the token object that contains much more information).

In [82]:
token_counts = Counter([str(x) for x in art])

In [86]:
token_counts.most_common()[:10]

[(',', 79),
 ('.', 54),
 ('the', 36),
 ('\n\n', 34),
 ('a', 27),
 ('"', 25),
 ('bike', 24),
 ('to', 21),
 ('cargo', 19),
 ('in', 18)]

This is a start! But not quite where we'd like to be.

Use one or more of the following to get a list of key words or phrases that better summarize the article's subject:
* Punctuation and Stopword removal
* Lemmatization
* Entity tagging
* Noun phrase parsing
* Part of speech tagging
* Token similarity (thought: if some of the top tokens are very similar to each other, maybe pick a representative one? not sure if this will be helpful or not)