## Text data

### [Kaggle fake news dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?resource=download)

### [Spacy tutorial](https://www.kaggle.com/code/sudalairajkumar/getting-started-with-spacy)



In [4]:
! python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [5]:
# Downloading the files needed to load the spacy language model
!pip3 install spacytextblob

import pandas as pd
from datetime import datetime # for grabbing date ranges
import spacy # for natural language processing
from spacytextblob.spacytextblob import SpacyTextBlob # for sentiment analysis

# This is loading the spacy language model
nlp = spacy.load('en_core_web_md')
nlp.add_pipe('spacytextblob')





<spacytextblob.spacytextblob.SpacyTextBlob at 0x7ebe69252310>

### NLP crash course with spacy

+ A text dataset is often called a corpus. Especially if it has been curated in some fashion (labeled, annotated, ...).

+ Spacy is a python package that performs standard NLP analyses of text.

+ Tokenization: Splitting text into units
  - Sentence
  - Word based tokenization (wasn't)
  - Subword based tokenization (morpheme-like units)
    + Byte pair encoding
    + WordPiece
    + Unigram
    + SentencePiece
  - Character based tokenization
+ Lemmatization: mapping word variants to canonical root form
  - removing pluralization
  - removing tense
+ Part of speech tagging: Labeling words with part of speech (verb, noun, etc.)
+ Noun phrase identification
+ Named entity recognition (NER)
  - person, place, country, currency, ...
+ Sentiment analysis

In [6]:
# Showing you attributes of tokens you have access to in Spacy.

# Example text for demonstrations
text = "Dr. James Harvey, the big furry cat ate the little brown mice who weren't very happy."

doc = nlp(text)

# Using a python "list comprehension" in order to go through all the tokens in the document"
print([t.text for t in doc]) # word based tokenization
print([t.lemma_ for t in doc]) # lemmatization
print([t.pos_ for t in doc]) # Part of speech tagging
print([t.tag_ for t in doc]) # Fine grained part of speech tagging
print([n for n in doc.noun_chunks]) # Noun phrase parsing
print(spacy.explain('JJ'))
print(nlp.get_pipe('ner').labels)
print([f'{ent.text}, {ent.label_}, {spacy.explain(ent.label_)}' for ent in doc.ents])

['Dr.', 'James', 'Harvey', ',', 'the', 'big', 'furry', 'cat', 'ate', 'the', 'little', 'brown', 'mice', 'who', 'were', "n't", 'very', 'happy', '.']
['Dr.', 'James', 'Harvey', ',', 'the', 'big', 'furry', 'cat', 'eat', 'the', 'little', 'brown', 'mouse', 'who', 'be', 'not', 'very', 'happy', '.']
['PROPN', 'PROPN', 'PROPN', 'PUNCT', 'DET', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'DET', 'ADJ', 'ADJ', 'NOUN', 'PRON', 'AUX', 'PART', 'ADV', 'ADJ', 'PUNCT']
['NNP', 'NNP', 'NNP', ',', 'DT', 'JJ', 'JJ', 'NN', 'VBD', 'DT', 'JJ', 'JJ', 'NNS', 'WP', 'VBD', 'RB', 'RB', 'JJ', '.']
[Dr. James Harvey, the big furry cat, the little brown mice, who]
adjective (English), other noun-modifier (Chinese)
('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')
['James Harvey, PERSON, People, including fictional']


In [8]:
fake = pd.read_csv('https://fw.cs.wwu.edu/~hutchib2/doc/Fake.csv')
real = pd.read_csv('https://fw.cs.wwu.edu/~hutchib2/doc/True.csv')

In [9]:
def clean_dates(df):
  """
  Translates dates from various string formats to python Datetime objects.
  Adds columns Date: Datetime, month: int, year: int, day: int.
  Also tosses out corrupted rows in the dataframe.

  :param df: (DataFrame) With column 'date'

  :returns: (DataFrame) Cleaned with additional columns Date, month, day, year
  """
  formats = {1: '%d-%b-%y', 0: '%B %d, %Y', 2: '%b %d, %Y'}
  dates = []
  bad = []
  for i, date in enumerate(df['date']):
    try:
      if date[0].isdigit():
        f = formats[1]
      elif len(date.split()[0].strip()) > 3:
        f = formats[0]
      else:
        f = formats[2]
      dates.append(datetime.strptime(date.strip(), f))
    except:
      bad.append([date, format, i])
  print(bad)
  df = df.drop([b[-1] for b in bad])
  df['Date'] = dates
  df['month'] = pd.DatetimeIndex(dates).month
  df['year'] = pd.DatetimeIndex(dates).year
  df['day'] = pd.DatetimeIndex(dates).day
  return df

In [10]:
fake = clean_dates(fake)
real = clean_dates(real)

[['https://100percentfedup.com/served-roy-moore-vietnamletter-veteran-sets-record-straight-honorable-decent-respectable-patriotic-commander-soldier/', <built-in function format>, 9358], ['https://100percentfedup.com/video-hillary-asked-about-trump-i-just-want-to-eat-some-pie/', <built-in function format>, 15507], ['https://100percentfedup.com/12-yr-old-black-conservative-whose-video-to-obama-went-viral-do-you-really-love-america-receives-death-threats-from-left/', <built-in function format>, 15508], ['https://fedup.wpengine.com/wp-content/uploads/2015/04/hillarystreetart.jpg', <built-in function format>, 15839], ['https://fedup.wpengine.com/wp-content/uploads/2015/04/entitled.jpg', <built-in function format>, 15840], ['https://fedup.wpengine.com/wp-content/uploads/2015/04/hillarystreetart.jpg', <built-in function format>, 17432], ['https://fedup.wpengine.com/wp-content/uploads/2015/04/entitled.jpg', <built-in function format>, 17433], ['MSNBC HOST Rudely Assumes Steel Worker Would Neve

In [11]:
# Boolean indexing
short_fake = fake[(fake.year == 2017) & (fake.month == 1)]


In [12]:
short_fake

Unnamed: 0,title,text,subject,date,Date,month,year,day
2749,Trump’s SCOTUS Pick Sided With Hobby Lobby Ag...,"On Tuesday, Donald Trump announced the identit...",News,"January 31, 2017",2017-01-31,1,2017,31
2750,It Took A Scathing Letter From Canada’s Prime...,Fox News couldn t wait to try to spin the Queb...,News,"January 31, 2017",2017-01-31,1,2017,31
2751,WATCH: Jake Tapper STUNNED Into Disbelief Lis...,Sean Spicer is doing his level best to make en...,News,"January 31, 2017",2017-01-31,1,2017,31
2752,An Anonymous Group Just Revealed The Direct P...,"Just after Donald Trump was sworn in, his admi...",News,"January 31, 2017",2017-01-31,1,2017,31
2753,Trump Jr. Just ‘Liked’ Tweet Praising Mosque ...,When it comes to how shameless the Trump famil...,News,"January 31, 2017",2017-01-31,1,2017,31
...,...,...,...,...,...,...,...,...
23076,SOUR GRAPES? Whatever happened to the ‘smooth ...,Andrew Malcolm McClatchy News You better stop...,Middle-east,"January 3, 2017",2017-01-03,1,2017,3
23077,HACKING DEMOCRACY? CIA Accusing Russia of Doin...,Peter Certo Other WordsEven in an election yea...,Middle-east,"January 3, 2017",2017-01-03,1,2017,3
23078,Good News for Silver in 2017,James Burgess Oil PricePrecious metals are an...,Middle-east,"January 3, 2017",2017-01-03,1,2017,3
23079,Gerald Celente: Top 10 Trends for 2017,"What can we expect in 2017? Inflated markets, ...",Middle-east,"January 2, 2017",2017-01-02,1,2017,2


In [13]:
short_real = real[(real.year == 2017) & (real.month == 1)]

In [15]:
%time fake_docs = short_fake['title'].apply(nlp)

CPU times: user 11 s, sys: 40.1 ms, total: 11 s
Wall time: 11.1 s


In [16]:
%time real_docs = short_real['title'].apply(nlp)

CPU times: user 7.45 s, sys: 36.3 ms, total: 7.49 s
Wall time: 8.47 s


In [17]:
doc = fake_docs.iloc[0]
print([t.text for t in doc]) # tokenized text of the document
print([t.lemma_ for t in doc]) # lemmatized text of the document
print([t.pos_ for t in doc]) # Part of speech for all tokens in the document
print([t.tag_ for t in doc]) # Fine-grained part of speech for all tokens in the document
print(spacy.explain('VBN')) # making sense out of the spacy acronyms
print([c for c in doc.noun_chunks]) # All the noun phrases in the document
print(len([doc for doc in fake_docs if round(doc._.blob.polarity, 2)  < -.9])) # Sentiment analysis of the document. <0 means negative >0 means positive.
print(len([doc for doc in real_docs if round(doc._.blob.polarity, 2) < -.9]))

print(len([doc for doc in fake_docs if round(doc._.blob.polarity, 2)  > .9])) # Sentiment analysis of the document. <0 means negative >0 means positive.
print(len([doc for doc in real_docs if round(doc._.blob.polarity, 2) > .9]))


[' ', 'Trump', '’s', 'SCOTUS', 'Pick', 'Sided', 'With', 'Hobby', 'Lobby', 'Against', 'Women', ',', 'Thinks', 'Christianity', 'Trumps', '‘', 'Secular', 'Courts', '’']
[' ', 'Trump', '’s', 'SCOTUS', 'Pick', 'side', 'with', 'Hobby', 'Lobby', 'against', 'Women', ',', 'think', 'Christianity', 'Trumps', "'", 'Secular', 'Courts', "'"]
['SPACE', 'PROPN', 'PART', 'PROPN', 'PROPN', 'VERB', 'ADP', 'PROPN', 'PROPN', 'ADP', 'PROPN', 'PUNCT', 'VERB', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PROPN', 'PUNCT']
['_SP', 'NNP', 'POS', 'NNP', 'NNP', 'VBD', 'IN', 'NNP', 'NNP', 'IN', 'NNPS', ',', 'VBZ', 'NNP', 'NNP', '``', 'NNP', 'NNPS', "''"]
verb, past participle
[ Trump’s SCOTUS Pick, Hobby Lobby, Women, Christianity Trumps, ‘Secular Courts]
22
1
18
1


In [18]:
text = 'The big furry cat ate the little brown mouse.'
doc = nlp(text)
print([n for n in doc.noun_chunks])

[The big furry cat, the little brown mouse]


In [19]:
from collections import Counter

# c = Counter(['a', 'a', 'b', 'b', 'b'])
# print(c)
# d = Counter(['a', 'b', 'd', 'd'])
# d.most_common(1)

def count_nouns(doc):
  nouns = [
      token.lemma_ for token in doc if
             token.tag_ == "NNPS"]
  word_freq = Counter(nouns)
  return word_freq

# counts = real_docs.apply(count_nouns).sum().most_common(20)
# # print(counts)
counter_series = fake_docs.apply(count_nouns)
print(counter_series.iloc[5])
print(counter_series.iloc[0])
print(counter_series.sum().most_common(5))
print(fake_docs.apply(count_nouns).sum().most_common(20))

Counter({'Walls': 1})
Counter({'Women': 1, 'Courts': 1})
[('Republicans', 29), ('Americans', 18), ('Democrats', 16), ('Rights', 11), ('Women', 10)]
[('Republicans', 29), ('Americans', 18), ('Democrats', 16), ('Rights', 11), ('Women', 10), ('Supporters', 10), ('Dems', 8), ('Refugees', 8), ('Muslims', 7), ('Liberals', 7), ('Streets', 6), ('Blacks', 5), ('Orders', 4), ('Workers', 4), ('Delivers', 4), ('Regulations', 4), ('Conservatives', 4), ('Attacks', 4), ('Hits', 4), ('Sessions', 4)]
