# Lecture 4: Structured Data, Data Formats; Data Collection, Data Questions

## Announcements:
  * I created an `#events` channel on Discord where I'll post talk announcements.
  * There's also a `#rtyi` (relevant to your interests) - feel free to post links to course-relevant articles, etc.
  * Lab 1 due Thursday at 10pm

* Themes from the start of quarter survey
  * Experience levels: very mixed! Let's try to use this as an asset.
  * ![# Quarters at Western](https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/lectures/L04/qtrs.png)
  * ![# CSCI courses taken](https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/lectures/L04/csci.png)
  * Overall: lots of excitement! This is gonna be super fun!

* Talks this week (topics and rooms TBA):
  * Thu 1/19 4pm Ryan Bockmon - Research (CS education)
  * Fri 1/20 4pm Ryan Bockmon - Teaching



## Goals:
  * Know the basic data types that often appear in pandas columns (string, integer, floating-point, `object`)
  * Know about some of the various ways data can be structured (table, tree, graph, stream)
  * Know the basics of some of the most common data formats, including CSV, JSON, XML, (SQL). (Skiena 3.1.2)
  * Get a broad overview of some of the technologies used in data science and where our tools of choice fit in.
  * Think about who collects data, who makes it available, and know about some of the places you might find data available for download. (Skiena 3.2)

(saved for later:)
  * Asking questions: Think about what makes a good data science question and practice formulating some of your own. (Skiena 1.2)


## Data Types

* Types
    * `str`
    * integer
      * signed
        * `int32` (*roughly* -2b to +2 billion)
        * `int64` (*roughly* -9q to +9 quintillion)
      * unsigned
        * `uint8` (0 - 255)
        * `uint32` (*roughly* 0-4b)
        * `uint64` (*roughly* 0-18 quintillion)
    * floating-point
      * `float32` - *roughly* 7 decimal digits of precision
      * `float64` - *roughly* 15 decimal digits of precision
    * `object` - Pandas type that usually wraps columns of strings and other mixed types

## Types of Structured Data

We talked about structured vs unstructured data. Structured data can come in many different forms, depending on *how* it's structured.

**One minute think:** what kind of *structure* might data have? Or in other words, how might it be organized? Think conceptual structure, not file formats.


* **Tables**
* Graphs (not plots)
  * maps (cartographic)
  * social networks
* (plot)
* Map (key-value pairs)
* Tree
  * family tree?
  * outline
  * genres
  * evolution
  * organization chart
* Time series / stream

## File Formats

* CSV (our bread and butter)
 * example: bike and 311 data from Lab 1
 * Reading a CSV file in Python, 3 ways: https://www.youtube.com/watch?v=fbl8fMQ9tQM
 * Lab 1 Q2.1 gives a taste of the craziness you can find in CSV files, and the capabilities [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) gives for dealing with it.
   * Variations: separators, quoting, headers or not, ...
 * Let's look at the raw contents of the `avengers.csv` file from last week:

In [1]:
import requests

response = requests.get("https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/avengers/avengers.csv")
response.text

'URL,Name/Alias,Appearances,Current?,Gender,Probationary Introl,Full/Reserve Avengers Intro,Year,Years since joining,Honorary,Death1,Return1,Death2,Return2,Death3,Return3,Death4,Return4,Death5,Return5,Notes\rhttp://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym",1269,YES,MALE,,Sep-63,1963,52,Full,YES,NO,,,,,,,,,Merged with Ultron in Rage of Ultron Vol. 1. A funeral was held. \rhttp://marvel.wikia.com/Janet_van_Dyne_(Earth-616),Janet van Dyne,1165,YES,FEMALE,,Sep-63,1963,52,Full,YES,YES,,,,,,,,,Dies in Secret Invasion V1:I8. Actually was sent tto Microverse later recovered\rhttp://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark",3068,YES,MALE,,Sep-63,1963,52,Full,YES,YES,,,,,,,,,"Death: ""Later while under the influence of Immortus Stark committed a number of horrible acts and was killed.\'  This set up young Tony. Franklin Richards later brought him back"\rhttp://marvel.wikia.com/Robert_Bruce_Banner_(Earth-616),Robert Bruce Banner,2089,YES

   
* JSON (popular, especially for data with more nuanced structure)
 * example: [SEA_building_energy.json](https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_21f/data/SEA_building_energy.json), via https://www.kaggle.com/city-of-seattle/sea-building-energy-benchmarking/version/8
* XML - highly structured, like JSON, more rigid; easier to specify schemas; decreasing in popularity
* SQL - traditional databases; generally not human-readable, but exportable to JSON or CSV; see CSCI 330 for more on this


## The Data Science Computing Landscape
* Languages
  * Julia:
    * Newcomer, designed for scientific computing.
    * Fast\* and high-level
    * Good for modeling physical systems, generating simulation data, among other things
  * Perl
    * great string manipulation, less great everything else
    * used to be popular, but fell behind Python in 2008 and continues to fall
    * may encounter it in legacy projects
  * R
    * Most popular language among statisticians
    * R vs Python is a popular battle
    * Can call R from Python, if needed
    * Nice builtin visualization with `ggplot`
  * Matlab
    * great linear algebra support, good visualization
    * decent number of packages
    * highly popular among engineers
    * GNU Octave is an open source alternative
    * Often used to generate simulation data
  * Java, C, C++
    * Show up mostly in big data projects, distributed computing
  * Mathematica / Wolfram Alpha
    * Good for numeric and symbolic mathematics
  * Excel
    * Widely used
  * Python
    * dominant language for data science
      * quick development
      * rich library support
      * general purpose
      * a great "glue" language to leverage lots of different libraries
   * slow?
     * sort of, but most of the time will be spent in calls to libraries that are highly optimized C, C++ or Fortran code
     * not a concern for most data science use cases

* Useful Python Libraries
  * pandas - manipulate, summarize, clean tabular structured data
  * NumPy - fast numerical operations (linear algebra, fourier transform, etc)
    * pandas is built on numpy
  * matplotlib
    * 2d visualizations
  * seaborn
    * fancy data viz, build on matplotlib
  * SciPy
    * Assorted scientific computing functionality, including signal processing, statistics, optimization, etc.
  * scikit-learn
    * machine learning in python (best for small-data applications)
  * Natural Language Processing
    * NLTk 
    * Spacy
  * Deep learning: industrial-strength and/or research-grade machine learning; out of scope for this course
    * Pytorch
    * Tensorflow
    * OpenAI gym



### Data Collection

**Two minute think:** who (or which entities) collect data? Which ones might share that data with you?


* Researchers
* Government
* Businesses / corporations
* AI
* People (social interactions)
* Managers
* Computer viruses
* Animals? maybe in the future we'll have this on our computers?
* Cameras
* Web crawlers
* Libraries
* Hospitals
* NGOs

A few ideas:

* Companies, sometimes (privacy/liability/trade secrets)
 * Google ngrams: https://books.google.com/ngrams
 * API - "application programming interfaces" - a way to programmatically ask for data. Many companies, incluing:
   * Twitter
   * New York Times
   * Facebook
   * Uber
* Governments
 * https://www.data.gov/
 * https://data.wa.gov/
 * https://www.whatcomcounty.us/716/Data
 * https://www.google.com/search?q=city+of+bellingham+data
 * https://data.seattle.gov/
* Scientists / academics
 * Have to search around - rarely available in centralized repositories; often not available.
   * https://scholar.google.com/ is good for finding papers related to a subject
* Other data sciency people
 * Data science/ML contests
   * https://www.kaggle.com/datasets
 * Data journalists:
   * https://data.fivethirtyeight.com/
   * https://open.nytimes.com/data/home
* You!
 * Any tech you have control over, you can log stuff!
   * smartphone, fitness tracker, ...
 * *some* online services let you download the data they have on you:
   * e.g.: https://support.strava.com/hc/en-us/articles/216918437-Exporting-your-Data-and-Bulk-Export#Bulk
 * Web scraping - later in this course

### Asking interesting data questions

* Consider a couple example datasets:
 * IMDB: All things movies. https://www.imdb.com/interfaces/
   * Films: title, duration, genre tags, date, cast, crew, user ratings, critic ratings, ...
   * People (actors, directors, writers, producers, crew): appearances/credits, birth/death dates, height, awards, ...
 * Boston Bike Share data: https://www.bluebikes.com/system-data
   * Trips: Trip Duration, Start Time and Date, Stop Time and Date, Start Station Name & ID, End Station Name & ID, Bike ID, User Type (Casual = Single Trip or Day Pass user; Member = Annual or Monthly Member), Birth Year, Gender (self-reported by member)
   * Stations: ID, Name, GPS coordinates, # docks

* Think-pair-share: Come up wtih one interesting question you might want to answer with each of these datasets.




### Ideas from the class:


**Insight**: Datasets can often answer questions that the data isn't directly about.
 * Example: baseball stats (http://www.baseball-reference.com) has details about ~20,000 major league baseball players over the last 150 years. This data includes handedness and birth/death dates, so we can study a possible link between handedness and lifespan.