Lecture 1 Notes

Announcements

Start of quarter survey on Canvas. Fill out by tonight.
Please fill out Nick’s office hours scheduling poll.
Lab 1 Pre-Lab and Lab 1 are available.
- Pre-lab is due by the start of your lab - submit on Canvas
- During Wednesday’s class you will have the chance to ask questions pertaining to the Pre-Lab
- The Pre-Lab is to be done individually, while the Lab itself will be done in pairs.

Goals

Understand what is meant by a few properties of data:
- Structured vs unstructured
- Numerical vs categorical
- Big vs small
Know about some different kinds of structured data, including tables, trees, graphs
Know the properties of some basic datatypes that we can use to represent data on computers
- signed and unsigned integers, floats, strings, “objects”
Know how to start a Jupyter server on the department’s JupyterHub.
Be able to navigate and work in Jupyter notebooks with
- Basic markdown syntax in text cells
- Python code in code cells
Get a basic understanding of the landscape of data science tools, and which ones we’ll focus on using in this course.

Notes / Agenda

Questions on the syllabus?

Properties of Data

Properties of data

structured/unstructured
numerical/categorical
big/small

Different kinds of structured data:

Tables
- gradebook
- event log
- Relational database
Graphs (not plots)
- maps (cartographic)
- social networks
(Plots)
Map (key-value pairs)
- potentially with nesting (e.g. JSON)
Tree
- org chart
- e.g. XML
Time series / stream
- temperature records
- video

Worksheet - #1

Data Types:

str
integer
- signed
  - int32 (roughly -2b to +2 billion)
  - int64 (roughly -9q to +9 quintillion)
- unsigned
  - uint8 (0 - 255)
  - uint32 (roughly 0-4b)
  - uint64 (roughly 0-18 quintillion)
floating-point
- float32 - roughly 7 decimal digits of precision
- float64 - roughly 15 decimal digits of precision
- Representation, by example for float32:
  - 1 sign bit (\(S\)), 8 exponent bits (\(E\)), 23 mantissa bits (\(M\))
  - value is similar to scientific notation: \((-1)^S \times 1.M^{(E-127)}\)
- not exact, and mathematically equivalent calculations don’t necessarily give you the same answer!
  - 0.1 + 0.2 == 0.3 evaluates to False!
  - if you’re comparing for equality, avoid using ==
  - instead, check whether they’re closer than some tolerance
object - Pandas type that usually wraps columns of strings and other mixed types

JupyterHub

Also see the full guide. Hre are the basic steps:

If off campus, connect to VPN
Go to https://csci-head.cluster.cs.wwu.edu/
For now, keep the all defaults in Server options, except select DATA311 from the “Environment” dropdown. Press Start.
When you’re done, please shut your server down:
1. Go to File > Hub Control Panel
2. Press the red Stop My Server button.

JupyterLab

Things to point out:

file browser (you are on the cluster filesystem - separate homedir)
Launcher
- Notebook
- Console
- Terminal
Jupyter Notebook basics
- Python cells
  - Contain Python code
  - You can run and re-run cells
  - State is maintained after running a cell
  - The value of the last line, if any, is displayed (not printed)
- Markdown cells:
  - Allow you to intersperse formatted text with code.
  - Type your markdown syntax, then “run” the cell to see it rendered with formatting.
  - Basic markdown formatting:
    - headings ## Level 2 Heading
    - lists (bulleted, numbered)
      - * bulleted list item
      - 1. numbered list item
    - *italics*, **bold**, `monospace`
    - code block:
```
```language
      z = 4
```
```
    - link: [link text](link url)
    - image: ![alt text](image url)
- Why jupyter?
  - Interleaved display
    - Quick, interactive development cycle
    - Reproducibility. Cardinal rule of data science: Always start with the raw data.
- Why not Jupyter?
  - State get confusing if you run cells out of order, or selectively re-run cells

Worksheet - #2

The Data Science Computing Landscape

It’s a big (and growing!) world out there:

https://media.datacamp.com/legacy/image/upload/v1675362554/Marketing/Blog/The_MLOps_Tooling_Landscape.pdf

Languages
- Julia:
  - Newcomer, designed for scientific computing.
  - Fast* and high-level
  - Good for modeling physical systems, generating simulation data, among other things
- R
  - Most popular language among statisticians
  - R vs Python is a popular battle
  - Can call R from Python, if needed
  - Nice builtin visualization with ggplot
- Matlab
  - great linear algebra support, good visualization
  - decent number of packages
  - highly popular among engineers
  - GNU Octave is an open source alternative
  - Often used to generate simulation data
- Java, C, C++
  - Show up mostly in big data projects, distributed computing
- Mathematica / Wolfram Alpha
  - Good for numeric and symbolic mathematics
- Excel
  - Widely used
- Python
  - dominant language for data science
    - quick development
    - rich library support
    - general purpose
    - a great “glue” language to leverage lots of different libraries
  - slow?
    - sort of, but most of the time will be spent in calls to libraries that are highly optimized C, C++ or Fortran code
    - not a concern for most data science use cases
Useful Python Libraries
- NumPy - fast numerical operations (linear algebra, fourier transform, etc)
- pandas - manipulate, summarize, clean tabular structured data
  - pandas is built on numpy
- matplotlib
  - 2d visualizations
- seaborn
  - fancy data viz, build on matplotlib
- SciPy
  - Assorted scientific computing functionality, including signal processing, statistics, optimization, etc.
- scikit-learn
  - machine learning in python (best for small-data applications)
- Natural Language Processing
  - NLTk
  - Spacy
- Deep learning: industrial-strength and/or research-grade machine learning; out of scope for this course
  - Pytorch
  - Tensorflow