Lecture 1 Notes

Announcements

Start of quarter survey on Canvas. Fill out by tonight.
Lab 1 Pre-Lab and Lab 1 are now available in only-slightly-unofficial-draft form. Canvas assignments haven’t been released, but the writeup is posted and I don’t anticipate major changes.
- Pre-lab is due by the start of your lab - submit on Canvas
- During Wednesday’s class you will have the chance to ask questions pertaining to the Pre-Lab, so Thursday labfolk may want to work on it before then.
- The Pre-Lab is to be done individually, while the Lab itself will be done in pairs.

Goals

Know about some different kinds of structured data, including tables, trees, graphs
Know the properties of some basic datatypes that we can use to represent data on computers
- signed and unsigned integers, floats, strings, “objects”
Know how to start a Jupyter server on the department’s JupyterHub.
Be able to navigate and work in Jupyter with
- Basic markdown syntax in text cells
- Python code in code cells
Get a basic understanding of the landscape of data science tools, and which ones we’ll focus on using in this course.

Notes / Agenda

Questions on the syllabus?

Properties of Data - Continued

Different kinds of structured data:

Tables
- gradebook
- event log
- Relational database
Graphs (not plots)
- maps (cartographic)
- social networks
(Plots)
Map (key-value pairs)
- potentially with nesting (e.g. JSON)
Tree
- org chart
- e.g. XML
Time series / stream
- temperature records
- video

Worksheet - #1

Data Types:

str
integer
- signed
  - int32 (roughly -2b to +2 billion)
  - int64 (roughly -9q to +9 quintillion)
- unsigned
  - uint8 (0 - 255)
  - uint32 (roughly 0-4b)
  - uint64 (roughly 0-18 quintillion)
floating-point
- float32 - roughly 7 decimal digits of precision
- float64 - roughly 15 decimal digits of precision
- not exact, and mathematically equivalent calculations don’t necessarily give you the same answer!
  - 0.1 + 0.2 == 0.3 evaluates to False!
  - if you’re comparing for equality, avoid using ==
  - instead, check whether they’re closer than some tolerance
object - Pandas type that usually wraps columns of strings and other mixed types

JupyterHub

If off campus, connect to VPN
Go to https://csci-head.cluster.cs.wwu.edu/
For now, keep the all defaults in Server options, except select DATA311 from the “Environment” dropdown. (If it doesn’t exist, keep Default). Press Start.
When you’re done, please shut your server down:
1. Go to File > Hub Control Panel
2. Press the red Stop My Server button.

JupyterLab

Things to point out:

file browser
Launcher
- Notebook
- Console
- Terminal
Jupyter Notebook basics
- Python cells
  - Contain Python code
  - You can run and re-run cells
  - State is maintained after running a cell
  - The value of the last line, if any, is displayed (not printed)
  - Markdown cells:
    - Allow you to intersperse formatted text with code.
    - Type your markdown syntax, then “run” the cell to see it rendered with formatting.
    - Basic markdown formatting
- Why jupyter?
  - Interleaved display
    - Quick, interactive development cycle
    - Reproducibility. Cardinal rule of data science: Always start with the raw data.

Worksheet - #2

The Data Science Computing Landscape

It’s a big (and growing!) world out there:

https://media.datacamp.com/legacy/image/upload/v1675362554/Marketing/Blog/The_MLOps_Tooling_Landscape.pdf

Languages
- Julia:
  - Newcomer, designed for scientific computing.
  - Fast* and high-level
  - Good for modeling physical systems, generating simulation data, among other things
- R
  - Most popular language among statisticians
  - R vs Python is a popular battle
  - Can call R from Python, if needed
  - Nice builtin visualization with ggplot
- Matlab
  - great linear algebra support, good visualization
  - decent number of packages
  - highly popular among engineers
  - GNU Octave is an open source alternative
  - Often used to generate simulation data
- Java, C, C++
  - Show up mostly in big data projects, distributed computing
- Mathematica / Wolfram Alpha
  - Good for numeric and symbolic mathematics
- Excel
  - Widely used
- Python
  - dominant language for data science
    - quick development
    - rich library support
    - general purpose
    - a great “glue” language to leverage lots of different libraries
- slow?
  - sort of, but most of the time will be spent in calls to libraries that are highly optimized C, C++ or Fortran code
  - not a concern for most data science use cases
Useful Python Libraries
- NumPy - fast numerical operations (linear algebra, fourier transform, etc)
- pandas - manipulate, summarize, clean tabular structured data
  - pandas is built on numpy
- matplotlib
  - 2d visualizations
- seaborn
  - fancy data viz, build on matplotlib
- SciPy
  - Assorted scientific computing functionality, including signal processing, statistics, optimization, etc.
- scikit-learn
  - machine learning in python (best for small-data applications)
- Natural Language Processing
  - NLTk
  - Spacy
- Deep learning: industrial-strength and/or research-grade machine learning; out of scope for this course
  - Pytorch
  - Tensorflow