Announcements
- Start of quarter survey on Canvas. Fill out by tonight.
- Lab 1 Pre-Lab and Lab 1 are now available in
only-slightly-unofficial-draft form. Canvas assignments haven’t been
released, but the writeup is posted and I don’t anticipate major
changes.
- Pre-lab is due by the start of your lab - submit on Canvas
- During Wednesday’s class you will have the chance to ask questions
pertaining to the Pre-Lab, so Thursday labfolk may want to work on it
before then.
- The Pre-Lab is to be done individually, while the Lab itself will be
done in pairs.
Goals
- Know about some different kinds of structured data, including
tables, trees, graphs
- Know the properties of some basic datatypes that we can use to
represent data on computers
- signed and unsigned integers, floats, strings, “objects”
- Know how to start a Jupyter server on the department’s
JupyterHub.
- Be able to navigate and work in Jupyter with
- Basic markdown syntax in text cells
- Python code in code cells
- Get a basic understanding of the landscape of data science tools,
and which ones we’ll focus on using in this course.
Notes / Agenda
Questions on the syllabus?
Properties of Data -
Continued
Different kinds of structured data:
- Tables
- gradebook
- event log
- Relational database
- Graphs (not plots)
- maps (cartographic)
- social networks
- (Plots)
- Map (key-value pairs)
- potentially with nesting (e.g. JSON)
- Tree
- Time series / stream
- temperature records
- video
Worksheet - #1
Data Types:
str
- integer
- signed
int32
(roughly -2b to +2 billion)
int64
(roughly -9q to +9 quintillion)
- unsigned
uint8
(0 - 255)
uint32
(roughly 0-4b)
uint64
(roughly 0-18 quintillion)
- floating-point
float32
- roughly 7 decimal digits of
precision
float64
- roughly 15 decimal digits of
precision
- not exact, and mathematically equivalent calculations don’t
necessarily give you the same answer!
0.1 + 0.2 == 0.3
evaluates to False
!
- if you’re comparing for equality, avoid using
==
- instead, check whether they’re closer than some tolerance
object
- Pandas type that usually wraps columns of
strings and other mixed types
JupyterHub
If off campus, connect to VPN
Go to https://csci-head.cluster.cs.wwu.edu/
For now, keep the all defaults in Server options, except select
DATA311 from the “Environment” dropdown. (If it doesn’t exist, keep
Default). Press Start.
When you’re done, please shut your server down:
- Go to File > Hub Control Panel
- Press the red Stop My Server button.
JupyterLab
Things to point out:
file browser
Launcher
- Notebook
- Console
- Terminal
Jupyter Notebook basics
- Python cells
Contain Python code
You can run and re-run cells
State is maintained after running a cell
The value of the last line, if any, is displayed (not
printed)
Markdown cells:
- Allow you to intersperse formatted text with code.
- Type your markdown syntax, then “run” the cell to see it rendered
with formatting.
- Basic markdown formatting
- Why jupyter?
- Interleaved display
- Quick, interactive development cycle
- Reproducibility. Cardinal rule of data science: Always start
with the raw data.
Worksheet - #2
The Data Science Computing
Landscape
It’s a big (and growing!) world out there:
https://media.datacamp.com/legacy/image/upload/v1675362554/Marketing/Blog/The_MLOps_Tooling_Landscape.pdf
- Languages
- Julia:
- Newcomer, designed for scientific computing.
- Fast* and high-level
- Good for modeling physical systems, generating simulation data,
among other things
- R
- Most popular language among statisticians
- R vs Python is a popular battle
- Can call R from Python, if needed
- Nice builtin visualization with
ggplot
- Matlab
- great linear algebra support, good visualization
- decent number of packages
- highly popular among engineers
- GNU Octave is an open source alternative
- Often used to generate simulation data
- Java, C, C++
- Show up mostly in big data projects, distributed computing
- Mathematica / Wolfram Alpha
- Good for numeric and symbolic mathematics
- Excel
- Python
- dominant language for data science
- quick development
- rich library support
- general purpose
- a great “glue” language to leverage lots of different libraries
- slow?
- sort of, but most of the time will be spent in calls to libraries
that are highly optimized C, C++ or Fortran code
- not a concern for most data science use cases
- Useful Python Libraries
- NumPy - fast numerical operations (linear algebra, fourier
transform, etc)
- pandas - manipulate, summarize, clean tabular structured data
- matplotlib
- seaborn
- fancy data viz, build on matplotlib
- SciPy
- Assorted scientific computing functionality, including signal
processing, statistics, optimization, etc.
- scikit-learn
- machine learning in python (best for small-data applications)
- Natural Language Processing
- Deep learning: industrial-strength and/or research-grade machine
learning; out of scope for this course