Announcements
- Start of quarter survey on Canvas. Fill out by tonight.
- Please fill out Nick’s office hours scheduling poll.
- Lab 1 Pre-Lab and Lab 1 are available.
- Pre-lab is due by the start of your lab - submit on Canvas
- During Wednesday’s class you will have the chance to ask questions
pertaining to the Pre-Lab
- The Pre-Lab is to be done individually, while the Lab itself will be
done in pairs.
Goals
Understand what is meant by a few properties of data:
- Structured vs unstructured
- Numerical vs categorical
- Big vs small
Know about some different kinds of structured data, including
tables, trees, graphs
Know the properties of some basic datatypes that we can use to
represent data on computers
- signed and unsigned integers, floats, strings, “objects”
Know how to start a Jupyter server on the department’s
JupyterHub.
Be able to navigate and work in Jupyter notebooks with
- Basic markdown syntax in text cells
- Python code in code cells
Get a basic understanding of the landscape of data science tools,
and which ones we’ll focus on using in this course.
Notes / Agenda
Questions on the syllabus?
Properties of Data
Properties of data
- structured/unstructured
- numerical/categorical
- big/small
Different kinds of structured data:
- Tables
- gradebook
- event log
- Relational database
- Graphs (not plots)
- maps (cartographic)
- social networks
- (Plots)
- Map (key-value pairs)
- potentially with nesting (e.g. JSON)
- Tree
- Time series / stream
- temperature records
- video
Worksheet - #1
Data Types:
str
- integer
- signed
int32 (roughly -2b to +2 billion)
int64 (roughly -9q to +9 quintillion)
- unsigned
uint8 (0 - 255)
uint32 (roughly 0-4b)
uint64 (roughly 0-18 quintillion)
- floating-point
float32 - roughly 7 decimal digits of
precision
float64 - roughly 15 decimal digits of
precision
- Representation, by example for float32:
- 1 sign bit (\(S\)), 8 exponent bits
(\(E\)), 23 mantissa bits (\(M\))
- value is similar to scientific notation: \((-1)^S \times 1.M^{(E-127)}\)
- not exact, and mathematically equivalent calculations don’t
necessarily give you the same answer!
0.1 + 0.2 == 0.3 evaluates to False!
- if you’re comparing for equality, avoid using
==
- instead, check whether they’re closer than some tolerance
object - Pandas type that usually wraps columns of
strings and other mixed types
JupyterHub
Also see the full
guide. Hre are the basic steps:
If off campus, connect to VPN
Go to https://csci-head.cluster.cs.wwu.edu/
For now, keep the all defaults in Server options, except select
DATA311 from the “Environment” dropdown. Press Start.
When you’re done, please shut your server down:
- Go to File > Hub Control Panel
- Press the red Stop My Server button.
JupyterLab
Things to point out:
Worksheet - #2
The Data Science Computing
Landscape
It’s a big (and growing!) world out there:
https://media.datacamp.com/legacy/image/upload/v1675362554/Marketing/Blog/The_MLOps_Tooling_Landscape.pdf
- Languages
- Julia:
- Newcomer, designed for scientific computing.
- Fast* and high-level
- Good for modeling physical systems, generating simulation data,
among other things
- R
- Most popular language among statisticians
- R vs Python is a popular battle
- Can call R from Python, if needed
- Nice builtin visualization with
ggplot
- Matlab
- great linear algebra support, good visualization
- decent number of packages
- highly popular among engineers
- GNU Octave is an open source alternative
- Often used to generate simulation data
- Java, C, C++
- Show up mostly in big data projects, distributed computing
- Mathematica / Wolfram Alpha
- Good for numeric and symbolic mathematics
- Excel
- Python
- dominant language for data science
- quick development
- rich library support
- general purpose
- a great “glue” language to leverage lots of different libraries
- slow?
- sort of, but most of the time will be spent in calls to libraries
that are highly optimized C, C++ or Fortran code
- not a concern for most data science use cases
- Useful Python Libraries
- NumPy - fast numerical operations (linear algebra, fourier
transform, etc)
- pandas - manipulate, summarize, clean tabular structured data
- matplotlib
- seaborn
- fancy data viz, build on matplotlib
- SciPy
- Assorted scientific computing functionality, including signal
processing, statistics, optimization, etc.
- scikit-learn
- machine learning in python (best for small-data applications)
- Natural Language Processing
- Deep learning: industrial-strength and/or research-grade machine
learning; out of scope for this course