Announcements
- Start of quarter survey - thanks!
- Lots of data science premajors (yay!)
- Other related majors here out of interest (awesome!)
- Lots of cool hobbies!
- Office hours are posted on the course webpage:
- Scott (in CF471):
- Monday 3-4
- Thursday 10:30-11:30
- Friday 1-1:50
- and by appointment
- Nick (in CF 163):
- Lab 1 Pre-Lab is due by the start of your lab - individual,
PDF submission on Canvas.
- We’ll be working with numpy today, so be sure to ask about any
questions that have come up so far!
- Lab 1 is either Wednesday or Thursday, depending on your
section.
Goals
- Get a basic understanding of the landscape of data science tools,
and which ones we’ll focus on using in this course.
- Get an appreciation for the efficiency gains to be had by using
numpy over native Python
- Know the basics of creation and manipulation of numerical arrays
with
numpy
.
- Know how to work with multidimensional arrays, including indexing,
slicing, operations along axes, and boolean indexing/masking
- Know the basics of how multidimensional array operations are
broadcast across singleton dimensions.
Notes / Agenda
The Data Science Computing
Landscape
It’s a big (and growing!) world out there:
https://media.datacamp.com/legacy/image/upload/v1675362554/Marketing/Blog/The_MLOps_Tooling_Landscape.pdf
- Languages
- Julia:
- Newcomer, designed for scientific computing.
- Fast* and high-level
- Good for modeling physical systems, generating simulation data,
among other things
- R
- Most popular language among statisticians
- R vs Python is a popular battle
- Can call R from Python, if needed
- Nice builtin visualization with
ggplot
- Matlab
- great linear algebra support, good visualization
- decent number of packages
- highly popular among engineers
- GNU Octave is an open source alternative
- Often used to generate simulation data
- Java, C, C++
- Show up mostly in big data projects, distributed computing
- Mathematica / Wolfram Alpha
- Good for numeric and symbolic mathematics
- Excel
- Python
- dominant language for data science
- quick development
- rich library support
- general purpose
- a great “glue” language to leverage lots of different libraries
- slow?
- sort of, but most of the time will be spent in calls to libraries
that are highly optimized C, C++ or Fortran code
- not a concern for most data science use cases
- Useful Python Libraries
- NumPy - fast numerical operations (linear algebra, fourier
transform, etc)
- pandas - manipulate, summarize, clean tabular structured data
- matplotlib
- seaborn
- fancy data viz, build on matplotlib
- SciPy
- Assorted scientific computing functionality, including signal
processing, statistics, optimization, etc.
- scikit-learn
- machine learning in python (best for small-data applications)
- Natural Language Processing
- Deep learning: industrial-strength and/or research-grade machine
learning; out of scope for this course
[Live Jupyter Demo]
Creating arrays
array
, zeros
, ones
,
*_like
- python list-like slicing basics
- elementwise operations
- array/scalar, array/array
Exercise 1 - in pairs
Multidimensional arrays
- 2D arrays
- dimensions (a.shape, a.ndim)
- Elementwise operations (array/scalar, array/array)
- Indexing, slicing, boolean indexing/masking
- A few useful functions:
- transpose, reshape
- sum, mean, max, min; axis kwarg
- Broadcasting (basic example)
Exercise 2 - in pairs