Lecture 2 Notes

Announcements

Start of quarter survey - thanks!
- Lots of data science premajors (yay!)
- Other related majors here out of interest (awesome!)
- Lots of cool hobbies!
Office hours are posted on the course webpage:
- Scott (in CF471):
  - Monday 3-4
  - Thursday 10:30-11:30
  - Friday 1-1:50
  - and by appointment
- Nick (in CF 163):
  - Friday 3-4
Lab 1 Pre-Lab is due by the start of your lab - individual, PDF submission on Canvas.
- We’ll be working with numpy today, so be sure to ask about any questions that have come up so far!
Lab 1 is either Wednesday or Thursday, depending on your section.

Goals

Get a basic understanding of the landscape of data science tools, and which ones we’ll focus on using in this course.
Get an appreciation for the efficiency gains to be had by using numpy over native Python
Know the basics of creation and manipulation of numerical arrays with numpy.
Know how to work with multidimensional arrays, including indexing, slicing, operations along axes, and boolean indexing/masking
Know the basics of how multidimensional array operations are broadcast across singleton dimensions.

Notes / Agenda

The Data Science Computing Landscape

It’s a big (and growing!) world out there:

https://media.datacamp.com/legacy/image/upload/v1675362554/Marketing/Blog/The_MLOps_Tooling_Landscape.pdf

Languages
- Julia:
  - Newcomer, designed for scientific computing.
  - Fast* and high-level
  - Good for modeling physical systems, generating simulation data, among other things
- R
  - Most popular language among statisticians
  - R vs Python is a popular battle
  - Can call R from Python, if needed
  - Nice builtin visualization with ggplot
- Matlab
  - great linear algebra support, good visualization
  - decent number of packages
  - highly popular among engineers
  - GNU Octave is an open source alternative
  - Often used to generate simulation data
- Java, C, C++
  - Show up mostly in big data projects, distributed computing
- Mathematica / Wolfram Alpha
  - Good for numeric and symbolic mathematics
- Excel
  - Widely used
- Python
  - dominant language for data science
    - quick development
    - rich library support
    - general purpose
    - a great “glue” language to leverage lots of different libraries
- slow?
  - sort of, but most of the time will be spent in calls to libraries that are highly optimized C, C++ or Fortran code
  - not a concern for most data science use cases
Useful Python Libraries
- NumPy - fast numerical operations (linear algebra, fourier transform, etc)
- pandas - manipulate, summarize, clean tabular structured data
  - pandas is built on numpy
- matplotlib
  - 2d visualizations
- seaborn
  - fancy data viz, build on matplotlib
- SciPy
  - Assorted scientific computing functionality, including signal processing, statistics, optimization, etc.
- scikit-learn
  - machine learning in python (best for small-data applications)
- Natural Language Processing
  - NLTk
  - Spacy
- Deep learning: industrial-strength and/or research-grade machine learning; out of scope for this course
  - Pytorch
  - Tensorflow

`numpy`: our favorite tool for working with raw numerical data

[Live Jupyter Demo]

Creating arrays

array, zeros, ones, *_like
- dtype argument
python list-like slicing basics
elementwise operations
- array/scalar, array/array

Exercise 1 - in pairs

Multidimensional arrays

2D arrays
dimensions (a.shape, a.ndim)
Elementwise operations (array/scalar, array/array)
Indexing, slicing, boolean indexing/masking
A few useful functions:
- transpose, reshape
- sum, mean, max, min; axis kwarg
Broadcasting (basic example)

Announcements

Goals

Notes / Agenda

The Data Science Computing Landscape

numpy: our favorite tool for working with raw numerical data

Creating arrays

Exercise 1 - in pairs

Multidimensional arrays

Exercise 2 - in pairs

`numpy`: our favorite tool for working with raw numerical data