import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
penguins = sns.load_dataset("penguins")
sns.relplot(data=penguins, x="flipper_length_mm", y="bill_length_mm")
<seaborn.axisgrid.FacetGrid at 0x171488ca0>
X = penguins[["flipper_length_mm", "bill_length_mm"]].to_numpy()
X.shape
(344, 2)
X
array([[181. , 39.1], [186. , 39.5], [195. , 40.3], [ nan, nan], [193. , 36.7], [190. , 39.3], [181. , 38.9], [195. , 39.2], [193. , 34.1], [190. , 42. ], [186. , 37.8], [180. , 37.8], [182. , 41.1], [191. , 38.6], [198. , 34.6], [185. , 36.6], [195. , 38.7], [197. , 42.5], [184. , 34.4], [194. , 46. ], [174. , 37.8], [180. , 37.7], [189. , 35.9], [185. , 38.2], [180. , 38.8], [187. , 35.3], [183. , 40.6], [187. , 40.5], [172. , 37.9], [180. , 40.5], [178. , 39.5], [178. , 37.2], [188. , 39.5], [184. , 40.9], [195. , 36.4], [196. , 39.2], [190. , 38.8], [180. , 42.2], [181. , 37.6], [184. , 39.8], [182. , 36.5], [195. , 40.8], [186. , 36. ], [196. , 44.1], [185. , 37. ], [190. , 39.6], [182. , 41.1], [179. , 37.5], [190. , 36. ], [191. , 42.3], [186. , 39.6], [188. , 40.1], [190. , 35. ], [200. , 42. ], [187. , 34.5], [191. , 41.4], [186. , 39. ], [193. , 40.6], [181. , 36.5], [194. , 37.6], [185. , 35.7], [195. , 41.3], [185. , 37.6], [192. , 41.1], [184. , 36.4], [192. , 41.6], [195. , 35.5], [188. , 41.1], [190. , 35.9], [198. , 41.8], [190. , 33.5], [190. , 39.7], [196. , 39.6], [197. , 45.8], [190. , 35.5], [195. , 42.8], [191. , 40.9], [184. , 37.2], [187. , 36.2], [195. , 42.1], [189. , 34.6], [196. , 42.9], [187. , 36.7], [193. , 35.1], [191. , 37.3], [194. , 41.3], [190. , 36.3], [189. , 36.9], [189. , 38.3], [190. , 38.9], [202. , 35.7], [205. , 41.1], [185. , 34. ], [186. , 39.6], [187. , 36.2], [208. , 40.8], [190. , 38.1], [196. , 40.3], [178. , 33.1], [192. , 43.2], [192. , 35. ], [203. , 41. ], [183. , 37.7], [190. , 37.8], [193. , 37.9], [184. , 39.7], [199. , 38.6], [190. , 38.2], [181. , 38.1], [197. , 43.2], [198. , 38.1], [191. , 45.6], [193. , 39.7], [197. , 42.2], [191. , 39.6], [196. , 42.7], [188. , 38.6], [199. , 37.3], [189. , 35.7], [189. , 41.1], [187. , 36.2], [198. , 37.7], [176. , 40.2], [202. , 41.4], [186. , 35.2], [199. , 40.6], [191. , 38.8], [195. , 41.5], [191. , 39. ], [210. , 44.1], [190. , 38.5], [197. , 43.1], [193. , 36.8], [199. , 37.5], [187. , 38.1], [190. , 41.1], [191. , 35.6], [200. , 40.2], [185. , 37. ], [193. , 39.7], [193. , 40.2], [187. , 40.6], [188. , 32.1], [190. , 40.7], [192. , 37.3], [185. , 39. ], [190. , 39.2], [184. , 36.6], [195. , 36. ], [193. , 37.8], [187. , 36. ], [201. , 41.5], [192. , 46.5], [196. , 50. ], [193. , 51.3], [188. , 45.4], [197. , 52.7], [198. , 45.2], [178. , 46.1], [197. , 51.3], [195. , 46. ], [198. , 51.3], [193. , 46.6], [194. , 51.7], [185. , 47. ], [201. , 52. ], [190. , 45.9], [201. , 50.5], [197. , 50.3], [181. , 58. ], [190. , 46.4], [195. , 49.2], [181. , 42.4], [191. , 48.5], [187. , 43.2], [193. , 50.6], [195. , 46.7], [197. , 52. ], [200. , 50.5], [200. , 49.5], [191. , 46.4], [205. , 52.8], [187. , 40.9], [201. , 54.2], [187. , 42.5], [203. , 51. ], [195. , 49.7], [199. , 47.5], [195. , 47.6], [210. , 52. ], [192. , 46.9], [205. , 53.5], [210. , 49. ], [187. , 46.2], [196. , 50.9], [196. , 45.5], [196. , 50.9], [201. , 50.8], [190. , 50.1], [212. , 49. ], [187. , 51.5], [198. , 49.8], [199. , 48.1], [201. , 51.4], [193. , 45.7], [203. , 50.7], [187. , 42.5], [197. , 52.2], [191. , 45.2], [203. , 49.3], [202. , 50.2], [194. , 45.6], [206. , 51.9], [189. , 46.8], [195. , 45.7], [207. , 55.8], [202. , 43.5], [193. , 49.6], [210. , 50.8], [198. , 50.2], [211. , 46.1], [230. , 50. ], [210. , 48.7], [218. , 50. ], [215. , 47.6], [210. , 46.5], [211. , 45.4], [219. , 46.7], [209. , 43.3], [215. , 46.8], [214. , 40.9], [216. , 49. ], [214. , 45.5], [213. , 48.4], [210. , 45.8], [217. , 49.3], [210. , 42. ], [221. , 49.2], [209. , 46.2], [222. , 48.7], [218. , 50.2], [215. , 45.1], [213. , 46.5], [215. , 46.3], [215. , 42.9], [215. , 46.1], [216. , 44.5], [215. , 47.8], [210. , 48.2], [220. , 50. ], [222. , 47.3], [209. , 42.8], [207. , 45.1], [230. , 59.6], [220. , 49.1], [220. , 48.4], [213. , 42.6], [219. , 44.4], [208. , 44. ], [208. , 48.7], [208. , 42.7], [225. , 49.6], [210. , 45.3], [216. , 49.6], [222. , 50.5], [217. , 43.6], [210. , 45.5], [225. , 50.5], [213. , 44.9], [215. , 45.2], [210. , 46.6], [220. , 48.5], [210. , 45.1], [225. , 50.1], [217. , 46.5], [220. , 45. ], [208. , 43.8], [220. , 45.5], [208. , 43.2], [224. , 50.4], [208. , 45.3], [221. , 46.2], [214. , 45.7], [231. , 54.3], [219. , 45.8], [230. , 49.8], [214. , 46.2], [229. , 49.5], [220. , 43.5], [223. , 50.7], [216. , 47.7], [221. , 46.4], [221. , 48.2], [217. , 46.5], [216. , 46.4], [230. , 48.6], [209. , 47.5], [220. , 51.1], [215. , 45.2], [223. , 45.2], [212. , 49.1], [221. , 52.5], [212. , 47.4], [224. , 50. ], [212. , 44.9], [228. , 50.8], [218. , 43.4], [218. , 51.3], [212. , 47.5], [230. , 52.1], [218. , 47.5], [228. , 52.2], [212. , 45.5], [224. , 49.5], [214. , 44.5], [226. , 50.8], [216. , 49.4], [222. , 46.9], [203. , 48.4], [225. , 51.1], [219. , 48.5], [228. , 55.9], [215. , 47.2], [228. , 49.1], [216. , 47.3], [215. , 46.8], [210. , 41.7], [219. , 53.4], [208. , 43.3], [209. , 48.1], [216. , 50.5], [229. , 49.8], [213. , 43.5], [230. , 51.5], [217. , 46.2], [230. , 55.1], [217. , 44.5], [222. , 48.8], [214. , 47.2], [ nan, nan], [215. , 46.8], [222. , 50.4], [212. , 45.2], [213. , 49.9]])
penguins["species"].value_counts()
Adelie 152 Gentoo 124 Chinstrap 68 Name: species, dtype: int64
Y = penguins["species"].map({"Gentoo": 1, "Adelie": 2, "Chinstrap": 3}).to_numpy()
Y.shape
(344,)
Y
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
This is hard, and our intuition tends to fall apart. However, real high-dimensional data often lies on a lower-dimensional manifold.
What the heck does that mean?
If you sliced, rotated, projected, warped, etc. your space in just the right ways, you could represent the same information with fewer dimensions. The smallest possible number of dimensions possible to represent your data is called its intrinsic dimensionality.
Goal: discover structure without any ground-truth labels.
What might we mean by structure?
Ways to get your $n$ feature vectors from $d$ dimensions to $d'$ dimensions (where $d' < d$).
Good for:
in all cases, these likely come at the expense of some accuracy.
Here are two common approches that are limited to linear notions of stretching, slicing, warping, etc:
Reduces dimensionality by finding $d'$ new features (each is a linear combination of the old features) that explain as much variance as possible.
Reduces dimensionality by multiplying $X_{n \times d}$ by a random matrix $P_{d \times d'}$, resulting in a reduced-dimensionality dataset $X'_{n \times d'}$.
Huh?
Somewhat surprisingly, this works pretty well. What's our metric for "works"? It preserves pairwise distances between points.
A common family of distance metrics is the $L^p$ distance:
$$d_p(a, b) = \sqrt[p]{\sum_{i=1}^d |a_i - b_i|^p}$$When $p = 2$, this is the Euclidean distance we're all used to, based on the Pythagorean theorem; in 2D, it reduces to: $$\sqrt{(b_x - a_x)^2 + (b_y - a_y)^2}$$
Different values of $p$ give different behavior:
A few examples of the "unit circle" under different $L^p$ distances:
It's worth noting that these distances become less and less useful as $d$ gets larger. There are a few ways to think about this, but the simplest is just that more dimensions means more opportunities for points to be far apart.