import pandas as pd
import seaborn as sns
df = pd.DataFrame({
"Income": [0.49, 0.18, 0.31, 0.40, 0.24],
"CrimeRate": [0.09, 0.45, 0.23, 0.19, 0.48]
})
Generally: unseen data is drawn from the same distrubtion as your dataset.
Consequence: We don't assume correlation is causation, but we do assume that observed correlations will hold in unseen data.
Specific model classes make additional assumptions. For example, linear regression assumes:
Generalization is the ability of a model to perform well on unseen data (i.e., data that was not in the training set).
sns.scatterplot(data=df,x="Income",y="CrimeRate")
<matplotlib.axes._subplots.AxesSubplot at 0x7fa922a7f3a0>
Consider the following possible model classes:
Question 1: Which of these chioces of model class will result in the best (lowest) empirical risk $\hat{R}(h; \mathcal{X})$?
Question 2: Which of these will result in the best (lowest) empirical risk on another batch of data $\mathcal{X}'$ drawn from the same distribution as $\mathcal{X}$?
Let's formalize the above distinction.
What we truly care about is a quantity known as (true) risk: $R(h; {\cal X})$.
There are three contributors to risk:
To understand bias and variance, we need to consider hypothetical: