DATA 311 - Lab 8: Linear Regression - YMMV

Scott Wehrwein

Winter 2023

Introduction

In this lab, you’ll work on a regression problem, using linear regression (and some of the tricks that can be used to make it more effective) to make some predictions. Then, you’ll do some analysis to validate and interpret the model.

Collaboration Policy

This lab will be done individually. As usual, you may spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.

Getting Started

There is no starter notebook for this lab. Create a notebook with a title and your name at the top.

The Data

For this lab, we’ll work with the mpg dataset that comes built into Seaborn. Your job is to build a model that effectively predicts the miles per gallon column based on the values in the other columns. You may want to spend a couple minutes getting familiar with the dataset and its columns.

Your Tasks

There are two parts to this lab. In the first, you’ll create data splits and try out various tricks to get linear regression to predict miles per gallon (the mpg column) with high accuracy. In the second part, you’ll examine your trained model to validate (i.e., convince yourself that it’s doing its job well) and interpret it (i.e., use the model to learn which features are most significant predictors of mpg).

Part 1 - Developing The Model

Start by splitting the dataset into training, validation, and testing sets as we discussed in lecture. Since the dataset is sorted by model year, you should probably randomize the splits. Remember that your test set is sacred: until you’re sure you’re done modifying your model, do not touch your test set! You may find the sklearn.model_selection.train_test_split function helpful; I strongly recommend passing a number into the random_state argument so that if you run the code again, you get the same splits - otherwise you risk having data in a test set that you previously trained or validated a model on. I split my data into train, val, and test sets of 250, 75, and 67, respectively.

Your next task is to train a successful linear regression model to predict the mpg column. You’ll likely want to make use of the sklearn.linear_model.LinearRegression model for this.

Evaluating Your Model

Before we even train a model, we need to know how to tell if a model is good. For this I suggest looking at two metrics:

The coefficient of determination is a relative measure of performance that will be unaffected by the scale of your data. It’s computed as \(1 - \frac{u}{v}\), where \(u\) is the sum of squared residuals (that is, the thing your model is trying to minimize), and \(v\) is the sum of squared deviations from the mean in \(y\). This is a measure of how much better you’re doing than just predicting the mean of the training \(y\)’s; if you did that, you’d get a score of 0, and if you have no error you’d get a maximal score of 1. It’s possible to do worse than predicting the mean, so the score can go negative. This metric is conveniently computed by LinearRegression’s score method.
The root mean squared error, which is the square root of the average squared error (i.e., residuals). Note that this will be sensitive to the scale of your data, but it is actually in the units of the quantity we’re trying to predict, so it gives us a better sense of the absolute performance of the model. Though it’s not too hard to compute this yourself, you can also use sklearn.metrics.mean_squared_error with the squared kwarg set to False.

When evaluating your model, check its performance on both the training and validation sets. Large differences in training and validation accuracy can suggest overfitting, so keep an eye out for that.

Trying Out Tricks

Though we’re using a standard linear regression model, there are still a lot of decisions we can make that may affect the performance of your model. I recommend training the simplest possible model first, then trying out different ideas for preprocessing that might help your model perform better. These are listed in no particular order, and each one may or may not help - it’s up to you to play around and find what works.

Column choice: What’s the effect of including or excluding certain columns from your training data?
Categorical columns: Relatedly, the categorical columns aren’t immediately applicable to a regression problem because they’re not numerical. But you can convert them to numbers in various ways.
- Turn each possible value into an integer. If your input data has a survey question with possible answers “yes”, “maybe”, and “no”, then “yes” becomes 0, “maybe” becomes 1, “no” becomes 2).
- Use a “one-hot encoding”, where each possible value becomes its own feature with a 1 if the datapoint is in that category. In the above example, you’d have a “yes” column with a 1 where the original column was “yes” and zeros everywhere else, and likewise for a “maybe” column and a “no” column.
The above can, once again, be done manually, but sklearn.preprocessing also has this functionality built in: OrdinalEncoder and OneHotEncoder.
Data Scaling: If the magnitudes of your input features differ by a lot, the model may do a better job of fitting when the features are all scaled to \(z\)-scores. Even if this doesn’t affect model performance, it can help with interpretability (see Part 2). You can use sklearn.preprocessing.StandardScaler(), or there’s also a RobustScaler that is less sensitive to outliers.
Feature Expansions As discussed in class, nonlinear relationships can be modeled by applying nonlinear transformations to the feature values before fitting the model. This could any function, be it polynomial, exponential, or something else. Simple things are pretty easy to do by yourself, but sklearn has a sklearn.preprocessing.PolynomialFeatures tool that will give you all the polynomial combinations of features. For example, if you started with input features \(\begin{bmatrix} x_1 & x_2 \end{bmatrix}\) and asked for 2nd-order polynomial features you’d get \(\begin{bmatrix}1 & x_1 & x_2 & x_1^2 & x_1x_2 & x_2^2 \end{bmatrix}\). Be careful when using this - it explodes the number of features and increases your model complexity quickly, which increases the danger of overfitting.

You’re not limited to the above - feel free to explore the other preprocessing features built into scikit-learn, or come up with your own ideas. The scores you achieve will depend on your data splits, but based on my experiments, you’ll likely be able to achieve a coefficient of determination score of well over 0.8 on both training and validation - ideally you can go even higher than that.

Part 2 - Validation and Interpretation

One of the nice things about a linear regression models (in contrast to many fancier techqniques) is that they are relatively explainable. This means we can probe the model and understand some things about how it’s working, which is useful both to build confidence that it’s a well-behaved model, and also to help us understand things about the underlying data.

Try out the following “sanity checks” to help validate your model, and comment on whether each shows the “good outcome” - namely, that our model is behaving as expected.

Calculate the residuals - that is, the difference between your predictions and the ground-truth values - on the validation set - and plot their distribution. If the model is working well and its assumptions hold, we should see a Gaussian-like distribution with a mean near zero.
We can’t scatterplot all our features vs the predictions because our features are too high-dimensional. But if we scatterplot our predictions vs the ground-truth values, we should see a roughly linear relationship.
We can check for homoscedasticity - the property that residuals do not vary depending on the value of \(y\) - by plotting the residuals versus the ground-truth \(y\) value. If we get a scatterplot with no visible patterns or correlation, that’s a good sign.

Finally, linear regression gives us a directly interpretable signal about what features the model found useful: the coefficients themselves. You can access these via linear_regression_model.coef_, and see the weight applied to each input feature to compute the output value. If this number is large in magnitude for a given feature, that feature was important in computing the result; if it was close to zero, then that feature didn’t matter much.

Caveat 1: the scale of your features also affects the coefficient. If two equally important features vary from 0 to 1,000 (feature 1) and 0 to 10 (feature 2), the coefficient on feature 1 will be 1/100th of the feature 2 coefficient. For this reason, coefficients are best interpreted when you’ve normalized your input features to \(z\)-scores before fitting the model so they have around the same range. If you didn’t do that above, go ahead and add a preprocessing step that scales the features so we can interpret the coefficients.

Caveat 2: Keep in mind that other preprocessing, especially feature expansions, will also affect the coefficients; if you turned your original features into a new set of features, you’ll get coefficients for the new features. You’ll need to find some way to interpret the new coefficients in terms of the original columns, based on the transformations you did.

Show (e.g., with a bar plot) the coefficients on each input feature and comment on which features were most important to the model; does this make intuitive sense?

Evaluating on the Test Set

When you’ve refined your preprocessing and model, performed the above validation steps, and you’re happy with its performance, go ahead and run it on the test set and comment on your test accuracy. If it is similar to your validation accuracy, great! If it isn’t, that’s okay - you won’t lose credit. However, if there’s evidence that you cheated and ran on the test set more than once, you will lose credit.

Extra Credit

When I trained my model, the coefficients revealed a somewhat disappointing result: the model year is one of the largest predictors of mpg; this says that cars got more efficient over time, but it has nothing to do with the actual properties of the engines. Try training a model that uses only values related to the engine itself, and see what kind of accuracy you can achieve.

Submitting your work

Notebook

Submit a single notebook with your final training, validation, and interpretation. Your submitted notebook should not include all the experimentation you did - just submit a clean, readable version of how you trained, validated, and interpreted your best model. See the rubric for details of what I’ll be grading on.

Survey

Finally fill out the Week 8 Survey on Canvas. Your submission will not be considered complete until you have submitted the survey.

Rubric

Part 1 is worth 15 points:

10 points: you arrived at a sensible and reasonably accurate model
5 points: the training code is clear and readable, and the preprocessing steps are well documented

Part 2 is worth 15 points:

4 points for each validation step:
- 2 points for correctness (i.e., you plotted the correct things)
- 2 points for explaining what it shows (i.e., you interpreted the results correctly)
3 points for reporting your accuracy on the test set

Extra Credit

Up to 2 points for a good model trained only on engine-related features.