DATA 311 - Lab 9: Enter an ML Competition!

Scott Wehrwein

Fall 2021

Introduction

In this lab, you’ll enter a low-key machine learning competition on Kaggle and see how high you can climb on the leaderboard.

A note on timing:

This Kaggle contest ends at 11:59PM UTC on November 30th, which corresponds to 4:00pm on Tuesday, November 30th in our local time zone. You won’t be able to submit to the leaderboard after that, so make sure your submissions are done by then!

You have until 10:00pm on Tuesday, November 30th to submit your notebook and screenshot to Canvas.

Collaboration Policy

This lab will be done individually. In contrast with prior individual labs, I want you to do this one completely on your own. You may discuss general approaches and strategies, but you must do so away from computers: you should never see any part of anyone else’s code, and nobody else should see any part of yours.

Getting Started

You’ll need to create an account on Kaggle.com to enter the contest. After that, head over to the Tabular Playground Series November competition and read through the rules. This is a low-key competition that’s designed to be a little more interesting than the basic “hello world” datasets, but still approachable for relative novices.

The Data

The data for this contest is a little bigger than what we’ve worked with before. For this reason you should not download a copy of the data to your home directory in the CS labs.

If you’re working in the CS labs, you should be able to access my copy of the data directly:

pd.read_csv("/web/faculty/wehrwes/courses/data311_21f/data/spam/{name_of_file}")

where {name_of_file} is train.csv (the training data), test.csv (the testing data, provided without labels), or sample_submission.csv, a sample of the format your submission is expected to have.

If you’re working on your own computer, you can download the zip file containing all the data either from my webpage here (preferred, if you’re on the campus network) or from Kaggle.

Your Tasks

The Kaggle competition explains your task - you are solving a binary classification problem. A few details worth highlighting:

Your submission is not binary labels, but continuous probabilities or scores. Take a look at the sample_submission.csv file to see what it needs to look like. You can ask your classifier to produce these using predict_proba or decision_function as we saw in L33.
Your submission is scored based on the area under the curve of the ROC curve calculated from your scores by sweeping a classification threshold across the range of your scores. As usual, sklearn has this built in under sklearn.metrics.roc_auc_score.

Suggestions:

This data is big, so running on the whole thing is going to be unwieldy. I’d suggest taking a sample of the data (you can use DataFrame’s sample method) to develop and test your models.
The test data is provided to you, but it doesn’t have labels. The public leaderboard is based on 19% of the test data, while the final rankings when the contest closes will be computed on the remaining 81%. So in a sense, train.csv is our training set, the public leaderboard 19% is our validation set, and the remaining 81% is our test set. However, since you can only submit 5 times per day, I recommend splitting a validation set off from the training data so you can check your model on that as many times as you’d like.

How good is good enough?

You’ll notice that, on the public leaderboard, more than half of the submissions are hovering between 74% and just over 75%. You should be able to get into this range (and extra credit is available if you get at or near the top of the leaderboard!). My first submission had a major error in it and had an AUC of around .53; and my second submission, which did not involve a lot of tuning, achieved 0.74493.

What should you try to make your score better?

Different classifiers
Different preprocessing steps (sklearn.preprocessing)
Model selection strategies: cross-validation, parameter search (sklearn.model_selection).

Since the data is large, speed may also be a factor - if a model is slow to train, you’ll be able to try out fewer things and you may end up with worse performance.

Submitting your work

Submit a notebook showing how you trained your classifier. If you tried other approaches that didn’t work as well, feel free to tell me about those.

Also submit a screenshot of your score on the Kaggle public leaderboard, along the lines of the following. This is what I was able to achieve without trying very hard:

Rubric

As in past labs, your notebook should be correct and clear. Your AUC should be in the range described above.

20 points: your model achieves appropriate AUC as described above on the Kaggle public leaderboard
20 points: the training code is clear and readable, and the preprocessing steps are well documented