Fall 2021
In this lab, you’ll enter a low-key machine learning competition on Kaggle and see how high you can climb on the leaderboard.
This Kaggle contest ends at 11:59PM UTC on November 30th, which corresponds to 4:00pm on Tuesday, November 30th in our local time zone. You won’t be able to submit to the leaderboard after that, so make sure your submissions are done by then!
You have until 10:00pm on Tuesday, November 30th to submit your notebook and screenshot to Canvas.
This lab will be done individually. In contrast with prior individual labs, I want you to do this one completely on your own. You may discuss general approaches and strategies, but you must do so away from computers: you should never see any part of anyone else’s code, and nobody else should see any part of yours.
You’ll need to create an account on Kaggle.com to enter the contest. After that, head over to the Tabular Playground Series November competition and read through the rules. This is a low-key competition that’s designed to be a little more interesting than the basic “hello world” datasets, but still approachable for relative novices.
The data for this contest is a little bigger than what we’ve worked with before. For this reason you should not download a copy of the data to your home directory in the CS labs.
If you’re working in the CS labs, you should be able to access my copy of the data directly:
"/web/faculty/wehrwes/courses/data311_21f/data/spam/{name_of_file}") pd.read_csv(
where {name_of_file}
is train.csv
(the training data), test.csv
(the testing data, provided without labels), or sample_submission.csv
, a sample of the format your submission is expected to have.
If you’re working on your own computer, you can download the zip file containing all the data either from my webpage here (preferred, if you’re on the campus network) or from Kaggle.
The Kaggle competition explains your task - you are solving a binary classification problem. A few details worth highlighting:
sample_submission.csv
file to see what it needs to look like. You can ask your classifier to produce these using predict_proba
or decision_function
as we saw in L33.sklearn
has this built in under sklearn.metrics.roc_auc_score
.Suggestions:
DataFrame
’s sample
method) to develop and test your models.train.csv
is our training set, the public leaderboard 19% is our validation set, and the remaining 81% is our test set. However, since you can only submit 5 times per day, I recommend splitting a validation set off from the training data so you can check your model on that as many times as you’d like.You’ll notice that, on the public leaderboard, more than half of the submissions are hovering between 74% and just over 75%. You should be able to get into this range (and extra credit is available if you get at or near the top of the leaderboard!). My first submission had a major error in it and had an AUC of around .53; and my second submission, which did not involve a lot of tuning, achieved 0.74493.
sklearn.preprocessing
)sklearn.model_selection
).Since the data is large, speed may also be a factor - if a model is slow to train, you’ll be able to try out fewer things and you may end up with worse performance.
Submit a notebook showing how you trained your classifier. If you tried other approaches that didn’t work as well, feel free to tell me about those.
Also submit a screenshot of your score on the Kaggle public leaderboard, along the lines of the following. This is what I was able to achieve without trying very hard:
As in past labs, your notebook should be correct and clear. Your AUC should be in the range described above.