DATA 311 - Lab 9: Enter an ML Competition!

Scott Wehrwein

Fall 2021

Introduction

In this lab, you’ll enter a low-key machine learning competition on Kaggle and see how high you can climb on the leaderboard.

A note on timing:

This Kaggle contest ends at 11:59PM UTC on November 30th, which corresponds to 4:00pm on Tuesday, November 30th in our local time zone. You won’t be able to submit to the leaderboard after that, so make sure your submissions are done by then!

You have until 10:00pm on Tuesday, November 30th to submit your notebook and screenshot to Canvas.

Collaboration Policy

This lab will be done individually. In contrast with prior individual labs, I want you to do this one completely on your own. You may discuss general approaches and strategies, but you must do so away from computers: you should never see any part of anyone else’s code, and nobody else should see any part of yours.

Getting Started

You’ll need to create an account on Kaggle.com to enter the contest. After that, head over to the Tabular Playground Series November competition and read through the rules. This is a low-key competition that’s designed to be a little more interesting than the basic “hello world” datasets, but still approachable for relative novices.

The Data

The data for this contest is a little bigger than what we’ve worked with before. For this reason you should not download a copy of the data to your home directory in the CS labs.

If you’re working in the CS labs, you should be able to access my copy of the data directly:

pd.read_csv("/web/faculty/wehrwes/courses/data311_21f/data/spam/{name_of_file}")

where {name_of_file} is train.csv (the training data), test.csv (the testing data, provided without labels), or sample_submission.csv, a sample of the format your submission is expected to have.

If you’re working on your own computer, you can download the zip file containing all the data either from my webpage here (preferred, if you’re on the campus network) or from Kaggle.

Your Tasks

The Kaggle competition explains your task - you are solving a binary classification problem. A few details worth highlighting:

Suggestions:

How good is good enough?

You’ll notice that, on the public leaderboard, more than half of the submissions are hovering between 74% and just over 75%. You should be able to get into this range (and extra credit is available if you get at or near the top of the leaderboard!). My first submission had a major error in it and had an AUC of around .53; and my second submission, which did not involve a lot of tuning, achieved 0.74493.

What should you try to make your score better?

Since the data is large, speed may also be a factor - if a model is slow to train, you’ll be able to try out fewer things and you may end up with worse performance.

Submitting your work

Submit a notebook showing how you trained your classifier. If you tried other approaches that didn’t work as well, feel free to tell me about those.

Also submit a screenshot of your score on the Kaggle public leaderboard, along the lines of the following. This is what I was able to achieve without trying very hard:

Rubric

As in past labs, your notebook should be correct and clear. Your AUC should be in the range described above.