Fall 2025
In this lab, we will use simple machine learning techniques to solve a prediction problem: predicting the cuisine of a recipe given a list of its ingredients.
This lab will be done individually. You can brainstorm with any classmates about ideas for feature representations, preprocessing, or experimental process. However, your solution uniquely your own. If you had fruitful discussions that led to ideas you used, you must acknowledge these ideas or suggestions you received from other classmates in an acknowledgments section at the end of the notebook you submit.
The data for this lab (recipes.json) comes in JSON format containing a list of recipes, each represented as a JSON object like the following:
{
"id": 24717,
"cuisine": "indian",
"ingredients": [
"tumeric",
"vegetable stock",
"tomatoes",
"garam masala",
"naan",
"red lentils",
"red chili peppers",
"onions",
"spinach",
"sweet potatoes"
]
},Your task is to use the ingredients list for a recipe to predict which one of 20 cuisine labels it has.
You can read this into list of Python dicts
as follows:
import json
with open('recipes.json', 'r') as f:
recipes = json.load(f)Read the following:
train_test_splitIn a text editor of your choice that is capable of exporting to pdf, answer these questions:
Pipeline?weights hyperparameter?stratify argument to the
train_test_split function do?Optional if you want to get an early start on the lab: load the data, create a train/val split, figure out some basic way of representing each recipe as a feature vector, and train a KNN classifier on it.
Create a new notebook for Lab 7. Title it and include your name at the top. Your notebook will eventually have the following 5 sections:
Load the data as described above, then immediately split the data
into a training set and a validation
set. The specifics are up to you, but I recommend using a
constant random seed (pass a constant integer to the
random_state argument to train_test_split) so
that your split is reproducible and your training set won’t change.
Next, create a new section of your notebook titled Baselines. In this section, write code to calculate or estimate the classification accuracy on your validation set of at least two simple baselines. At a minimum, you should estimate the accuracy of random guessing and calculate the accuracy of predicting the most common label.
In this section, you’ll build the data matrix X and ground-truth labels y that are required for supervised learning. As a reminder, X is an n × d array where each of n rows is a d-dimensional feature vector for one datapoint, while y is a 1D length-n vector of ground-truth labels. Keep in mind that your X should not include the cuisine label - that’s the prediction target, so we can’t include it in our features!
Since the data we have about each recipe does not lend itself to trivial conversion to a feature vector, this part is likely to be the most interesting part of the lab, and your choices here will have the greatest effect on your classification accuracy. You have latitude to do this pretty much however you want! That said, I want you to start simple at first: Rather than dream up the fanciest way to do this, start by setting a stronger baseline than the simple ones you calculated above.
Construct your strong-baseline dataset and store it in a numpy array
called X_baseline_train, and apply the same processing to
produce X_baseline_val for your validation set.
While developing your feature extraction approach, you may want to work with only a subset of the data if your processing is at all time-consuming. When you run this section of the notebook for the final time before submission, make sure you’re using the entire training set.
Train a K-nearest-neighbors classifier on your dataset using
sklearn. For now, you don’t need to tune hyperparameters -
just use Euclidean distance and a value of K that seems sensible to you.
Use the trained baseline classifier to make predictions and evaluate
its performance. Write this code in a function so you
can use it again later! Your evaluation function might, for example,
take the following arguments:
X_train, X_val, y_train, y_val, classifier.
At a minimum, your evaluation section should output the following:
sklearn’s confusion_matrix
to calculate it and Seaborn’s heatmap
to visualize it.To get credit for this section, your strong baseline must have better validation accuracy than both the random guessing and most-common-label baselines you developed in Part 2.
Now the fun part: see how high you can get your validation accuracy! Find a better feature extraction strategy, choose a better classifier model, and/or tune hyperparameters to get your classification accuracy as high as you can get it.
When you’re finished, this section should include fully reproducible code that begins with the dataset loaded in Part 1 and ends with a call to the evaluation function you wrote in Part 5 to showing your best model’s performance. Don’t modify your code from Parts 2-4, and any modifications to Part 5 should be general enough so it still works to evaluate your strong baseline.
The design space here is huge! Here are a few suggestions and tips:
_trf
model for best results). You could also find more advanced embedding
models such as HuggingFace’s Sentence
Transformers.When you’re happy with your final model, make sure you’ve run the evaluation function to show its performance characteristics.
I’ve held out a test set of recipes; I have the correct labels, but
I’m not giving them to you. In the last section of your notebook, you
will load test.json and make predictions on the
5,000 recipes therein using both your strong baseline classifier and
your final model classifier. The format of test.json is
identical to the traning data, except that the cuisine label is
missing.
For each set of predictions, you’ll save out a CSV file with two
columns, id and cuisine. I recommend using
DataFrame.to_csv with the index=False
parameter. The first few rows of a valid predictions CSV should look
like this (of course your cuisine labels will likely be different):
id,cuisine
5065,irish
42709,mexican
36038,japanese
25841,french
19724,southern_us
(...)
Each CSV should have exactly one row per test set recipe ID; the order of the rows does not matter. Generate two such CSV files:
Make predictions on the test set using your strong baseline
classifier, and save them in strong_baseline.csv.
Make predictions on the test set using your final model, and save
them in final_model.csv.
For the pre-lab, submit your answers in PDF format to the Lab 7 Pre-Lab assignment on Canvas.
For the lab:
lab7.ipynb)
and CSV files (strong_baseline.csv and
final_model.csv).strong_baseline.csv and final_model.csv are
submitted, containing estimated labels for the test set produced by your
methods.Your test set predictions will be used to calculate the following:
Strong baseline performance (5 points): 1 point per 4% accuracy above 20%, capped at 5 points (40%). That is:
min(5, (pct_correct - 20)/4),
where pct_correct is the percent (0-100) of correct
predictions on the test set by your strong baseline.
Final model performance (5+ points):