DATA 311 - Lab 7: Classifying Recipes with Machine Learning

Scott Wehrwein

Fall 2025

Overview

In this lab, we will use simple machine learning techniques to solve a prediction problem: predicting the cuisine of a recipe given a list of its ingredients.

Collaboration Policy

This lab will be done individually. You can brainstorm with any classmates about ideas for feature representations, preprocessing, or experimental process. However, your solution uniquely your own. If you had fruitful discussions that led to ideas you used, you must acknowledge these ideas or suggestions you received from other classmates in an acknowledgments section at the end of the notebook you submit.

The Data

The data for this lab (recipes.json) comes in JSON format containing a list of recipes, each represented as a JSON object like the following:

 {
 "id": 24717,
 "cuisine": "indian",
 "ingredients": [
     "tumeric",
     "vegetable stock",
     "tomatoes",
     "garam masala",
     "naan",
     "red lentils",
     "red chili peppers",
     "onions",
     "spinach",
     "sweet potatoes"
 ]
 },

Your task is to use the ingredients list for a recipe to predict which one of 20 cuisine labels it has.

You can read this into list of Python dicts as follows:

import json

with open('recipes.json', 'r') as f:
  recipes = json.load(f)

Pre-Lab

Read the following:

The Scikit-Learn Getting Started guide.
The Nearest Neighbors Classification example
The documentation for train_test_split

In a text editor of your choice that is capable of exporting to pdf, answer these questions:

What is the purpose of a Pipeline?
How does the KNN classifier’s behavior change depending on the value of the weights hyperparameter?
What does the stratify argument to the train_test_split function do?

Optional if you want to get an early start on the lab: load the data, create a train/val split, figure out some basic way of representing each recipe as a feature vector, and train a KNN classifier on it.

Lab

Create a new notebook for Lab 7. Title it and include your name at the top. Your notebook will eventually have the following 5 sections:

Loading and Splitting
Baselines
Feature Extraction
Training
Evaluation

Part 1: Loading and Splitting the Data

Load the data as described above, then immediately split the data into a training set and a validation set. The specifics are up to you, but I recommend using a constant random seed (pass a constant integer to the random_state argument to train_test_split) so that your split is reproducible and your training set won’t change.

Part 2: Baselines

Next, create a new section of your notebook titled Baselines. In this section, write code to calculate or estimate the classification accuracy on your validation set of at least two simple baselines. At a minimum, you should estimate the accuracy of random guessing and calculate the accuracy of predicting the most common label.

Part 3: Feature Extraction - Strong Baseline

In this section, you’ll build the data matrix X and ground-truth labels y that are required for supervised learning. As a reminder, X is an n × d array where each of n rows is a d-dimensional feature vector for one datapoint, while y is a 1D length-n vector of ground-truth labels. Keep in mind that your X should not include the cuisine label - that’s the prediction target, so we can’t include it in our features!

Since the data we have about each recipe does not lend itself to trivial conversion to a feature vector, this part is likely to be the most interesting part of the lab, and your choices here will have the greatest effect on your classification accuracy. You have latitude to do this pretty much however you want! That said, I want you to start simple at first: Rather than dream up the fanciest way to do this, start by setting a stronger baseline than the simple ones you calculated above.

If you’re short on ideas, here are a couple hints for simple feature extraction strategies.

The first thing I thought of was to treat ingredients as binary features (like a one-hot encoding, but each recipe will have more than one ingredient).
You could also treat the ingredients list as a “bag of words” and similarly encode individual words as above.
SKlearn’s CountVectorizer may be hepful.

Construct your strong-baseline dataset and store it in a numpy array called X_baseline_train, and apply the same processing to produce X_baseline_val for your validation set.

While developing your feature extraction approach, you may want to work with only a subset of the data if your processing is at all time-consuming. When you run this section of the notebook for the final time before submission, make sure you’re using the entire training set.

Part 4: Learning - Strong Baseline

Train a K-nearest-neighbors classifier on your dataset using sklearn. For now, you don’t need to tune hyperparameters - just use Euclidean distance and a value of K that seems sensible to you.

Part 5: Evaluation

Use the trained baseline classifier to make predictions and evaluate its performance. Write this code in a function so you can use it again later! Your evaluation function might, for example, take the following arguments: X_train, X_val, y_train, y_val, classifier.

At a minimum, your evaluation section should output the following:

The percent accuracy on the training set.
The percent accuracy on the validation set.
A nicely labeled confusion matrix showing the confusion among classes. You may find it helpful to use sklearn’s confusion_matrix to calculate it and Seaborn’s heatmap to visualize it.

To get credit for this section, your strong baseline must have better validation accuracy than both the random guessing and most-common-label baselines you developed in Part 2.

Part 6: Improve on the Baselines

Now the fun part: see how high you can get your validation accuracy! Find a better feature extraction strategy, choose a better classifier model, and/or tune hyperparameters to get your classification accuracy as high as you can get it.

When you’re finished, this section should include fully reproducible code that begins with the dataset loaded in Part 1 and ends with a call to the evaluation function you wrote in Part 5 to showing your best model’s performance. Don’t modify your code from Parts 2-4, and any modifications to Part 5 should be general enough so it still works to evaluate your strong baseline.

The design space here is huge! Here are a few suggestions and tips:

Improve feature extraction: my guess is that this is where you’ll have the most flexibility and the greatest potential for improvement. Some ideas to get you thinking:
- Can you use only a subset of the ingredients? Perhaps the most popular ones, or the most discriminative ones? If you feel like learning something new, try looking up TF-IDF.
- I found at least 71 distinct ingredients that mention the word “tomato”. Can you find and group similar ingredients that have different names?
- Consider how you might use embedding vectors for ingredient tokens, whole ingredients, or even the entire list of ingredients. You can try using Spacy models (I recommend using the largest, _trf model for best results). You could also find more advanced embedding models such as HuggingFace’s Sentence Transformers.
Tune hyperparameters: find the K and distance metric (see DistanceMetric) that maximize validation accuracy for your KNN classifier. Or if you choose a different classifier (below), tune whatever hyperparameters it exposes. If you want to make the most of your data, look into K-fold cross-validation.
Try out different classifier models. For example, you could try out Random Forest or MLP classifiers from SKLearn. Beware that fancier models don’t always improve performance, though!

When you’re happy with your final model, make sure you’ve run the evaluation function to show its performance characteristics.

Part 7: Predictions on the Test Set

I’ve held out a test set of recipes; I have the correct labels, but I’m not giving them to you. In the last section of your notebook, you will load test.json and make predictions on the 5,000 recipes therein using both your strong baseline classifier and your final model classifier. The format of test.json is identical to the traning data, except that the cuisine label is missing.

For each set of predictions, you’ll save out a CSV file with two columns, id and cuisine. I recommend using DataFrame.to_csv with the index=False parameter. The first few rows of a valid predictions CSV should look like this (of course your cuisine labels will likely be different):

id,cuisine
5065,irish
42709,mexican
36038,japanese
25841,french
19724,southern_us
(...)

Each CSV should have exactly one row per test set recipe ID; the order of the rows does not matter. Generate two such CSV files:

Make predictions on the test set using your strong baseline classifier, and save them in strong_baseline.csv.
Make predictions on the test set using your final model, and save them in final_model.csv.

Submission

For the pre-lab, submit your answers in PDF format to the Lab 7 Pre-Lab assignment on Canvas.

For the lab:

Make sure that your notebook is complete and has up-to-date output for all cells. The output should include your evaluation results for both the Strong Baseline model and your Final Model.
Make sure that your test set prediction CSV files are up-to-date.
Create a zip file containing your notebook (lab7.ipynb) and CSV files (strong_baseline.csv and final_model.csv).
Submit your zip file to Canvas.
Fill out the Lab 7 survey on Canvas.

Rubric

Pre-Lab (9 points)

3 points per question

Lab - Completeness (40 points)

Part 1 (5 points): Data is loaded and split into train/val splits.
Part 2 (5 points): Baselines are calculated/estimated appropriately.
Part 3 (10 points): Features are extracted to construct a dataset
Part 4 (5 points) A KNN classifier is trained on the extracted features
Part 5 (10 points): Evaluation is wrapped in a function that reports validation accuracy, training accuracy, and a confusion matrix
Part 7 (5 points): Correctly-formatted CSV files strong_baseline.csv and final_model.csv are submitted, containing estimated labels for the test set produced by your methods.

Lab - Test Set Performance (10 points)

Your test set predictions will be used to calculate the following:

Strong baseline performance (5 points): 1 point per 4% accuracy above 20%, capped at 5 points (40%). That is:

min(5, (pct_correct - 20)/4),

where pct_correct is the percent (0-100) of correct predictions on the test set by your strong baseline.
Final model performance (5+ points):
- 1 point per 5% accuracy above 40%, capped at 5 points (65%).
- 1 point of extra credit per 10% accuracy above 65%, no cap.