For our project, we took a dataset we found on smoking in the UK at https://www.openintro.org/data/index.php?data=smoking. It included several columns of demographic information, such as gender, age, income, and a few other categories. We wanted to see if we could use machine learning models to draw a correlation between these categories, and the amount of cigarettes a person smoked each day. Our data included both columns for weekdays and weekends, and for simplicity, we chose to focus on weekdays.
Before we started training models, we developed a few baselines to judge the efficacy of our results. We started with two different baselines: zeroes and mean, using both zeroes and the target columns mean value respectively. As shown below, we found that most people in the set were nonsmokers, and thus using 0's was a reasonable baseline metric. The resultant mean squared errors (MSE) of these baselines were 7.8 and 7.0 respectively
We initially started with a linear regression model, but found that it was barely any better than our baselines at 6.93 MSE. Attempting to create an SVR GridSearch model yielded even worse results at 7.1 MSE.
Attempting another trick, we decided to add in a column we initially dropped from the data: "smoke." This column is binary, representing whether each given person is a smoker or nonsmoker. With this new column, we calculated one last baseline to test our new "biased" models on. This applied 0's to each nonsmoker, and the mean smoking amount of smokers to those that did smoke. This new baseline had an MSE of 4.27.
Unfortunately, we found ourselves in the same situation as before, with our new regression and SVR models both preforming at ~4.29 MSE.
It seems that our attempt at utilizing machine learning to predict smoking rates, at least with this particular dataset, was fruitless. There is negligible difference in MSE between each of the models in the biased and unbiased sets.
Despite this, we still wanted to see if we could explore the impact of each feature in our set. Using the feature coefficients from our regression models, we were able to plot these weights.
This graph indicates to us that apart from gender, none of our feature columns seemed to be very useful in determining our target. However, even gender is not as impactful as this makes it appear to be, as shown by the coefficient weights of our biased regression model:
This shows us that the most important predictor in determining the amount of smoking any of these people do on any given day, is almost entirely dependent on whether they are a smoker or nonsmoker. This is obviously not very insightful, and instead shows us that things like age, income, and marital status are functionally irrelevant.
We found that our data was unworkable in regards to machine learning techniques. Our prediction algorithms were unable to meaningfully do better than our baselines, and our features provided very little in regards to what actually went into our predictions. However, this all does seem to indicate that regardless of a persons position in life, they are nearly equally capable of being a heavy smoker as anyone else.