Data Cleaning (and some other stuff)¶

Connor, Andrew, and Mayla¶

Goals:¶

We set out to answer a simple predictive question:

Using a car model's features, can we predict its MSRP for 2021?

Our goal was to create a model that could model and predict the manufacturer suggested retail prices (MSRP) for new vehicles sold within a calander year. The function of such a model would be to interpolate prices for theoretical cars, not predict future prices or market trends. The model could also be used to evaluate how reasonably-priced a vehicle was during this time period based on its known features and similarities to other vehicles on the market.

The Data:¶

We collected data from cars.com, which had detailed list of specifications for vehicles of different makes, models, and years. While exploring the available data, we noticed that many vehicle had multiple trim options listed on their specs page, many of which had significant differences in specifications. For instance, the 2021 RAM 1500 had 38 different available trims, with MSRPs ranging from \$32,795 to \\$70,325. Clearly there were some major differences impacting the prices of these variations, and this was exactly the sort of thing we wanted to be able to predict with our model. Therefore, when collecting our data, we classified vehicles not just by make and model, but also by trim.

The specifications pages themselves included a myriad of data for each vehicle model and trim. All possible color options for the exterior and interior of the vehicle are listed, along with various optional packages that add extra features to the vehicle. A lengthy Specifications table lists various statistics for categories such as Engine, Steering, Dimensions, and Entertainment.

Specifications page example

Using the data from these tables, we planned to generate a number of feature vectors that we could use to represent car elements and generate a predictive model for MSRP

Data Collection¶

A lot of scraping was required to collect the data we had found. The code we created to accolate specifications data into a CSV had to iterate through three layers of web scraping:

#### Scraping names of each 2021 vehicle model
- Research options on cars.com allowed us to search for vehicle models by year. From these pages, we obtained the names of all 2021 models.
- Yield: 440* different models
#### Scraping trims of each 2021 vehicle model
- Each vehicle model had a trims page from which we obtained the links to specification pages for each trim option.
- Yield: ~2450* different trims
#### Scraping specifications for each trim
- Each vehicle trim has a specifications page from which we obtained as much information as possible.
- Yield: for each of the trims we obtained...
  - List of exterior and interior color options
  - Whatever specs were included in the table - since the given specifications not consistent across all vehicles (for example, some had trailering information listed while others did not) we ran scraping code that collected all available entries, whatever they might be. When we appended the data for each car trim to a singular data frame, the various specification categories were matched up in their own columns.
  - Lists of Entertainment, Exterior, Interior, Mechanica, and Safety Features. The specs table included long list of standard and optional features for these categories that were included for each vehicle. Because these were names of specific programs (like SiriusXM radio) and make-specific technologies (such as "HVAC -inc: Underseat Ducts and Console Ducts"), we compiled a list of features for each vehicle rather than creating seperate columns for each unique one (that bit will come later).

Altogether we generated a .csv file with about 2,432 rows and 172 columns*, requiring about 90 minutes of runtime.

*We went back a couple of times to revise our scraping code and remake our data table. We found that the number of cars and trims varied slightly between these runs, presumably because cars.com added more pages to their website. The additions were not significant enough to affect our overall data collection, but it does mean that any data collected from future runnings of this code will result in slightly different data than that used in our project.

Clean...¶

Data Formatting¶

The first step in the (very, very long) cleaning process we went through with this data was to collect data into numerical and categorical columns. The CSV file we generated had collected data directly from the specs table, and included units, "N/A" entries, and other anomolous text features. For columns with missing values, we replaced "N/A" entries with "-1", to identify for later filling and correcting while still allowing a conversion to integer or float data types. Since none of the specifications in our data involved negative values, the only instances of -1 came from data marked as blank or N/A.

There were also many columns that contained more complex details that needed to be 'decoded' before they could be easily used as feature vectors. Tire sizes, for instance, have a specific formatting: drawing

Because of the massive variation in specific tires that are issued with vehicles, the tire size columns were replaced with columns for width, aspect ratio, and diameter, which would better quantify the specifics of different tire categories.

Altogether, this first round of cleaning involved (1) formatting existing data, (2) deriving new columns from existing data, and (3) dropping columns replaced by the previous processes. (These steps are included in the 'Main Column Cleaning' section of this cleaning notebook.

... clean...¶

Blobs, Pickles, and FuzzyWuzzy¶

When I bought my first car, I did not base my search off of the cold cranking amps, the fifth gear ratio, or the gross vehicle weight. While some customers might look for specific performance-based features to match their needs (such as those looking to tow or off-road with a vehicle) the vast majority of people are more focused on the little details like the gadgets on the radio, the driving assistance options, and the adjustability of seats. These features are not nearly as standardized as brake types, dimensions, or even tire size. Instead, we generated five columns for each trim option: Entertainment Features, Exterior Features, Interior Features, Mechanical Features, and Safety Features.

The individual entries in these columns were long lists of frustratingly-specificly-named features:

['Cloth Bench Seat^^', 'Analog Appearance^^', 'Cruise Control w/Steering Wheel Controls^^', 'Passenger Visor Vanity Mirror w/Passenger Illumination^^', 'Systems Monitor^^', 'Metal-Look Gear Shifter Material^^', 'Proximity Key For Push Button Start Only^^', 'Day-Night Rearview Mirror^^', 'HVAC -inc: Underseat Ducts and Console Ducts^^', 'Front Center Armrest w/Storage and Rear Center Armrest^^', 'Full Cloth Headliner^^', 'Full Carpet Floor Covering -inc: Carpet Front And Rear Floor Mats^^', 'Rear Cupholder^^', 'Vinyl Door Trim Insert^^', 'Power Door Locks w/Autolock Feature^^', 'Fade-To-Off Interior Lighting^^', 'Rear 60/40 Folding Seat^^', '3 Rear Seat Head Restraints^^', 'Valet Function^^', 'Power 1st Row Windows w/Driver And Passenger 1-Touch Up/Down^^', 'Integrated Voice Command w/Bluetooth^^', 'Outside Temp Gauge^^', 'Manual Tilt/Telescoping Steering Column^^', 'Interior Trim -inc: Deluxe Sound Insulation, Metal-Look Instrument Panel Insert, Metal-Look Door Panel Insert and Chrome/Metal-Look Interior Accents^^', 'Seats w/Cloth Back Material^^', 'Sentry Key Engine Immobilizer^^', 'Remote Keyless Entry w/Integrated Key Transmitter, Illuminated Entry and Panic Button^^', 'Pickup Cargo Box Lights^^', 'Front Seat Back Map Pockets^^', 'Power Rear Windows^^', 'GPS Antenna Input^^', '40/20/40 Split Bench Seat^^', 'Leather Steering Wheel^^', '4 Way Front Headrests^^', 'Manual Adjust 4-Way Front Passenger Seat^^', 'Trip Computer^^', 'Compass^^', 'Instrument Panel Bin, Dashboard Storage, Driver And Passenger Door Bins and 1st Row Underseat Storage^^', 'Passenger Seat^^', 'Redundant Digital Speedometer^^', '4-Way Driver Seat -inc: Manual Recline and Fore/Aft Movement^^', 'Locking Glove Box^^', 'Manual Air Conditioning^^', 'Front Facing Cloth Rear Seat^^', 'Illuminated Front Cupholder^^', '2 12V DC Power Outlets^^', 'Front Map Lights^^', 'Mini Overhead Console and 2 12V DC Power Outlets^^', 'Gauges -inc: Speedometer, Odometer, Voltmeter, Oil Pressure, Engine Coolant Temp, Tachometer, Oil Temperature, Transmission Fluid Temp, Engine Hour Meter, Trip Odometer and Trip Computer^^']

(This specific entry was taken from the 2021 RAM 1500 Big Horn 4x2 Quad Cab 6'4" 'Interior Features' entry and is just one of many, many such lists. The double carrots were added in the scraping process to be used as a deliminator later in the analysis, and were not original to the data.)

Clearly, there is a lot of information here that would be relevant to buying a car and would likely affect its starting price. "Integrated Voice Command w/Bluetooth," and "Push Button Start" are a features that many people might look for and be willing to pay extra for in today's vehicle market. However, the naming process for these features is not at all standardized, and it is likely that among all the manufacturers and vehicle types there are many different naming conventions for features that serve the exact same purpose. In order to fully realize our goal in creating this model, we needed a method for reducing these feature names to their most basic functions so that we could group and compare across many different makes and models.

This is where nltk and TextBlob came in. Using as little natural language processing as possible, we were able to separate the noun phrases from within the lists of different features. These noun phrases were far more general and appeared much more frequently than specific feature names, and while some of them were a bit strange or vague, they were all extracted from the original data and thus represented the presence some item in the original feature list.

blobify

Our list now looked something like this:

['cloth bench seat', 'analog appearance', 'cruise', 'controls', 'passenger visor vanity mirror', 'illumination', 'systems monitor', 'metal-look gear shifter material', 'proximity key', 'push button start', 'day-night rearview mirror', 'hvac', 'underseat ducts', 'console ducts', 'front', 'armrest', 'rear', 'armrest', 'cloth headliner', 'carpet floor covering', 'carpet front', 'rear floor mats', 'rear cupholder', 'vinyl door trim insert', 'power door locks', 'feature', 'fade-to-off interior lighting', 'rear', 'folding seat', 'rear seat head restraints', 'valet function', 'power', '1st row', 'windows', 'passenger', 'up/down', 'integrated voice', 'command w/bluetooth', 'temp gauge', 'manual tilt/telescoping steering column', 'interior trim', 'deluxe sound insulation', 'metal-look instrument panel insert', 'metal-look door panel insert', 'chrome/metal-look interior accents', 'material', 'key engine immobilizer', 'remote keyless entry', 'key transmitter', 'illuminated entry', 'panic button', 'pickup cargo', 'lights', 'front seat', 'pockets', 'power rear windows', 'gps antenna input', 'bench seat', 'leather steering', 'front headrests', 'manual adjust', 'front passenger seat', 'trip computer', 'compass', 'instrument panel bin', 'dashboard storage', 'passenger door bins', '1st row', 'underseat storage', 'passenger seat', 'redundant digital speedometer', '4-way driver', 'seat', 'manual recline', 'fore/aft movement', 'locking glove', 'manual', 'conditioning', 'front facing cloth rear seat', 'illuminated front cupholder', 'dc power outlets', 'front', 'lights', 'mini overhead console', 'dc power outlets', 'gauges', 'speedometer', 'odometer', 'voltmeter', 'pressure', 'engine coolant temp', 'tachometer', 'temperature', 'transmission fluid temp', 'engine hour meter', 'trip odometer', 'trip computer']

Of course, even with the noun-phrases isolated, there were still many features that had similar functions, but resulted in slightly different list entries. In some cases, differences in punctuation or plurals distinguished almost identical phrases, like 'driver seat w/8-way power adjustment' and 'drivers seat 8-way power adjustment'. In other cases, similar features might have slightly different descriptions or specifications, like 'dc power outlets' and 'ac power outlets'. In order to include as much detail in these features as possible while maintaining a (semi-) reasonable number of categories in our data, we needed to use string matching to combine near-similar noun phrases.

It was time to get fuzzy. Since the string matching sought to combine similar features from across different vehicles, it needed to be performed on a larger list containing all possible noun phrases for a category. After combining all the features for each of our five categories (Entertainment, Exterior, Interior, Mechanical, and Safety) into five mega-lists, we needed a way to iterate through each list and compare strings. Even this was a multi-step process. Iterating through the list from start to finish and replacing similar strings would successfully reduce the number of unique elements, but there was a high probability that sub-optimal replacements would occur.

Say strA matches with strB, and strB matches with strC, but strA and strC are not quite similar enough to match. If strA were used first for comparison, it would replace strB but leave strC as a distinct category. However, if strB were to be used first, it would replace both of the other strings, reducing them to just a single category.

In order to prioritize the noun phrases with the highest string matching compatability, we iterated through each mega-list and compared each noun phrase in the list to ever other unique noun phrase using fuzz.ratio. The average ratio (computed by adding up the ratios and dividing by the number of elements it was compared with) was stored alongside each unique element in a dictionary.

simCount

This dictionary was then used to create a list of noun phrases in descending order of their compatability ratio, so that those which were most compatable were listed at the front of the list.

sortKeys

We then iterated through the ordered list, comparing each unique element using fuzz.ratio by beginning with the most compatable phrases and working down through to the least compatable. Whenever a ratio was greater than 90, the element and its compatible replacement was added to a dictionary. The result was a dictionary capable of 'translating' less-compatible phrases into more compatible, similar phrases.

fuzzify

Finally, the dictionaries generated for each of the five categories could be used to transform the features listed for each vehicle into lists of more common noun phrases that were likely to be matched well across vehicle makes and models. (These dictionaries were saved as .pickle files because they took quite a bit of time to generate (up to an hour for each). The files can be found here.) These columns were one-hot-encoded soon after, so that each vehicle was marked to either 'have' a feature, or 'not have' a feature.

There are some important things that should be noted:

Not all of the extracted features made the cut.

Once all the cars had been updated with the blob- and fuzz-ified feature lists, we checked to see just how common each feature was (eg, how many vehicles had the feature listed). Only those that were shared by at least 200 vehicles (~10% of our cars) were kept. Even after all of our processing and renaming there were over 3,000 total interior features, and of these, we kept just 194.

Fuzzy is fuzzy.

Fuzzy string matching uses similarity ratios between the characters in two strings. Even with a 90-percent similarity threshhold, there were many different features that were mapped to eachother in our final dictionaries. This was especially common when features involved numbers - '8-way power passenger': '4-way power passenger', for instance. Ultimately, with the conversion between similar strings, some information was lost and other information was replaced with inaccurate counterparts. However, this does not affect the accuracy of our model, as even these inaccuracies reflect elements that were present in the original data. In the end, the one-hot-encoded columns for each features do not necessarily represent whether a car really has that exact feature, but whether it has some feature that is similar or related to the listed column. A vehicle may have 8-way power passenger, or 6-way power passenger, or even 4-way power passenger, but the fact that it had a noun phrase extracted from its features that met the requirements to string match with the others is enough to indicate a general similarity between marketed features.

... and clean some more.¶

It was around here that we started doing some preliminary exploratory analysis and basic linear regression to see how well our hard-earned data worked in the real world. It was also here that we remembered all of the -1s we had added previously to mark spots for dropping or filling or otherwise adjusting data. These spots were now marking glaring trends in our different data categories. So it was time to go back to the cleaners.

We identified 63 numerical columns that had been augmented with negative values to replace missing data. These columns were of a diverse group of categories, and we were careful in our consideration of how to address each one. Firstly, we dropped all the columns that had fewer than 1000 real values. This brought the total down to 21 columns. From these, we used one of two strategies to fill in missing data:

Torque and Horsepower: replace with eachother

We noticed a high correlation between the torque and horsepower, and found that though the two measurements have different units, they are relatively equal in numerical value. Therefore, we planned to replace one with the other wherever tere were missing values in our data set. However, we soon discovered that the erronious entries were from Tesla models which had neither torque nor horsepower, and thus could not be so easily filled. Since we didn't want to drop these cars from our data, we researched the appropriate values and inserted them manually. Ultimately, we did end up dropping the Model Y cars, since they were lacking significant specs and hadn't been fully released yet.

Everything else: replace with the mean

With all other cateogories, we used group averages to fill in missing values. To do this, we isolated each target column one by one and grouped them by vehicle type. After averaging the positive values from the group, we replaced missing values with this average.

Lastly, before exporting a final cleaned CSV file, we checked through all the columns and removed those that had poorly distributed values or sparse entries. Some of the columns dropped were Year (which was 2021 for every vehicle), Option weight - front (all entries were 0.0), Step-up height - side (which had only 12 real entries), and Ninth gear ratio (:1) (which had a near-zero standard deviation).

With all missing values filled, our data was finally (finally) complete and ready to be used in our exploration and predictions. All columns were now cleaned and filled numerical types, one-hot encoded vehicle features, or categorical types with cleaned and clearly listed entries. These categorical types (such as Drivetrain, which specified whether vehicles were all wheel drive, 4 wheel drive, etc.) were left with their original entries to make the cleaned data more comprehensible. While we planned to encode these into numerical columns, leaving the categories alone at this stage gave us more flexibility in our later analysis to experiment with different encoding methods and sort by original categorical types.

For a list of columns included in the final cleaned CSV, along with the transformations applied to them, refer to our docs

Exploratory Analysis¶

We began our exploratory analysis of the data by looking at our prediction target: MSRP.

msrp_hist_stats

We can see that the distribution of MSRPs in our data set is heavily skewed to the right, with a mean of about \$50,000. The median is significantly lower than the mean, about \\$42,500, further confirming the skew of the data. Our histrogram also presents this fact in that the majority of our data (>50%) is scored to the left of our mean.

Next, we checked each of our columns to see how well they correlated with MSRP.

histogram_code msrp histogram

The plot can be viewed and expanded in a separate tab, and can also be viewed in the runnable analysis notebook.

Add note somewhere about removing mech features

0.0 = Gas 1.0 = Hybrid 2.0 = Electric

Interestingly, when looking at the horsepower graph, we can see that there are some outliers to the far right of our chart (values are colored in dark purple). Upon closer look, those values are Tesla vehicles, which are scewing our graph. We had to manually look these Tesla cars up to ensure that they were correct, which ended up being so. The other dark purple dots at the 600 horsepower threshold are also Tesla vehicles. It's interesting to see that Tesla is practically in the lead for highest values of horsepower, and considering that they're not the most expensive car on the list, that's rather impressive.

horsepowerExam

0.0 = Gas 1.0 = Hybrid 2.0 = Electric

Looking at the torque graph below, once again, Tesla is skewing the graph over along right side side of our chart. When looking in 1000 ft lb general area along the x-axis, we can also see that there are some gaspowered vehicles thrown in the mix (vehicles are the salmon / tan color values). After doing some inspecting, we found that these vehicles are Ford Diesel pick-up models. I did additional researched and confirmed that these torque values are correct for all the Ford model pickup types. I was surprised, as these trucks had vastly different torque values when compared to other gas powered cars.

torqueExam

Below, we have two graphs listed. On the left are all the cars with the number of INTERIOR colors they have listed. On the right are all the cars with EXTERIOR colors that they have listed. Interestingly, there are much more exterior colors available for all of our cars. Only 12% of all of our cars have 5 or fewer EXTERIOR colors to choose from, 88% have more than 5 to choose from. Meanwhile, 90% of our cars have 5 or fewer INTERIOR colors to choose from, 10% have more than 5 to choose from. It seems when people go out to purchase a car, consumers feel that exterior color is more important than interior colors, hence why there are more options available.

Interior_and_Exterior_Colors

Personally, I find this ineteresting. When buying a car, consumers will not expect too much variety when it comes to head room. As you can see from the graph, most cars will have around 35" to 40" of head room. The values that lay outside of the graph are cargo passenger vans.

Second_head_roomExam

Finding our Baseline¶

Finding good baseline's are important for judging the worth of a model, and we attempted a couple things to find a good one. We found the mean of the 2021 car MSRP's, and just picked that for everything which resulted in some terrible evaluation results.

MSRP_Baseline

We had better luck when looking at high correlation columns with MSRP. Two we found with over 70% correlation where "HorsePower", and "Rear Wheel Width (in)". These had decent prediction results with "MSRP" getting an R^2 of 58%, and Mean Squared Error of roughly 18,000. A Mean Squared Error of 18,000 wasn't too bad when considering the range of MSRP values was 321,100.

High_Corr_Baseline

Pretty Satisfied that this was the best Baseline we where gonna get, we made HorsePower our Baseline, if our model couldn't do better than this it wasn't worth anything.

Model Testing:¶

To start of our model testing we first made splits of our slimmed data, and Full data sets and created linear regression models. These models worked as our model baseline for our model testing. Any further changes to the model or data sets, needed to be justified by an increase in accuracy, of the model compared to these baselines.

Basic_Linear_Model

PCA Testing¶

Our model has been consistantly overfit, with every model that we've tried. We experimented with PCA to see if reducing the number of features might help with the overfit. After experimenting with it we found that PCA did make the results from our training and validation closer, but they also did markably worse then the non PCA data set on tranning and validation split data.

PCA_Test

We also noticed that it seemed to get simmilar results for it's validation data split as the slim version of our data set. With a generally better score than slim for the trainning split.

RidgeCV¶

The Linear Regression Model did pretty good, but we wanted to experiment with some other models so next we tried, RidgeCV. We decided to use RidgeCV next because this model automaticly does "leave one out cross validation", to improve the model. We tried with all three of our data splits, the one for the slim version of our data set, our full data set, and the PCA modified version of our data set. It did the best on our full data set, and its scores where generaly better then the basic Linear regression model.

Ridge_CV_Exam

Poly Features¶

Next we experimented with the polynomial features and RidgeCV to see if poly fitting our data would get better results, then just the RidgeCV by itself. It was better then our wildest expectations and scored an impressive R2 score of 98%, and Mean Squared error of 4221 on the validation data. Which is amazingly accurate, and surpisingly reduced the difference in the the score's between the tranning and validation by a lot.

PolyScore

With this result we where pretty Confident this was our final model so we decided to do some examination of the residuals of the model to get a better idea of how effective it was.

Here is a Histogram of our model's risiduals for the histogram split. The residuals seem very sharply centered on the regression line, which is pretty good.

residualHist We graphed The homoscedasticity of our model below, and we can see that that the model becomes less accurate for more expensive models. This is not too surprising since the more expensive luxury cars are a lot more rare, and so our data is heavily concentrated on cheaper models around the $50,000 mark.

Unit Analysis¶

For each of the different "Vehcicle Yypes" and "Makes" of vehicles, we ran training and validation data for them to test their individual succesfullness with our model, and then graphed the results. The graph has the amount of cars used in the data set along the x-axis, and the R2 score along the y-axis.

UnitAnalysis

UnitGraph1

UnitGraph2

Final Results¶

Test

Our model did great on our test set we got 95% for R2, amd Mean Squared was 6026. below we included graphs some graphs of our test split results with the model. The first is a histogram of the residuals. The second a scaterplot of the predicted results MSRP vs. Real MSRP.

TestHist

TestScatter

THIS NOTEBOOK CONTAINS ALL OF OUR EXPLORATORY ANALYSIS AND ML MODEL TESTING*