A Statistical Exploration and Predictive Modeling of Ford Car Attributes

Spring 2024 Data Science Project
¶

Yuvraj Delada, Jaskaran Gill, Anchita Shukla

Member 1: Anchita Shukla, Contribution: 100%

Member 2: Yuvraj Delada, Contribution: 100%

Member 3: Jaskaran Gill, Contribution: 100%

"We, all team members, agree together that the above information is true, and we are confident about our contributions to this submitted project/final tutorial." - Anchita Shukla, Yuvraj Delada, Jaskaran Gill 5/7/2024

Anchita Shukla, Yuvraj Delada, Jaskaran Gill 5/7/2024

We all explored different datasets and figured out which one would be best, as well as figuring out what exploratory things we should do and how we should clean the data. We also wrote the steps together for the Data Curation part.

Anchita explored t test, I explored ANOVA, and Jaskaran did the Chi squared, and we all did the writing for our respective tests. For the initial regression, Jaskaran and I wrote the code while Anchita figured out a better way to do the regression and provided insights into what was happening. I wrote the Ridge regression, while Anchita and Jaskaran each used the data to answer the questions proposed in the introduction. We all did the insights and conclusions part, as well as the introduction.

Introduction¶

Ford has been a leading car brand in America for over a century and is known for its high-quality vehicles. The United States is the largest market for Ford as wholesales to U.S. dealerships reached 1.7 million vehicles in 2021. Additionally, according to a Statista study, in 2022, Ford overtook Toyota as the leading car brand in the United States based on vehicle sales, delivering about 1.8 million units to U.S. customers. The dataset we will be interpreting in this analysis contains information about used Ford Car prices and the various factors which may influence the price. It is from Kaggle, a data science platform with over 276,000 high quality public datasets. We were interested in this data for several reasons. From a business perspective, analyzing the factors that influence the used car prices can help identify which variables are the key drivers of pricing in the used car market. For Ford, understanding the price dynamics in this market can help them make informed decisions on consumer preferences and identify trends. Building a predictive model from this dataset would be valuable to both buyers and sellers of used cars. This particular dataset we will be looking at has 17,966 observations and 7 predictor variables for the response variable, price. The numerical explanatory variables are production year (year), number of miles traveled (mileage), annual tax in dollars (tax), miles per gallon(mpg), and the car's engine size in liters (engineSize). The categorical explanatory variables are the cars transmission type (transmission) and fuel type (fuel_type).


Using our data, we explored associations between car models and transmission types, the influence of fuel type on mileage performance, and variations in price points across different fuel types. Building upon what we uncovered through these exploratory processes, we created a predctive model to answer questions that can benefit both consumers and Ford, impacting both parties' decision making process. Critical questions we aim to answer are:

  • How do specific car statistics influence the price of a Ford vehicle?
  • How well does the model predict the prices of Ford vehicles in different price ranges?

We aim to provide insight which can help Ford determine what influences the pricing of the best selling car company in America, which will help them in the fields of product development, pricing strategies, and even marketing efforts. Delving into analyzing different price range's predictions is useful for consumers that may view this model as a guide. It is important that it is known how strong the model performs based off the kind of price range they are in the market for.

Sources:

  • Scikit-Learn GridSearchCV (open-source ML library for python):

    https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

  • Chi-Square Test:

    https://www.geeksforgeeks.org/ml-chi-square-test-for-feature-selection/

  • T-Test:

    https://www.geeksforgeeks.org/t-test/

  • One-Way ANOVA:

    https://www.geeksforgeeks.org/one-way-anova/

  • Data Curation:

    https://www.techtarget.com/searchbusinessanalytics/definition/data-curation

In [1]:
#imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency, ttest_ind, f_oneway
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

Data Curation¶

  1. We imported the dataset as a DataFrame in Pandas.

    https://www.kaggle.com/datasets/adhurimquku/ford-car-price-prediction/data

  2. Using info(), we can see that all columns are their desired data types, and there is no need to convert anything to numeric values.
  3. All the columns have the same non-null count, which is an indication that there are no missing values. We confirm this statement with some code later.
  4. The 'year' column indicated the year in which the model was produced. This is not a very useful column for our purposes, as years will be in the 1990s-2000s, which are not valuable numbers to use. For us to properly quantify how old these cars are, it is important for us to convert the year the car was made to the age it is. We can now use the age to draw meaningful conclusions.
  5. Removed duplicate data. There seemed to be duplicate data, as the row count dropped from 17966 (shown as non-null count from above) to 17812 (shown in displayed df).
  6. Checked for any missing data. This was useful to confirm that there are indeed no missing data and that we can proceed to exploring the data.
  7. Reset the indices after cleaning, as the indices did not properly correspond to their row.
In [2]:
# read in the ford csv
df = pd.read_csv("ford.csv")

#check the info of the DataFrame for any abnormalities
print(df.info())

#change the 'year' column to something more meaningful,'age'.
df['age_of_car'] = 2024 - df['year']
del df['year'] #no need for 'year' anymore

#get rid of duplicate data
df.drop_duplicates(inplace=True)

# this block of code checks for any missing data
missing_values = df.isnull().sum()
if missing_values.any():
    print("columns with missing vals:")
    print(missing_values[missing_values > 0])
else:
    print("none")

#reset the index to be properly 0-indexed after some cleaning
df.reset_index(drop=True, inplace=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17966 entries, 0 to 17965
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         17966 non-null  object 
 1   year          17966 non-null  int64  
 2   price         17966 non-null  int64  
 3   transmission  17966 non-null  object 
 4   mileage       17966 non-null  int64  
 5   fuelType      17966 non-null  object 
 6   tax           17966 non-null  int64  
 7   mpg           17966 non-null  float64
 8   engineSize    17966 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 1.2+ MB
None
none
In [3]:
#we see a negative age, which does not make sense! Run this alone first to see
#the negative value.
df.describe()
df[df.age_of_car == -36]

#drop the outlier
df.drop(df.query("age_of_car == -36").index, inplace = True)
df.describe()
Out[3]:
price mileage tax mpg engineSize age_of_car
count 17811.000000 17811.000000 17811.000000 17811.000000 17811.000000 17811.000000
mean 12269.880523 23379.381955 113.309865 57.909545 1.350620 7.140026
std 4736.220719 19418.128363 62.032540 10.132348 0.432593 2.026478
min 495.000000 1.000000 0.000000 20.800000 0.000000 4.000000
25% 8999.000000 10000.000000 30.000000 52.300000 1.000000 6.000000
50% 11289.000000 18274.000000 145.000000 58.900000 1.200000 7.000000
75% 15295.000000 31092.500000 145.000000 65.700000 1.500000 8.000000
max 54995.000000 177644.000000 580.000000 201.800000 5.000000 28.000000

Exploratory Data Analysis With Visualizations¶

1. Chi-Squared Test¶

$H_{0}$: There is no association between the car model and the transmission type

$H_{a}$: There is an association between the car model and the transmission type

Contingency Table Creation:The first step involves creating a contingency table that shows the frequency distribution of variables. This table is essential for the Chi-Squared test as it provides the observed frequencies in each category.

Chi-Squared Test Computation: Using chi2_contingency, we compute the Chi-Squared statistic and the p-value. The p-value helps us determine whether to reject the null hypothesis (no association between the variables).

Visualization: We plot a bar chart to visually inspect the distribution of transmission types across different car models. This helps in understanding the data intuitively.

In [4]:
# Chi-Squared Test
contingency_table = pd.crosstab(df['model'], df['transmission'])
_, p_val, _, _ = chi2_contingency(contingency_table)
print("P-value for Chi-Squared Test:", p_val)
contingency_table.plot(kind='bar', figsize=(10, 6))
plt.title("Model vs. Transmission Type")
plt.xlabel("Model")
plt.ylabel("Count")
plt.show()
P-value for Chi-Squared Test: 0.0

Chi-Squared Test Conclusion:¶

This test is investing the association between two categorical varaibles, which are the car model and transmission type in this case. With a p-value of 0, this leads us to reject the null hypothesis indicating a significant association exists between the car model and the transmission type.

2. Two Sample T-test¶

$H_{0}$: The mean mpg is not different across petrol and diesel cars.

$H_{a}$: The mean mpg is significantly different across petrol and diesel cars.

Data Selection: Extract MPG data for each group (petrol and diesel cars).

T-Test Execution: Perform the two-sample T-test to compare the means of the two groups. The result is expressed through a p-value.

Visualization: Histograms are plotted to show the distribution of MPG values for both fuel types, highlighting the central tendency and dispersion.

In [5]:
# Two Sample T-Test
petrol_mpg = df[df['fuelType'] == 'Petrol']['mpg']
diesel_mpg = df[df['fuelType'] == 'Diesel']['mpg']
_, p_val = ttest_ind(petrol_mpg, diesel_mpg)
print("P-value for T Test:", p_val)
plt.figure(figsize=(8, 6))
plt.hist([petrol_mpg, diesel_mpg], bins=10, label=['Petrol', 'Diesel'], density=True)
plt.title("MPG Distribution for Petrol and Diesel Cars")
plt.xlabel("MPG")
plt.ylabel("Density")
plt.legend()
plt.show()
P-value for T Test: 0.0

Two Sample T-Test Conclusion:¶

This test compares the mean miles per gallon between two groups, the petrol and diesel cars, to see if there is a significant difference between them. Given the p-value of 0 again, we reject the null hypothesis indicating that the mean mpg signifcantly differs between the petrol and diesel cars.

3. One-Way ANOVA Test¶

$H_{0}$: There is no significant difference in mean prices of cars among different fuel types.

$H_{a}$: At least one fuel type has a significantly different mean price than the other fuel types.

Grouping Data: First, group data by fuelType and collect price data for each group.

ANOVA Test: Perform the one-way ANOVA to see if there are significant differences in the mean prices among the different fuel types.

Visualization: A boxplot shows price distributions across fuel types, highlighting median, quartiles, and outliers.

In [6]:
#One-way ANOVA
unique_fuels = df['fuelType'].unique() # ['Petrol', 'Diesel', 'Hybrid', 'Electric', 'Other']
fuel_type_groups = df.groupby('fuelType')['price']
_, p_val = f_oneway(*[group for name, group in fuel_type_groups])
print("P-value for ANOVA Test:", p_val)
plt.figure(figsize=(10, 6))
sns.boxplot(x='fuelType', y='price', data=df, hue='fuelType', palette='Set2', dodge=False)
plt.title("Price Distribution across Different Fuel Types")
plt.xlabel("Fuel Type")
plt.ylabel("Price")
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
P-value for ANOVA Test: 1.9977378926368497e-179

One-way ANOVA Test Conclusion:¶

This test compares the means across multiple groups and in this case, it is assesing whether car prices vary significantly across different fuel types. With a p-value close to 0, we reject the null hypothesis and conclude that there is a signficant difference in car prices amongst the different fuel types.

Overall Conclusion:¶

The statistical tests applied each result in rejecting the null hypothesis due to close to or zero p-values, concluding that there are significant differences or associations with the case in each test.

  • There is a significant association between the car model and transmission type.
  • There is a signficant difference in the mean mpg between petrol and diesel cars.
  • Finally, there is a signficant difference in the mean prices amongst cars of different fuel types. These conclusions reinforce the idea that crucial characteristics in a car such as the model, fuel type, transmission type, and more place a vital role in factors such as fuel efficiency and car pricing.

Primary Analysis and Visualizations¶

Given the nature of the questions that we want to answer as well as the format of the dataset regression analysis will be used. The main goal is to try and predict Ford car prices based on various features and previous years Ford car prices, which is the first question proposed in the introduction. The target variable (Car Prices) is continous, and we want to understand it compared to one or more predictor variables making regression analysis suitable.

We start our regression by identifying the target variable and the features we are inspecting to see their influence on the target. After splitting our data for modeling purposes, we use one-hot encoding to process the categorical features to ensure that regression can handle categorical values. We then train our model.

In [7]:
# Splitting features and target variable
X = df.drop(columns=['price'])
y = df['price']

# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing categorical features
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()

onehot = OneHotEncoder(handle_unknown='ignore')

# Preprocessing pipeline for categorical variables
categorical_transformer = Pipeline(steps=[
    ('onehot', onehot)
])

# Column Transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_cols),
    ])

# Append regression model to preprocessing pipeline
regression_model = Pipeline(steps=[('preprocessor', preprocessor),
                                   ('regressor', LinearRegression())])

# Train the model
regression_model.fit(X_train, y_train)

# Make predictions
y_pred = regression_model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)
Mean Absolute Error: 2681.0750883750725
In [8]:
# Make predictions on the remaining data
y_remaining_pred = regression_model.predict(X)

# Assess model accuracy
mae_remaining = mean_absolute_error(y, y_remaining_pred)
r2_remaining = r2_score(y, y_remaining_pred)
print("Mean Absolute Error on Remaining Data:", mae_remaining)
print("R-squared on Remaining Data:", r2_remaining)

# Create visualizations
# Scatter plot of actual vs. predicted prices
plt.figure(figsize=(10, 6))
plt.scatter(y, y_remaining_pred, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel('Actual Price ($)')
plt.ylabel('Predicted Price ($)')
plt.title('Actual vs. Predicted Prices')
plt.grid(True)
plt.show()

# Distribution of residuals
residuals = y - y_remaining_pred
plt.figure(figsize=(10, 6))
sns.histplot(residuals, bins=30, kde=True)
plt.xlabel('Residuals ($)')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.grid(True)
plt.show()
Mean Absolute Error on Remaining Data: 2669.9070307127663
R-squared on Remaining Data: 0.4440272329154157

As we can see, the model did not perform very well, especially considering that the mean absolute error was around 2700 dollars, which is a large percentage of the average price of a car (12269.880523 dollars). The R^2 is also relatively low, with only 44% of the variability in the data being explained by the model. To properly answer the questions proposed in this analysis, we must try to improve our model. The following code takes a look at another form of regression, Ridge Regression, provides a model and statistics which help determine how much the model improved.

In [9]:
# Seperating numerical and categorical columns
numerical_cols = X_train.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()

# Setting up preprocessing pipeline for numerical variables
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Combine all of it into a single preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols),
    ])

# Updated regression model, using the preprocessor we just made and Ridge
ridge_model_improved = Pipeline(steps=[('preprocessor', preprocessor),
                                       ('regressor', Ridge())])

# Define a range of alpha values for Ridge Regression to try
param_grid = {
    'regressor__alpha': [0.001, 0.01, 0.1, 1, 10, 100]
}

# Grid search for hyperparameter tuning, allows us to find the best params
grid_search = GridSearchCV(ridge_model_improved, param_grid, cv=5, scoring='neg_mean_absolute_error')
grid_search.fit(X_train, y_train)

# Train the model with best parameters
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Make predictions
y_pred = best_model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Best Ridge Regression Model:")
print("Mean Absolute Error:", mae)
print("R-squared:", r2)

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel('Actual Price ($)')
plt.ylabel('Predicted Price ($)')
plt.title('Actual vs. Predicted Prices (Ridge Regression)')
plt.grid(True)
plt.show()
Best Ridge Regression Model:
Mean Absolute Error: 1351.0499581605309
R-squared: 0.8578125364945244

Now, we see that the best Ridge model yielded a lower MAE and a much better R^2, with 85% of the variance in the data being explained by the model. To answer which features influence the model the greatest, we can run the following code which inspects the coefficients of the features.

In [10]:
# Retrieve coefficients of all features
coefficients = best_model.named_steps['regressor'].coef_

# Get feature names
feature_names = numerical_cols + categorical_cols

# Create a dictionary mapping feature names to coefficients
feature_coefficients = dict(zip(feature_names, coefficients))

# Print feature coefficients
for feature, coefficient in feature_coefficients.items():
    print(f"{feature}: {coefficient}")
mileage: -1214.1141925518411
tax: -47.87943202212063
mpg: -733.7412275620932
engineSize: 1234.0538223049728
age_of_car: -2263.0816456289454
model: -3984.0314459804003
transmission: -2825.4352273861523
fuelType: -2359.862612889144

We can judge the statistics of the car to evaluate the first question we proposed. We see the mileage, mpg, engine_size, and age_of_car and their respective coefficients. The most significant are age_of_car and engineSize. Because the absolute value of both are the highest out of the other statistics, we are inspecting them. With a one year increase in age of the car, the price of the car decreases by -2263.041665289351 dollars holding all other factors constant. With a one liter increase in engineSize, the price of a car increases by 1234.2062080616374 dollars. These both make sense, and are valuable insights for both consumers and Ford.

In [11]:
# Define price ranges
price_ranges = [
    (0, 10000),
    (10000, 20000),
    (20000, 30000),
    (30000, 40000),
    (40000, 50000),
    (50000, np.inf)
]

# Evaluate model performance for each price range
for low, high in price_ranges:
    mask = (y_test >= low) & (y_test < high)
    y_test_range = y_test[mask]
    y_pred_range = y_pred[mask]

    if len(y_test_range) > 0:
        mae_range = mean_absolute_error(y_test_range, y_pred_range)
        r2_range = r2_score(y_test_range, y_pred_range)

        print(f"Price Range: ${low} - ${high}")
        print(f"Number of Samples: {len(y_test_range)}")
        print(f"Mean Absolute Error: {mae_range:.2f}")
        print(f"R-squared: {r2_range:.2f}")
        print("---")
    else:
        print(f"No samples in the price range: ${low} - ${high}")
        print("---")
Price Range: $0 - $10000
Number of Samples: 1288
Mean Absolute Error: 1341.15
R-squared: 0.01
---
Price Range: $10000 - $20000
Number of Samples: 2091
Mean Absolute Error: 1189.84
R-squared: 0.73
---
Price Range: $20000 - $30000
Number of Samples: 163
Mean Absolute Error: 2872.99
R-squared: -0.88
---
Price Range: $30000 - $40000
Number of Samples: 19
Mean Absolute Error: 6460.98
R-squared: -7.18
---
Price Range: $40000 - $50000
Number of Samples: 2
Mean Absolute Error: 3691.65
R-squared: -53.71
---
No samples in the price range: $50000 - $inf
---

We can see that the model performs better for lower prices, judging by the R^2 values of each price range. This shows consumers to use our analysis when considering lower end cars compared to the more expensive ones.

Insights and Conclusions:¶

The comprehensive analysis of Ford car attributes has illuminated several important factors that influence the prices of used Ford vehicles. This is leveraged by a rich dataset to develop predictive models with practical implications. This project has successfully bridged the gap between theoretical data science techniques and real-world applications, providing clear insights into the automotive market dynamics.

For an uninformed reader, the project serves as a detailed introduction to how data science can be applied in understanding and predicting car prices based on historical data. The explanations of statistical tests and modeling processes, alongside the use of visual aids, help make complex concepts easier to understand and engaging. By detailing each step from data curation to model evaluation, the project ensures that readers without a prior background can grasp the significance of each analysis phase and understand how these contribute to the final model's predictions.

For those already familiar with the topic, the project offers a deeper dive into specific data science applications within the automotive industry. The detailed analysis of different fuel types, transmission systems, and car models with respect to price provides a detailed comparative insight that could enhance an experienced reader's understanding of market trends. Additionally, the exploration of various statistical tests and regression models provides a practical demonstration of improving model performance in real-world datasets.

Overall, the project not only informs but also equips readers with knowledge about advanced data analysis techniques applied in a meaningful context. It highlights the critical role of data-driven decision-making in business strategies and consumer awareness, making a compelling case for the power of analytics in transforming industries. This balance of educational and practical insights ensures that all readers, irrespective of their prior knowledge, come away with both a foundational understanding and deeper insights into data science applications in the automotive sector.