Member 1: Anchita Shukla, Contribution: 100%
Member 2: Yuvraj Delada, Contribution: 100%
Member 3: Jaskaran Gill, Contribution: 100%
"We, all team members, agree together that the above information is true, and we are confident about our contributions to this submitted project/final tutorial." - Anchita Shukla, Yuvraj Delada, Jaskaran Gill 5/7/2024
Anchita Shukla, Yuvraj Delada, Jaskaran Gill 5/7/2024
We all explored different datasets and figured out which one would be best, as well as figuring out what exploratory things we should do and how we should clean the data. We also wrote the steps together for the Data Curation part.
Anchita explored t test, I explored ANOVA, and Jaskaran did the Chi squared, and we all did the writing for our respective tests. For the initial regression, Jaskaran and I wrote the code while Anchita figured out a better way to do the regression and provided insights into what was happening. I wrote the Ridge regression, while Anchita and Jaskaran each used the data to answer the questions proposed in the introduction. We all did the insights and conclusions part, as well as the introduction.
Ford has been a leading car brand in America for over a century and is known for its high-quality vehicles. The United States is the largest market for Ford as wholesales to U.S. dealerships reached 1.7 million vehicles in 2021. Additionally, according to a Statista study, in 2022, Ford overtook Toyota as the leading car brand in the United States based on vehicle sales, delivering about 1.8 million units to U.S. customers. The dataset we will be interpreting in this analysis contains information about used Ford Car prices and the various factors which may influence the price. It is from Kaggle, a data science platform with over 276,000 high quality public datasets. We were interested in this data for several reasons. From a business perspective, analyzing the factors that influence the used car prices can help identify which variables are the key drivers of pricing in the used car market. For Ford, understanding the price dynamics in this market can help them make informed decisions on consumer preferences and identify trends. Building a predictive model from this dataset would be valuable to both buyers and sellers of used cars. This particular dataset we will be looking at has 17,966 observations and 7 predictor variables for the response variable, price. The numerical explanatory variables are production year (year), number of miles traveled (mileage), annual tax in dollars (tax), miles per gallon(mpg), and the car's engine size in liters (engineSize). The categorical explanatory variables are the cars transmission type (transmission) and fuel type (fuel_type).
Using our data, we explored associations between car models and transmission types, the influence of fuel type on mileage performance, and variations in price points across different fuel types. Building upon what we uncovered through these exploratory processes, we created a predctive model to answer questions that can benefit both consumers and Ford, impacting both parties' decision making process. Critical questions we aim to answer are:
We aim to provide insight which can help Ford determine what influences the pricing of the best selling car company in America, which will help them in the fields of product development, pricing strategies, and even marketing efforts. Delving into analyzing different price range's predictions is useful for consumers that may view this model as a guide. It is important that it is known how strong the model performs based off the kind of price range they are in the market for.
Sources:
Scikit-Learn GridSearchCV (open-source ML library for python):
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Chi-Square Test:
https://www.geeksforgeeks.org/ml-chi-square-test-for-feature-selection/
T-Test:
One-Way ANOVA:
Data Curation:
https://www.techtarget.com/searchbusinessanalytics/definition/data-curation
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency, ttest_ind, f_oneway
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
We imported the dataset as a DataFrame in Pandas.
https://www.kaggle.com/datasets/adhurimquku/ford-car-price-prediction/data
# read in the ford csv
df = pd.read_csv("ford.csv")
#check the info of the DataFrame for any abnormalities
print(df.info())
#change the 'year' column to something more meaningful,'age'.
df['age_of_car'] = 2024 - df['year']
del df['year'] #no need for 'year' anymore
#get rid of duplicate data
df.drop_duplicates(inplace=True)
# this block of code checks for any missing data
missing_values = df.isnull().sum()
if missing_values.any():
print("columns with missing vals:")
print(missing_values[missing_values > 0])
else:
print("none")
#reset the index to be properly 0-indexed after some cleaning
df.reset_index(drop=True, inplace=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17966 entries, 0 to 17965 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 model 17966 non-null object 1 year 17966 non-null int64 2 price 17966 non-null int64 3 transmission 17966 non-null object 4 mileage 17966 non-null int64 5 fuelType 17966 non-null object 6 tax 17966 non-null int64 7 mpg 17966 non-null float64 8 engineSize 17966 non-null float64 dtypes: float64(2), int64(4), object(3) memory usage: 1.2+ MB None none
#we see a negative age, which does not make sense! Run this alone first to see
#the negative value.
df.describe()
df[df.age_of_car == -36]
#drop the outlier
df.drop(df.query("age_of_car == -36").index, inplace = True)
df.describe()
price | mileage | tax | mpg | engineSize | age_of_car | |
---|---|---|---|---|---|---|
count | 17811.000000 | 17811.000000 | 17811.000000 | 17811.000000 | 17811.000000 | 17811.000000 |
mean | 12269.880523 | 23379.381955 | 113.309865 | 57.909545 | 1.350620 | 7.140026 |
std | 4736.220719 | 19418.128363 | 62.032540 | 10.132348 | 0.432593 | 2.026478 |
min | 495.000000 | 1.000000 | 0.000000 | 20.800000 | 0.000000 | 4.000000 |
25% | 8999.000000 | 10000.000000 | 30.000000 | 52.300000 | 1.000000 | 6.000000 |
50% | 11289.000000 | 18274.000000 | 145.000000 | 58.900000 | 1.200000 | 7.000000 |
75% | 15295.000000 | 31092.500000 | 145.000000 | 65.700000 | 1.500000 | 8.000000 |
max | 54995.000000 | 177644.000000 | 580.000000 | 201.800000 | 5.000000 | 28.000000 |
$H_{0}$: There is no association between the car model and the transmission type
$H_{a}$: There is an association between the car model and the transmission type
Contingency Table Creation:The first step involves creating a contingency table that shows the frequency distribution of variables. This table is essential for the Chi-Squared test as it provides the observed frequencies in each category.
Chi-Squared Test Computation: Using chi2_contingency, we compute the Chi-Squared statistic and the p-value. The p-value helps us determine whether to reject the null hypothesis (no association between the variables).
Visualization: We plot a bar chart to visually inspect the distribution of transmission types across different car models. This helps in understanding the data intuitively.
# Chi-Squared Test
contingency_table = pd.crosstab(df['model'], df['transmission'])
_, p_val, _, _ = chi2_contingency(contingency_table)
print("P-value for Chi-Squared Test:", p_val)
contingency_table.plot(kind='bar', figsize=(10, 6))
plt.title("Model vs. Transmission Type")
plt.xlabel("Model")
plt.ylabel("Count")
plt.show()
P-value for Chi-Squared Test: 0.0
This test is investing the association between two categorical varaibles, which are the car model and transmission type in this case. With a p-value of 0, this leads us to reject the null hypothesis indicating a significant association exists between the car model and the transmission type.
$H_{0}$: The mean mpg is not different across petrol and diesel cars.
$H_{a}$: The mean mpg is significantly different across petrol and diesel cars.
Data Selection: Extract MPG data for each group (petrol and diesel cars).
T-Test Execution: Perform the two-sample T-test to compare the means of the two groups. The result is expressed through a p-value.
Visualization: Histograms are plotted to show the distribution of MPG values for both fuel types, highlighting the central tendency and dispersion.
# Two Sample T-Test
petrol_mpg = df[df['fuelType'] == 'Petrol']['mpg']
diesel_mpg = df[df['fuelType'] == 'Diesel']['mpg']
_, p_val = ttest_ind(petrol_mpg, diesel_mpg)
print("P-value for T Test:", p_val)
plt.figure(figsize=(8, 6))
plt.hist([petrol_mpg, diesel_mpg], bins=10, label=['Petrol', 'Diesel'], density=True)
plt.title("MPG Distribution for Petrol and Diesel Cars")
plt.xlabel("MPG")
plt.ylabel("Density")
plt.legend()
plt.show()
P-value for T Test: 0.0
This test compares the mean miles per gallon between two groups, the petrol and diesel cars, to see if there is a significant difference between them. Given the p-value of 0 again, we reject the null hypothesis indicating that the mean mpg signifcantly differs between the petrol and diesel cars.
$H_{0}$: There is no significant difference in mean prices of cars among different fuel types.
$H_{a}$: At least one fuel type has a significantly different mean price than the other fuel types.
Grouping Data: First, group data by fuelType and collect price data for each group.
ANOVA Test: Perform the one-way ANOVA to see if there are significant differences in the mean prices among the different fuel types.
Visualization: A boxplot shows price distributions across fuel types, highlighting median, quartiles, and outliers.
#One-way ANOVA
unique_fuels = df['fuelType'].unique() # ['Petrol', 'Diesel', 'Hybrid', 'Electric', 'Other']
fuel_type_groups = df.groupby('fuelType')['price']
_, p_val = f_oneway(*[group for name, group in fuel_type_groups])
print("P-value for ANOVA Test:", p_val)
plt.figure(figsize=(10, 6))
sns.boxplot(x='fuelType', y='price', data=df, hue='fuelType', palette='Set2', dodge=False)
plt.title("Price Distribution across Different Fuel Types")
plt.xlabel("Fuel Type")
plt.ylabel("Price")
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
P-value for ANOVA Test: 1.9977378926368497e-179
This test compares the means across multiple groups and in this case, it is assesing whether car prices vary significantly across different fuel types. With a p-value close to 0, we reject the null hypothesis and conclude that there is a signficant difference in car prices amongst the different fuel types.
The statistical tests applied each result in rejecting the null hypothesis due to close to or zero p-values, concluding that there are significant differences or associations with the case in each test.
Given the nature of the questions that we want to answer as well as the format of the dataset regression analysis will be used. The main goal is to try and predict Ford car prices based on various features and previous years Ford car prices, which is the first question proposed in the introduction. The target variable (Car Prices) is continous, and we want to understand it compared to one or more predictor variables making regression analysis suitable.
We start our regression by identifying the target variable and the features we are inspecting to see their influence on the target. After splitting our data for modeling purposes, we use one-hot encoding to process the categorical features to ensure that regression can handle categorical values. We then train our model.
# Splitting features and target variable
X = df.drop(columns=['price'])
y = df['price']
# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing categorical features
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()
onehot = OneHotEncoder(handle_unknown='ignore')
# Preprocessing pipeline for categorical variables
categorical_transformer = Pipeline(steps=[
('onehot', onehot)
])
# Column Transformer
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_cols),
])
# Append regression model to preprocessing pipeline
regression_model = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', LinearRegression())])
# Train the model
regression_model.fit(X_train, y_train)
# Make predictions
y_pred = regression_model.predict(X_test)
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)
Mean Absolute Error: 2681.0750883750725
# Make predictions on the remaining data
y_remaining_pred = regression_model.predict(X)
# Assess model accuracy
mae_remaining = mean_absolute_error(y, y_remaining_pred)
r2_remaining = r2_score(y, y_remaining_pred)
print("Mean Absolute Error on Remaining Data:", mae_remaining)
print("R-squared on Remaining Data:", r2_remaining)
# Create visualizations
# Scatter plot of actual vs. predicted prices
plt.figure(figsize=(10, 6))
plt.scatter(y, y_remaining_pred, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel('Actual Price ($)')
plt.ylabel('Predicted Price ($)')
plt.title('Actual vs. Predicted Prices')
plt.grid(True)
plt.show()
# Distribution of residuals
residuals = y - y_remaining_pred
plt.figure(figsize=(10, 6))
sns.histplot(residuals, bins=30, kde=True)
plt.xlabel('Residuals ($)')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.grid(True)
plt.show()
Mean Absolute Error on Remaining Data: 2669.9070307127663 R-squared on Remaining Data: 0.4440272329154157
As we can see, the model did not perform very well, especially considering that the mean absolute error was around 2700 dollars, which is a large percentage of the average price of a car (12269.880523 dollars). The R^2 is also relatively low, with only 44% of the variability in the data being explained by the model. To properly answer the questions proposed in this analysis, we must try to improve our model. The following code takes a look at another form of regression, Ridge Regression, provides a model and statistics which help determine how much the model improved.
# Seperating numerical and categorical columns
numerical_cols = X_train.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()
# Setting up preprocessing pipeline for numerical variables
numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
# Combine all of it into a single preprocessor
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols),
])
# Updated regression model, using the preprocessor we just made and Ridge
ridge_model_improved = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', Ridge())])
# Define a range of alpha values for Ridge Regression to try
param_grid = {
'regressor__alpha': [0.001, 0.01, 0.1, 1, 10, 100]
}
# Grid search for hyperparameter tuning, allows us to find the best params
grid_search = GridSearchCV(ridge_model_improved, param_grid, cv=5, scoring='neg_mean_absolute_error')
grid_search.fit(X_train, y_train)
# Train the model with best parameters
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)
# Make predictions
y_pred = best_model.predict(X_test)
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Best Ridge Regression Model:")
print("Mean Absolute Error:", mae)
print("R-squared:", r2)
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel('Actual Price ($)')
plt.ylabel('Predicted Price ($)')
plt.title('Actual vs. Predicted Prices (Ridge Regression)')
plt.grid(True)
plt.show()
Best Ridge Regression Model: Mean Absolute Error: 1351.0499581605309 R-squared: 0.8578125364945244
Now, we see that the best Ridge model yielded a lower MAE and a much better R^2, with 85% of the variance in the data being explained by the model. To answer which features influence the model the greatest, we can run the following code which inspects the coefficients of the features.
# Retrieve coefficients of all features
coefficients = best_model.named_steps['regressor'].coef_
# Get feature names
feature_names = numerical_cols + categorical_cols
# Create a dictionary mapping feature names to coefficients
feature_coefficients = dict(zip(feature_names, coefficients))
# Print feature coefficients
for feature, coefficient in feature_coefficients.items():
print(f"{feature}: {coefficient}")
mileage: -1214.1141925518411 tax: -47.87943202212063 mpg: -733.7412275620932 engineSize: 1234.0538223049728 age_of_car: -2263.0816456289454 model: -3984.0314459804003 transmission: -2825.4352273861523 fuelType: -2359.862612889144
We can judge the statistics of the car to evaluate the first question we proposed. We see the mileage, mpg, engine_size, and age_of_car and their respective coefficients. The most significant are age_of_car and engineSize. Because the absolute value of both are the highest out of the other statistics, we are inspecting them. With a one year increase in age of the car, the price of the car decreases by -2263.041665289351 dollars holding all other factors constant. With a one liter increase in engineSize, the price of a car increases by 1234.2062080616374 dollars. These both make sense, and are valuable insights for both consumers and Ford.
# Define price ranges
price_ranges = [
(0, 10000),
(10000, 20000),
(20000, 30000),
(30000, 40000),
(40000, 50000),
(50000, np.inf)
]
# Evaluate model performance for each price range
for low, high in price_ranges:
mask = (y_test >= low) & (y_test < high)
y_test_range = y_test[mask]
y_pred_range = y_pred[mask]
if len(y_test_range) > 0:
mae_range = mean_absolute_error(y_test_range, y_pred_range)
r2_range = r2_score(y_test_range, y_pred_range)
print(f"Price Range: ${low} - ${high}")
print(f"Number of Samples: {len(y_test_range)}")
print(f"Mean Absolute Error: {mae_range:.2f}")
print(f"R-squared: {r2_range:.2f}")
print("---")
else:
print(f"No samples in the price range: ${low} - ${high}")
print("---")
Price Range: $0 - $10000 Number of Samples: 1288 Mean Absolute Error: 1341.15 R-squared: 0.01 --- Price Range: $10000 - $20000 Number of Samples: 2091 Mean Absolute Error: 1189.84 R-squared: 0.73 --- Price Range: $20000 - $30000 Number of Samples: 163 Mean Absolute Error: 2872.99 R-squared: -0.88 --- Price Range: $30000 - $40000 Number of Samples: 19 Mean Absolute Error: 6460.98 R-squared: -7.18 --- Price Range: $40000 - $50000 Number of Samples: 2 Mean Absolute Error: 3691.65 R-squared: -53.71 --- No samples in the price range: $50000 - $inf ---
We can see that the model performs better for lower prices, judging by the R^2 values of each price range. This shows consumers to use our analysis when considering lower end cars compared to the more expensive ones.
The comprehensive analysis of Ford car attributes has illuminated several important factors that influence the prices of used Ford vehicles. This is leveraged by a rich dataset to develop predictive models with practical implications. This project has successfully bridged the gap between theoretical data science techniques and real-world applications, providing clear insights into the automotive market dynamics.
For an uninformed reader, the project serves as a detailed introduction to how data science can be applied in understanding and predicting car prices based on historical data. The explanations of statistical tests and modeling processes, alongside the use of visual aids, help make complex concepts easier to understand and engaging. By detailing each step from data curation to model evaluation, the project ensures that readers without a prior background can grasp the significance of each analysis phase and understand how these contribute to the final model's predictions.
For those already familiar with the topic, the project offers a deeper dive into specific data science applications within the automotive industry. The detailed analysis of different fuel types, transmission systems, and car models with respect to price provides a detailed comparative insight that could enhance an experienced reader's understanding of market trends. Additionally, the exploration of various statistical tests and regression models provides a practical demonstration of improving model performance in real-world datasets.
Overall, the project not only informs but also equips readers with knowledge about advanced data analysis techniques applied in a meaningful context. It highlights the critical role of data-driven decision-making in business strategies and consumer awareness, making a compelling case for the power of analytics in transforming industries. This balance of educational and practical insights ensures that all readers, irrespective of their prior knowledge, come away with both a foundational understanding and deeper insights into data science applications in the automotive sector.