Using everyone's favorite Diamond dataset, I was able to predict the selling price of the diamonds in the data with a high degree of accuracy using the variables I was given.
- Python was used to transform, explore, evaluate the data, and to build the machine learning model.
- Powershell was used to create a custom 'Conda' virtual enviroment and to install all the nessecary packages.
- PowerBI was used to visualize the data and the model results.
Let's take a look at the data before we proceed further.
df.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
To reduce the complexity of the data, I created a ratio of two dimensions and chose to drop the original columns.
# Encoding categorical variables into machine readable values
d_df = pd.get_dummies(df)
corr_heatmap(d_df.corr()) # Calling heatmap plotting function (see notebook)
df['xy'] = df['x']/df['y']
X = d_df.drop(['price', 'x', 'y', 'z'], axis=1) # Dropping target variable & highly correlated columns
y = d_df['price']
The actual process of building the model isn't very exciting. If you're curious, you can see the project notebook located in the 'resources' directory.
- Think of RMSE as the average difference between the predicted values and the actual values.
RMSE | |
---|---|
KNN | 1170.659452 |
MLR | 1120.830488 |
RF | 552.995054 |
Lasso | 1120.723179 |
Null(mean y value) | 3942.168776 |
- The closer the R2 value is to 1, the better the model fits the data.
R2 | |
---|---|
KNN | 0.91133 |
MLR | 0.918722 |
RF | 0.980161 |
Lasso | 0.918734 |
# Save ML model to disk
import pickle
import os
# Directory path and file names
directory_path = r"C:\Users\conno\workspace\projects\diamond_price_prediction\resources"
model_file_name = 'random_forest_model.pkl'
scaler_file_name = 'scaler.pkl'
processed_data_file_name = 'processed_diamond_data.csv'
# Full paths
model_full_path = os.path.join(directory_path, model_file_name)
scaler_full_path = os.path.join(directory_path, scaler_file_name)
data_full_path = os.path.join(directory_path, processed_data_file_name)
# Save ML model to disk
with open(model_full_path, 'wb') as model_file:
pickle.dump(rf, model_file)
with open(scaler_full_path, 'wb') as scaler_file:
pickle.dump(s, scaler_file)
# Saving the processed data as a csv
df.to_csv(data_full_path, index=False)
carat | cut | color | clarity | depth | table | price | x | y | z | xy | predictions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 0.992462 | 377.0 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 1.013021 | 404.8 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 0.995086 | 349.6 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 0.992908 | 372.0 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 0.997701 | 402.1 |
The following module shows how I imported the model I created and exported in the project notebook, then loaded it into Power BI using PowerQuery.
- The Python code below is the actual script from the 'Run_Python_script' step, but presented in a more readable format.
let
Source = Csv.Document(File.Contents("C:\Users\conno\workspace\projects\diamond_price_prediction\resources\processed_diamond_data.csv"),[Delimiter=",", Columns=11, Encoding=1252, QuoteStyle=QuoteStyle.None]),
PromotedHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
Run_Python_script = Python.Execute("# 'dataset' holds the input data for this script#(lf)
import pandas as pd#(lf)
import pickle#(lf)#(lf)# Loading random forest model & scaler#(lf)
file_path = r""C:\Users\conno\workspace\projects\diamond_price_prediction\resources\random_forest_model.pkl""#(lf)
scaler_path = r""C:\Users\conno\workspace\projects\diamond_price_prediction\resources\scaler.pkl""#(lf)
with open(file_path, 'rb') as file:#(lf)
model = pickle.load(file)#(lf)
with open(scaler_path, 'rb') as scaler_file:#(lf)
scaler = pickle.load(scaler_file)#(lf)#(lf)
# Feature Engineering#(lf)
d_dataset = pd.get_dummies(dataset)#(lf)
d_dataset = d_dataset.drop(['price', 'x', 'y', 'z'], axis=1)#(lf)
X = scaler.transform(d_dataset)#(lf)#(lf)
# Make predictions#(lf)
dataset['predictions'] = model.predict(X)",[dataset=PromotedHeaders]
),
dataset = Run_Python_script{[Name="dataset"]}[Value],
// The index will serve as our data points on the scatter plot
Added_Index = Table.AddIndexColumn(dataset, "Index", 0, 1, Int64.Type),
Changed_DType = Table.TransformColumnTypes(Added_Index,
{{"carat", type number}, {"cut", type text}, {"color", type text}, {"clarity", type text}, {"depth", type number}, {"table", type number}, {"price", Int64.Type}, {"x", type number}, {"y", type number}, {"z", type number}, {"xy", type number}, {"predictions", Int64.Type}}
)
in
Changed_DType
# 'dataset' holds the input data for this script
import pandas as pd
import pickle
# Loading random forest model & scaler
file_path = r"C:\Users\conno\workspace\projects\diamond_price_prediction\resources\random_forest_model.pkl"
scaler_path = r"C:\Users\conno\workspace\projects\diamond_price_prediction\resources\scaler.pkl"
with open(file_path, 'rb') as file:
model = pickle.load(file)
with open(scaler_path, 'rb') as scaler_file:
scaler = pickle.load(scaler_file)
# Feature Engineering
d_dataset = pd.get_dummies(dataset)
d_dataset = d_dataset.drop(['price', 'x', 'y', 'z'], axis=1)
X = scaler.transform(d_dataset)
# Make predictions
dataset['predictions'] = model.predict(X)
- Finally, it's time to load the model into Power BI for the final report. After some initial data modeling and measure development, we have a clean, easy-to-understand report that's ready for end-user consumption.