Project Overview

Using everyone's favorite Diamond dataset, I was able to predict the selling price of the diamonds in the data with a high degree of accuracy using the variables I was given.

Tools

Python was used to transform, explore, evaluate the data, and to build the machine learning model.
- Powershell was used to create a custom 'Conda' virtual enviroment and to install all the nessecary packages.
PowerBI was used to visualize the data and the model results.

Python

In the following section, I'll walk through key points of the project as well as some of the code.

Let's take a look at the data before we proceed further.

df.head()

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

Feature Correlation

To reduce the complexity of the data, I created a ratio of two dimensions and chose to drop the original columns.

# Encoding categorical variables into machine readable values
d_df = pd.get_dummies(df)
corr_heatmap(d_df.corr()) # Calling heatmap plotting function (see notebook)

Correlation Heatmap

Feature Engineering

df['xy'] = df['x']/df['y']
X = d_df.drop(['price', 'x', 'y', 'z'], axis=1) # Dropping target variable & highly correlated columns
y = d_df['price']

Model Evaluation

The actual process of building the model isn't very exciting. If you're curious, you can see the project notebook located in the 'resources' directory.

RMSE results

Think of RMSE as the average difference between the predicted values and the actual values.

	RMSE
KNN	1170.659452
MLR	1120.830488
RF	552.995054
Lasso	1120.723179
Null(mean y value)	3942.168776

R2 results

The closer the R2 value is to 1, the better the model fits the data.

	R2
KNN	0.91133
MLR	0.918722
RF	0.980161
Lasso	0.918734

Exporting model

# Save ML model to disk
import pickle
import os

# Directory path and file names
directory_path = r"C:\Users\conno\workspace\projects\diamond_price_prediction\resources"
model_file_name = 'random_forest_model.pkl'
scaler_file_name = 'scaler.pkl'
processed_data_file_name = 'processed_diamond_data.csv'

# Full paths
model_full_path = os.path.join(directory_path, model_file_name)
scaler_full_path = os.path.join(directory_path, scaler_file_name)
data_full_path = os.path.join(directory_path, processed_data_file_name)

# Save ML model to disk
with open(model_full_path, 'wb') as model_file:
    pickle.dump(rf, model_file)

with open(scaler_full_path, 'wb') as scaler_file:
    pickle.dump(s, scaler_file)

# Saving the processed data as a csv
df.to_csv(data_full_path, index=False)

	carat	cut	color	clarity	depth	table	price	x	y	z	xy	predictions
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43	0.992462	377.0
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31	1.013021	404.8
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31	0.995086	349.6
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63	0.992908	372.0
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75	0.997701	402.1

PowerQuery

The following module shows how I imported the model I created and exported in the project notebook, then loaded it into Power BI using PowerQuery.

The Python code below is the actual script from the 'Run_Python_script' step, but presented in a more readable format.

let
    Source = Csv.Document(File.Contents("C:\Users\conno\workspace\projects\diamond_price_prediction\resources\processed_diamond_data.csv"),[Delimiter=",", Columns=11, Encoding=1252, QuoteStyle=QuoteStyle.None]),
    PromotedHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),

    Run_Python_script = Python.Execute("# 'dataset' holds the input data for this script#(lf)
      import pandas as pd#(lf)
      import pickle#(lf)#(lf)# Loading random forest model & scaler#(lf)
      file_path = r""C:\Users\conno\workspace\projects\diamond_price_prediction\resources\random_forest_model.pkl""#(lf)
      scaler_path = r""C:\Users\conno\workspace\projects\diamond_price_prediction\resources\scaler.pkl""#(lf)
      with open(file_path, 'rb') as file:#(lf)
          model = pickle.load(file)#(lf)
      with open(scaler_path, 'rb') as scaler_file:#(lf)
        scaler = pickle.load(scaler_file)#(lf)#(lf)
      # Feature Engineering#(lf)
      d_dataset = pd.get_dummies(dataset)#(lf)
      d_dataset = d_dataset.drop(['price', 'x', 'y', 'z'], axis=1)#(lf)
      X = scaler.transform(d_dataset)#(lf)#(lf)
      # Make predictions#(lf)
      dataset['predictions'] = model.predict(X)",[dataset=PromotedHeaders]
    ),
    dataset = Run_Python_script{[Name="dataset"]}[Value],

    // The index will serve as our data points on the scatter plot
    Added_Index = Table.AddIndexColumn(dataset, "Index", 0, 1, Int64.Type),
    Changed_DType = Table.TransformColumnTypes(Added_Index,
      {{"carat", type number}, {"cut", type text}, {"color", type text}, {"clarity", type text}, {"depth", type number}, {"table", type number}, {"price", Int64.Type}, {"x", type number}, {"y", type number}, {"z", type number}, {"xy", type number}, {"predictions", Int64.Type}}
    )
in
    Changed_DType

# 'dataset' holds the input data for this script
import pandas as pd
import pickle

# Loading random forest model & scaler
file_path = r"C:\Users\conno\workspace\projects\diamond_price_prediction\resources\random_forest_model.pkl"
scaler_path = r"C:\Users\conno\workspace\projects\diamond_price_prediction\resources\scaler.pkl"
with open(file_path, 'rb') as file:
    model = pickle.load(file)
with open(scaler_path, 'rb') as scaler_file:
    scaler = pickle.load(scaler_file)

# Feature Engineering
d_dataset = pd.get_dummies(dataset)
d_dataset = d_dataset.drop(['price', 'x', 'y', 'z'], axis=1)
X = scaler.transform(d_dataset)

# Make predictions
dataset['predictions'] = model.predict(X)

Power BI

Finally, it's time to load the model into Power BI for the final report. After some initial data modeling and measure development, we have a clean, easy-to-understand report that's ready for end-user consumption.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Diamond Price Predictions.Report		Diamond Price Predictions.Report
Diamond Price Predictions.SemanticModel		Diamond Price Predictions.SemanticModel
resources		resources
.gitattributes		.gitattributes
.gitignore		.gitignore
Diamond Price Predictions.pbip		Diamond Price Predictions.pbip
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Tools

Python

In the following section, I'll walk through key points of the project as well as some of the code.

Feature Correlation

Correlation Heatmap

Feature Engineering

Model Evaluation

RMSE results

R2 results

Exporting model

PowerQuery

Power BI

About

Languages

connor-hanan/diamond_price_prediction

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Tools

Python

In the following section, I'll walk through key points of the project as well as some of the code.

Feature Correlation

Correlation Heatmap

Feature Engineering

Model Evaluation

RMSE results

R2 results

Exporting model

PowerQuery

Power BI

About

Resources

Stars

Watchers

Forks

Languages