Skip to content

Latest commit

 

History

History
373 lines (353 loc) · 9.55 KB

README.md

File metadata and controls

373 lines (353 loc) · 9.55 KB

Project Overview


Using everyone's favorite Diamond dataset, I was able to predict the selling price of the diamonds in the data with a high degree of accuracy using the variables I was given.

Tools

  • Python was used to transform, explore, evaluate the data, and to build the machine learning model.
    • Powershell was used to create a custom 'Conda' virtual enviroment and to install all the nessecary packages.
  • PowerBI was used to visualize the data and the model results.

Python

In the following section, I'll walk through key points of the project as well as some of the code.

Let's take a look at the data before we proceed further.

df.head()
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

Feature Correlation

To reduce the complexity of the data, I created a ratio of two dimensions and chose to drop the original columns.

# Encoding categorical variables into machine readable values
d_df = pd.get_dummies(df)
corr_heatmap(d_df.corr()) # Calling heatmap plotting function (see notebook)

Correlation Heatmap

alt text

Feature Engineering

df['xy'] = df['x']/df['y']
X = d_df.drop(['price', 'x', 'y', 'z'], axis=1) # Dropping target variable & highly correlated columns
y = d_df['price']

Model Evaluation

The actual process of building the model isn't very exciting. If you're curious, you can see the project notebook located in the 'resources' directory.

RMSE results

  • Think of RMSE as the average difference between the predicted values and the actual values.
RMSE
KNN 1170.659452
MLR 1120.830488
RF 552.995054
Lasso 1120.723179
Null(mean y value) 3942.168776

R2 results

  • The closer the R2 value is to 1, the better the model fits the data.
R2
KNN 0.91133
MLR 0.918722
RF 0.980161
Lasso 0.918734

Exporting model

# Save ML model to disk
import pickle
import os

# Directory path and file names
directory_path = r"C:\Users\conno\workspace\projects\diamond_price_prediction\resources"
model_file_name = 'random_forest_model.pkl'
scaler_file_name = 'scaler.pkl'
processed_data_file_name = 'processed_diamond_data.csv'

# Full paths
model_full_path = os.path.join(directory_path, model_file_name)
scaler_full_path = os.path.join(directory_path, scaler_file_name)
data_full_path = os.path.join(directory_path, processed_data_file_name)

# Save ML model to disk
with open(model_full_path, 'wb') as model_file:
    pickle.dump(rf, model_file)

with open(scaler_full_path, 'wb') as scaler_file:
    pickle.dump(s, scaler_file)

# Saving the processed data as a csv
df.to_csv(data_full_path, index=False)
carat cut color clarity depth table price x y z xy predictions
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 0.992462 377.0
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 1.013021 404.8
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 0.995086 349.6
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 0.992908 372.0
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 0.997701 402.1

PowerQuery

The following module shows how I imported the model I created and exported in the project notebook, then loaded it into Power BI using PowerQuery.

  • The Python code below is the actual script from the 'Run_Python_script' step, but presented in a more readable format.
let
    Source = Csv.Document(File.Contents("C:\Users\conno\workspace\projects\diamond_price_prediction\resources\processed_diamond_data.csv"),[Delimiter=",", Columns=11, Encoding=1252, QuoteStyle=QuoteStyle.None]),
    PromotedHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),

    Run_Python_script = Python.Execute("# 'dataset' holds the input data for this script#(lf)
      import pandas as pd#(lf)
      import pickle#(lf)#(lf)# Loading random forest model & scaler#(lf)
      file_path = r""C:\Users\conno\workspace\projects\diamond_price_prediction\resources\random_forest_model.pkl""#(lf)
      scaler_path = r""C:\Users\conno\workspace\projects\diamond_price_prediction\resources\scaler.pkl""#(lf)
      with open(file_path, 'rb') as file:#(lf)
          model = pickle.load(file)#(lf)
      with open(scaler_path, 'rb') as scaler_file:#(lf)
        scaler = pickle.load(scaler_file)#(lf)#(lf)
      # Feature Engineering#(lf)
      d_dataset = pd.get_dummies(dataset)#(lf)
      d_dataset = d_dataset.drop(['price', 'x', 'y', 'z'], axis=1)#(lf)
      X = scaler.transform(d_dataset)#(lf)#(lf)
      # Make predictions#(lf)
      dataset['predictions'] = model.predict(X)",[dataset=PromotedHeaders]
    ),
    dataset = Run_Python_script{[Name="dataset"]}[Value],

    // The index will serve as our data points on the scatter plot
    Added_Index = Table.AddIndexColumn(dataset, "Index", 0, 1, Int64.Type),
    Changed_DType = Table.TransformColumnTypes(Added_Index,
      {{"carat", type number}, {"cut", type text}, {"color", type text}, {"clarity", type text}, {"depth", type number}, {"table", type number}, {"price", Int64.Type}, {"x", type number}, {"y", type number}, {"z", type number}, {"xy", type number}, {"predictions", Int64.Type}}
    )
in
    Changed_DType
# 'dataset' holds the input data for this script
import pandas as pd
import pickle

# Loading random forest model & scaler
file_path = r"C:\Users\conno\workspace\projects\diamond_price_prediction\resources\random_forest_model.pkl"
scaler_path = r"C:\Users\conno\workspace\projects\diamond_price_prediction\resources\scaler.pkl"
with open(file_path, 'rb') as file:
    model = pickle.load(file)
with open(scaler_path, 'rb') as scaler_file:
    scaler = pickle.load(scaler_file)

# Feature Engineering
d_dataset = pd.get_dummies(dataset)
d_dataset = d_dataset.drop(['price', 'x', 'y', 'z'], axis=1)
X = scaler.transform(d_dataset)

# Make predictions
dataset['predictions'] = model.predict(X)

Power BI

  • Finally, it's time to load the model into Power BI for the final report. After some initial data modeling and measure development, we have a clean, easy-to-understand report that's ready for end-user consumption.

alt text