Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I make an accurate prediction on new data when the target variable is missing? #18

Open
scoroman opened this issue Jul 6, 2024 · 0 comments

Comments

@scoroman
Copy link

scoroman commented Jul 6, 2024

Hi, I get a shaping error when trying to make predictions on new and unseen data without the target feature variable that the model(s) were trained on, so I use placeholder values for the target variable as substitute for the missing data. however, when I use placeholders like np.zeros, previous values, averages etc. my prediction error goes from <1% to at least over 8% :(

# Old data
X_train, X_test, y_train, y_test = train_test_split(X, y)  

def tell_me_about_the_Data(**kwargs):
  .....

tell_me_about_the_data(X_train, X_test, y_train, y_test)
# these datasets are np.arrays of shape (1999, 42). here we will be training with 42 features

# import, train, fine tune and fit the model(s)
   ........

# New data
tell_me_about_the_data(new_data)
# this dataset is an np.array of shape (30, 41). that is only 41 features while your best models were trained and fit on 42 features. you will have a shaping error if you try to make one forward pass prediction on a dataset with an unknown or missing target variable

make_predictions = model.predict(new_data)
ValueError or whatever error corresponds to shaping error: shape (x, 41) but the model expected shape (x, 42)


# Using placeholders for the target_variable feature to fix the shape error creates poor predictions and reduces accuracy by >= 8%
new_data['target_variables'] = np.zeros # or average of old_data or some other filler
make_predictions = model.predict(new_data)
# MSE = 25%

last_known_target_features = old_data['target_variable'].tail(30)
new_data['target_variable'] = last_known_target_features
make_predictions = model.predict(new_data)
# MSE = 8%

# Original models MSE for generalized testing on the held out y_test set is < 1%. I want close to the <1% error I originally trained and tested on
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant