Handling of NaNs #80

naoise-h · 2023-03-20T12:00:48Z

Currently, the presence of NaNs in a dataset produces a distinguishing event, as for most functions (other than nanmean, nanvar, nanstd) the output will always be NaN. Is the best solution for all functions to just ignore NaNs (like nanmean, etc)?

For single-dimensional problems, there should be no issue, as removing NaNs is a simple deterministic pre-processing step. For multi-dimensional problems, removing data rows may prove problematic for utility. Is there justification here for doing something fancier, like mapping to a value within the range?

inf may also need special consideration, although these can usually be overcome when clipping the data, as inf will clip to the upper bound, and -inf to the lower bound. NaN has no obvious value to map to. When the algorithm requires the norm of a row to be clipped (like LogisticRegression), mapping from inf to a value is no longer trivial. Do we map inf to a value that ensures the row's norm matches the clip, or do we also scale the rest of the row?

The text was updated successfully, but these errors were encountered:

naoise-h added bug Something isn't working enhancement New feature or request labels Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of NaNs #80

Handling of NaNs #80

naoise-h commented Mar 20, 2023

Handling of NaNs #80

Handling of NaNs #80

Comments

naoise-h commented Mar 20, 2023