Ordinal encoding not working as expected #275

ablaom · 2024-09-16T23:50:00Z

In stepping through fit for NeuraNetworkRegressor, using the data at the top of the test file regressors.jl, I am getting some unexpected behaviour.

Here is a minimal version of that data giving the same behaviour:

using MLJBase, MLJFlux, Tables

X = (
     ; Column2 = categorical(repeat(['a', 'b', 'c'], 10)),
    Column3 = categorical(repeat(["b", "c", "d"], 10), ordered = true),
)
y = rand(Float32, 30)

schema(X)
# ┌─────────┬──────────────────┬──────────────────────────────────┐
# │ names   │ scitypes         │ types                            │
# ├─────────┼──────────────────┼──────────────────────────────────┤
# │ Column2 │ Multiclass{5}    │ CategoricalValue{Char, UInt32}   │
# │ Column3 │ OrderedFactor{4} │ CategoricalValue{String, UInt32} │
# └─────────┴──────────────────┴──────────────────────────────────┘

And the model:

model = NeuralNetworkRegressor()

Okay, now the following lines are copied from fit, as given in "src/mlj_model_iinterface.jl" on the dev branch:

# Get input properties
shape = MLJFlux.shape(model, X, y)
cat_inds = MLJFlux.get_cat_inds(X)
pure_continuous_input = isempty(cat_inds)

# Decide whether to enable entity embeddings (e.g., ImageClassifier won't)
enable_entity_embs = MLJFlux.is_embedding_enabled(model) && !pure_continuous_input

# Prepare entity embeddings inputs and encode X if entity embeddings enabled
featnames = []
if enable_entity_embs
    X = MLJFlux.convert_to_table(X)
    featnames = Tables.schema(X).names
end

# entityprops is (index = cat_inds[i], levels = num_levels[i], newdim = newdims[i])
# for each categorical feature
default_embedding_dims = enable_entity_embs ? model.embedding_dims : Dict{Symbol, Real}()
entityprops, entityemb_output_dim =
    MLJFlux.prepare_entityembs(X, featnames, cat_inds, default_embedding_dims)
X, ordinal_mappings = MLJFlux.ordinal_encoder_fit_transform(X; featinds = cat_inds)

At this point I expect X to have Continuous scitype - no more categoricals. However:

schema(X)
# ┌─────────┬──────────────────┬─────────────────────────────────────────┐
# │ names   │ scitypes         │ types                                   │
# ├─────────┼──────────────────┼─────────────────────────────────────────┤
# │ Column2 │ Multiclass{3}    │ CategoricalValue{AbstractFloat, UInt32} │
# │ Column3 │ OrderedFactor{3} │ CategoricalValue{AbstractFloat, UInt32} │
# └─────────┴──────────────────┴─────────────────────────────────────────┘

The raw element type is Float32 but these are getting wrapped as categorical vectors.

typeof(X.Column2)
CategoricalVector{AbstractFloat, UInt32, AbstractFloat, CategoricalValue{AbstractFloat, UInt32}, Union{}} (alias for CategoricalArray{AbstractFloat, 1, UInt32, AbstractFloat, CategoricalValue{AbstractFloat, UInt32}, Union{}})

The text was updated successfully, but these errors were encountered:

EssamWisam · 2024-09-17T12:59:50Z

Thanks for catching this. I believe it's highly likely because I use recode from categorical arrays during transform. This was intended to preserve the categorical type (which is useful in MLJTransforms where the indices are kept as integers and not floats).

I will confirm this and try to implement ordinal_encoder_transform differently to fix. But maybe just not today.

EssamWisam · 2024-09-26T01:12:06Z

@ablaom could you check now on the entityembeds branch

ablaom · 2024-09-26T21:40:05Z

Okay, there is still a problem. For some reason the vectors, although consisting now of Float32 are still wrapped as AbstractFloat:

julia> X.Column2
5-element Vector{AbstractFloat}:
 1.0f0
 2.0f0
 3.0f0
 4.0f0
 5.0f0

julia> [X.Column2...]
5-element Vector{Float32}:
 1.0
 2.0
 3.0
 4.0
 5.0

I looked into this, and this seems to arise from the nature of recode. It does not seem to find the the minimum type for the returned vector. I am resolving this in a PR to replace this one by applying [v...] to the output v (as shown above) but the need for this suggests the recode approach is less than optimal. With this pattern the compiler is not able to resolve the appropriate type, which we actually know a priori. For present purposes it does not matter too much, but @EssamWisam you may want to revisit your encoders at MLJTransforms to ensure we are not encountering the same issue there. When vectors have abstract element type, this can lead to a degradation in performance, and to other issues.

EssamWisam · 2024-09-26T21:49:24Z

Thanks for your initiative in applying this fix. Yes, I will revisit for MLJTransforms. I do believe that the ordinal encoder there should return values still wrapped in CategoricalValue but not the frequency or target encoders.

What I am thinking about now is whether doing encoded_vector = map(x -> encode_dict[x], cat_vector) would be more efficient that recode followed by your trick. It's rational to assume it is, unless you tell me otherwise @ablaom .

ablaom · 2024-09-26T23:01:52Z

If you need "my trick" you are already doing something suboptimal.

I didn't check this, but maybe the problem is not recode at all but that your dictionary (or Pair list) for replacements already has an abstract value type parameter. Is this possible?

EssamWisam · 2024-09-27T00:21:05Z

Quite embarresingly that turned out to be the case. Some how, it completely slipped me that by this it will be forced to maintain the supertype. I will aim to remove the abstract type annotation in the dict for MLJTransforms as well...

@ablaom I already did here in the entityembeds branch so we no longer need the trick I believe...

ablaom · 2024-09-27T01:41:45Z

Okay, thanks. I think easiest is for me to separately fix the dict type in #276, which I've not merged yet.

EssamWisam · 2024-09-28T21:10:53Z

Okay, thanks. I think easiest is for me to separately fix the dict type in #276, which I've not merged yet.

@ablaom I made a commit there with the fix. Hope it helps.

ablaom · 2024-11-03T20:12:00Z

Although resolved, I'm re-opening this issue because the test suite currently lacks a catch for re-occurrence and this already led to an issue at #281.

EssamWisam · 2024-11-10T02:46:31Z

Is it implied that I should add tests of a particular nature to the current test suite?

ablaom · 2024-11-12T00:31:59Z

Not sure I understand the question. I am proposing that new tests be added to the test suite to catch errors like the one encountered in the opening post. Nothing more, nothing less.

ablaom added the invalid This doesn't seem right label Sep 16, 2024

ablaom assigned EssamWisam Sep 16, 2024

ablaom mentioned this issue Sep 26, 2024

Force Float32 as type presented to Flux chains #276

Merged

ablaom closed this as completed Oct 8, 2024

ablaom mentioned this issue Oct 31, 2024

performance improvements #281

Merged

ablaom reopened this Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ordinal encoding not working as expected #275

Ordinal encoding not working as expected #275

ablaom commented Sep 16, 2024 •

edited

Loading

EssamWisam commented Sep 17, 2024

EssamWisam commented Sep 26, 2024

ablaom commented Sep 26, 2024 •

edited

Loading

EssamWisam commented Sep 26, 2024

ablaom commented Sep 26, 2024 •

edited

Loading

EssamWisam commented Sep 27, 2024

ablaom commented Sep 27, 2024

EssamWisam commented Sep 28, 2024

ablaom commented Nov 3, 2024

EssamWisam commented Nov 10, 2024

ablaom commented Nov 12, 2024

Ordinal encoding not working as expected #275

Ordinal encoding not working as expected #275

Comments

ablaom commented Sep 16, 2024 • edited Loading

EssamWisam commented Sep 17, 2024

EssamWisam commented Sep 26, 2024

ablaom commented Sep 26, 2024 • edited Loading

EssamWisam commented Sep 26, 2024

ablaom commented Sep 26, 2024 • edited Loading

EssamWisam commented Sep 27, 2024

ablaom commented Sep 27, 2024

EssamWisam commented Sep 28, 2024

ablaom commented Nov 3, 2024

EssamWisam commented Nov 10, 2024

ablaom commented Nov 12, 2024

ablaom commented Sep 16, 2024 •

edited

Loading

ablaom commented Sep 26, 2024 •

edited

Loading

ablaom commented Sep 26, 2024 •

edited

Loading