Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordinal encoding not working as expected #275

Open
ablaom opened this issue Sep 16, 2024 · 11 comments
Open

Ordinal encoding not working as expected #275

ablaom opened this issue Sep 16, 2024 · 11 comments
Assignees
Labels
invalid This doesn't seem right

Comments

@ablaom
Copy link
Collaborator

ablaom commented Sep 16, 2024

In stepping through fit for NeuraNetworkRegressor, using the data at the top of the test file regressors.jl, I am getting some unexpected behaviour.

Here is a minimal version of that data giving the same behaviour:

using MLJBase, MLJFlux, Tables

X = (
     ; Column2 = categorical(repeat(['a', 'b', 'c'], 10)),
    Column3 = categorical(repeat(["b", "c", "d"], 10), ordered = true),
)
y = rand(Float32, 30)

schema(X)
# ┌─────────┬──────────────────┬──────────────────────────────────┐
# │ names   │ scitypes         │ types                            │
# ├─────────┼──────────────────┼──────────────────────────────────┤
# │ Column2 │ Multiclass{5}    │ CategoricalValue{Char, UInt32}   │
# │ Column3 │ OrderedFactor{4} │ CategoricalValue{String, UInt32} │
# └─────────┴──────────────────┴──────────────────────────────────┘

And the model:

model = NeuralNetworkRegressor()

Okay, now the following lines are copied from fit, as given in "src/mlj_model_iinterface.jl" on the dev branch:

# Get input properties
shape = MLJFlux.shape(model, X, y)
cat_inds = MLJFlux.get_cat_inds(X)
pure_continuous_input = isempty(cat_inds)

# Decide whether to enable entity embeddings (e.g., ImageClassifier won't)
enable_entity_embs = MLJFlux.is_embedding_enabled(model) && !pure_continuous_input

# Prepare entity embeddings inputs and encode X if entity embeddings enabled
featnames = []
if enable_entity_embs
    X = MLJFlux.convert_to_table(X)
    featnames = Tables.schema(X).names
end

# entityprops is (index = cat_inds[i], levels = num_levels[i], newdim = newdims[i])
# for each categorical feature
default_embedding_dims = enable_entity_embs ? model.embedding_dims : Dict{Symbol, Real}()
entityprops, entityemb_output_dim =
    MLJFlux.prepare_entityembs(X, featnames, cat_inds, default_embedding_dims)
X, ordinal_mappings = MLJFlux.ordinal_encoder_fit_transform(X; featinds = cat_inds)

At this point I expect X to have Continuous scitype - no more categoricals. However:

schema(X)
# ┌─────────┬──────────────────┬─────────────────────────────────────────┐
# │ names   │ scitypes         │ types                                   │
# ├─────────┼──────────────────┼─────────────────────────────────────────┤
# │ Column2 │ Multiclass{3}    │ CategoricalValue{AbstractFloat, UInt32} │
# │ Column3 │ OrderedFactor{3} │ CategoricalValue{AbstractFloat, UInt32} │
# └─────────┴──────────────────┴─────────────────────────────────────────┘

The raw element type is Float32 but these are getting wrapped as categorical vectors.

typeof(X.Column2)
CategoricalVector{AbstractFloat, UInt32, AbstractFloat, CategoricalValue{AbstractFloat, UInt32}, Union{}} (alias for CategoricalArray{AbstractFloat, 1, UInt32, AbstractFloat, CategoricalValue{AbstractFloat, UInt32}, Union{}})
@ablaom ablaom added the invalid This doesn't seem right label Sep 16, 2024
@EssamWisam
Copy link
Collaborator

Thanks for catching this. I believe it's highly likely because I use recode from categorical arrays during transform. This was intended to preserve the categorical type (which is useful in MLJTransforms where the indices are kept as integers and not floats).

I will confirm this and try to implement ordinal_encoder_transform differently to fix. But maybe just not today.

@EssamWisam
Copy link
Collaborator

@ablaom could you check now on the entityembeds branch

@ablaom
Copy link
Collaborator Author

ablaom commented Sep 26, 2024

Okay, there is still a problem. For some reason the vectors, although consisting now of Float32 are still wrapped as AbstractFloat:

julia> X.Column2
5-element Vector{AbstractFloat}:
 1.0f0
 2.0f0
 3.0f0
 4.0f0
 5.0f0

julia> [X.Column2...]
5-element Vector{Float32}:
 1.0
 2.0
 3.0
 4.0
 5.0

I looked into this, and this seems to arise from the nature of recode. It does not seem to find the the minimum type for the returned vector. I am resolving this in a PR to replace this one by applying [v...] to the output v (as shown above) but the need for this suggests the recode approach is less than optimal. With this pattern the compiler is not able to resolve the appropriate type, which we actually know a priori. For present purposes it does not matter too much, but @EssamWisam you may want to revisit your encoders at MLJTransforms to ensure we are not encountering the same issue there. When vectors have abstract element type, this can lead to a degradation in performance, and to other issues.

@EssamWisam
Copy link
Collaborator

Thanks for your initiative in applying this fix. Yes, I will revisit for MLJTransforms. I do believe that the ordinal encoder there should return values still wrapped in CategoricalValue but not the frequency or target encoders.

What I am thinking about now is whether doing encoded_vector = map(x -> encode_dict[x], cat_vector) would be more efficient that recode followed by your trick. It's rational to assume it is, unless you tell me otherwise @ablaom .

@ablaom
Copy link
Collaborator Author

ablaom commented Sep 26, 2024

If you need "my trick" you are already doing something suboptimal.

I didn't check this, but maybe the problem is not recode at all but that your dictionary (or Pair list) for replacements already has an abstract value type parameter. Is this possible?

@EssamWisam
Copy link
Collaborator

Quite embarresingly that turned out to be the case. Some how, it completely slipped me that by this it will be forced to maintain the supertype. I will aim to remove the abstract type annotation in the dict for MLJTransforms as well...

@ablaom I already did here in the entityembeds branch so we no longer need the trick I believe...

@ablaom
Copy link
Collaborator Author

ablaom commented Sep 27, 2024

Okay, thanks. I think easiest is for me to separately fix the dict type in #276, which I've not merged yet.

@EssamWisam
Copy link
Collaborator

Okay, thanks. I think easiest is for me to separately fix the dict type in #276, which I've not merged yet.

@ablaom I made a commit there with the fix. Hope it helps.

@ablaom
Copy link
Collaborator Author

ablaom commented Nov 3, 2024

Although resolved, I'm re-opening this issue because the test suite currently lacks a catch for re-occurrence and this already led to an issue at #281.

@EssamWisam
Copy link
Collaborator

Is it implied that I should add tests of a particular nature to the current test suite?

@ablaom
Copy link
Collaborator Author

ablaom commented Nov 12, 2024

Not sure I understand the question. I am proposing that new tests be added to the test suite to catch errors like the one encountered in the opening post. Nothing more, nothing less.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants