Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenML integration: Columns as Symbols #579

Closed
drcxcruz opened this issue Jun 27, 2020 · 4 comments
Closed

OpenML integration: Columns as Symbols #579

drcxcruz opened this issue Jun 27, 2020 · 4 comments

Comments

@drcxcruz
Copy link
Contributor

drcxcruz commented Jun 27, 2020

Describe the bug

Some dataframe columns are "normal" names while other dataframe columns are Symbol. Column names as symbols are painful to get to :)

For example, while the following line works fine
dfAdult |> @filter(_.workclass == "Self-emp-not-inc")

the next statement does not work because .native-country is a Symbol
dfAdult |> @filter(_.native-country == "Mexico")

To Reproduce

using MLJ, DataFrames, Queryverse                                    

rowtable = OpenML.load(1590)
dfAdult = DataFrame(rowtable)

julia> names(dfAdult)
15-element Array{Symbol,1}:
 :age
 :workclass
 :fnlwgt
 :education
 Symbol("education-num")
 Symbol("marital-status")
 :occupation
 :relationship
 :race
 :sex
 Symbol("capital-gain")
 Symbol("capital-loss")
 Symbol("hours-per-week")
 Symbol("native-country")
 :class


julia> dfAdult |> @filter(_.workclass == "Self-emp-not-inc")
julia> dfAdult |> @filter(_.native-country == "Mexico")

Error showing value of type QueryOperators.EnumerableFilter{NamedTuple{(:age, :workclass, :fnlwgt, :education, Symbol("education-num"), Symbol("marital-status"), :occupation, :relationship, :race, :sex, Symbol("capital-gain"), Symbol("capital-loss"), Symbol("hours-per-week"), Symbol("native-country"), :class),Tuple{Int64,SubString{String},Int64,SubString{String},Int64,SubString{String},SubString{String},SubString{String},SubString{String},SubString{String},Int64,Int64,Int64,SubString{String},SubString{String}}},QueryOperators.EnumerableIterable{NamedTuple{(:age, :workclass, :fnlwgt, :education, Symbol("education-num"), Symbol("marital-status"), :occupation, :relationship, :race, :sex, Symbol("capital-gain"), Symbol("capital-loss"), Symbol("hours-per-week"), Symbol("native-country"), :class),Tuple{Int64,SubString{String},Int64,SubString{String},Int64,SubString{String},SubString{String},SubString{String},SubString{String},SubString{String},Int64,Int64,Int64,SubString{String},SubString{String}}},Tables.DataValueRowIterator{NamedTuple{(:age, :workclass, :fnlwgt, :education, Symbol("education-num"), Symbol("marital-status"), :occupation, :relationship, :race, :sex, Symbol("capital-gain"), Symbol("capital-loss"), Symbol("hours-per-week"), Symbol("native-country"), :class),Tuple{Int64,SubString{String},Int64,SubString{String},Int64,SubString{String},SubString{String},SubString{String},SubString{String},SubString{String},Int64,Int64,Int64,SubString{String},SubString{String}}},Tables.Schema{(:age, :workclass, :fnlwgt, :education, Symbol("education-num"), Symbol("marital-status"), :occupation, :relationship, :race, :sex, Symbol("capital-gain"), Symbol("capital-loss"), Symbol("hours-per-week"), Symbol("native-country"), :class),Tuple{Int64,SubString{String},Int64,SubString{String},Int64,SubString{String},SubString{String},SubString{String},SubString{String},SubString{String},Int64,Int64,Int64,SubString{String},SubString{String}}},Tables.RowIterator{NamedTuple{(:age, :workclass, :fnlwgt, :education, Symbol("education-num"), Symbol("marital-status"), :occupation, :relationship, :race, :sex, Symbol("capital-gain"), Symbol("capital-loss"), Symbol("hours-per-week"), Symbol("native-country"), :class),Tuple{Array{Int64,1},Array{SubString{String},1},Array{Int64,1},Array{SubString{String},1},Array{Int64,1},Array{SubString{String},1},Array{SubString{String},1},Array{SubString{String},1},Array{SubString{String},1},Array{SubString{String},1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{SubString{String},1},Array{SubString{String},1}}}}}},var"#16#18"}:
ERROR: type NamedTuple has no field native

Expected behavior
dfAdult |> @filter(_.native-country == "Mexico")

to work without error :)

Additional context

Thank you for your time and expertise. Let me know if I am misunderstanding how Symbol Columns work in Julia.

Versions
Julia 1.4.2

(@v1.4) pkg> status
Status C:\Users\BCP\.julia\environments\v1.4\Project.toml
[336ed68f] CSV v0.6.1
[324d7699] CategoricalArrays v0.7.7
[aaaa29a8] Clustering v0.14.1
[861a8166] Combinatorics v1.0.2
[d58978e5] Dagger v0.8.0
[a93c6f00] DataFrames v0.20.2
[7806a523] DecisionTree v0.10.5
[31c24e10] Distributions v0.23.4
[38e38edf] GLM v1.3.9
[0e44f5e4] Hwloc v1.0.3
[09f84164] HypothesisTests v0.10.0
[7073ff75] IJulia v1.21.2
[b1bec4e5] LIBSVM v0.4.0
[add582a8] MLJ v0.11.2
[a7f614a8] MLJBase v0.13.5
[e80e1ace] MLJModelInterface v0.2.7
[d491faf4] MLJModels v0.9.10
[e1d29d7a] Missings v0.4.3
[6f286f6a] MultivariateStats v0.7.0
[5fb14364] OhMyREPL v0.5.5
[54e16d92] PrettyPrinting v0.2.0
[1a8c2f83] Query v0.12.2
[612083be] Queryverse v0.6.1
[8523bd24] ShapML v0.3.0
[2913bbd2] StatsBase v0.33.0
[bd369af6] Tables v1.0.4
[112f6efa] VegaLite v2.2.0
[009559a3] XGBoost v1.1.1

@darenasc
Copy link
Collaborator

Thanks @drcxcruz for reporting this. I'll take a look at it.

@darenasc
Copy link
Collaborator

hi @drcxcruz the following options should work for filtering the data.

filter(row -> row[Symbol("native-country")] == "Mexico", dfAdult)
filter(row -> row[:workclass] == "Self-emp-not-inc", dfAdult)
filter(row -> row["workclass"] == "Self-emp-not-inc", dfAdult)

The dash in name columns seems to be an issue in the use of the @filter macro, I'll be looking into that.

@drcxcruz
Copy link
Contributor Author

hi,

The - character is also problematic when I used CVS.File to open the data file. A simple way to tackle the issue is to convert every - to _ in the column names.

Thanks

@ablaom
Copy link
Member

ablaom commented Jul 8, 2020

@ablaom ablaom closed this as completed Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants