Add a hint to tell that the row iterator state is the row number in tables with row access #342

ronisbr · 2023-09-06T13:40:27Z

Hi!

We have a problem in PrettyTables.jl related to huge matrices with row access. The current method, AFAIK, to access an element in those tables is to iterate the rows until we reach the desired one and then retrieve the column:

    # Get the i-th row by iterating the row table.
    it, state = iterate(rtable.table)

    for _ in 2:i
        it, state = iterate(rtable.table, state)
        it === nothing && error("The row `i` does not exist.")
    end

This method is OK for any usage that requires to read the entire data. However, when you are printing huge tables, you usually want a part of the table. In the case of PrettyTables.jl, we have a mode in which we print the beginning and the end of the table. Hence, for each printed column, we need to iterate through all the rows to the very bottom, even though we only need a few of them, leading to a very slow process.

The problem appeared here: ronisbr/PrettyTables.jl#217 (comment)

If we add to the Tables.jl API some hint that the state in the iteration is the row id, I can access, let's say, the element in the row 1,000,000 by just:

it, ~ = iterate(stable.table, 1_000_000)

instead of looping through the entire iterator.

In the case of GeoTables.jl, which uses the row ID as the state, the printing process was reduced from 8s to 0.001s when we only show 10 rows and 10 columns with "middle" cropping mode.

bkamins · 2023-09-06T18:32:13Z

I think there would be a need for "trait" that verifies if a given table has a known number of rows and these rows are indexable (as not all tables support this). E.g. nothing would be returned if a table does not support row indexing and otherwise row could be returned.

ablaom · 2023-09-06T21:21:41Z

Yes!

bkamins · 2023-09-07T07:30:09Z

OK. I have looked at it a bit more. So:

the DataAPI.nrow function is in the API.
the Tables.subset function is also in the API.

The only question is how well this API is supported in the ecosystem, but I think that the best way forward is to just start using it and then the packages that miss it can add it.

Probably more work needs to be done with nrow/ncol support which is worse than Tables.subset support currently.

(and in PrettyTables.jl maybe what can be done is e.g. try running nrow and if it throws MethodError do fallback to the slow variant)

bkamins · 2023-09-07T07:47:51Z

Ah - and there is an internal Tables.rowcount function, so it could be used instead of nrow. We just need to wait for @quinnj to comment on the use of Tables.rowcount. Maybe we just need to define in Tables.jl the default fallback:

`DataAPI.nrow(x) = Tables.rowcount(x)`

?

ronisbr · 2023-09-07T12:25:41Z

Hi @bkamins !

But even if the interface allow us to obtain the number of rows, there is the problem about how to obtain the n-th row. There is nothing in the interface describing what the state should be in the iterator Tables.AbstractRow. Sometime ago, I was using the row number as the state. However, SparseVariables.jl uses 0 as the initial state, which was leading to problems:

ronisbr/PrettyTables.jl#174

In fact, I think that state can even be some kind of object, leading to the same problem.

AFAIK, the only way today to safely obtain the n-th row, even if the object can return the number of rows, is to iterate through the rows.

bkamins · 2023-09-07T13:35:56Z

No, the API is Tables.subset to which you pass an index and this index should be 1-based.

In other words: Tables.jl tables should support Tables.subset with 1-based indexing:

julia> df = DataFrame(a=1:5)
5×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4
   5 │     5

julia> rt = Tables.rowtable(df)
5-element Vector{NamedTuple{(:a,), Tuple{Int64}}}:
 (a = 1,)
 (a = 2,)
 (a = 3,)
 (a = 4,)
 (a = 5,)

julia> ct = Tables.columntable(df)
(a = [1, 2, 3, 4, 5],)

julia> Tables.subset(df, 4)
DataFrameRow
 Row │ a
     │ Int64
─────┼───────
   4 │     4

julia> Tables.subset(rt, 4)
(a = 4,)

julia> Tables.subset(ct, 4)
(a = 4,)

ronisbr · 2023-09-07T14:00:57Z

Perfect! Thank you very much @bkamins ! I will update PrettyTables.jl to improve this use case.

ronisbr · 2023-09-12T20:43:57Z

Unfortunately, it seems that Tables.subet is only optional. For example, GeoTables does not implement it:

julia> using GeoTables

julia> a = georef((a=rand(1000,1000),));

julia> Tables.subset(a, 1)
ERROR: ArgumentError: no default `Tables.subset` implementation for type: GeoTable{Meshes.CartesianGrid{2, Float64}, NamedTuple{(:a,), Tuple{Vector{Float64}}}}
Stacktrace:
 [1] subset(x::GeoTable{Meshes.CartesianGrid{2, Float64}, NamedTuple{(:a,), Tuple{Vector{Float64}}}}, inds::Int64; viewhint::Nothing, view::Nothing)
   @ Tables ~/.julia/packages/Tables/AcRIE/src/Tables.jl:632
 [2] subset(x::GeoTable{Meshes.CartesianGrid{2, Float64}, NamedTuple{(:a,), Tuple{Vector{Float64}}}}, inds::Int64)
   @ Tables ~/.julia/packages/Tables/AcRIE/src/Tables.jl:608
 [3] top-level scope
   @ REPL[27]:1

@juliohm Is it possible to implement Tables.subset for GeoTables? I am planning to check if there is a valid implementation of Tables.subset to obtain the i-th row. If not, I will fallback to the default iteration. Hence, if you implement this function, that problem related with vcrop_mode = :middle will vanish.

juliohm · 2023-09-12T21:00:14Z

Definitely. We will tackle this tomorrow and will let you know. Em ter., 12 de set. de 2023 17:44, Ronan Arraes Jardim Chagas < ***@***.***> escreveu:

…

Unfortunately, it seems that Tables.subet is only optional. For example, GeoTables does not implement it: julia> using GeoTables julia> a = georef((a=rand(1000,1000),)); julia> Tables.subset(a, 1) ERROR: ArgumentError: no default `Tables.subset` implementation for type: GeoTable{Meshes.CartesianGrid{2, Float64}, NamedTuple{(:a,), Tuple{Vector{Float64}}}} Stacktrace: [1] subset(x::GeoTable{Meshes.CartesianGrid{2, Float64}, NamedTuple{(:a,), Tuple{Vector{Float64}}}}, inds::Int64; viewhint::Nothing, view::Nothing) @ Tables ~/.julia/packages/Tables/AcRIE/src/Tables.jl:632 [2] subset(x::GeoTable{Meshes.CartesianGrid{2, Float64}, NamedTuple{(:a,), Tuple{Vector{Float64}}}}, inds::Int64) @ Tables ~/.julia/packages/Tables/AcRIE/src/Tables.jl:608 [3] top-level scope @ REPL[27]:1 @juliohm <https://github.com/juliohm> Is it possible to implement Tables.subset for GeoTables? I am planning to check if there is a valid implementation of Tables.subset to obtain the i-th row. If not, I will fallback to the default iteration. Hence, if you implement this function, that problem related with vcrop_mode = :middle will vanish. — Reply to this email directly, view it on GitHub <#342 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZQW3JBBB2ZB6ZP6UOJ34TX2DCRRANCNFSM6AAAAAA4NLS5NU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

bkamins · 2023-09-12T21:08:38Z

And also I guess preferably nrow from DataAPI.jl (if you do not support it already).

juliohm · 2023-09-12T21:13:27Z

We do support nrow and ncol from DataAPI.jl, thanks for checking @bkamins.

As we discussed in another issue, it would be really nice if we could port these names to Tables.nrows and Tables.ncols so that anything related to the tables interface lives in a single package.

bkamins · 2023-09-12T21:25:37Z

so that anything related to the tables interface lives in a single package.

Yes. Let us wait for other maintainers to comment there.

juliohm · 2023-09-13T12:58:10Z

@ronisbr a new patch release is available with the Tables.subset implementation.

Please let us know if something else is missing.

Yes. Let us wait for other maintainers to comment there.

It would be nice if Tables.jl had other maintainers with decision power besides @quinnj, to advance things more quickly.

tpapp · 2023-09-13T14:28:53Z

In general I like the fact that we have a single person who maintains a coherent design. I don't think think that an interface package like Tables.jl should prioritize quick changes; a lot of packages in the ecosystem rely on the API so it should move carefully.

quinnj · 2023-09-13T15:36:41Z

Yeah, Tables.subset is the right answer here, as pointed out by @bkamins.

On the Tables.rowcount question, that's really only an internal function meant to operate on column-oriented table sources. So it's safe to call on the result of Tables.columns, but not otherwise, so that makes me hesitate to set it as the default for DataAPI.nrows.

@juliohm, any JuliaData admin/maintainer is free to make changes to Tables.jl and don't require approval from me. We've built a, IMO, healthy process to act by committee though, hence the invitation from other maintainers to weigh in.

IMO, there's always going to be a bit of awkwardness between DataAPI.jl and Tables.jl interfaces. Some interfaces more obviously belong to one package or the other, but it also doesn't make sense for them to all live in one or the other. The functions in DataAPI have, so far, been those that aren't exactly tables-specific, so a more generic home seems appropriate for packages where it wouldn't make sense to take Tables.jl as a dependency but they want to overload a data-related interface. nrows/ncols are admittedly a bit awkward here because they seem tables related, but in reality, they impose a more strict interface that what Tables.jl requires (i.e. Tables.jl doesn't require the # of rows to be known), so it seemed prudent to have them live in DataAPI. So the awkwardness is understandable, but hopefully the clarification here helps explain the current arrangement?

juliohm · 2023-09-13T15:41:56Z

Can Tables.jl with unknown rows be considered the exceptional case instead of the most common case? Perhaps a separate trait to distinguish the two cases and then move nrows and ncols to Tables.jl? That would be more intuitive IMO. Em qua., 13 de set. de 2023 12:36, Jacob Quinn ***@***.***> escreveu:

…

Yeah, Tables.subset is the right answer here, as pointed out by @bkamins <https://github.com/bkamins>. On the Tables.rowcount question, that's really only an internal function meant to operate on column-oriented table sources. So it's safe to call on the result of Tables.columns, but not otherwise, so that makes me hesitate to set it as the default for DataAPI.nrows. @juliohm <https://github.com/juliohm>, any JuliaData admin/maintainer is free to make changes to Tables.jl and don't require approval from me. We've built a, IMO, healthy process to act by committee though, hence the invitation from other maintainers to weigh in. IMO, there's always going to be a bit of awkwardness between DataAPI.jl and Tables.jl interfaces. Some interfaces more obviously belong to one package or the other, but it also doesn't make sense for them to all live in one or the other. The functions in DataAPI have, so far, been those that aren't exactly tables-specific, so a more generic home seems appropriate for packages where it wouldn't make sense to take Tables.jl as a dependency but they want to overload a *data*-related interface. nrows/ncols are admittedly a bit awkward here because they *seem* tables related, but in reality, they impose a more strict interface that what Tables.jl requires (i.e. Tables.jl doesn't require the # of rows to be known), so it seemed prudent to have them live in DataAPI. So the awkwardness is understandable, but hopefully the clarification here helps explain the current arrangement? — Reply to this email directly, view it on GitHub <#342 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZQW3NOVP2E2MAOWSRCRD3X2HHJLANCNFSM6AAAAAA4NLS5NU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

juliohm · 2023-09-13T15:45:13Z

Alternatively define nrows(table) = undef by default. We just need the name in the context of tables as opposed to general data. Em qua., 13 de set. de 2023 12:41, Júlio Hoffimann < ***@***.***> escreveu:

…

Can Tables.jl with unknown rows be considered the exceptional case instead of the most common case? Perhaps a separate trait to distinguish the two cases and then move nrows and ncols to Tables.jl? That would be more intuitive IMO. Em qua., 13 de set. de 2023 12:36, Jacob Quinn ***@***.***> escreveu: > Yeah, Tables.subset is the right answer here, as pointed out by @bkamins > <https://github.com/bkamins>. > > On the Tables.rowcount question, that's really only an internal function > meant to operate on column-oriented table sources. So it's safe to call on > the result of Tables.columns, but not otherwise, so that makes me > hesitate to set it as the default for DataAPI.nrows. > > @juliohm <https://github.com/juliohm>, any JuliaData admin/maintainer is > free to make changes to Tables.jl and don't require approval from me. We've > built a, IMO, healthy process to act by committee though, hence the > invitation from other maintainers to weigh in. > > IMO, there's always going to be a bit of awkwardness between DataAPI.jl > and Tables.jl interfaces. Some interfaces more obviously belong to one > package or the other, but it also doesn't make sense for them to all live > in one or the other. The functions in DataAPI have, so far, been those that > aren't exactly tables-specific, so a more generic home seems appropriate > for packages where it wouldn't make sense to take Tables.jl as a dependency > but they want to overload a *data*-related interface. nrows/ncols are > admittedly a bit awkward here because they *seem* tables related, but in > reality, they impose a more strict interface that what Tables.jl requires > (i.e. Tables.jl doesn't require the # of rows to be known), so it seemed > prudent to have them live in DataAPI. So the awkwardness is understandable, > but hopefully the clarification here helps explain the current arrangement? > > — > Reply to this email directly, view it on GitHub > <#342 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAZQW3NOVP2E2MAOWSRCRD3X2HHJLANCNFSM6AAAAAA4NLS5NU> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

bkamins mentioned this issue Sep 7, 2023

Add nrow and ncol fallback implementations #343

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a hint to tell that the row iterator state is the row number in tables with row access #342

Add a hint to tell that the row iterator state is the row number in tables with row access #342

ronisbr commented Sep 6, 2023 •

edited

Loading

bkamins commented Sep 6, 2023

ablaom commented Sep 6, 2023

bkamins commented Sep 7, 2023 •

edited

Loading

bkamins commented Sep 7, 2023

ronisbr commented Sep 7, 2023 •

edited

Loading

bkamins commented Sep 7, 2023

ronisbr commented Sep 7, 2023

ronisbr commented Sep 12, 2023

juliohm commented Sep 12, 2023 via email

bkamins commented Sep 12, 2023

juliohm commented Sep 12, 2023

bkamins commented Sep 12, 2023

juliohm commented Sep 13, 2023

tpapp commented Sep 13, 2023

quinnj commented Sep 13, 2023

juliohm commented Sep 13, 2023 via email

juliohm commented Sep 13, 2023 via email

Add a hint to tell that the row iterator state is the row number in tables with row access #342

Add a hint to tell that the row iterator state is the row number in tables with row access #342

Comments

ronisbr commented Sep 6, 2023 • edited Loading

bkamins commented Sep 6, 2023

ablaom commented Sep 6, 2023

bkamins commented Sep 7, 2023 • edited Loading

bkamins commented Sep 7, 2023

ronisbr commented Sep 7, 2023 • edited Loading

bkamins commented Sep 7, 2023

ronisbr commented Sep 7, 2023

ronisbr commented Sep 12, 2023

juliohm commented Sep 12, 2023 via email

bkamins commented Sep 12, 2023

juliohm commented Sep 12, 2023

bkamins commented Sep 12, 2023

juliohm commented Sep 13, 2023

tpapp commented Sep 13, 2023

quinnj commented Sep 13, 2023

juliohm commented Sep 13, 2023 via email

juliohm commented Sep 13, 2023 via email

ronisbr commented Sep 6, 2023 •

edited

Loading

bkamins commented Sep 7, 2023 •

edited

Loading

ronisbr commented Sep 7, 2023 •

edited

Loading