Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a query data structure for tech.ml.dataset #261

Closed
ezmiller opened this issue Aug 10, 2021 · 1 comment
Closed

Define a query data structure for tech.ml.dataset #261

ezmiller opened this issue Aug 10, 2021 · 1 comment

Comments

@ezmiller
Copy link
Contributor

We want to define a data structure specification for a query that can become canonical within tech.ml.dataset. This will help make query-related functions smarter because it will be introspectable.

A concrete example of a function that can use this query definition is tech.v3.dataset.base/filter-column, whose signature is currently (dataset colname predicate) -> dataset. predicate can be a value or an instance of IFn. If filter-column were to take a query specification instead of a function, it could decide how to execute the filter, choosing the most optimal path that is appropriate for the data -- for example, choosing to use binary search for ordered data or the new column index-structure for unordered data. This the behavior we want to unlock with this change.

@cnuernber laid out a draft of what this might look like in a PR (see here). In it there are two simple query types: :any-of and :range. The filter data structures are maps that include a special key :filter-type and then other keys as needed based on the type of filter.

#:tech.v3.dataset{:filter-type :any-of
                    :values (set item-seq)}

#:tech.v3.dataset{:filter-type :range
                  :start start
                  :start-inclusive? start-inclusive?
                  :comparator comparator
                  :datatype op-dtype
                  :stop stop
                  :stop-inclusive? stop-inclusive?}))

It also bears mentioning that this data structure resonates a bit with the signature of the tech.v3.dataset.column-index-structure.select-from-index function that can be used to query a column's index structure. That function takes a mode that at the moment is either :slice or :pick and then a hash map of key-value pairs specifying the query based on the mode (see here). Whatever data structure we end up creating, it could be that that we change select-from-index to take that query structure. This would be a case of this query data structure becoming universal among TMD functions.

Another thing to keep in mind, is that @ribelo and @genmeblog are working on "lifting" tech.ml.dataset column functions into tablecloth in this PR that may be something to consider. I'm not sure yet if the work that is being done there could influence how we define the query data structure here, or vice versa.

@cnuernber
Copy link
Collaborator

Closing - I do think we could use datastructure-based queries for things like filter-column and filter and such but no one is moving that direction and I would rather see users come up with their own pathways and then take some of those pathways and move them up the chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants