Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline Expansion of ReasonSets #330

Open
okennedy opened this issue Jun 26, 2019 · 0 comments
Open

Streamline Expansion of ReasonSets #330

okennedy opened this issue Jun 26, 2019 · 0 comments

Comments

@okennedy
Copy link
Member

okennedy commented Jun 26, 2019

As discussed in the writeup of Variable Generating Relational Algebra and in this paper, Mimir uses special Expression objects called VGTerms (and DataWarnings) to encode uncertainty about a value being computed over. These objects 'tag' results with a warning. Every tag is associated with a 3-part identifier:
( model:String, index:Int, key:Seq[PrimitiveValue] )
Identifiers are generated as part of query processing. The model and index fields are static: They are a fixed part of the query. The key field is generated during query processing (a typical use case is to pass a ROWID).

Specific identifiers are dropped during normal (BestGuess) query processing, and the query tracks only whether a given row or cell has been tagged (and not which specific warning tagged it). To understand what the warning is, a user needs to issue a separate query:

ANALYZE SELECT ....

This query returns a sequence of Reason objects, which include a human-readable explanation, and a process for resolving the error.

Analyze queries is handled by AnalyzeUncertainty. Handling these queries is a two stage process. First is AnalyzeUncertainty.explainSubsetWithoutOptimizing. This is a static pass over the query (i.e., no data is touched) that identifies every VG-Term (and DataWarning) in the query. The static pass returns a collection of ReasonSets. Every ReasonSet includes the static components of the identifier (model, index), as well as a relational algebra query that generates the dynamic components (the fields of key).

The second pass (presently happens in the front-end) expands out each ReasonSet into a collection of Reason objects by executing the query for each ReasonSet, generating the appropriate keys, and filling those in to Reasons. Typical usage is through the take function of ReasonSet, which returns a limited number of Reasons.

This ReasonSet expansion process includes a significant amount of redundant computation, as the ReasonSet queries are generally generated from the same source query. The goal of this project would be to streamline this expansion process, through materialized views, parallel execution, or inlining multiple queries into the same single-pass query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant