-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance incremental computation support in Texera #2165
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good and clean! Left some small comments in code. Although, I am not very sure about the new incremental join operator, what is it behavior if the inputs are already retractable?
var rewrittenLogicalPlan = | ||
WorkflowCacheRewriter.transform(logicalPlan, opResultStorage, opsToReuseCache) | ||
rewrittenLogicalPlan.operatorMap.values.foreach(initOperator) | ||
|
||
// perform rewrite to enforce progressive computation constraints | ||
rewrittenLogicalPlan = ProgressiveRetractionEnforcer.enforceDelta(rewrittenLogicalPlan, context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest creating a new variable name for each step of the rewrite, as they are rewrites with different purposes.
private def shouldEmitOutput(): Boolean = { | ||
System.currentTimeMillis - lastUpdatedTime > UPDATE_INTERVAL_MS | ||
} | ||
|
||
private def emitOutputAndResetState(): scala.Iterator[Tuple] = { | ||
lastUpdatedTime = System.currentTimeMillis | ||
val resultIterator = getPartialOutputs() | ||
this.partialObjectsPerKey = new mutable.HashMap[List[Object], List[Object]]() | ||
resultIterator | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see similar code for partial and final aggregate operators to do time-based snapshots to push partial results out. If the time-based snapshot is a universal strategy for incremental operators to push out partial results, is it better to make it a standard framework?
@@ -272,9 +272,9 @@ public BuilderV2 add(String attributeName, AttributeType attributeType, Object f | |||
*/ | |||
public BuilderV2 addSequentially(Object[] fields) { | |||
checkNotNull(fields); | |||
checkSchemaMatchesFields(schema.getAttributes(), Lists.newArrayList(fields)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need such an assertion for the normal tuple fields. if we need to add new fields (e.g., retraction or not), we can treat it separately? If so, I can do it in a future PR.
|
||
import scala.collection.mutable.ArrayBuffer | ||
|
||
object ProgressiveRetractionEnforcer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some doc to explain this enforcer's duty?
@@ -71,6 +72,7 @@ class SpecializedAggregateOpDesc extends AggregateOpDesc { | |||
} | |||
Schema | |||
.newBuilder() | |||
.add(ProgressiveUtils.insertRetractFlagAttr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can have a Builder.allowRetract()
to add this attribute internally for users?
val builder = Tuple | ||
.newBuilder(operatorSchemaInfo.outputSchemas(0)) | ||
.add(left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case where the input left tuples and/or right tuples are already supporting retraction?
will revisit after complier refactoring. |
This PR enhances incremental computation support in Texera, including:
PartialAggregateOpExec
andFinalAggregateOpExec
are updated to use incremental computation. They will perdoically emit partial results to downstream.Aggregate
andLineChart
operators now use the new aggregation framework. Other aggregate-based visualizations are not using it as they are now implemented with Python UDFs and HTML visualizations.WordCloud
is not using the new framework, asWordCloud
is a special top-k aggregation.supportRetractableInput
to indicate whether an operator support retractions as input tuples.For detailed technical presentation on incremental computation, see this slide and descriptions in this PR