Preserve records batch order when SchemaCastScanExec is involved #70
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🗣 Description
EnforceDistribution PhisicalOptimizerRule aggressively enforces inputs to be split and processed in parallel which leads to re-ordering of record batches containing few records.
benefits_from_input_partitioning
setting enforces 1-1 relationship betweenSchemaCastScanExec
and its input preventing re-ordering. This does not block partitioning/parallel processing in general, just ensures 1-1 mapping for SchemaCastScanExec and wrapped input.Example plan.
RepartitionExec
is automatically enforced by the optimization. Order of resultant records batches vary as query records are processed in parallel.🤔 Other approaches considered
output_partitioning
forVirtualExecutionPlan
to indicate that output is ordered - same behavior[prefer_existing_sort](https://docs.rs/datafusion/latest/datafusion/config/struct.OptimizerOptions.html#structfield.prefer_existing_sort)
and some other datafusion configuration options - same behaviorEnforceDistribution
optimizer (execution plan is updated by injectingRepartitionExec
and re-initializingSchemaCastScanExec
)