Prioritization of partitions launched by Automation #26207

maxfirman · 2024-11-28T22:37:10Z

maxfirman
Nov 28, 2024

The Problem

We have daily partitioned assets that are materialized using automation policies. The consumers of these assets want upstream changes to be propagated to the asset within a few minutes for recent partitions. They care mostly about the freshness of the latest partition.

There are regular "update" events throughout the day that target the latest partition. Occasionally we receive bulk "restatement" events that target a large number of partitions, which triggers a backfill. When this occurs the latest partition can get queued up behind historic partitions and therefore breach its freshness SLA.

Proposed Solution

One idea is to leverage the existing run queue prioritization functionality. In order for this to work, we would need a mechanism to dynamically tag the individual partition runs at "backfill launch time". The runs would be tagged with the dagster/priority value based on the partition age.

A generalisation of this idea would be to allow the user to define a "tag generating function" for the asset that gets called during run submission. The api for this could look as follows:

from datetime import date

from dagster import asset, DailyPartitionsDefinition


def op_tags_fn(context) -> list[dict]:
    """Tag generating function that computes 'dagster/prioritization' based on partition age."""
    partition_date = date.fromisoformat(context.partition_key)
    today = date.today()
    partition_age_days = (today - partition_date).days
    return [{"dagster/prioritization": -partition_age_days}]


@asset(
    parititions_def=DailyPartitionsDefinition(start_date="2024-11-01"),
    op_tags_fn=op_tags_fn,
)
def my_asset():
    """Asset with backfill partition prioritization."""

Alternative approaches

We have come up with a workaround to this problem internally. We split the problem into "recent" partitions (<1 month old) which are handled through Automation policy. Historic partitions are handled with a scheduled backfill. There is some intelligence to scheduled backfill as it will only target stale partitions. Nonetheless this feels like a hack. We would prefer to be able to handle this entirely with Declaritive Automation.

maxfirman · 2024-11-30T15:48:45Z

maxfirman
Nov 30, 2024
Author

Having considered this further, one issue with my suggestion above is how to handle duplicate tags when multiple assets are materialised in a single run.

A simpler approach could be to add an optional partition_priority enum to the BackfillPolicy to specify the desired priority order. If specified, the partitions would be prioritised according to the lexigraphical sort order of the partition key, either FORWARDS or REVERSE.

This could look as follows:

from dagster import asset, BackfillPolicy, PartitionPriority, DailyPartitionsDefinition


@asset(
    partitions_def=DailyPartitionsDefinition(start_date="2024-01-01"),
    backfill_policy=BackfillPolicy(partition_priority=PartitionPriority.REVERSE),
)
def my_asset_with_backfill_priority():
    """
    Example asset with a 'REVERSE' partition priority.
    Recent partitions will be given the highest priority
    """

2 replies

OwenKephart Dec 2, 2024
Maintainer

Hi @maxfirman ! The BackfillDaemon is intended to do this sort of prioritization for time-partitioned assets, in which it attempts to materialize the newest time partitions first. Do you have an example where you're not seeing this behavior?

That being said, I'm curious to hear more about the specific failure case for you here -- is it basically that the run queue is filling up because of the backfill runs, and then in a later evaluation a smaller update happens and this gets queued behind the backfill?

maxfirman Dec 6, 2024
Author

Thanks @OwenKephart. To be fair maybe this was a more of a hypothetical concern. I did not realise that the Backill deamon already sorted the runs in this way. I will do some further testing to confirm that this will work or our use case.

We do have another related issue which is that we need to be able to set different concurrency limits for different assets. I'm aware of the Op/Asset level concurrency feature, however I have concerns about the scalability and reliability of that approach. It would be nice to be able to set "run tags" on the asset's BackfillPolicy. This would allow us to control the concurrency using the run tags. Alternatively the BackfillPolicy could support concurrency as part of its configuration:

@asset(
    partition_def=DailyPartitionDefinition(start_date="2020-01-01"), 
    backfill_policy=BackfillPolicy(max_concurrency=2)
)
def my_asset():
   ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prioritization of partitions launched by Automation #26207

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Prioritization of partitions launched by Automation #26207

maxfirman Nov 28, 2024

The Problem

Proposed Solution

Alternative approaches

Replies: 1 comment · 2 replies

maxfirman Nov 30, 2024 Author

OwenKephart Dec 2, 2024 Maintainer

maxfirman Dec 6, 2024 Author

maxfirman
Nov 28, 2024

Replies: 1 comment 2 replies

maxfirman
Nov 30, 2024
Author

OwenKephart Dec 2, 2024
Maintainer

maxfirman Dec 6, 2024
Author