perf: maintain_order can lead to faster overall operations #20346

mwaddoups · 2024-12-18T11:20:36Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import numpy as np
import polars as pl

# Generate a dataset with NUM_BIG_G * NUM_SMALL_G groups across two columns,
# and random data

NUM_BIG_G = 100000
NUM_SMALL_G = 100
N = 2

groups = np.arange(NUM_BIG_G)
small_groups = np.arange(NUM_SMALL_G)
data = np.random.normal(size=NUM_SMALL_G * NUM_BIG_G * N)

df = pl.DataFrame({
    "groups": np.repeat(np.repeat(groups, NUM_SMALL_G), N),
    "groups_2": np.repeat(np.tile(small_groups, NUM_BIG_G), N),
    "data": data,
})


def test_query(maintain_order: bool):

    _, prof = (
        df.lazy()
        .group_by("groups", "groups_2", maintain_order=maintain_order)
        .agg(pl.col("data").first())
        .group_by("groups", maintain_order=False)
        .agg(
            pl.col("data"),
        )
        .profile(no_optimization=True)
    )

    return prof.select("node", pl.col("end") - pl.col("start"))


print(test_query(True))
print(test_query(False))

This outputs

shape: (3, 2)
┌─────────────────────────────────┬─────────┐
│ node                            ┆ end     │
│ ---                             ┆ ---     │
│ str                             ┆ u64     │
╞═════════════════════════════════╪═════════╡
│ optimization                    ┆ 12      │
│ group_by_partitioned(groups, g… ┆ 1552986 │
│ group_by(groups)                ┆ 86992   │
└─────────────────────────────────┴─────────┘

shape: (3, 2)
┌─────────────────────────────────┬─────────┐
│ node                            ┆ end     │
│ ---                             ┆ ---     │
│ str                             ┆ u64     │
╞═════════════════════════════════╪═════════╡
│ optimization                    ┆ 5       │
│ group_by_partitioned(groups, g… ┆ 1261104 │
│ group_by(groups)                ┆ 164829  │
└─────────────────────────────────┴─────────┘

Log output

keys/aggregates are not partitionable: running default HASH AGGREGATION
POLARS_FORCE_PARTITION set: running partitioned HASH AGGREGATION
run PARTITIONED HASH AGGREGATION

keys/aggregates are not partitionable: running default HASH AGGREGATION
POLARS_FORCE_PARTITION set: running partitioned HASH AGGREGATION
run PARTITIONED HASH AGGREGATION

Issue description

I ran into an issue in my code where adding maintain_order=True made some aggregation workflows around 10% faster overall, even though the group_by step is slower.

I've tried to show the behaviour here minimally - although in this case the overall query is slower, the profiling shows that the second group_by is twice as fast following maintain_order=True rather than maintain_order=False.

In general, I see this behaviour for other aggregation ops (not just .first()) and most settings of NUM_BIG_G and NUM_SMALL_G - although it's definitely more pronounced when NUM_BIG_G >> NUM_SMALL_G. Laziness also doesn't matter.

I think some fast path or caching ends up helping out here? But I'm not familiar enough with rust yet to dig the issue all the way down in the rust side. No sorted flags are set on the frame at any point so I don't see anything obvious on the Python side.

Not 100% sure if this is a bug (and the same fast path exists but is missed with maintain_order=False) or just a misunderstanding of how caching works and why it's able to take a faster path here.

Expected behavior

I would expect the time for the second group_by(groups) in profiling to match in both cases.

Installed versions

--------Version info---------
Polars:              1.17.1
Index type:          UInt32
Platform:            Linux-6.6.62-gentoo-mark-dist-x86_64-with-glibc2.39
Python:              3.12.4 (main, Aug 20 2024, 17:14:48) [GCC 13.2.0]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.10.0
gevent               <not installed>
google.auth          2.37.0
great_tables         <not installed>
matplotlib           3.9.4
nest_asyncio         1.6.0
numpy                2.1.3
openpyxl             <not installed>
pandas               2.2.3
pyarrow              <not installed>
pydantic             2.10.3
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

mwaddoups added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: maintain_order can lead to faster overall operations #20346

perf: maintain_order can lead to faster overall operations #20346

mwaddoups commented Dec 18, 2024

perf: maintain_order can lead to faster overall operations #20346

perf: maintain_order can lead to faster overall operations #20346

Comments

mwaddoups commented Dec 18, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions