Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: maintain_order can lead to faster overall operations #20346

Open
2 tasks done
mwaddoups opened this issue Dec 18, 2024 · 0 comments
Open
2 tasks done

perf: maintain_order can lead to faster overall operations #20346

mwaddoups opened this issue Dec 18, 2024 · 0 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@mwaddoups
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import numpy as np
import polars as pl

# Generate a dataset with NUM_BIG_G * NUM_SMALL_G groups across two columns,
# and random data

NUM_BIG_G = 100000
NUM_SMALL_G = 100
N = 2

groups = np.arange(NUM_BIG_G)
small_groups = np.arange(NUM_SMALL_G)
data = np.random.normal(size=NUM_SMALL_G * NUM_BIG_G * N)

df = pl.DataFrame({
    "groups": np.repeat(np.repeat(groups, NUM_SMALL_G), N),
    "groups_2": np.repeat(np.tile(small_groups, NUM_BIG_G), N),
    "data": data,
})


def test_query(maintain_order: bool):

    _, prof = (
        df.lazy()
        .group_by("groups", "groups_2", maintain_order=maintain_order)
        .agg(pl.col("data").first())
        .group_by("groups", maintain_order=False)
        .agg(
            pl.col("data"),
        )
        .profile(no_optimization=True)
    )

    return prof.select("node", pl.col("end") - pl.col("start"))


print(test_query(True))
print(test_query(False))

This outputs

shape: (3, 2)
┌─────────────────────────────────┬─────────┐
│ node                            ┆ end     │
│ ---                             ┆ ---     │
│ str                             ┆ u64     │
╞═════════════════════════════════╪═════════╡
│ optimization                    ┆ 12      │
│ group_by_partitioned(groups, g… ┆ 1552986 │
│ group_by(groups)                ┆ 86992   │
└─────────────────────────────────┴─────────┘

shape: (3, 2)
┌─────────────────────────────────┬─────────┐
│ node                            ┆ end     │
│ ---                             ┆ ---     │
│ str                             ┆ u64     │
╞═════════════════════════════════╪═════════╡
│ optimization                    ┆ 5       │
│ group_by_partitioned(groups, g… ┆ 1261104 │
│ group_by(groups)                ┆ 164829  │
└─────────────────────────────────┴─────────┘

Log output

keys/aggregates are not partitionable: running default HASH AGGREGATION
POLARS_FORCE_PARTITION set: running partitioned HASH AGGREGATION
run PARTITIONED HASH AGGREGATION

keys/aggregates are not partitionable: running default HASH AGGREGATION
POLARS_FORCE_PARTITION set: running partitioned HASH AGGREGATION
run PARTITIONED HASH AGGREGATION

Issue description

I ran into an issue in my code where adding maintain_order=True made some aggregation workflows around 10% faster overall, even though the group_by step is slower.

I've tried to show the behaviour here minimally - although in this case the overall query is slower, the profiling shows that the second group_by is twice as fast following maintain_order=True rather than maintain_order=False.

In general, I see this behaviour for other aggregation ops (not just .first()) and most settings of NUM_BIG_G and NUM_SMALL_G - although it's definitely more pronounced when NUM_BIG_G >> NUM_SMALL_G. Laziness also doesn't matter.

I think some fast path or caching ends up helping out here? But I'm not familiar enough with rust yet to dig the issue all the way down in the rust side. No sorted flags are set on the frame at any point so I don't see anything obvious on the Python side.

Not 100% sure if this is a bug (and the same fast path exists but is missed with maintain_order=False) or just a misunderstanding of how caching works and why it's able to take a faster path here.

Expected behavior

I would expect the time for the second group_by(groups) in profiling to match in both cases.

Installed versions

--------Version info---------
Polars:              1.17.1
Index type:          UInt32
Platform:            Linux-6.6.62-gentoo-mark-dist-x86_64-with-glibc2.39
Python:              3.12.4 (main, Aug 20 2024, 17:14:48) [GCC 13.2.0]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.10.0
gevent               <not installed>
google.auth          2.37.0
great_tables         <not installed>
matplotlib           3.9.4
nest_asyncio         1.6.0
numpy                2.1.3
openpyxl             <not installed>
pandas               2.2.3
pyarrow              <not installed>
pydantic             2.10.3
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@mwaddoups mwaddoups added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant