Sliding window for pl.Expr.over function #8976

lautarortega · 2023-05-22T12:52:47Z

Problem description

I would like to have a parameter for pl.Expr.over to determine the window edges. Like to take only the first and current row, or the current row and the last. So, not only the whole group, but also only bits of the group.

Maybe there is such feature, but I could not find it

avimallu · 2023-05-22T14:06:50Z

+1, great idea.

I guess we can rewrite this issue request as a "sliding window" clause.

lautarortega · 2023-05-22T14:15:31Z

Yes, I was looking for something like this PySpark function.

cmdlineluser · 2023-05-22T15:10:05Z

If you want to address the current row and only the first/last row you can
use .first().over() / .last().over() in an expression:

df = pl.DataFrame({
    "id": [1, 2, 3, 1, 2, 3, 1, 2, 3],
    "value": ["a", "b", "c", "d", "e", "f", "g", "h", "i"]
})

df.with_columns(all = pl.col("value") + pl.first("value").over("id") + pl.last("value").over("id"))

# shape: (9, 3)
# ┌─────┬───────┬─────┐
# │ id  ┆ value ┆ all │
# │ --- ┆ ---   ┆ --- │
# │ i64 ┆ str   ┆ str │
# ╞═════╪═══════╪═════╡
# │ 1   ┆ a     ┆ aag │
# │ 2   ┆ b     ┆ bbh │
# │ 3   ┆ c     ┆ cci │
# │ 1   ┆ d     ┆ dag │
# │ 2   ┆ e     ┆ ebh │
# │ 3   ┆ f     ┆ fci │
# │ 1   ┆ g     ┆ gag │
# │ 2   ┆ h     ┆ hbh │
# │ 3   ┆ i     ┆ ici │
# └─────┴───────┴─────┘

If you have a specific example, perhaps a better suggestion could be made.

lautarortega · 2023-05-22T15:12:39Z

For now, I was thinking of doing this instead

df = df.with_columns((pl.col(self.target_col).over(index_plus_order).sort_by(self.order_col).median()).alias(self.output_col_median))

sql = pl.SQLContext()
sql.register('df', df)

df = sql.execute(f"""
                SELECT *,
                LAST({self.output_col_median}) OVER (PARTITION BY {index} ORDER BY {self.order_col} ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Test
                FROM df""")

lautarortega · 2023-05-22T15:14:19Z

If you want to address the current row and only the first/last row you can use .first().over() / .last().over() in an expression:

df = pl.DataFrame({
    "id": [1, 2, 3, 1, 2, 3, 1, 2, 3],
    "value": ["a", "b", "c", "d", "e", "f", "g", "h", "i"]
})

df.with_columns(all = pl.col("value") + pl.first("value").over("id") + pl.last("value").over("id"))

# shape: (9, 3)
# ┌─────┬───────┬─────┐
# │ id  ┆ value ┆ all │
# │ --- ┆ ---   ┆ --- │
# │ i64 ┆ str   ┆ str │
# ╞═════╪═══════╪═════╡
# │ 1   ┆ a     ┆ aag │
# │ 2   ┆ b     ┆ bbh │
# │ 3   ┆ c     ┆ cci │
# │ 1   ┆ d     ┆ dag │
# │ 2   ┆ e     ┆ ebh │
# │ 3   ┆ f     ┆ fci │
# │ 1   ┆ g     ┆ gag │
# │ 2   ┆ h     ┆ hbh │
# │ 3   ┆ i     ┆ ici │
# └─────┴───────┴─────┘

If you have a specific example, perhaps a better suggestion could be made.

I dont want to compare it only with the first and current but rather everything in between.

lautarortega · 2023-05-22T15:17:00Z

Basically, I want to rewrite this pyspark function to polars

# take median of category + timestamp
w = Window.partitionBy(F.col(index_plus_order)).orderBy(F.col(self.order_col))
treated_df = df.withColumn(self.output_col_median,
                                    F.expr(f"PERCENTILE({self.target_col}, 0.5)").over(w))
# forward fill target col
w2 = Window.partitionBy(F.col(index)).orderBy(F.col(self.order_col)) \
            .rowsBetween(Window.unboundedPreceding, Window.currentRow)
        
filled_column = F.last(F.col(self.output_col_median), ignorenulls=True).over(w2)
treated_df = treated_df.withColumn(self.output_col_median, filled_column)```

magarick · 2023-05-22T18:46:05Z

Are you asking to apply functions to an expanding window in addition to the rolling window operations in Polars?

To clarify:
Rolling: f(x[0..t]) for all t
Expanding: f(x[(t-w)..t)) for all t and fixed w

And you're asking this to be done per group as well as over a whole frame/series?

lautarortega · 2023-05-23T09:10:40Z

Yes, I was looking for something like this PySpark function.

In short, I want this functionality, of rowsbetween.

Basically, using pl.Expr.over I want to be able to get a different result for each element of the group.

For example getting the accumulated sales for each product.

I want to group by product, have each slice sorted by month, and apply a sum operation for the current row and all the preceding ones.

So jan will have the sales for jan
Feb will have jan + feb
Mar will have jan + feb + mar

It is a function that SQL has, an PySpark as well. I think it could be a useful enhancement.

cmdlineluser · 2023-05-23T11:55:20Z

For example getting the accumulated sales for each product.

That sounds like a .cumsum():

df = pl.DataFrame(dict(
   product = ["A", "B", "A", "C", "C", "B", "A"],
   date = ["2022/12/01", "2022/10/02", "2022/01/11", "2022/02/01", "2022/03/02", "2022/04/22", "2022/06/12"],
   sales = [12, 11, 13, 14, 20, 11, 7]
))

(df.with_row_count()
   .sort(pl.col("date").str.to_datetime().dt.month())
   .with_columns(
      cumsum = pl.col("sales").cumsum().over("product")
   )
)

# shape: (7, 5)
# ┌────────┬─────────┬────────────┬───────┬────────┐
# │ row_nr ┆ product ┆ date       ┆ sales ┆ cumsum │
# │ ---    ┆ ---     ┆ ---        ┆ ---   ┆ ---    │
# │ u32    ┆ str     ┆ str        ┆ i64   ┆ i64    │
# ╞════════╪═════════╪════════════╪═══════╪════════╡
# │ 2      ┆ A       ┆ 2022/01/11 ┆ 13    ┆ 13     │
# │ 3      ┆ C       ┆ 2022/02/01 ┆ 14    ┆ 14     │
# │ 4      ┆ C       ┆ 2022/03/02 ┆ 20    ┆ 34     │ # 14 + 20     (row_nr: 3, 4)
# │ 5      ┆ B       ┆ 2022/04/22 ┆ 11    ┆ 11     │ 
# │ 6      ┆ A       ┆ 2022/06/12 ┆ 7     ┆ 20     │ # 13 + 7      (row_nr: 2, 6)
# │ 1      ┆ B       ┆ 2022/10/02 ┆ 11    ┆ 22     │ # 11 + 11     (row_nr: 5, 1)
# │ 0      ┆ A       ┆ 2022/12/01 ┆ 12    ┆ 32     │ # 13 + 7 + 12 (row_nr: 2, 6, 0)
# └────────┴─────────┴────────────┴───────┴────────┘

lautarortega · 2023-05-23T12:17:12Z

Maybe not the best example then. However, this is not specific to .cumsum().

lucazanna · 2023-05-23T12:33:53Z

I also think this would be a useful addition to Polars

the SQL and Spark syntax ‘preceding rows’ and ‘following rows’ is very flexible and allows expressing complex calculations in an easy way

It’s probably the only thing I miss from PySpark .

ritchie46 · 2023-05-23T12:59:05Z

How does this relate to groupby_rolling?

lucazanna · 2023-05-23T13:23:43Z

hi @ritchie46 ,

I believe the groupby_rolling does not allow to 'look ahead', like the 'preceding / following' syntax does.

For example, here is a Stack Overflow question for reproducing the 'preceding / following' syntax in Polars:
https://stackoverflow.com/questions/74513494/polars-take-a-window-of-n-rows-surrounding-a-row-fulfilling-a-condition

So this operation is possible in Polars using the solutions from SO but it's more verbose and potentially less clear than a 'preceding / following' syntax.
This syntax could also be seen as a generalization of cumulative functions (in addition to a generalization of groupby_rolling). The cumulative functions could be expressed as 'preceding = unbounded'.

So overall adding it could reduce the number of Polars functions to learn when first starting with Polars.

A Polars syntax could be something like over('group', preceding = '2i', following='unbounded')
The possible inputs for preceding and following can be the same as groupby_rolling with one addition for unbounded.

What do you think?

magarick · 2023-05-23T17:50:49Z

There seem to be two questions going on here. With the first that asks for an expanding/cumulative window, there's a min_periods argument you can set to 1, but then you also need to know the number of rows in the data/per group, which I guess is an extra step.
For the other question, there's the center option if you want a symmetric window. Otherwise it's bookkeeping to align labels and results. Out of curiosity, what's the use case for a fixed number of preceding observations and unlimited following?

lucazanna · 2023-05-24T05:14:39Z

hi @magarick ,

I agree that it's possible to reproduce most transformations in Polars today (using expressions as building blocks is indeed quite powerful).
Even the 'look ahead' transformation is possible (see Stack Overflow answer above), even though it might be complicated to do if different groups have different lengths (for example: when 'following unbounded' represents 3 rows for one group and 10 for another).

I think the opportunity is for an easy way to express those transformations.
With the added bonus that people with a SQL and/or a Spark background are used to this type of syntax, so this makes it easier to switch to Polars.

I teach Polars to mid-sized and big organizations, and how to do 'preceding / following' is a question that often comes up. There is oftentimes a way to reproduce calculations with Polars syntax, however this would simplify things (in my opinion).

To wrap it up: I am not arguing that Polars should do something because other tools do it. Polars has the advantage of not taking all syntax from other tools and keeping a clean API. However, in this case, it might be worth in my opinion.

for the example of preceding = 2i and following = unbounded, I did not take it from a specific use case

curious to hear @ritchie46 's thoughts

cmdlineluser · 2023-05-26T19:45:59Z

I believe the groupby_rolling does not allow to 'look ahead'

There is some support with offset= @lucazanna

df = pl.DataFrame(dict(
   id = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
   idx = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
   val = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"]
))

(df.groupby_rolling(index_column="idx", by="id", period="5i", offset="-3i")
   .agg(window = pl.col("val"))
   .with_columns(  # just to display it
      pl.col("window").list.to_struct(n_field_strategy="max_width")
   )
)

# shape: (11, 3)
# ┌─────┬─────┬─────────────────────────┐
# │ id  ┆ idx ┆ window                  │
# │ --- ┆ --- ┆ ---                     │
# │ i64 ┆ i64 ┆ struct[5]               │
# ╞═════╪═════╪═════════════════════════╡
# │ 1   ┆ 1   ┆ {"a","b","c",null,null} │
# │ 1   ┆ 2   ┆ {"a","b","c","d",null}  │
# │ 1   ┆ 3   ┆ {"a","b","c","d","e"}   │
# │ 1   ┆ 4   ┆ {"b","c","d","e","f"}   │
# │ 1   ┆ 5   ┆ {"c","d","e","f",null}  │
# │ 1   ┆ 6   ┆ {"d","e","f",null,null} │
# │ 2   ┆ 7   ┆ {"g","h","i",null,null} │
# │ 2   ┆ 8   ┆ {"g","h","i","j",null}  │
# │ 2   ┆ 9   ┆ {"g","h","i","j","k"}   │
# │ 2   ┆ 10  ┆ {"h","i","j","k",null}  │
# │ 2   ┆ 11  ┆ {"i","j","k",null,null} │
# └─────┴─────┴─────────────────────────┘

You don't get leading nulls though, so the "positions" are not "correct", and I'm not sure if there is a proper way to do unbounded.

You can use period=f"{df.height}i" for following = unbounded on a non-lazy dataframe.

I imagine polars-sql will eventually have to support this RANGE BETWEEN FOLLOWING/PRECEDING UNBOUNDED syntax?

So it seems like a good idea if a polars idiom existed for it.

vmgustavo · 2023-06-25T02:03:43Z

I ran into this problem while trying to get the count of records that ocurred before the current record. It is trivial to get this result using PySpark but I am having a hard time to get it using Polars. I have created this StackOverflow question but I am also copying the contents of it here.

StackOverflow Question

> Given a dataset with records of an event where the event can happen multiple times for the same ID I want to find the aggregate of the previous records of that ID. Let's say I have the following table: > > | id | datetime | value | > |-----|----------------|-------| > | 123 | 20230101T00:00 | 2 | > | 123 | 20230101T01:00 | 5 | > | 123 | 20230101T03:00 | 7 | > | 123 | 20230101T04:00 | 1 | > | 456 | 20230201T04:00 | 1 | > | 456 | 20230201T07:00 | 1 | > | 456 | 20230205T04:00 | 1 | > > I want to create a new column "agg" that adds the previous values of "value" found for that same record to get the following table: > > | id | datetime | value | agg | > |-----|----------------|-------|-----| > | 123 | 20230101T00:00 | 2 | 0 | > | 123 | 20230101T01:00 | 5 | 2 | > | 123 | 20230101T03:00 | 7 | 7 | > | 123 | 20230101T04:00 | 1 | 0 | > | 456 | 20230201T04:00 | 1 | 1 | > | 456 | 20230201T07:00 | 1 | 2 | > | 456 | 20230205T04:00 | 1 | 3 | > > [Polars documentation](https://pola-rs.github.io/polars-book/user-guide/expressions/window/) says there is a window function but it is not clear how to collect just the previous values of the current record. I know it is possible to do this with PySpark using: > > ```python > window = Window.partitionBy('id').orderBy('datetime').rowsBetween(Window.unboundedPreceding, -1) > > ( > df_pyspark > .withColumn('agg', f.sum('value').over(window).cast('int')) > .fillna(0, subset=['agg']) > ) > ``` > > --- >

I tried using groupby_rolling, correct me if I am wrong, but it seems to not support aggregates over a feature. Also it requires a lot os steps to first create an index column with integer value to later groupby "i" period. I think the PySpark API is a great reference as to how it could work. Also after executing the groupby I would need to join the data back to the original dataframe in the context of feature engineering and if there are multiple index columns or if the index column is a string I can't groupby with groupby_rolling.

EDIT: There is an answer in StackOverflow that works for my case, maybe it solves some other people's problems here too. I still believe the solution isn't as intuitive as PySpark Window but it works for now.

montanarograziano · 2024-01-29T12:36:55Z

Honestly this is the biggest drawback holding me from switching many of my current code from pyspark to Polars.
Correct me if I'm wrong, but right now there's no support for specifying window size in window functions, e.g. something like rows between unbounded preceding 1 preceding.
The Pyspark syntax for this is pretty neat (also the possibility of using a Window object to pass to over()).
Could this be a beneficial feature for Polars?

cmdlineluser · 2024-01-29T15:56:48Z

@montanarograziano

The particular example you've given is sort of possible - I believe it is what Pandas calls an "expanding window".

Since these calculations are a special case of rolling statistics

So it can be expressed in terms of .rolling() but it's a bit awkward.

requires hardcoding the period size to the .height of the frame
fiddling with offset=

(I'm also not sure of how "efficient" this ends up being)

pl.Config(fmt_table_cell_list_len=-1)

df = pl.DataFrame(dict(letter=list("abcdefgh"))).with_row_index()

# period = df.height
# offset = -(df.height + 1)
(df.rolling(index_column=pl.col("index").cast(int), period="8i", offset="-9i")
   .agg(
      window = pl.col("letter")
   )
)
# shape: (8, 2)
# ┌───────┬─────────────────────────────────────┐
# │ index ┆ window                              │
# │ ---   ┆ ---                                 │
# │ i64   ┆ list[str]                           │
# ╞═══════╪═════════════════════════════════════╡
# │ 0     ┆ []                                  │
# │ 1     ┆ ["a"]                               │
# │ 2     ┆ ["a", "b"]                          │
# │ 3     ┆ ["a", "b", "c"]                     │
# │ 4     ┆ ["a", "b", "c", "d"]                │
# │ 5     ┆ ["a", "b", "c", "d", "e"]           │
# │ 6     ┆ ["a", "b", "c", "d", "e", "f"]      │
# │ 7     ┆ ["a", "b", "c", "d", "e", "f", "g"] │
# └───────┴─────────────────────────────────────┘

(There was an issue for adding dedicated "expanding" functions but it has gone stale: #4799)

montanarograziano · 2024-01-29T16:03:18Z

@cmdlineluser I've tried experimenting with offset and period and yeah, for some small windows sizes it seems to work. However, trying to reproduce an "unbounded preceding" window causes my machine to crash for OOM, even on a big cluster. I believe it's just not optimized for it as it seems is not able to compute it in lazy mode (that's just one hypothesis).

cianoo45 · 2024-06-13T14:45:19Z

Any update on this?

MatthiasRoels · 2024-07-11T08:08:24Z

I've been using Polars heavily for the past year and it has become my preferred tool of choice. However, the lack of a Polars equivalent of Spark's rowsBetween and rangeBetween really start to prevent me from fully transitioning. I want to avoid falling back to duckdb for only this functionality. So my question is: given the current API. How would you translate the following syntax to Polars?

from datetime import date

import duckdb
import polars as pl


input_df = pl.DataFrame(
    {
        "element_id": [1, 1, 1, 2, 2, 2, 2],
        "observation_dt": [
            date(2024, 5, 25),
            date(2024, 6, 1),
            date(2024, 6, 7),
            date(2024, 5, 7),
            date(2024, 5, 12),
            date(2024, 5, 14),
            date(2024, 5, 20),
        ],
        "feature": [0, 2, 1, 3, 6, 1, 0],
    }
)

query = """
SELECT  element_id, 
        observation_dt, 
        sum(feature) OVER (PARTITION BY element_id ORDER BY observation_dt RANGE BETWEEN INTERVAL 14 DAYS PRECEDING AND INTERVAL 1 DAYS PRECEDING) as transformed_feature 
FROM input_df
"""
output_df = duckdb.sql(query).pl()

Where input_df:

shape: (7, 3)
┌────────────┬────────────────┬─────────┐
│ element_id ┆ observation_dt ┆ feature │
│ ---        ┆ ---            ┆ ---     │
│ i64        ┆ date           ┆ i64     │
╞════════════╪════════════════╪═════════╡
│ 1          ┆ 2024-05-25     ┆ 0       │
│ 1          ┆ 2024-06-01     ┆ 2       │
│ 1          ┆ 2024-06-07     ┆ 1       │
│ 2          ┆ 2024-05-07     ┆ 3       │
│ 2          ┆ 2024-05-12     ┆ 6       │
│ 2          ┆ 2024-05-14     ┆ 1       │
│ 2          ┆ 2024-05-20     ┆ 0       │
└────────────┴────────────────┴─────────┘

and output_df:

shape: (7, 3)
┌────────────┬────────────────┬─────────────────────┐
│ element_id ┆ observation_dt ┆ transformed_feature │
│ ---        ┆ ---            ┆ ---                 │
│ i64        ┆ date           ┆ f64                 │
╞════════════╪════════════════╪═════════════════════╡
│ 2          ┆ 2024-05-07     ┆ null                │
│ 2          ┆ 2024-05-12     ┆ 3.0                 │
│ 2          ┆ 2024-05-14     ┆ 9.0                 │
│ 2          ┆ 2024-05-20     ┆ 10.0                │
│ 1          ┆ 2024-05-25     ┆ null                │
│ 1          ┆ 2024-06-01     ┆ 0.0                 │
│ 1          ┆ 2024-06-07     ┆ 2.0                 │
└────────────┴────────────────┴─────────────────────┘

I know that one workaround it to use rolling in a smart way and join the output back to the original dataframe. But, for some reason it goes OOM for my dataset. So I was hoping there is some smart alternative available...

CarloNicolini · 2024-07-11T16:27:49Z

@cmdlineluser I've tried experimenting with offset and period and yeah, for some small windows sizes it seems to work. However, trying to reproduce an "unbounded preceding" window causes my machine to crash for OOM, even on a big cluster. I believe it's just not optimized for it as it seems is not able to compute it in lazy mode (that's just one hypothesis).

This problem still exists in polars 1.0.0.

I believe that the PySpark solution using windows and unbounded preceding is the way to go for the Polars API.

CarloNicolini · 2024-07-11T16:28:52Z

I've been using Polars heavily for the past year and it has become my preferred tool of choice. However, the lack of a Polars equivalent of Spark's rowsBetween and rangeBetween really start to prevent me from fully transitioning. I want to avoid falling back to duckdb for only this functionality. So my question is: given the current API. How would you translate the following syntax to Polars?

from datetime import date

import duckdb
import polars as pl


input_df = pl.DataFrame(
    {
        "element_id": [1, 1, 1, 2, 2, 2, 2],
        "observation_dt": [
            date(2024, 5, 25),
            date(2024, 6, 1),
            date(2024, 6, 7),
            date(2024, 5, 7),
            date(2024, 5, 12),
            date(2024, 5, 14),
            date(2024, 5, 20),
        ],
        "feature": [0, 2, 1, 3, 6, 1, 0],
    }
)

query = """
SELECT  element_id, 
        observation_dt, 
        sum(feature) OVER (PARTITION BY element_id ORDER BY observation_dt RANGE BETWEEN INTERVAL 14 DAYS PRECEDING AND INTERVAL 1 DAYS PRECEDING) as transformed_feature 
FROM input_df
"""
output_df = duckdb.sql(query).pl()

Where input_df:

shape: (7, 3)
┌────────────┬────────────────┬─────────┐
│ element_id ┆ observation_dt ┆ feature │
│ ---        ┆ ---            ┆ ---     │
│ i64        ┆ date           ┆ i64     │
╞════════════╪════════════════╪═════════╡
│ 1          ┆ 2024-05-25     ┆ 0       │
│ 1          ┆ 2024-06-01     ┆ 2       │
│ 1          ┆ 2024-06-07     ┆ 1       │
│ 2          ┆ 2024-05-07     ┆ 3       │
│ 2          ┆ 2024-05-12     ┆ 6       │
│ 2          ┆ 2024-05-14     ┆ 1       │
│ 2          ┆ 2024-05-20     ┆ 0       │
└────────────┴────────────────┴─────────┘

and output_df:

shape: (7, 3)
┌────────────┬────────────────┬─────────────────────┐
│ element_id ┆ observation_dt ┆ transformed_feature │
│ ---        ┆ ---            ┆ ---                 │
│ i64        ┆ date           ┆ f64                 │
╞════════════╪════════════════╪═════════════════════╡
│ 2          ┆ 2024-05-07     ┆ null                │
│ 2          ┆ 2024-05-12     ┆ 3.0                 │
│ 2          ┆ 2024-05-14     ┆ 9.0                 │
│ 2          ┆ 2024-05-20     ┆ 10.0                │
│ 1          ┆ 2024-05-25     ┆ null                │
│ 1          ┆ 2024-06-01     ┆ 0.0                 │
│ 1          ┆ 2024-06-07     ┆ 2.0                 │
└────────────┴────────────────┴─────────────────────┘

I know that one workaround it to use rolling in a smart way and join the output back to the original dataframe. But, for some reason it goes OOM for my dataset. So I was hoping there is some smart alternative available...

I hope a PySpark-like solution using rowsBetween will be available soon.
I am a bit stuck on this simple problem.

CarloNicolini · 2024-07-11T16:29:53Z

@montanarograziano

The particular example you've given is sort of possible - I believe it is what Pandas calls an "expanding window".

Since these calculations are a special case of rolling statistics

So it can be expressed in terms of .rolling() but it's a bit awkward.

requires hardcoding the period size to the .height of the frame
fiddling with offset=

(I'm also not sure of how "efficient" this ends up being)

pl.Config(fmt_table_cell_list_len=-1)

df = pl.DataFrame(dict(letter=list("abcdefgh"))).with_row_index()

# period = df.height
# offset = -(df.height + 1)
(df.rolling(index_column=pl.col("index").cast(int), period="8i", offset="-9i")
   .agg(
      window = pl.col("letter")
   )
)
# shape: (8, 2)
# ┌───────┬─────────────────────────────────────┐
# │ index ┆ window                              │
# │ ---   ┆ ---                                 │
# │ i64   ┆ list[str]                           │
# ╞═══════╪═════════════════════════════════════╡
# │ 0     ┆ []                                  │
# │ 1     ┆ ["a"]                               │
# │ 2     ┆ ["a", "b"]                          │
# │ 3     ┆ ["a", "b", "c"]                     │
# │ 4     ┆ ["a", "b", "c", "d"]                │
# │ 5     ┆ ["a", "b", "c", "d", "e"]           │
# │ 6     ┆ ["a", "b", "c", "d", "e", "f"]      │
# │ 7     ┆ ["a", "b", "c", "d", "e", "f", "g"] │
# └───────┴─────────────────────────────────────┘

(There was an issue for adding dedicated "expanding" functions but it has gone stale: #4799)

This cannot work when each "group" has a different number of elements.

CarloNicolini · 2024-07-11T16:51:50Z

A possible solution comes from using the SQL Context from Polars 1.0.
However I believe also that this result is really strange:

df = pl.DataFrame({
    'NAME': ['a', 'a', 'a', 'b', 'b'],
    'INDEX': [0, 1, 2, 0, 2],
    'VAL': [1,1,1,1,1]
})

sql = pl.SQLContext()
sql.register('df', df)

sql.execute(
    f"""
    SELECT
        NAME,
        INDEX,
        SUM(VAL) OVER (PARTITION BY NAME ORDER BY INDEX ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS CUMSUM
    FROM
        df
    """
).collect()

I get the following

┌──────┬───────┬────────┐
│ NAME ┆ INDEX ┆ CUMSUM │
│ ---  ┆ ---   ┆ ---    │
│ str  ┆ i64   ┆ i64    │
╞══════╪═══════╪════════╡
│ a    ┆ 0     ┆ 3      │
│ a    ┆ 1     ┆ 3      │
│ a    ┆ 2     ┆ 3      │
│ b    ┆ 0     ┆ 2      │
│ b    ┆ 2     ┆ 2      │
└──────┴───────┴────────┘

instead of the cumulative sum over the VALUE.
I'd expect this result, indeed, given that the window function is performed with the rows between preceding and current row.

┌──────┬───────┬────────┐
│ NAME ┆ INDEX ┆ CUMSUM │
│ ---  ┆ ---   ┆ ---    │
│ str  ┆ i64   ┆ i64    │
╞══════╪═══════╪════════╡
│ a    ┆ 0     ┆ 1      │
│ a    ┆ 1     ┆ 2      │
│ a    ┆ 2     ┆ 3      │
│ b    ┆ 0     ┆ 1      │
│ b    ┆ 2     ┆ 2      │
└──────┴───────┴────────┘

cmdlineluser · 2024-07-11T19:49:49Z

@CarloNicolini I can't find anything in polars-sql to suggest ROWS BETWEEN is supported yet.

It seems it's just silenty ignored at the moment.

montanarograziano · 2024-07-12T09:05:52Z

A possible solution comes from using the SQL Context from Polars 1.0. However I believe also that this result is really strange:

df = pl.DataFrame({
    'NAME': ['a', 'a', 'a', 'b', 'b'],
    'INDEX': [0, 1, 2, 0, 2],
    'VAL': [1,1,1,1,1]
})

sql = pl.SQLContext()
sql.register('df', df)

sql.execute(
    f"""
    SELECT
        NAME,
        INDEX,
        SUM(VAL) OVER (PARTITION BY NAME ORDER BY INDEX ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS CUMSUM
    FROM
        df
    """
).collect()

I get the following

┌──────┬───────┬────────┐
│ NAME ┆ INDEX ┆ CUMSUM │
│ ---  ┆ ---   ┆ ---    │
│ str  ┆ i64   ┆ i64    │
╞══════╪═══════╪════════╡
│ a    ┆ 0     ┆ 3      │
│ a    ┆ 1     ┆ 3      │
│ a    ┆ 2     ┆ 3      │
│ b    ┆ 0     ┆ 2      │
│ b    ┆ 2     ┆ 2      │
└──────┴───────┴────────┘

instead of the cumulative sum over the VALUE. I'd expect this result, indeed, given that the window function is performed with the rows between preceding and current row.

┌──────┬───────┬────────┐
│ NAME ┆ INDEX ┆ CUMSUM │
│ ---  ┆ ---   ┆ ---    │
│ str  ┆ i64   ┆ i64    │
╞══════╪═══════╪════════╡
│ a    ┆ 0     ┆ 1      │
│ a    ┆ 1     ┆ 2      │
│ a    ┆ 2     ┆ 3      │
│ b    ┆ 0     ┆ 1      │
│ b    ┆ 2     ┆ 2      │
└──────┴───────┴────────┘

Yeah I tried with polars.sql too and the results are wrong. Probabily this is intended as there is no specific polars API available yet for rowsBetween

MatthiasRoels · 2024-12-15T18:30:32Z

Any update on this issue? I’m in the process of porting a Spark code-base with a significant amount of ROWS BETWEEN and even RANGE BETWEEN clauses that I find hard to properly translate. It would make my life much easier if Polars native supported this.

cmdlineluser · 2024-12-16T12:23:39Z

.join_where() has since been added. (it does just "inner" for now: #19962 - so would need an additional left-join)

(input_df
  .with_row_index()
  .join_where(input_df,
       pl.col("element_id") == pl.col("element_id_right"),
       pl.col("observation_dt_right").is_between(
           pl.col("observation_dt").dt.offset_by("-14d"),
           pl.col("observation_dt").dt.offset_by("-1d"),
       )
   )
   .group_by("index")
   .agg(
       pl.col("element_id", "observation_dt").first(),
       pl.col("feature_right").sum().alias("transformed_feature")
   )
)

# shape: (5, 4)
# ┌───────┬────────────┬────────────────┬─────────────────────┐
# │ index ┆ element_id ┆ observation_dt ┆ transformed_feature │
# │ ---   ┆ ---        ┆ ---            ┆ ---                 │
# │ u32   ┆ i64        ┆ date           ┆ i64                 │
# ╞═══════╪════════════╪════════════════╪═════════════════════╡
# │ 5     ┆ 2          ┆ 2024-05-14     ┆ 9                   │
# │ 4     ┆ 2          ┆ 2024-05-12     ┆ 3                   │
# │ 6     ┆ 2          ┆ 2024-05-20     ┆ 10                  │
# │ 2     ┆ 1          ┆ 2024-06-07     ┆ 2                   │
# │ 1     ┆ 1          ┆ 2024-06-01     ┆ 0                   │
# └───────┴────────────┴────────────────┴─────────────────────┘

I suppose you could compare index and index_right in a predicate for the ROWS BETWEEN case?

But I'm guessing an SQL engine uses some other algorithms/optimizations and these are not actually implemented as just non-equi joins?

erinov1 · 2024-12-20T16:51:14Z

One way to avoid burdening the core polars project (and its developers) with the panoply of exotic (or not) group-by requests would be to allow users to construct custom groupings (similar in spirit to the expression plugin API) and then pass them to the compute engine for executing aggregations. I filed an issue here a while ago: #16584

@ritchie46 does something like this sound feasible?

lautarortega added the enhancement New feature or an improvement of an existing feature label May 22, 2023

lautarortega changed the title ~~Run over function between first and current rows only~~ Run pl.Expr.over function between first and current rows only May 22, 2023

lautarortega changed the title ~~Run pl.Expr.over function between first and current rows only~~ Sliding window fro pl.Expr.over function May 22, 2023

lautarortega changed the title ~~Sliding window fro pl.Expr.over function~~ Sliding window for pl.Expr.over function May 22, 2023

cmdlineluser mentioned this issue Oct 26, 2023

rolling: add grouping by rows (no index) #12014

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sliding window for pl.Expr.over function #8976

Sliding window for pl.Expr.over function #8976

lautarortega commented May 22, 2023

avimallu commented May 22, 2023

lautarortega commented May 22, 2023

cmdlineluser commented May 22, 2023

lautarortega commented May 22, 2023

lautarortega commented May 22, 2023

lautarortega commented May 22, 2023

magarick commented May 22, 2023

lautarortega commented May 23, 2023

cmdlineluser commented May 23, 2023

lautarortega commented May 23, 2023

lucazanna commented May 23, 2023

ritchie46 commented May 23, 2023

lucazanna commented May 23, 2023 •

edited

Loading

magarick commented May 23, 2023

lucazanna commented May 24, 2023

cmdlineluser commented May 26, 2023

vmgustavo commented Jun 25, 2023 •

edited

Loading

montanarograziano commented Jan 29, 2024 •

edited

Loading

cmdlineluser commented Jan 29, 2024

montanarograziano commented Jan 29, 2024

cianoo45 commented Jun 13, 2024

MatthiasRoels commented Jul 11, 2024

CarloNicolini commented Jul 11, 2024

CarloNicolini commented Jul 11, 2024

CarloNicolini commented Jul 11, 2024

CarloNicolini commented Jul 11, 2024

cmdlineluser commented Jul 11, 2024

montanarograziano commented Jul 12, 2024

MatthiasRoels commented Dec 15, 2024

cmdlineluser commented Dec 16, 2024

erinov1 commented Dec 20, 2024 •

edited

Loading

Sliding window for pl.Expr.over function #8976

Sliding window for pl.Expr.over function #8976

Comments

lautarortega commented May 22, 2023

Problem description

avimallu commented May 22, 2023

lautarortega commented May 22, 2023

cmdlineluser commented May 22, 2023

lautarortega commented May 22, 2023

lautarortega commented May 22, 2023

lautarortega commented May 22, 2023

magarick commented May 22, 2023

lautarortega commented May 23, 2023

cmdlineluser commented May 23, 2023

lautarortega commented May 23, 2023

lucazanna commented May 23, 2023

ritchie46 commented May 23, 2023

lucazanna commented May 23, 2023 • edited Loading

magarick commented May 23, 2023

lucazanna commented May 24, 2023

cmdlineluser commented May 26, 2023

vmgustavo commented Jun 25, 2023 • edited Loading

montanarograziano commented Jan 29, 2024 • edited Loading

cmdlineluser commented Jan 29, 2024

montanarograziano commented Jan 29, 2024

cianoo45 commented Jun 13, 2024

MatthiasRoels commented Jul 11, 2024

CarloNicolini commented Jul 11, 2024

CarloNicolini commented Jul 11, 2024

CarloNicolini commented Jul 11, 2024

CarloNicolini commented Jul 11, 2024

cmdlineluser commented Jul 11, 2024

montanarograziano commented Jul 12, 2024

MatthiasRoels commented Dec 15, 2024

cmdlineluser commented Dec 16, 2024

erinov1 commented Dec 20, 2024 • edited Loading

lucazanna commented May 23, 2023 •

edited

Loading

vmgustavo commented Jun 25, 2023 •

edited

Loading

montanarograziano commented Jan 29, 2024 •

edited

Loading

erinov1 commented Dec 20, 2024 •

edited

Loading