DataFrame intitialization from list of dataclasses is slow #20333

KilianKW · 2024-12-17T10:57:11Z

Description

First of all thank you very much for polars! This is my first issue and I wasn't sure if this is rather a bug or feature request.

I am using a list of dataclasses.dataclass to initialize a pl.DataFrame which works fine. However the intialization takes much longer compared to a case where I first convert the list of dataclasses.dataclass to a dict and perform the df initialization with that:

import dataclasses
import time

import polars as pl

@dataclasses.dataclass
class Row:
    int_value: int
    str_value: str
    list_value: list[int]

ROWS = [
    Row(index, str(index), [index]*100) for index in range(10000)
]

def df_from_rows():
    return pl.DataFrame(ROWS)

def df_from_dict():
        data_dict = {}
        for field in dataclasses.fields(Row):
            data_dict[field.name] = [getattr(row, field.name) for row in ROWS]
        return pl.DataFrame(data_dict)

start = time.perf_counter()
df_from_rows()
end = time.perf_counter()
print(f"Time taken to create DataFrame from rows: {end - start}")

start = time.perf_counter()
df_from_dict()
end = time.perf_counter()
print(f"Time taken to create DataFrame from dict: {end - start}")

With polars 1.17.1 I get

Time taken to create DataFrame from rows: 0.5881143999995402
Time taken to create DataFrame from dict: 0.06102950000058627

I don't know the details but my guess is that dataclass is first converted to a tuple which creates deepcopies.

Would it be possible to enhance the initialization speed for this use case?

The text was updated successfully, but these errors were encountered:

vanheck · 2024-12-17T11:37:08Z

Hi, polars uses asdict or astuple, you can check it here:

polars/py-polars/polars/_utils/construction/dataframe.py

Line 798 in b82b2b2

def _sequence_of_dataclasses_to_pydf(

def df_from_dict2():
    dicts = [dataclasses.asdict(md) for md in ROWS]
    return pl.DataFrame(dicts)

start = time.perf_counter()
df_from_dict2()
end = time.perf_counter()
print(f"Time taken to create DataFrame from dict: {end - start}")

This takes similar time to your code.

KilianKW · 2024-12-17T11:55:22Z

Thanks for the insight. df_from_dict2() takes about the same time as df_from_rows() (i.e., around 10 times slower than df_from_dict()).

Would it make sense to incorporate my work-around (see df_from_dict(), probably in a more sophisticated manner) within the constructor of DataFrame?

vanheck · 2024-12-17T13:54:14Z

Your implementation make sence! I don't know (I'm not developer of polars), I think Pull Request would be appreciated. But, wouldn't it be better to make a proposal directly to python dataclasses.asdict(deepcopy=False)?

KilianKW · 2024-12-19T10:06:53Z

Some people already seem to be working on superfluous deep copies in dataclasses but I guess it will take quite some time till we see it in the next Python release.

I can have a look at opening a PR here if one of the polars devs consider it useful.

KilianKW added the enhancement New feature or an improvement of an existing feature label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame intitialization from list of dataclasses is slow #20333

DataFrame intitialization from list of dataclasses is slow #20333

KilianKW commented Dec 17, 2024

vanheck commented Dec 17, 2024 •

edited

Loading

KilianKW commented Dec 17, 2024

vanheck commented Dec 17, 2024

KilianKW commented Dec 19, 2024 •

edited

Loading

DataFrame intitialization from list of dataclasses is slow #20333

DataFrame intitialization from list of dataclasses is slow #20333

Comments

KilianKW commented Dec 17, 2024

Description

vanheck commented Dec 17, 2024 • edited Loading

KilianKW commented Dec 17, 2024

vanheck commented Dec 17, 2024

KilianKW commented Dec 19, 2024 • edited Loading

vanheck commented Dec 17, 2024 •

edited

Loading

KilianKW commented Dec 19, 2024 •

edited

Loading