Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame intitialization from list of dataclasses is slow #20333

Open
KilianKW opened this issue Dec 17, 2024 · 4 comments
Open

DataFrame intitialization from list of dataclasses is slow #20333

KilianKW opened this issue Dec 17, 2024 · 4 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@KilianKW
Copy link

Description

First of all thank you very much for polars! This is my first issue and I wasn't sure if this is rather a bug or feature request.

I am using a list of dataclasses.dataclass to initialize a pl.DataFrame which works fine. However the intialization takes much longer compared to a case where I first convert the list of dataclasses.dataclass to a dict and perform the df initialization with that:

import dataclasses
import time

import polars as pl

@dataclasses.dataclass
class Row:
    int_value: int
    str_value: str
    list_value: list[int]

ROWS = [
    Row(index, str(index), [index]*100) for index in range(10000)
]

def df_from_rows():
    return pl.DataFrame(ROWS)

def df_from_dict():
        data_dict = {}
        for field in dataclasses.fields(Row):
            data_dict[field.name] = [getattr(row, field.name) for row in ROWS]
        return pl.DataFrame(data_dict)

start = time.perf_counter()
df_from_rows()
end = time.perf_counter()
print(f"Time taken to create DataFrame from rows: {end - start}")

start = time.perf_counter()
df_from_dict()
end = time.perf_counter()
print(f"Time taken to create DataFrame from dict: {end - start}")

With polars 1.17.1 I get

Time taken to create DataFrame from rows: 0.5881143999995402
Time taken to create DataFrame from dict: 0.06102950000058627

I don't know the details but my guess is that dataclass is first converted to a tuple which creates deepcopies.

Would it be possible to enhance the initialization speed for this use case?

@KilianKW KilianKW added the enhancement New feature or an improvement of an existing feature label Dec 17, 2024
@vanheck
Copy link

vanheck commented Dec 17, 2024

Hi, polars uses asdict or astuple, you can check it here:

def _sequence_of_dataclasses_to_pydf(

def df_from_dict2():
    dicts = [dataclasses.asdict(md) for md in ROWS]
    return pl.DataFrame(dicts)

start = time.perf_counter()
df_from_dict2()
end = time.perf_counter()
print(f"Time taken to create DataFrame from dict: {end - start}")

This takes similar time to your code.

@KilianKW
Copy link
Author

Thanks for the insight. df_from_dict2() takes about the same time as df_from_rows() (i.e., around 10 times slower than df_from_dict()).

Would it make sense to incorporate my work-around (see df_from_dict(), probably in a more sophisticated manner) within the constructor of DataFrame?

@vanheck
Copy link

vanheck commented Dec 17, 2024

Your implementation make sence! I don't know (I'm not developer of polars), I think Pull Request would be appreciated. But, wouldn't it be better to make a proposal directly to python dataclasses.asdict(deepcopy=False)?

@KilianKW
Copy link
Author

KilianKW commented Dec 19, 2024

Some people already seem to be working on superfluous deep copies in dataclasses but I guess it will take quite some time till we see it in the next Python release.

I can have a look at opening a PR here if one of the polars devs consider it useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants