Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet writer only writes one data page if it utilizes dictionary encoding #20141

Open
2 tasks done
coastalwhite opened this issue Dec 4, 2024 · 0 comments · May be fixed by #20148
Open
2 tasks done

Parquet writer only writes one data page if it utilizes dictionary encoding #20141

coastalwhite opened this issue Dec 4, 2024 · 0 comments · May be fixed by #20148
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer P-high Priority: high python Related to Python Polars

Comments

@coastalwhite
Copy link
Collaborator

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.parquet as pq
import os, io

os.environ['PARQUET_DO_VERBOSE'] = '1'
f = io.BytesIO()

df = pl.DataFrame({ "a": ["A"] * 10000 }, schema = { "a": pl.Enum(["A"]) })

f.seek(0)
df.write_parquet(f, data_page_size=1)
f.truncate()

f.seek(0)
print("Polars:")
df = pl.read_parquet(f)

print()
print()
print()

f.seek(0)
pq.write_table(df.to_arrow(), f, data_page_size=1)
f.truncate()

f.seek(0)
print("PyArrow:")
df = pl.read_parquet(f)

Log output

Polars:
Parquet DictPage ( num_values: 1, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray } )
Parquet DataPageV1 ( num_values: 10000, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )



PyArrow:
Parquet DictPage ( num_values: 1, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray } )
Parquet DataPageV1 ( num_values: 1024, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )
Parquet DataPageV1 ( num_values: 1024, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )
Parquet DataPageV1 ( num_values: 1024, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )
Parquet DataPageV1 ( num_values: 1024, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )
Parquet DataPageV1 ( num_values: 1024, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )
Parquet DataPageV1 ( num_values: 1024, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )
Parquet DataPageV1 ( num_values: 1024, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )
Parquet DataPageV1 ( num_values: 1024, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )
Parquet DataPageV1 ( num_values: 1024, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )
Parquet DataPageV1 ( num_values: 784, datatype: PrimitiveType { field_info: FieldInfo { name: "a", repetition: Optional, id: None }, logical_type: Some(String), converted_type: Some(Utf8), physical_type: ByteArray }, encoding: Some(RleDictionary) )

Issue description

Polars only writes one data page. Even when it is explicitly told to write more than one.

Expected behavior

Polars should divide the data over several data pages, as to make reading with predicates a lot faster.

Installed versions

Replace this line with the output of pl.show_versions(). Leave the backticks in place.
@coastalwhite coastalwhite added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer P-high Priority: high labels Dec 4, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Dec 4, 2024
coastalwhite added a commit to coastalwhite/polars that referenced this issue Dec 4, 2024
coastalwhite added a commit to coastalwhite/polars that referenced this issue Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer P-high Priority: high python Related to Python Polars
Projects
Status: Ready
1 participant