Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Decryptor errors when scanning a dataset that uses uniform encryption #44852

Open
adamreeve opened this issue Nov 25, 2024 · 0 comments

Comments

@adamreeve
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

@pitrou pointed out that InternalFileDecryptor reusing the footer_data_decryptor_ could be problematic for multi-threaded Parquet reads: #43057 (comment)

I confirmed that this does lead to decryptor errors when scanning a Dataset with Parquet files that use uniform encryption by modifying the existing Parquet Dataset encryption tests:

diff --git a/cpp/src/arrow/dataset/file_parquet_encryption_test.cc b/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
index 0287d593d1..6a13b1ee37 100644
--- a/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
+++ b/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
@@ -90,7 +90,7 @@ class DatasetEncryptionTestBase : public ::testing::Test {
     auto encryption_config =
         std::make_shared<parquet::encryption::EncryptionConfiguration>(
             std::string(kFooterKeyName));
-    encryption_config->column_keys = kColumnKeyMapping;
+    encryption_config->uniform_encryption = true;
     auto parquet_encryption_config = std::make_shared<ParquetEncryptionConfig>();
     // Directly assign shared_ptr objects to ParquetEncryptionConfig members
     parquet_encryption_config->crypto_factory = crypto_factory_;

This causes DatasetEncryptionTest::WriteReadDatasetWithEncryption to fail with an error like:

/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:159: Failure
Failed
'_error_or_value28.status()' failed with IOError: AesDecryptor was wiped outDeserializing page header failed.

/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:109  LoadBatch(batch_size)
/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:1263  ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
/home/adam/dev/arrow/cpp/src/arrow/util/parallel.h:95  func(i, inputs[i])
/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:208: Failure
Expected: TestScanDataset() doesn't generate new fatal failures in the current thread.
  Actual: it does.

For LargeRowEncryptionTest::ReadEncryptLargeRows, I sometimes get the same AesDecryptor was wiped out error, but also see errors like:

/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:159: Failure
Failed
'_error_or_value28.status()' failed with IOError: Failed decryption finalization
/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:109  LoadBatch(batch_size)
/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:1263  ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
/home/adam/dev/arrow/cpp/src/arrow/util/parallel.h:95  func(i, inputs[i])
/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:265: Failure
Expected: TestScanDataset() doesn't generate new fatal failures in the current thread.
  Actual: it does.

I don't think it's possible to reproduce this from PyArrow only, as the uniform_encryption setting isn't exposed in PyArrow.

Component(s)

C++, Parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant