Table Provider for Parquet Files and In-Memory Arrays #11971

nrdmao33 · 2024-08-13T18:50:08Z

nrdmao33
Aug 13, 2024

I am interested in using DataFusion to simultaneously query parquet files and in-memory data. The basic idea is that the in-memory data represents data that is being ingested in real-time and the parquet files store historical data that had previously been ingested but after it reached a certain size was written out to disk. Older files would be removed based on space or time criteria. When the system reboots it has access to the saved files and begins populating a new, in-memory table.

The basic idea then would be to create a table provider that would dispatch scan() calls to the the parquet provider and the in-memory provider and then merge the results. In effect the ExecutionPlan would be a composite of the ExecutionPlan from parquet and the ExecutionPlan for the in-memory table. I am not sure if merging the results from two different execute() functions is a reasonable thing to do.

I would like to reuse as much of the parquet table provider as possible. The current implementation uses ListingTable but that might not be the right option since it seems the list of files cannot change over time. Perhaps the better interface is ParquetFormat that implements the FileFormat trait?

Also, any advice on the in-memory table would be appreciated. Perhaps there is already something similar available?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table Provider for Parquet Files and In-Memory Arrays #11971

{{title}}

Replies: 0 comments

Select a reply

Table Provider for Parquet Files and In-Memory Arrays #11971

nrdmao33 Aug 13, 2024

Replies: 0 comments

nrdmao33
Aug 13, 2024