You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am interested in using DataFusion to simultaneously query parquet files and in-memory data. The basic idea is that the in-memory data represents data that is being ingested in real-time and the parquet files store historical data that had previously been ingested but after it reached a certain size was written out to disk. Older files would be removed based on space or time criteria. When the system reboots it has access to the saved files and begins populating a new, in-memory table.
The basic idea then would be to create a table provider that would dispatch scan() calls to the the parquet provider and the in-memory provider and then merge the results. In effect the ExecutionPlan would be a composite of the ExecutionPlan from parquet and the ExecutionPlan for the in-memory table. I am not sure if merging the results from two different execute() functions is a reasonable thing to do.
I would like to reuse as much of the parquet table provider as possible. The current implementation uses ListingTable but that might not be the right option since it seems the list of files cannot change over time. Perhaps the better interface is ParquetFormat that implements the FileFormat trait?
Also, any advice on the in-memory table would be appreciated. Perhaps there is already something similar available?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I am interested in using DataFusion to simultaneously query parquet files and in-memory data. The basic idea is that the in-memory data represents data that is being ingested in real-time and the parquet files store historical data that had previously been ingested but after it reached a certain size was written out to disk. Older files would be removed based on space or time criteria. When the system reboots it has access to the saved files and begins populating a new, in-memory table.
The basic idea then would be to create a table provider that would dispatch scan() calls to the the parquet provider and the in-memory provider and then merge the results. In effect the ExecutionPlan would be a composite of the ExecutionPlan from parquet and the ExecutionPlan for the in-memory table. I am not sure if merging the results from two different execute() functions is a reasonable thing to do.
I would like to reuse as much of the parquet table provider as possible. The current implementation uses ListingTable but that might not be the right option since it seems the list of files cannot change over time. Perhaps the better interface is ParquetFormat that implements the FileFormat trait?
Also, any advice on the in-memory table would be appreciated. Perhaps there is already something similar available?
Beta Was this translation helpful? Give feedback.
All reactions