-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support apache arrow client #24586
Comments
Duplicate of #18038, you could comment there if you have a different and/or more specific use case. |
i think this is not exactly a duplicate, but i can move discussion there if you prefer this ticket is more generally about having some kind of arrow based client for trino, whereas 18038 is specifically about arrow flight. its also not clear to me if 18038 is about creating an arrow flight connector for trino so it can fetch data from an arrow flight server, or implementing the arrow flight endpoints in trino so it can serve as an arrow flight server |
Thanks, I see the distinction now. It might be actually a much smaller effort to use Arrow instead of JSON as the transport format, than implementing a client compatible with Arrow Flight SQL. I think the biggest issue with Arrow is how to handle Trino data types that don't map to Arrow types. This would require special logic on the client side - we should enumerate such types and see if that would not invalidate the whole point of using Arrow, that is, if it would be simpler to convert all data on the client side. Recent work on #22271 improved fetching data in the clients, so it should be possible now to benchmark the Arrow conversion done on both the client and server side. On Slack you mentioned you already tried implementing an ADBC client, do you have any code you could share? |
i agree that no matter what protocol is used, type conversion will be tricky. i took a pass at converting trino types to arrow types, and mostly they map well. the gaps are around (unsurprisingly) time and timestamp types. arrow only supports times up to nanosecond precision, and only supports times with timezones scoped to a file or block of data, not to the row like trino does i handled these in my draft implementation by using arrow extension types. arrow extension types are a way to mark some arrow physical type as representing nonstandard logical type. in this case, i encoded picosecond precision times as a struct with a nano timestamp and pico adjustment, and times with timezone as a struct with a time and zone offset. sure, i can make a draft PR. note, its very much a draft, untested and in progress edit: oh, also hyperloglog, which i didnt know enough about to attempt. will have to figure out what to do for this |
created #24587 with the initial draft |
@dysn maybe it is worth checking if there is a chance to make arrow supporting this problematic types in arrow itself? |
thats a good point, since migrating from extension types to natively supported arrow types would be a breaking change for clients. on the other hand, arrow is a big project, so adding new types may be slow would you prefer that i open an issue with arrow? or would it be better for a core trino contributor to start that discussion? FWIW, i think opening up that line of communication is worthwhile, especially if trino ever wants to support arrow flight |
I am not a maintainer at Trino or something like that but I think that it is worth opening an issue at arrow to see if adding types can be acceptable But it is worth to hear more from the Trino maintainers as they have more experience with integrating with other open-source projects and know more about Trino core |
i looked into arrow flight and flight sql. much of the protocol translates well to trino, with a couple exceptions first, flight handles fetching actual result data as a grpc call, which doesnt map to trino spooling protocol with the default STORAGE retrieval mode. I am optimistic that the arrow team might would be open to adding regular http based data fetching. more difficult is session management. flight uses a session id cookie, but that implies that the server will track session properties and prepared statements. its certainly possible to track those in trino, but it would come with more questions and design i think. its also possible to continue storing session info with headers using a custom grpc interceptor, but that kind of partly defeats the goal of being able to query trino with a regular flight sql client edit: that said, in some ways i think going straight to implementing flight sql is better. it avoids having to implement a client for each language, and encapsulate arrow specific behavior |
Arrow is primarily an in memory columnar data format, but defines ways for applications to exchange such data efficiently. Many common data processing/analysis tools support working directly on arrow data efficiently (eg polars, duckdb)
At our company, we are seeing a pattern emerge where data scientists and analysts will first generate modest amounts of data (100s of mbs to 10s of gbs) with heavy trino queries, and then load that into something like polars or duckdb for further analysis. We would like to better support this pattern. One current pain point is that users must convert the row oriented trino data into arrow data. Arrow provides utilities for this, but that is not so efficient.
There are a few ways trino could return arrow data to users. One would be by implementing the arrow flight (and higher level arrow flight sql) grpc protocol. This seems like it could be a pretty big lift, and likely would require some extra up front design. It may not be possible to support an acceptable subset of trinos features on the existing arrow flight protocol. In particular, fetching data in arrow flight is a grpc endpoint, which doesnt fit well with trino's spooling protocol
An easier way might be to implement a java adbc connector for trino. ADBC is just a java interface and doesnt require any specific protocol. There are two ways we could implement and adbc connector. We could modify the direct protocol to return data in the arrow wire format, or we could enable trino to spool data in the arrow ipc format. Of course, these aren't mutually exclusive.
Is an arrow based trino client desirable? If so, how should that be implemented?
The text was updated successfully, but these errors were encountered: