Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support apache arrow client #24586

Open
dysn opened this issue Dec 26, 2024 · 9 comments
Open

support apache arrow client #24586

dysn opened this issue Dec 26, 2024 · 9 comments

Comments

@dysn
Copy link

dysn commented Dec 26, 2024

Arrow is primarily an in memory columnar data format, but defines ways for applications to exchange such data efficiently. Many common data processing/analysis tools support working directly on arrow data efficiently (eg polars, duckdb)

At our company, we are seeing a pattern emerge where data scientists and analysts will first generate modest amounts of data (100s of mbs to 10s of gbs) with heavy trino queries, and then load that into something like polars or duckdb for further analysis. We would like to better support this pattern. One current pain point is that users must convert the row oriented trino data into arrow data. Arrow provides utilities for this, but that is not so efficient.

There are a few ways trino could return arrow data to users. One would be by implementing the arrow flight (and higher level arrow flight sql) grpc protocol. This seems like it could be a pretty big lift, and likely would require some extra up front design. It may not be possible to support an acceptable subset of trinos features on the existing arrow flight protocol. In particular, fetching data in arrow flight is a grpc endpoint, which doesnt fit well with trino's spooling protocol

An easier way might be to implement a java adbc connector for trino. ADBC is just a java interface and doesnt require any specific protocol. There are two ways we could implement and adbc connector. We could modify the direct protocol to return data in the arrow wire format, or we could enable trino to spool data in the arrow ipc format. Of course, these aren't mutually exclusive.

Is an arrow based trino client desirable? If so, how should that be implemented?

@nineinchnick
Copy link
Member

Duplicate of #18038, you could comment there if you have a different and/or more specific use case.

@dysn
Copy link
Author

dysn commented Dec 26, 2024

i think this is not exactly a duplicate, but i can move discussion there if you prefer

this ticket is more generally about having some kind of arrow based client for trino, whereas 18038 is specifically about arrow flight. its also not clear to me if 18038 is about creating an arrow flight connector for trino so it can fetch data from an arrow flight server, or implementing the arrow flight endpoints in trino so it can serve as an arrow flight server

@nineinchnick
Copy link
Member

nineinchnick commented Dec 26, 2024

Thanks, I see the distinction now. It might be actually a much smaller effort to use Arrow instead of JSON as the transport format, than implementing a client compatible with Arrow Flight SQL.

I think the biggest issue with Arrow is how to handle Trino data types that don't map to Arrow types. This would require special logic on the client side - we should enumerate such types and see if that would not invalidate the whole point of using Arrow, that is, if it would be simpler to convert all data on the client side.

Recent work on #22271 improved fetching data in the clients, so it should be possible now to benchmark the Arrow conversion done on both the client and server side.

On Slack you mentioned you already tried implementing an ADBC client, do you have any code you could share?

@nineinchnick nineinchnick reopened this Dec 26, 2024
@dysn
Copy link
Author

dysn commented Dec 26, 2024

i agree that no matter what protocol is used, type conversion will be tricky. i took a pass at converting trino types to arrow types, and mostly they map well. the gaps are around (unsurprisingly) time and timestamp types. arrow only supports times up to nanosecond precision, and only supports times with timezones scoped to a file or block of data, not to the row like trino does

i handled these in my draft implementation by using arrow extension types. arrow extension types are a way to mark some arrow physical type as representing nonstandard logical type. in this case, i encoded picosecond precision times as a struct with a nano timestamp and pico adjustment, and times with timezone as a struct with a time and zone offset.

sure, i can make a draft PR. note, its very much a draft, untested and in progress

edit: oh, also hyperloglog, which i didnt know enough about to attempt. will have to figure out what to do for this

@dysn
Copy link
Author

dysn commented Dec 26, 2024

created #24587 with the initial draft

@shohamyamin
Copy link
Contributor

@dysn maybe it is worth checking if there is a chance to make arrow supporting this problematic types in arrow itself?

@dysn
Copy link
Author

dysn commented Dec 26, 2024

thats a good point, since migrating from extension types to natively supported arrow types would be a breaking change for clients. on the other hand, arrow is a big project, so adding new types may be slow

would you prefer that i open an issue with arrow? or would it be better for a core trino contributor to start that discussion? FWIW, i think opening up that line of communication is worthwhile, especially if trino ever wants to support arrow flight

@shohamyamin
Copy link
Contributor

I am not a maintainer at Trino or something like that but I think that it is worth opening an issue at arrow to see if adding types can be acceptable
At the same time having this conversation at Trino and start the work here. If there will be progress from arrow side that will be great and make things simple.

But it is worth to hear more from the Trino maintainers as they have more experience with integrating with other open-source projects and know more about Trino core

@dysn
Copy link
Author

dysn commented Jan 3, 2025

i looked into arrow flight and flight sql. much of the protocol translates well to trino, with a couple exceptions

first, flight handles fetching actual result data as a grpc call, which doesnt map to trino spooling protocol with the default STORAGE retrieval mode. I am optimistic that the arrow team might would be open to adding regular http based data fetching.

more difficult is session management. flight uses a session id cookie, but that implies that the server will track session properties and prepared statements. its certainly possible to track those in trino, but it would come with more questions and design i think. its also possible to continue storing session info with headers using a custom grpc interceptor, but that kind of partly defeats the goal of being able to query trino with a regular flight sql client

edit: that said, in some ways i think going straight to implementing flight sql is better. it avoids having to implement a client for each language, and encapsulate arrow specific behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants