You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Save out two join tables so we know how big they are
See whether they are broadcasting and whether we should hint the broadcast
How do we make the small tables shuffle to the big tables?
Data types and speed
The current implmentation saves results out to csv in s3 (Athena's default behaviour) and then reads in from s3.
However, it is possible to save results out to parquet using a create table as statement.
This has two benefits:
Read speeds of parquet from s3 are a lot faster (double or more) reading csv
The resultant dataframe in pandas is guaranteed to have the right data types.
One potential issue with this approach is that the user must submit a select statement (not e.g. a delete table statement). So, if we're worried about this, we would need to somehow parse the sql statement to make sure it's a select statement.
I previously had a very rough go at this here, which does work in most situations, but it's very rough and ready.
Once we've done this, we should probably deprecate the python_athena_tools repo.
The text was updated successfully, but these errors were encountered:
Should be integrated into read_sql (and only write to parquet format - might require rewriting of get_athena_query_response innards)
Should throw error when user's sql is CREATE TABLE AS something like This function wraps your sql in a "CREATE TABLE AS" statement. Please use "get_athena_query_response" to run a "CREATE TABLE AS" statement.
Data types and speed
The current implmentation saves results out to csv in s3 (Athena's default behaviour) and then reads in from s3.
However, it is possible to save results out to parquet using a
create table as
statement.This has two benefits:
One potential issue with this approach is that the user must submit a select statement (not e.g. a delete table statement). So, if we're worried about this, we would need to somehow parse the sql statement to make sure it's a select statement.
I previously had a very rough go at this here, which does work in most situations, but it's very rough and ready.
Once we've done this, we should probably deprecate the
python_athena_tools
repo.The text was updated successfully, but these errors were encountered: