Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data types and speed #15

Open
RobinL opened this issue Jan 27, 2020 · 2 comments
Open

Data types and speed #15

RobinL opened this issue Jan 27, 2020 · 2 comments
Assignees

Comments

@RobinL
Copy link
Member

RobinL commented Jan 27, 2020

  • Need job that just does term frequency adjustment
  • Save out two join tables so we know how big they are
  • See whether they are broadcasting and whether we should hint the broadcast
  • How do we make the small tables shuffle to the big tables?

Data types and speed

The current implmentation saves results out to csv in s3 (Athena's default behaviour) and then reads in from s3.

However, it is possible to save results out to parquet using a create table as statement.

This has two benefits:

  • Read speeds of parquet from s3 are a lot faster (double or more) reading csv
  • The resultant dataframe in pandas is guaranteed to have the right data types.

One potential issue with this approach is that the user must submit a select statement (not e.g. a delete table statement). So, if we're worried about this, we would need to somehow parse the sql statement to make sure it's a select statement.

I previously had a very rough go at this here, which does work in most situations, but it's very rough and ready.

Once we've done this, we should probably deprecate the python_athena_tools repo.

@isichei
Copy link
Contributor

isichei commented Jan 27, 2020

Additional notes after chat:

  • Should be integrated into read_sql (and only write to parquet format - might require rewriting of get_athena_query_response innards)
  • Should throw error when user's sql is CREATE TABLE AS something like This function wraps your sql in a "CREATE TABLE AS" statement. Please use "get_athena_query_response" to run a "CREATE TABLE AS" statement.
  • Should consider issue CREATE TEMP TABLE #16 when making this

@isichei
Copy link
Contributor

isichei commented Feb 4, 2020

Should fix #17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants