Data types and speed #15

RobinL · 2020-01-27T06:35:08Z

Need job that just does term frequency adjustment
Save out two join tables so we know how big they are
See whether they are broadcasting and whether we should hint the broadcast
How do we make the small tables shuffle to the big tables?

Data types and speed

The current implmentation saves results out to csv in s3 (Athena's default behaviour) and then reads in from s3.

However, it is possible to save results out to parquet using a create table as statement.

This has two benefits:

Read speeds of parquet from s3 are a lot faster (double or more) reading csv
The resultant dataframe in pandas is guaranteed to have the right data types.

One potential issue with this approach is that the user must submit a select statement (not e.g. a delete table statement). So, if we're worried about this, we would need to somehow parse the sql statement to make sure it's a select statement.

I previously had a very rough go at this here, which does work in most situations, but it's very rough and ready.

Once we've done this, we should probably deprecate the python_athena_tools repo.

The text was updated successfully, but these errors were encountered:

isichei · 2020-01-27T10:08:57Z

Additional notes after chat:

Should be integrated into read_sql (and only write to parquet format - might require rewriting of get_athena_query_response innards)
Should throw error when user's sql is CREATE TABLE AS something like This function wraps your sql in a "CREATE TABLE AS" statement. Please use "get_athena_query_response" to run a "CREATE TABLE AS" statement.
Should consider issue CREATE TEMP TABLE #16 when making this

isichei · 2020-02-04T13:57:08Z

Should fix #17

This was referenced Jan 27, 2020

pandas column-type overwrite #1

Open

ValueError: Bool column has NA values in column 59 #14

Open

isichei assigned gkelly900 Feb 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data types and speed #15

Data types and speed #15

RobinL commented Jan 27, 2020 •

edited

Loading

isichei commented Jan 27, 2020 •

edited

Loading

isichei commented Feb 4, 2020

Data types and speed #15

Data types and speed #15

Comments

RobinL commented Jan 27, 2020 • edited Loading

isichei commented Jan 27, 2020 • edited Loading

isichei commented Feb 4, 2020

RobinL commented Jan 27, 2020 •

edited

Loading

isichei commented Jan 27, 2020 •

edited

Loading