Add option for Input Table with Athena Linker connection #2339

StephenBowser · 2024-08-15T13:34:34Z

Is your proposal related to a problem?

While using Splink with AWS Athena it would be great to be able to start with an existing table in Athena rather than a dataframe.

Describe the solution you'd like

When connecting with the linker object it takes an input dataframe and an output Athena connection as shown below. When you run Linker it creates a table in Athena and stores the data in s3. This table is referenced later when running things like linker.training etc..

Example:
import boto3
from splink.backends.athena import AthenaAPI
from splink import Linker, SettingsCreator, splink_datasets

boto3_session = boto3.Session(region_name="eu-west-1")
df = splink_datasets.historical_50k
db_api = AthenaAPI(
boto3_session,
output_bucket=bucket,
output_database=database,
output_filepath=filepath,
)
settings = SettingsCreator( ... )
linker = Linker(df, settings, db_api=db_api)
linker.training. ......

For my use case I already have tables set up in Athena that I want to run matching on. I would like to see an option to pass that table through the Linker() function rather than just having the option for a dataframe. As it is now, I need to read in the data as a dataframe and then it gets sent back to Athena/s3 which is redundant and problematic on larger data sets.

Describe alternatives you've considered

Currently I can read in all of the data that I have in the Athena tables. This is large and redundant.

Additional context

StephenBowser added the enhancement New feature or request label Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for Input Table with Athena Linker connection #2339

Add option for Input Table with Athena Linker connection #2339

StephenBowser commented Aug 15, 2024

Add option for Input Table with Athena Linker connection #2339

Add option for Input Table with Athena Linker connection #2339

Comments

StephenBowser commented Aug 15, 2024

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context