Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for Input Table with Athena Linker connection #2339

Open
StephenBowser opened this issue Aug 15, 2024 · 0 comments
Open

Add option for Input Table with Athena Linker connection #2339

StephenBowser opened this issue Aug 15, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@StephenBowser
Copy link

Is your proposal related to a problem?

While using Splink with AWS Athena it would be great to be able to start with an existing table in Athena rather than a dataframe.

Describe the solution you'd like

When connecting with the linker object it takes an input dataframe and an output Athena connection as shown below. When you run Linker it creates a table in Athena and stores the data in s3. This table is referenced later when running things like linker.training etc..

Example:
import boto3
from splink.backends.athena import AthenaAPI
from splink import Linker, SettingsCreator, splink_datasets

boto3_session = boto3.Session(region_name="eu-west-1")
df = splink_datasets.historical_50k
db_api = AthenaAPI(
boto3_session,
output_bucket=bucket,
output_database=database,
output_filepath=filepath,
)
settings = SettingsCreator( ... )
linker = Linker(df, settings, db_api=db_api)
linker.training. ......

For my use case I already have tables set up in Athena that I want to run matching on. I would like to see an option to pass that table through the Linker() function rather than just having the option for a dataframe. As it is now, I need to read in the data as a dataframe and then it gets sent back to Athena/s3 which is redundant and problematic on larger data sets.

Describe alternatives you've considered

Currently I can read in all of the data that I have in the Athena tables. This is large and redundant.

Additional context

@StephenBowser StephenBowser added the enhancement New feature or request label Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant