You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While using Splink with AWS Athena it would be great to be able to start with an existing table in Athena rather than a dataframe.
Describe the solution you'd like
When connecting with the linker object it takes an input dataframe and an output Athena connection as shown below. When you run Linker it creates a table in Athena and stores the data in s3. This table is referenced later when running things like linker.training etc..
Example:
import boto3
from splink.backends.athena import AthenaAPI
from splink import Linker, SettingsCreator, splink_datasets
For my use case I already have tables set up in Athena that I want to run matching on. I would like to see an option to pass that table through the Linker() function rather than just having the option for a dataframe. As it is now, I need to read in the data as a dataframe and then it gets sent back to Athena/s3 which is redundant and problematic on larger data sets.
Describe alternatives you've considered
Currently I can read in all of the data that I have in the Athena tables. This is large and redundant.
Additional context
The text was updated successfully, but these errors were encountered:
Is your proposal related to a problem?
While using Splink with AWS Athena it would be great to be able to start with an existing table in Athena rather than a dataframe.
Describe the solution you'd like
When connecting with the linker object it takes an input dataframe and an output Athena connection as shown below. When you run Linker it creates a table in Athena and stores the data in s3. This table is referenced later when running things like linker.training etc..
Example:
import boto3
from splink.backends.athena import AthenaAPI
from splink import Linker, SettingsCreator, splink_datasets
boto3_session = boto3.Session(region_name="eu-west-1")
df = splink_datasets.historical_50k
db_api = AthenaAPI(
boto3_session,
output_bucket=bucket,
output_database=database,
output_filepath=filepath,
)
settings = SettingsCreator( ... )
linker = Linker(df, settings, db_api=db_api)
linker.training. ......
For my use case I already have tables set up in Athena that I want to run matching on. I would like to see an option to pass that table through the Linker() function rather than just having the option for a dataframe. As it is now, I need to read in the data as a dataframe and then it gets sent back to Athena/s3 which is redundant and problematic on larger data sets.
Describe alternatives you've considered
Currently I can read in all of the data that I have in the Athena tables. This is large and redundant.
Additional context
The text was updated successfully, but these errors were encountered: