Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for DBI Method dbFetch() #185

Open
JavOrraca opened this issue Jul 30, 2024 · 3 comments
Open

Support for DBI Method dbFetch() #185

JavOrraca opened this issue Jul 30, 2024 · 3 comments

Comments

@JavOrraca
Copy link

Most of my production jobs deploy to Kubernetes, where we aim to execute everything with minimal resources. This constraint necessitates batch processing in small increments. Previously, I utilized DBI::dbFetch() within my pipelines to handle data collection in batches and structured it as follows (pseudo-code):

batch_data <- DBI::dbFetch(res, n = batch_size)

For reference, here are the docs for dbFetch(): dbi.r-dbi.org/reference/dbFetch.html

As the docs state, res represents an object inheriting from DBIResult created by DBI::dbSendQuery(). I know DBI::dbSendQuery() won't work with {pool} but I've been able to replace most DBI::dbSendQuery() calls in my code with pool::dbExecute().

Each time dbFetch() was executed within a loop, it'd intelligently fetch the next batch of records. There might be a more efficient method that I haven't yet discovered, but the alternative approach I currently employ with pool (which is effective but gross) involves manually setting start_row and end_row for each loop iteration and then executing a query like this:

batch_data <- pool::dbGetQuery(
  conn,
  glue::glue(
    "WITH CTE AS (
       SELECT {input_tbl}.*, ROW_NUMBER() OVER (ORDER BY 1) AS row_num
       FROM {db}.{input_tbl}
    )
    SELECT CTE.*
    FROM CTE
    WHERE row_num >= {start_row} AND row_num <= {end_row}"
  )
)

A {pool}-compatible dbFetch() would be incredibly useful. If there's a simpler solution that I've overlooked, I would appreciate any guidance. This package has already been a lifesaver and is now an integral part of my team's production cron jobs. Awesome package! 🙏

@hadley
Copy link
Member

hadley commented Jul 30, 2024

Unfortunately there's no way to implement dbFetch() with pool because in general it's possible that the connection used to create the query has gone away and pool is supplying a new one (and that new one obviously doesn't have any state about your previous query).

I'd suggest manually checking out (and returning) a connection and using dbFetch() on that.

@JavOrraca
Copy link
Author

Ah, got it... thank you. Subsetting a remote table by start_row and end_row would be ideal, but I've only found something like this to work for a few db's:

conn |> 
  tbl(I(glue("{db}.{input_tbl}"))) |>
  filter(
    row_number() >= start_row,
    row_number() <= end_row
  )

I get that database tables have no specific row order but if you have any other pointers, it'd be appreciated. Thanks again.

@hadley
Copy link
Member

hadley commented Jul 30, 2024

You can still use dbFetch() you just need to use poolCheckout() to get a persistent connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants