Support for DBI Method `dbFetch()` #185

JavOrraca · 2024-07-30T03:11:56Z

Most of my production jobs deploy to Kubernetes, where we aim to execute everything with minimal resources. This constraint necessitates batch processing in small increments. Previously, I utilized DBI::dbFetch() within my pipelines to handle data collection in batches and structured it as follows (pseudo-code):

batch_data <- DBI::dbFetch(res, n = batch_size)

For reference, here are the docs for dbFetch(): dbi.r-dbi.org/reference/dbFetch.html

As the docs state, res represents an object inheriting from DBIResult created by DBI::dbSendQuery(). I know DBI::dbSendQuery() won't work with {pool} but I've been able to replace most DBI::dbSendQuery() calls in my code with pool::dbExecute().

Each time dbFetch() was executed within a loop, it'd intelligently fetch the next batch of records. There might be a more efficient method that I haven't yet discovered, but the alternative approach I currently employ with pool (which is effective but gross) involves manually setting start_row and end_row for each loop iteration and then executing a query like this:

batch_data <- pool::dbGetQuery(
  conn,
  glue::glue(
    "WITH CTE AS (
       SELECT {input_tbl}.*, ROW_NUMBER() OVER (ORDER BY 1) AS row_num
       FROM {db}.{input_tbl}
    )
    SELECT CTE.*
    FROM CTE
    WHERE row_num >= {start_row} AND row_num <= {end_row}"
  )
)

A {pool}-compatible dbFetch() would be incredibly useful. If there's a simpler solution that I've overlooked, I would appreciate any guidance. This package has already been a lifesaver and is now an integral part of my team's production cron jobs. Awesome package! 🙏

The text was updated successfully, but these errors were encountered:

hadley · 2024-07-30T13:34:46Z

Unfortunately there's no way to implement dbFetch() with pool because in general it's possible that the connection used to create the query has gone away and pool is supplying a new one (and that new one obviously doesn't have any state about your previous query).

I'd suggest manually checking out (and returning) a connection and using dbFetch() on that.

JavOrraca · 2024-07-30T14:58:48Z

Ah, got it... thank you. Subsetting a remote table by start_row and end_row would be ideal, but I've only found something like this to work for a few db's:

conn |> 
  tbl(I(glue("{db}.{input_tbl}"))) |>
  filter(
    row_number() >= start_row,
    row_number() <= end_row
  )

I get that database tables have no specific row order but if you have any other pointers, it'd be appreciated. Thanks again.

hadley · 2024-07-30T19:03:48Z

You can still use dbFetch() you just need to use poolCheckout() to get a persistent connection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for DBI Method `dbFetch()` #185

Support for DBI Method `dbFetch()` #185

JavOrraca commented Jul 30, 2024

hadley commented Jul 30, 2024

JavOrraca commented Jul 30, 2024

hadley commented Jul 30, 2024

Support for DBI Method dbFetch() #185

Support for DBI Method dbFetch() #185

Comments

JavOrraca commented Jul 30, 2024

hadley commented Jul 30, 2024

JavOrraca commented Jul 30, 2024

hadley commented Jul 30, 2024

Support for DBI Method `dbFetch()` #185

Support for DBI Method `dbFetch()` #185