You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most of my production jobs deploy to Kubernetes, where we aim to execute everything with minimal resources. This constraint necessitates batch processing in small increments. Previously, I utilized DBI::dbFetch() within my pipelines to handle data collection in batches and structured it as follows (pseudo-code):
As the docs state, res represents an object inheriting from DBIResult created by DBI::dbSendQuery(). I know DBI::dbSendQuery() won't work with {pool} but I've been able to replace most DBI::dbSendQuery() calls in my code with pool::dbExecute().
Each time dbFetch() was executed within a loop, it'd intelligently fetch the next batch of records. There might be a more efficient method that I haven't yet discovered, but the alternative approach I currently employ with pool (which is effective but gross) involves manually setting start_row and end_row for each loop iteration and then executing a query like this:
batch_data <- pool::dbGetQuery(
conn,
glue::glue(
"WITH CTE AS (
SELECT {input_tbl}.*, ROW_NUMBER() OVER (ORDER BY 1) AS row_num
FROM {db}.{input_tbl}
)
SELECT CTE.*
FROM CTE
WHERE row_num >= {start_row} AND row_num <= {end_row}"
)
)
A {pool}-compatible dbFetch() would be incredibly useful. If there's a simpler solution that I've overlooked, I would appreciate any guidance. This package has already been a lifesaver and is now an integral part of my team's production cron jobs. Awesome package! 🙏
The text was updated successfully, but these errors were encountered:
Unfortunately there's no way to implement dbFetch() with pool because in general it's possible that the connection used to create the query has gone away and pool is supplying a new one (and that new one obviously doesn't have any state about your previous query).
I'd suggest manually checking out (and returning) a connection and using dbFetch() on that.
Ah, got it... thank you. Subsetting a remote table by start_row and end_row would be ideal, but I've only found something like this to work for a few db's:
Most of my production jobs deploy to Kubernetes, where we aim to execute everything with minimal resources. This constraint necessitates batch processing in small increments. Previously, I utilized
DBI::dbFetch()
within my pipelines to handle data collection in batches and structured it as follows (pseudo-code):batch_data <- DBI::dbFetch(res, n = batch_size)
For reference, here are the docs for
dbFetch()
: dbi.r-dbi.org/reference/dbFetch.htmlAs the docs state,
res
represents an object inheriting from DBIResult created byDBI::dbSendQuery()
. I knowDBI::dbSendQuery()
won't work with {pool} but I've been able to replace mostDBI::dbSendQuery()
calls in my code withpool::dbExecute()
.Each time
dbFetch()
was executed within a loop, it'd intelligently fetch the next batch of records. There might be a more efficient method that I haven't yet discovered, but the alternative approach I currently employ with pool (which is effective but gross) involves manually settingstart_row
andend_row
for each loop iteration and then executing a query like this:A {pool}-compatible
dbFetch()
would be incredibly useful. If there's a simpler solution that I've overlooked, I would appreciate any guidance. This package has already been a lifesaver and is now an integral part of my team's production cron jobs. Awesome package! 🙏The text was updated successfully, but these errors were encountered: