Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr::copy_to() performance with odbc 1.5.0 #7103

Closed
nathanhaigh opened this issue Nov 11, 2024 · 1 comment
Closed

dplyr::copy_to() performance with odbc 1.5.0 #7103

nathanhaigh opened this issue Nov 11, 2024 · 1 comment

Comments

@nathanhaigh
Copy link

I'm not sure exactly where to start or who to refer this issue to. However, since it's a performance issue I have noticed with calling dplyr::copy_to() with the recent odbc 1.5.0 release I thought I'd report it here in the first instance. I am using a Snowflake DB backend.

The recent odbc 1.5.0 release has rolled back a change to the default value of batch_rows for the dbWriteTable() and dbBind() methods that was made in odbc 1.3.0 (https://odbc.r-dbi.org/news/index.html#minor-improvements-and-bug-fixes-1-5-0). The default has been changed from NA to 1024. What this means is that when using dplyr::copy_to() with a reasonable sized table, the inserts are done in batches of 1024 rows. For sizeable tables, the overhead of performing many inserts is really, really noticable and is faaaar slower than what is achieved by using odbc 1.4.2 which simply does a single insert of all the rows in one go. It seems that once the size of the table is large enough, then a file is uploaded using PUT and the entire contents of the file inserted in one go.

I did some benchmarking of copying a table consisting of 1 column of integer values and differing numbers of rows (small = 1024 rows, large = 20k rows and xlarge = 400k) using odbc 1.4.2 and comparing to odbc 1.5.0. Here are the results:

Using odbc 1.4.2 (DBI 1.2.3, dplyr 1.1.4 and dbplyr 2.5.0):
Image

Using odbc 1.5.0 (DBI 1.2.3, dplyr 1.1.4 and dbplyr 2.5.0):
Image

As you can see, the the dplyr::copy_to() performance is severely impacted by odbc 1.5.0 and especially for tables containing many rows. The xlarge table containing a single column of ints 1:400000 took ~5s with odbc 1.4.2 vs 250s with odbc 1.5.0.

I see it is possible to set batch_rows globally via options() with a larger number of rows:

options(odbc.batch_rows = 1e9)

This fixes the performance issues (see below) but I'm unclear if the reason why odbc 1.5.0 made 1024 the default and if increasing this value to an arbitrarily large number might have consequences I haven't thought about.

Image

@DavisVaughan
Copy link
Member

Thanks, closing in favor of r-dbi/odbc#862

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants