Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance (redux) #6

Open
mattkerlogue opened this issue May 5, 2023 · 0 comments
Open

Performance (redux) #6

mattkerlogue opened this issue May 5, 2023 · 0 comments

Comments

@mattkerlogue
Copy link
Owner

A successor to previous performance issue (#3).

Both quick=TRUE and quick=FALSE methods will now handle text processing (not including comment text in cell content, repeated whitespace and multi-line cell content) without significant drains on performance.

Performance is broadly comparable to that provided by {readODS} for the standard example file in regards execution time, but perhaps understandably (due to what it is extracting) uses more memory.

# Basic example file (sheet 2) extraction comparison
#> # A tibble: 7 × 5
#>   expression          min   median mem_alloc n_itr
#>   <bch:expr>     <bch:tm> <bch:tm> <bch:byt> <int>
#> 1 cells_quick      38.3ms   38.8ms  536.92KB    13
#> 2 cells_slow       62.4ms   65.8ms    1.01MB     8
#> 3 sheet_quick      44.5ms   49.1ms  617.89KB    11
#> 4 sheet_slow       68.2ms   79.6ms     1.1MB     7
#> 5 readODS          43.9ms   46.1ms  375.06KB    11

Performance for large files when using quick=TRUE is quicker than {readODS} and only slightly slower when quick=FALSE, interestingly all {tidyods} extraction approaches use notably less memory than {readODS}.

# Postcode example file (sheet 2) extraction comparison
#> # A tibble: 7 × 5
#>   expression          min   median mem_alloc n_itr
#>   <bch:expr>     <bch:tm> <bch:tm> <bch:byt> <int>
#> 1 cells_quick       9.94s   10.44s  171.52MB     5
#> 2 cells_slow       14.93s   16.51s   289.8MB     5
#> 3 sheet_quick      10.26s   10.49s  187.64MB     5
#> 4 sheet_slow       15.47s   16.18s  311.19MB     5
#> 5 readODS          13.63s   13.83s    2.53GB     5

Performance bottlenecks are now largely due to {xml2} (and underlying libxml2) limitations that cannot be overcome without writing independent C/C++ code to handle XML extraction.

A critical limitation of libxml2 is its requirement for available memory 4 times the file size.

In general for a balanced textual document the internal memory requirement is about 4 times the size of the UTF8 serialization of this document (example the XML-1.0 recommendation is a bit more of 150KBytes and takes 650KBytes of main memory when parsed)
GNOME libxml 2 documentation

As a precaution {tidyods} checks the size of the content.xml file inside the ODS zip container and compares this to the available memory reported by ps::ps_system_memory() to determine whether the XML can be safely processed.

This check is an internal function that throws an error when the XML is too large and invisibly returns TRUE if the XML is an ok size, the internal function has a verbose argument if you want to get a report on the file size, processing requirement and available memory.

tidyods::check_xml_memory("path/to/small_file.ods")

#> Error in `check_xml_memory()`:
#> ! ODS file is too large to process
#> ℹ ODS XML is estimated to need 7.74 GB of memory, uncompressed content.xml
#>   file within path/to/small_file.ods is 1.93 GB in size.
#> ✖ Available system memory is estimated at 1.50 GB

tidyods:::check_xml_memory("path/to/small_file.ods", verbose = TRUE)

#> ℹ ODS XML is estimated to need 228.76 kB of memory, uncompressed content.xml 
#> file within path/to/small_file.ods is 57.19 kB in size.
#> ✔ Available system memory is estimated at 1.55 GB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant