Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R aborts during linkage of two files when n rows > 25k #86

Open
wbakerrobinson opened this issue Sep 12, 2024 · 0 comments
Open

R aborts during linkage of two files when n rows > 25k #86

wbakerrobinson opened this issue Sep 12, 2024 · 0 comments

Comments

@wbakerrobinson
Copy link

I would like to link two files of size 50k rows. Whenever I try to link these files R aborts on me. I systematically sampled n records from each file starting at 10k. When n gets to 25k R starts aborting. Is anyone else having this problem with the most up to date versions of R, and fastLink?

My computer has 64 gb ram. Here is the session information:
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] tidycensus_1.6.5 stringr_1.5.1 fastLink_0.6.1 dplyr_1.1.4

loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 utf8_1.2.4 generics_0.1.3 tidyr_1.3.1 class_7.3-22 xml2_1.3.6 lpSolve_5.6.20
[8] KernSmooth_2.23-24 adagio_0.9.2 gtools_3.9.5 stringi_1.8.4 lattice_0.22-6 hms_1.1.3 magrittr_2.0.3
[15] grid_4.4.1 iterators_1.0.14 foreach_1.5.2 doParallel_1.0.17 jsonlite_1.8.8 Matrix_1.7-0 e1071_1.7-14
[22] DBI_1.2.3 httr_1.4.7 rvest_1.0.4 purrr_1.0.2 fansi_1.0.6 stringdist_0.9.12 codetools_0.2-20
[29] cli_3.6.3 rlang_1.1.4 tigris_2.1 crayon_1.5.3 units_0.8-5 plotrix_3.8-4 tools_4.4.1
[36] parallel_4.4.1 tzdb_0.4.0 uuid_1.2-1 vctrs_0.6.5 R6_2.5.1 proxy_0.4-27 lifecycle_1.0.4
[43] classInt_0.4-10 pkgconfig_2.0.3 pillar_1.9.0 data.table_1.16.0 glue_1.7.0 Rcpp_1.0.13 sf_1.0-17
[50] tibble_3.2.1 tidyselect_1.2.1 rstudioapi_0.16.0 readr_2.1.5 compiler_4.4.1

Here is a reproducible example:

# Rstudio abort repex
# WBR
# 9/12/2024

# Packages
library(dplyr)
library(fastLink)
library(stringr)
library(tidycensus)

# Session information
sessionInfo()

# get zip code data for sim
or_zip <- get_acs(geography = "zcta",
                  variables = c("B01001_001"),
                  year = 2020,
                  survery = "acs5",
                  geometry = FALSE) %>% 
  select(GEOID, estimate) %>% 
  rename(zip_code = GEOID,
         pop = estimate) %>% 
  filter(str_detect(zip_code, "^97") & pop > 0)

or_zip_tot <- sum(or_zip$pop)

or_zip <- mutate(or_zip, pop_pct = pop/or_zip_tot)

# Function to create sim data
# Goal isn't to realistically simulate my dataset just create something to show problem
sim_data <- function(n_sample = 25000)
{
  tibble(age = rnorm(n_sample, mean = 50, sd = 20),
         sex = rbinom(n_sample, size = 1, prob = 0.5),
                 zip_code = sample(or_zip$zip_code, size = n_sample, replace = TRUE, or_zip$pop_pct),
                 disp_cat = sample(c(1, 2, 3, 4), size = n_sample, replace = TRUE, c(0.00278, 0.00915, 0.8811, 0.10697)),
                 LOS = rnbinom(n_sample, size = 2, prob = 0.2),
                 hours_since = round(runif(n_sample, 0, 8760)),
                 hospital_id = "test hosp") %>% 
    mutate(age = as.numeric(if_else(age < 0, NA_integer_, floor(age))),
           sex = factor(sex, labels = c("Male", "Female")),
           disp_cat = factor(disp_cat,
                             labels = c("Admit", "Deceased", "Discharged", "Transferred")),
           LOS = as.numeric(LOS),
           hours_since = as.numeric(hours_since))
}

# Function to link
blocked_link_fx <- function(df_split_A, df_split_B)
{
  require(fs)
  require(fastLink)
  
  df_dim_A <- dim(df_split_A)
  df_dim_B <- dim(df_split_B)
  cat("Facility:", unique(df_split_A$hospital_id), "\n")
  cat("DF A Num rows:", df_dim_A[1], "Num cols:", df_dim_A[2], "\n")
  cat("DF B Num rows:", df_dim_B[1], "Num cols:", df_dim_B[2], "\n")
  
  tryCatch({
    link_result <- fastLink(df_split_A, df_split_B,
                            varnames = c("age", "sex", "zip_code",
                                         "disp_cat", "LOS", "hours_since"),
                            numeric.match = c("age", "LOS", "hours_since"),
                            partial.match = c("age", "LOS", "hours_since"),
                            cut.a.num = 0.5,
                            cut.p.num = 2)
    
    return(link_result)
  }, error = function(e) {
    message("Error in processing: ", e)
    return(NULL)
  })
}

# Create sample data n = 10000
n <- 10000
n_exact <- 3000
set.seed(3)
df_A <- sim_data(n)

set.seed(6)
df_B <- bind_rows(slice_sample(df_A, n = n_exact),
                  sim_data(n - n_exact))

# Test linkage w/ 10000
test_link1 <- blocked_link_fx(df_A, df_B)

# Create sample data n = 25000
n <- 25000
n_exact <- 3000

set.seed(3)
df_A <- sim_data(n)

set.seed(6)
df_B <- bind_rows(slice_sample(df_A, n = n_exact),
                  sim_data(n - n_exact))

# Test linkage w/ 25000
test_link2 <- blocked_link_fx(df_A, df_B)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant