Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed progress updates to file due to file-system buffering [Was: No near-live progress update when running on a SLURM cluster] #149

Open
frederikziebell opened this issue Dec 19, 2022 · 17 comments

Comments

@frederikziebell
Copy link

frederikziebell commented Dec 19, 2022

I want to run a loop in parallel on a SLURM cluster and use progressr to output near-live updates. Here's my setup:

test.R

library("future")
library("furrr")
library("progressr")

options(progressr.enable = TRUE)
handlers(global = TRUE)
handlers(handler_progress(
  format   = ":current/:total [:bar] :percent in :elapsed, ETA: :eta",
  width    = 60,
  complete = "="
))
plan(cluster)

f <- function(i) {
  Sys.sleep(3)
  i
}

local({
  n <- 100
  p <- progressor(n)
  result <- future_map(1:n, function(i){
    p(class = "sticky")
    f(i)
  })
})

test.sh

#!/bin/bash
#SBATCH --time=0-00:05:00
#SBATCH --ntasks 10
#SBATCH --output=slurm.out
Rscript test.R

The issue is that when the job is running, cat slurm.out always shows
0/100 [--------------------------------------------------] 0% in 0s, ETA: ? and no live progress is reported. How can I get it to show the current progress?

I have read this discussion on getting a progress bar using future.batchtools and that thread on printing messages to the console when they are generated, but I can't get it to work. Is this a bug?

@frederikziebell frederikziebell changed the title No live progress update when running on a SLURM cluster No near-live progress update when running on a SLURM cluster Dec 19, 2022
@HenrikBengtsson
Copy link
Owner

First of all, it could be that you're actually only running with one parallel worker instead of ten, because, I think, the line:

#SBATCH --ntasks 10

should be

#SBATCH --ntasks=10

I suggest you add:

plan(cluster)
message("Number of parallel workers: ", nbrOfWorkers())

and make sure that you get 10 in the Slurm log file. That said, even with a single parallel worker, you should get progress updates eventually, but it will of course take longer. Are you saying you're not getting any updates even if you run it to the very end?

@frederikziebell
Copy link
Author

frederikziebell commented Dec 20, 2022

I already confirmed that I got 10 workers from availableWorkers() and from printing Sys.getpid() inside f. All that is as expected. Indeed, I'm not getting any updates, even to the end. When the job finished, slurm.out is just empty.

Update: When I switch to plan(sequential), I am getting updates. As of writing this, the content of slurm.out is
87/100 [========================>----] 87% in 4m, ETA: 39s.

Further update: Using plan(future.batchtools::batchtools_slurm) also gives updates, but the job now takes almost 5 minutes as compared to 50sec for using plan(cluster). I guess this comes form the overhead of running new slurm jobs as mentioned here.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Dec 20, 2022

Thanks for confirming. FWIW, I cannot reproduce this using R 4.2.2 with progressr 0.12.0 running on CentOS 7.9 and a Slurm 21.08.4 scheduler. The slurm.out file gets populated while the job is running, e.g.

$ sbatch test.sh
Submitted batch job 1065784

$ tail -f slurm.out 
Number of workers: 10
37/100 [==========>------------------]  37% in 14s, ETA: 24s

Troubleshooting suggestions

Use Rscript --vanilla test.R.

Add a print(sessionInfo()) at the very end to make sure the code actually completes and to log the session info.

Have you verified that you get output when you run the test.sh on a development/interactive node without involving Slurm? Something like:

$ chmod ugo+x test.sh
$ R_PARALLELLY_AVAILABLECORES_SYSTEM=10 ./test.sh
Number of workers: 10
21/100 [=====>-----------------------]  21% in  9s, ETA: 35s

Session details

$ $ cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)

$ sbatch --version
slurm 21.08.4
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /software/c4/cbi/software/R-4.2.2-gcc10/lib64/R/lib/libRblas.so
LAPACK: /software/c4/cbi/software/R-4.2.2-gcc10/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] progressr_0.12.0 furrr_0.3.1      future_1.30.0   

loaded via a namespace (and not attached):
 [1] compiler_4.2.2    parallelly_1.33.0 magrittr_2.0.3    cli_3.4.1        
 [5] tools_4.2.2       parallel_4.2.2    listenv_0.9.0     codetools_0.2-18 
 [9] digest_0.6.31     globals_0.16.2    rlang_1.0.6       purrr_0.3.5

@HenrikBengtsson
Copy link
Owner

One more troubleshooting suggestion:

When using only:

#SBATCH --ntasks=10

you might end up getting a job that can run on multiple hosts. To rule that problem out, make sure to ask for a single compute node;

#SBATCH --nodes=1
#SBATCH --ntasks=10

Then, although plan(cluster) should also work, use:

plan(multisession)

With these two changes, it'll parallelize on one compute node, on which your job/script got access to 10 CPU cores by Slurm. See if you get progress updates in that case.

@frederikziebell
Copy link
Author

frederikziebell commented Dec 20, 2022

Thanks for the suggesions. I'm also running R 4.2.2 and progressr 0.12.0 on CentOS 7.9, but my slurm scheduler is 22.05.2. I tried the --vanilla switch, but it didn't make a difference. Running on interactive mode as you described gives progress bar updates.

The issue persists if the cluster plan runs on one or multiple nodes. Ultimately, I need multiple nodes, because I need >200 cores for my computation.

Do you have any other ideas what I could try?

Session details

-bash-4.2$ cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)

-bash-4.2$ sbatch --version
slurm 22.05.2
Session info ───────────────────
 setting  value
 version  R version 4.2.2 (2022-10-31)
 os       CentOS Linux 7 (Core)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2022-12-20
 pandoc   1.12.3.1 @ /usr/bin/pandocPackages ────────────────────
 package     * version date (UTC) lib source
 cli           3.4.1   2022-09-23 [1] CRAN (R 4.2.2)
 codetools     0.2-18  2020-11-04 [1] CRAN (R 4.2.1)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.2.2)
 digest        0.6.31  2022-12-11 [1] CRAN (R 4.2.2)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.2)
 furrr       * 0.3.1   2022-08-15 [1] CRAN (R 4.2.2)
 future      * 1.30.0  2022-12-16 [1] CRAN (R 4.2.2)
 globals       0.16.2  2022-11-21 [1] CRAN (R 4.2.2)
 hms           1.1.2   2022-08-19 [1] CRAN (R 4.2.2)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.2)
 listenv       0.9.0   2022-12-16 [1] CRAN (R 4.2.2)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.2)
 parallelly    1.33.0  2022-12-14 [1] CRAN (R 4.2.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.2)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.2.2)
 progress      1.2.2   2019-05-16 [1] CRAN (R 4.2.2)
 progressr   * 0.12.0  2022-12-13 [1] CRAN (R 4.2.2)
 purrr         0.3.5   2022-10-06 [1] CRAN (R 4.2.2)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.2)
 rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.2)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.2)
 vctrs         0.5.1   2022-11-16 [1] CRAN (R 4.2.2)
──────────────────────────

@HenrikBengtsson
Copy link
Owner

Interesting and "odd" problem. Have you confirmed that the script actually completes, e.g.

result <- local({
  n <- 100
  p <- progressor(n)
  future_map(1:n, function(i){
    p(class = "sticky")
    f(i)
  })
})

str(result)
print(sessionInfo())

Because, given that you say you get:

0/100 [--------------------------------------------------]   0% in  0s, ETA:  ?

but nothing else tells me that at least that first progress updates generated when calling progressor() did go through, and even if you don't get near-live updates, the remain progress update should definitely roll in just before future_map() returns. So, if you don't get them then, to me, it sounds like future_map() actually never returns.

Another troubleshooting help is:

handlers("debug")

That outputs all progress update signals captured, without involving "fancy" progress bar rendering (erase, reprint, erase, etc.) See https://progressr.futureverse.org/#debugging for example output. If it only shows that first "initiate" progress signal, then there's something blocking anything following to appear (which is really odd to me right now).

Also, another thing you can try is to use with_progress(), instead of local() and handlers(global = TRUE). I'm just trying to come up with things to rule out.

Regarding:

Ultimately, I need multiple nodes, because I need >200 cores for my computation.

Note that R has a built-in limitation where you cannot run more than ~120 parallel workers, cf. https://parallelly.futureverse.org/reference/availableConnections.html. To get more, you'll have to tweak the R code in one place and build from source. If you want to help on that end, please consider reaching out to R-devel about this, so that limitation gets more attention - I've done it in the past, but I only have so much time to push that question.

@frederikziebell
Copy link
Author

frederikziebell commented Dec 21, 2022

The script actually completes, i.e. I put the session info after the loop as you suggested and it ends up in slurm.out. Swtiching to with_progress() didn't make a change. I also tried, handlers("debug"), here's the output, which I followed with tail -f:

[02:52:13.663] (0.000s => +0.001s) initiate: 0/100 (+0) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.082] (3.419s => +2.985s) update: 1/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.093] (3.431s => +2.648s) update: 2/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.106] (3.443s => +2.297s) update: 3/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.118] (3.455s => +1.976s) update: 4/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.131] (3.468s => +1.623s) update: 5/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.144] (3.481s => +1.328s) update: 6/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.157] (3.494s => +1.010s) update: 7/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.169] (3.506s => +0.678s) update: 8/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.181] (3.518s => +0.351s) update: 9/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.192] (3.530s => +0.030s) update: 10/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.215] (3.552s => +0.110s) update: 11/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.479] (3.816s => +0.026s) update: 12/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:17.876] (4.213s => +0.062s) update: 13/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:18.251] (4.588s => +0.102s) update: 14/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:18.632] (4.970s => +0.116s) update: 15/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:18.890] (5.227s => +0.067s) update: 16/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:19.264] (5.602s => +0.110s) update: 17/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:19.518] (5.855s => +0.021s) update: 18/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:19.893] (6.230s => +0.057s) update: 19/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:20.167] (6.504s => +0.035s) update: 20/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:20.266] (6.604s => +0.097s) update: 21/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:20.540] (6.877s => +0.064s) update: 22/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:20.913] (7.251s => +0.074s) update: 23/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:21.173] (7.510s => +0.001s) update: 24/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:21.547] (7.884s => +0.005s) update: 25/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:21.924] (8.261s => +0.078s) update: 26/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:22.178] (8.515s => +0.002s) update: 27/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:22.552] (8.889s => +0.034s) update: 28/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:22.925] (9.262s => +0.068s) update: 29/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:23.200] (9.537s => +0.065s) update: 30/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:23.300] (9.638s => +0.112s) update: 31/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:23.576] (9.913s => +0.097s) update: 32/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:23.952] (10.289s => +0.109s) update: 33/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:24.205] (10.542s => +0.030s) update: 34/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:24.580] (10.917s => +0.034s) update: 35/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:24.952] (11.289s => +0.102s) update: 36/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:25.205] (11.542s => +0.025s) update: 37/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:25.581] (11.919s => +0.060s) update: 38/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:25.955] (12.292s => +0.094s) update: 39/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:26.209] (12.546s => +0.017s) update: 40/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:26.230] (12.568s => +0.091s) update: 41/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:26.484] (12.821s => +0.001s) update: 42/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:26.858] (13.195s => +0.011s) update: 43/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:27.231] (13.568s => +0.052s) update: 44/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:27.608] (13.945s => +0.059s) update: 45/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:27.861] (14.198s => +0.007s) update: 46/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:28.235] (14.572s => +0.051s) update: 47/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:28.609] (14.946s => +0.083s) update: 48/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:28.866] (15.203s => +0.001s) update: 49/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:29.144] (15.481s => +0.001s) update: 50/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:29.245] (15.582s => +0.049s) update: 51/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:29.523] (15.860s => +0.037s) update: 52/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:29.898] (16.235s => +0.047s) update: 53/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:30.275] (16.612s => +0.092s) update: 54/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:30.654] (16.991s => +0.100s) update: 55/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:30.908] (17.245s => +0.050s) update: 56/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:31.286] (17.623s => +0.099s) update: 57/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:31.539] (17.877s => +0.012s) update: 58/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:31.912] (18.249s => +0.044s) update: 59/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:32.188] (18.525s => +0.041s) update: 60/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:32.291] (18.628s => +0.092s) update: 61/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:32.569] (18.906s => +0.079s) update: 62/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:32.944] (19.281s => +0.088s) update: 63/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:33.198] (19.535s => +0.011s) update: 64/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:33.573] (19.911s => +0.016s) update: 65/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:33.951] (20.288s => +0.090s) update: 66/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:34.204] (20.541s => +0.013s) update: 67/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:34.582] (20.919s => +0.051s) update: 68/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:34.957] (21.295s => +0.085s) update: 69/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:35.218] (21.555s => +0.015s) update: 70/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:35.241] (21.578s => +0.090s) update: 71/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:35.498] (21.836s => +0.005s) update: 72/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:35.878] (22.215s => +0.018s) update: 73/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:36.257] (22.594s => +0.066s) update: 74/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:36.637] (22.974s => +0.075s) update: 75/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:36.894] (23.231s => +0.029s) update: 76/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:37.277] (23.614s => +0.083s) update: 77/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:37.537] (23.874s => +0.002s) update: 78/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:37.913] (24.250s => +0.038s) update: 79/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:38.189] (24.526s => +0.034s) update: 80/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:38.291] (24.628s => +0.084s) update: 81/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:38.567] (24.904s => +0.070s) update: 82/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:38.951] (25.288s => +0.087s) update: 83/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:39.205] (25.542s => +0.010s) update: 84/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:39.584] (25.922s => +0.019s) update: 85/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:39.966] (26.304s => +0.098s) update: 86/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:40.224] (26.561s => +0.025s) update: 87/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:40.603] (26.940s => +0.063s) update: 88/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:40.990] (27.327s => +0.111s) update: 89/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:41.248] (27.585s => +0.037s) update: 90/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:41.271] (27.608s => +0.112s) update: 91/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:41.529] (27.866s => +0.028s) update: 92/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:41.907] (28.245s => +0.040s) update: 93/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:42.288] (28.625s => +0.089s) update: 94/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:42.684] (29.022s => +0.114s) update: 95/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:42.941] (29.278s => +0.069s) update: 96/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:43.203] (29.541s => +0.001s) update: 97/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:43.584] (29.921s => +0.040s) update: 98/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:43.963] (30.300s => +0.080s) update: 99/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:44.221] (30.558s => +0.006s) update: 100/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[02:52:44.222] (30.559s => +0.006s) update: 100/100 (+0) '' {clear=TRUE, enabled=TRUE, status=}

The interesting thing is that the first 'initiate' line gets written immediately, than nothing happens for a while, and then all the update lines are written simultaneously. I have the feeling that although progress updates are recorded internally, they don't get through until the loop is finished.

@HenrikBengtsson
Copy link
Owner

That confirms that all the progress signals go through.

Looks like there's a delay in how frequently the slurm.out file is updated, i.e. I suspect it's buffered and only refreshed once in a while.

What happens if you skip R and instead run the following in your job script:

for i in {1..100}; do
   echo $i
   sleep 0.3
done

? If you tail the file, do you see it being updated immediately (i.e. 3 times a second), or is it buffered too?

@frederikziebell
Copy link
Author

I see the updates in your example immediately, so everything is fine there as well.

@HenrikBengtsson
Copy link
Owner

I see the updates in your example immediately, so everything is fine there as well.

Thanks for confirming.

I know you mentioned 'sequential' before, but can you also show me the output of:

library("future")
library("furrr")
library("progressr")

options(progressr.enable = TRUE)
handlers(global = TRUE)
handlers("debug")

plan(sequential)

f <- function(i) {
  Sys.sleep(0.3)
  i
}

local({
  n <- 100
  p <- progressor(n)
  result <- future_map(1:n, function(i){
    p(class = "sticky")
    f(i)
  })
})

This will tell me if there's a buffered output in your R session.

FYI, the information outputted by the 'debug' handler reveals when a progress update was signal and how much later it was received by the handler, e.g.

[02:52:17.118] (3.455s => +1.976s) update: 4/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}

say it was signaled 3.455 s after the very first one, but it took another 1.976 s for it to be received. That's a more than expected, but most of them are delayed < 0.1 s, e.g.

[02:52:34.957] (21.295s => +0.085s) update: 69/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}

That was signaled at 21.295 s in, and received and outputted 0.085 s later. Thus, all those 'debug' lines should be visible to you within a few seconds. Note that the whole process takes 30.559 s to complete, which means they're received and outputted way before that. So, this still points to output being buffered somewhere.

@frederikziebell
Copy link
Author

Here's the output, so it seems there is no notable buffering.

[11:16:49.016] (0.000s => +0.005s) initiate: 0/100 (+0) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:49.273] (0.257s => +0.178s) update: 1/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:49.595] (0.579s => +0.021s) update: 2/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:49.897] (0.881s => +0.001s) update: 3/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:50.199] (1.183s => +0.001s) update: 4/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:50.501] (1.485s => +0.001s) update: 5/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:50.803] (1.787s => +0.001s) update: 6/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:51.105] (2.088s => +0.001s) update: 7/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:51.407] (2.391s => +0.001s) update: 8/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:51.709] (2.693s => +0.001s) update: 9/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:52.011] (2.995s => +0.001s) update: 10/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:52.313] (3.297s => +0.001s) update: 11/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:52.615] (3.599s => +0.001s) update: 12/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:52.917] (3.901s => +0.001s) update: 13/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:53.219] (4.203s => +0.001s) update: 14/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:53.521] (4.505s => +0.001s) update: 15/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:53.823] (4.806s => +0.001s) update: 16/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:54.125] (5.108s => +0.001s) update: 17/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:54.426] (5.410s => +0.001s) update: 18/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:54.728] (5.712s => +0.001s) update: 19/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:55.031] (6.014s => +0.001s) update: 20/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:55.332] (6.316s => +0.001s) update: 21/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:55.634] (6.618s => +0.001s) update: 22/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:55.936] (6.920s => +0.001s) update: 23/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:56.238] (7.221s => +0.001s) update: 24/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:56.539] (7.523s => +0.001s) update: 25/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:56.841] (7.825s => +0.001s) update: 26/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:57.143] (8.127s => +0.001s) update: 27/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:57.445] (8.429s => +0.001s) update: 28/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:57.747] (8.731s => +0.001s) update: 29/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:58.049] (9.033s => +0.001s) update: 30/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:58.351] (9.335s => +0.001s) update: 31/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:58.653] (9.637s => +0.001s) update: 32/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:58.955] (9.939s => +0.001s) update: 33/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:59.257] (10.241s => +0.001s) update: 34/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:59.560] (10.544s => +0.001s) update: 35/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:16:59.862] (10.846s => +0.001s) update: 36/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:00.165] (11.149s => +0.001s) update: 37/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:00.467] (11.451s => +0.001s) update: 38/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:00.769] (11.753s => +0.001s) update: 39/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:01.070] (12.054s => +0.001s) update: 40/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:01.373] (12.357s => +0.001s) update: 41/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:01.676] (12.659s => +0.001s) update: 42/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:01.978] (12.962s => +0.001s) update: 43/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:02.281] (13.265s => +0.001s) update: 44/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:02.583] (13.567s => +0.001s) update: 45/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:02.885] (13.869s => +0.001s) update: 46/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:03.187] (14.171s => +0.001s) update: 47/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:03.489] (14.473s => +0.001s) update: 48/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:03.791] (14.775s => +0.001s) update: 49/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:04.094] (15.078s => +0.001s) update: 50/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:04.396] (15.380s => +0.001s) update: 51/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:04.699] (15.683s => +0.001s) update: 52/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:05.002] (15.986s => +0.001s) update: 53/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:05.304] (16.288s => +0.001s) update: 54/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:05.607] (16.591s => +0.001s) update: 55/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:05.909] (16.892s => +0.001s) update: 56/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:06.211] (17.195s => +0.001s) update: 57/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:06.513] (17.496s => +0.001s) update: 58/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:06.815] (17.798s => +0.001s) update: 59/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:07.117] (18.100s => +0.001s) update: 60/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:07.418] (18.402s => +0.001s) update: 61/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:07.721] (18.705s => +0.001s) update: 62/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:08.023] (19.006s => +0.001s) update: 63/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:08.324] (19.308s => +0.001s) update: 64/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:08.626] (19.610s => +0.001s) update: 65/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:08.928] (19.912s => +0.001s) update: 66/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:09.231] (20.215s => +0.001s) update: 67/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:09.533] (20.517s => +0.001s) update: 68/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:09.836] (20.820s => +0.001s) update: 69/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:10.138] (21.121s => +0.001s) update: 70/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:10.439] (21.423s => +0.001s) update: 71/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:10.741] (21.725s => +0.001s) update: 72/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:11.043] (22.027s => +0.001s) update: 73/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:11.345] (22.329s => +0.001s) update: 74/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:11.648] (22.631s => +0.001s) update: 75/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:11.950] (22.934s => +0.001s) update: 76/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:12.253] (23.236s => +0.001s) update: 77/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:12.554] (23.538s => +0.001s) update: 78/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:12.856] (23.840s => +0.001s) update: 79/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:13.159] (24.143s => +0.001s) update: 80/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:13.461] (24.445s => +0.001s) update: 81/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:13.764] (24.748s => +0.001s) update: 82/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:14.066] (25.050s => +0.001s) update: 83/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:14.369] (25.353s => +0.001s) update: 84/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:14.671] (25.655s => +0.001s) update: 85/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:14.973] (25.957s => +0.001s) update: 86/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:15.275] (26.259s => +0.001s) update: 87/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:15.577] (26.561s => +0.001s) update: 88/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:15.880] (26.863s => +0.001s) update: 89/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:16.182] (27.166s => +0.001s) update: 90/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:16.485] (27.469s => +0.001s) update: 91/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:16.787] (27.770s => +0.001s) update: 92/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:17.088] (28.072s => +0.001s) update: 93/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:17.390] (28.374s => +0.001s) update: 94/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:17.697] (28.681s => +0.006s) update: 95/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:17.999] (28.983s => +0.001s) update: 96/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:18.300] (29.284s => +0.001s) update: 97/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:18.602] (29.586s => +0.001s) update: 98/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:18.903] (29.887s => +0.001s) update: 99/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:19.205] (30.189s => +0.001s) update: 100/100 (+1) '' {clear=TRUE, enabled=TRUE, status=}
[11:17:19.206] (30.189s => +0.001s) update: 100/100 (+0) '' {clear=TRUE, enabled=TRUE, status=}

But I got the live update to work (🥳) by running your script with more delay and iterations:

library("future")
library("furrr")
library("progressr")

options(progressr.enable = TRUE)
handlers(global = TRUE)
handlers(handler_progress(
  format   = ":current/:total [:bar] :percent in :elapsed, ETA: :eta",
  width    = 60,
  complete = "="
))

plan(cluster)

f <- function(i) {
  Sys.sleep(3)
  i
}

local({
  n <- 500
  p <- progressor(n)
  result <- future_map(1:n, function(i){
    p(class = "sticky")
    f(i)
  })
})

Interestingly, the progress updates come in at roughly multiples of 100, i.e. the first update I saw was
100/500 [========>-------------------] 20% in 30s, ETA: 2m
So there seems to be some delay and it was so big that my initial script finished before I saw the first update.

@HenrikBengtsson
Copy link
Owner

But I got the live update to work (partying_face) by running your script with more delay and iterations

Great. Definitely, sounds like buffering is happening somewhere, and eventually it's flushed. It could be that the global file system buffers the output. OTH, then I'd expect to see the same behavior when running sequentially.

If you add more output, will you get progress updates more frequently? You can do this by, for instance:

  result <- future_map(1:n, function(i) {
    message(sprintf("Hello from iteration #%d", i))
    message("Some letters: ", paste(rep(letters, times = 20), collapse = ", "))
    p()
    f(i)
  })

BTW, there's no need for you to use class = "sticky" in your example; that only makes a difference if you pass a message too, cf. https://progressr.futureverse.org/

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Dec 21, 2022

Here's the output, so it seems there is no notable buffering.

BTW, you still need to use tail -f to see if it's buffered. The "delay" estimates reported by 'debug' only measures the time between progress being signaled and being received by the 'debug' handler. It does not reflect buffering delays. Also, the 'debug' output you've reported is what I'd expect when things work as expected.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Dec 21, 2022

Actually, to force the extra output to be more instant, we can use those sticky progress messages, e.g.

  result <- future_map(1:n, function(i) {
    p(c("Some letters: ", letters), class = "sticky", amount = 0)
    p(c("Some digits: ", 1:100), class = "sticky", amount = 0)
    p(c("More letters: ", LETTERS), class = "sticky", amount = 0)
    p(sprintf("Iteration #%d", i))
    f(i)
  })

That might flush the buffering sooner. If that's the case, then we know for sure buffering is taking place and that the buffer has a size that needs to be filled up before being flushed.

@frederikziebell
Copy link
Author

The trick with the sticky progress messages works and I even get a working progress bar on the original test.R script. I just don't know if it's because of more buffer flusing or because it delays the loop so much that updates are visible.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Dec 21, 2022

The trick with the sticky progress messages works and I even get a working progress bar on the original test.R script.

Thanks for confirming.

I just don't know if it's because of more buffer flusing ...

I'm pretty sure it triggers any buffer output to be flushed. The buffering can be due to the file system caching writes to the file before actually pushing it out, so it's seen throughout the cluster. Whatever the buffer is, file cache or not, it typically has a maximum size and when it's filled up, then it flushes the cache out, to make room for the next one. This makes file writing much more efficient overall, but it obviously introduces a delay when monitored like here.

Regarding buffering: Searching online, there's a slight chance that using:

stdbuf -o0 -e0 Rscript test.R

will make a difference; it avoids buffering output from R. If it's the file system that's buffering, I doubt this makes a difference, but worth a shot.

Moreover, because I suspect there's file-system buffering is happening it might not work, but you can also use the 'filesize' progress handler to report on the progress via the size of a custom file, e.g.

handlers(list(
  bar = handler_progress(
    format   = ":current/:total [:bar] :percent in :elapsed, ETA: :eta",
    width    = 60,
    complete = "="
  ),
  file = handler_filesize(
    file = "slurm.progress",
    intrusiveness = 0
  )
))

Comment: The names bar and file are just for human readability - they don't have a special function.

This way you can monitor progress using ls -l too;

$ ls -l slurm.progress 
-rw-rw-r-- 1 hb hb 68 Dec 21 12:13 slurm.progress
                    ^
                    +-- = 68% progress

@frederikziebell
Copy link
Author

The stdbuf option didn't make a difference but using the filesize handler works as expected. So I guess the conclusion is on systems where the progress handler does not work (because of strange file system caching or buffering), one should use handler_filesize.

@HenrikBengtsson HenrikBengtsson changed the title No near-live progress update when running on a SLURM cluster Delayed progress updates to file due to file-system buffering [Was: No near-live progress update when running on a SLURM cluster] Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants