hca upload files using a lot of cpu/memory #358

malloryfreeberg · 2019-06-10T13:11:46Z

I was using hca upload files * to upload about 80GB of fastq files (16 files) from my local machine to an upload area. During the transfer, I experienced significant slowdown of everything else running on my machine. I don't remember experiencing this slowdown before, although I haven't had to transfer files from a local source in a while. It looks like my machine was maxed out on CPU usage (screenshots below). Wondering if this is normal or expected behavior? It doesn't seem ideal...

During transfer:

After transfer:

The text was updated successfully, but these errors were encountered:

sampierson · 2019-06-12T14:37:53Z

I'm guessing this is because the CLI is now doing client-side checksumming? @maniarathi do you stream the file while checksumming or attempt to read it into memory? Hmm looks like it is streamed in chunks get_s3_multipart_chunk_size big. I wonder how much the memory balloons. Someone should attempt to reproduce this. Unfortunately it can't be me as I am on a low-bandwidth link.

maniarathi · 2019-06-12T15:43:38Z

So I did actually test the memory footprint of this a while back and the memory was 64MB which is what is expected given that it streams it in that sized chunks.

sampierson · 2019-06-12T15:45:24Z

@malloryfreeberg how much memory was consumed? Alas your Activity monitor screenshots don't show that.

sampierson · 2019-06-12T16:02:08Z

As for CPU, I expect that simultaneous checksumming of several files will be quite CPU intensive. Does it limit parallelization? It looks like it does, based on the number of cores you have DEFAULT_THREAD_COUNT = multiprocessing.cpu_count() * 2. On my machine cpu_count() returns 8, so that means it is trying to checksum all 16 files simultaneously. That's a bad thing.

There are several ways to fix this:

parallelize less aggressively - reduce thread count
provide a command line option to limit parallelism further
calculate checksums in-line while uploading, which will limit the parallelism based on your available bandwidth.

I realize # 3 doesn't work well with the current architecture, as client-side and server-side checksums are compared before upload starts. I wish there was a more efficient way to decide whether to upload or not.
We should probably do # 1 and # 2.

malloryfreeberg · 2019-06-12T16:03:28Z

@sampierson @maniarathi I unfortunately did not grab memory usage during this time. I can reproduce, but I'll have to download the files to my local machine again :( Stay tuned!

sampierson · 2019-06-12T16:06:48Z

@malloryfreeberg Don't bother. I think we know what the culprit is. I think the problem is CPU not memory.

kozbo assigned maniarathi Jul 12, 2019

kozbo added the ENH label Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hca upload files using a lot of cpu/memory #358

hca upload files using a lot of cpu/memory #358

malloryfreeberg commented Jun 10, 2019

sampierson commented Jun 12, 2019 •

edited

Loading

maniarathi commented Jun 12, 2019

sampierson commented Jun 12, 2019

sampierson commented Jun 12, 2019 •

edited

Loading

malloryfreeberg commented Jun 12, 2019

sampierson commented Jun 12, 2019

hca upload files using a lot of cpu/memory #358

hca upload files using a lot of cpu/memory #358

Comments

malloryfreeberg commented Jun 10, 2019

sampierson commented Jun 12, 2019 • edited Loading

maniarathi commented Jun 12, 2019

sampierson commented Jun 12, 2019

sampierson commented Jun 12, 2019 • edited Loading

malloryfreeberg commented Jun 12, 2019

sampierson commented Jun 12, 2019

sampierson commented Jun 12, 2019 •

edited

Loading

sampierson commented Jun 12, 2019 •

edited

Loading