Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hca upload files using a lot of cpu/memory #358

Open
malloryfreeberg opened this issue Jun 10, 2019 · 6 comments
Open

hca upload files using a lot of cpu/memory #358

malloryfreeberg opened this issue Jun 10, 2019 · 6 comments
Assignees
Labels

Comments

@malloryfreeberg
Copy link
Member

I was using hca upload files * to upload about 80GB of fastq files (16 files) from my local machine to an upload area. During the transfer, I experienced significant slowdown of everything else running on my machine. I don't remember experiencing this slowdown before, although I haven't had to transfer files from a local source in a while. It looks like my machine was maxed out on CPU usage (screenshots below). Wondering if this is normal or expected behavior? It doesn't seem ideal...

During transfer:
Screen Shot 2019-06-10 at 09 54 25

After transfer:
Screen Shot 2019-06-10 at 13 45 17

@sampierson
Copy link
Member

sampierson commented Jun 12, 2019

I'm guessing this is because the CLI is now doing client-side checksumming? @maniarathi do you stream the file while checksumming or attempt to read it into memory? Hmm looks like it is streamed in chunks get_s3_multipart_chunk_size big. I wonder how much the memory balloons. Someone should attempt to reproduce this. Unfortunately it can't be me as I am on a low-bandwidth link.

@maniarathi
Copy link
Contributor

So I did actually test the memory footprint of this a while back and the memory was 64MB which is what is expected given that it streams it in that sized chunks.

@sampierson
Copy link
Member

@malloryfreeberg how much memory was consumed? Alas your Activity monitor screenshots don't show that.

@sampierson
Copy link
Member

sampierson commented Jun 12, 2019

As for CPU, I expect that simultaneous checksumming of several files will be quite CPU intensive. Does it limit parallelization? It looks like it does, based on the number of cores you have DEFAULT_THREAD_COUNT = multiprocessing.cpu_count() * 2. On my machine cpu_count() returns 8, so that means it is trying to checksum all 16 files simultaneously. That's a bad thing.

There are several ways to fix this:

  1. parallelize less aggressively - reduce thread count
  2. provide a command line option to limit parallelism further
  3. calculate checksums in-line while uploading, which will limit the parallelism based on your available bandwidth.

I realize # 3 doesn't work well with the current architecture, as client-side and server-side checksums are compared before upload starts. I wish there was a more efficient way to decide whether to upload or not.
We should probably do # 1 and # 2.

@malloryfreeberg
Copy link
Member Author

@sampierson @maniarathi I unfortunately did not grab memory usage during this time. I can reproduce, but I'll have to download the files to my local machine again :( Stay tuned!

@sampierson
Copy link
Member

@malloryfreeberg Don't bother. I think we know what the culprit is. I think the problem is CPU not memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants