Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow read of large blobs compared to gcloud storage #2726

Open
carlthome opened this issue Nov 29, 2024 · 1 comment
Open

Slow read of large blobs compared to gcloud storage #2726

carlthome opened this issue Nov 29, 2024 · 1 comment
Labels
p2 P2 pending customer action question Customer Issue: question about how to use tool

Comments

@carlthome
Copy link

carlthome commented Nov 29, 2024

I have a ~5 GB blob that I'm consuming in Python code. The Google Compute Engine VM is in the same region as the Google Storage bucket.

When using gcloud storage cp gs://my-bucket/my-blob it takes less than a minute to download to the VM.

When using python -c "with open("/gcs/my-bucket/my-blob", 'b') as f: f.read()" it takes several minutes to download the blob into the running process.

I assume (but haven't tried) whether cp /gcs/my-bucket/my-blob ~ would be faster but I assume it's also slower than gcloud storage.

Why is this and can we expect the same excellent high-download speeds that gcloud storage offers, from simple reads to /gcs in a future release for GCSFuse? The convenience of "just read like a regular file system" is very appreciated and we don't want to introduce additional bucket storage clients if we can avoid it.

@carlthome carlthome added p2 P2 question Customer Issue: question about how to use tool labels Nov 29, 2024
@kislaykishore
Copy link
Collaborator

kislaykishore commented Nov 29, 2024

@carlthome I think your Python test script could be written a bit more optimally. Instead of downloading the entire 5Gi into memory when you invoke f.read(), you could download the content in chunks:

with open('<file_path>', 'rb') as f:
    while f.read(1024 * 1024):
        pass

cp <file_path> <dest_path> should also not take much time.

You can also try a few of the perf-optimizations mentioned here. Optimizations such as the following could be useful:

Lemme know how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2 P2 pending customer action question Customer Issue: question about how to use tool
Projects
None yet
Development

No branches or pull requests

2 participants