Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulty with ingesting large files #181

Closed
Gpadh opened this issue Dec 1, 2023 · 1 comment
Closed

Difficulty with ingesting large files #181

Gpadh opened this issue Dec 1, 2023 · 1 comment
Labels

Comments

@Gpadh
Copy link

Gpadh commented Dec 1, 2023

I'm trying to ingest a 100+GB file of legal data into a Kernel Memory service. The data I would like to access are the "opinions" files from this link (https://com-courtlistener-storage.s3-us-west-2.amazonaws.com/list.html?prefix=bulk-data/). They are zipped .bz2 files.

To ingest the data, I use azcopy to get a file into a container. Then, I have a function which triggers on file ingestion in this container. The function unzips the .bz2 file and sends it to Kernel Memory for ingestion in the form of a stream. The zipped file is about 30GB, when I unzip it the size becomes 100+GB.

This is the error message I get when I try to ingest the files into Kernel Memory:
image

The repository to repro the issue is here: https://github.com/Gpadh/KMFileIngestion/tree/master

Please let me know if I can provide any more details to help.

@dluc dluc added the triage label Dec 4, 2023
@dluc
Copy link
Collaborator

dluc commented Aug 27, 2024

Handling a file of over 100GB presents too many challenges, including network transfer limits, memory management constraints, VM capacity, and potential timeouts during processing.

To mitigate these issues, I recommend partitioning the data into smaller chunks, ideally between 10-20MB.

This approach will help reducing the likelihood of encountering errors related to memory allocation or network interruptions and make the ingestion process more manageable.

@dluc dluc closed this as completed Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants