Difficulty with ingesting large files #181

Gpadh · 2023-12-01T23:12:38Z

I'm trying to ingest a 100+GB file of legal data into a Kernel Memory service. The data I would like to access are the "opinions" files from this link (https://com-courtlistener-storage.s3-us-west-2.amazonaws.com/list.html?prefix=bulk-data/). They are zipped .bz2 files.

To ingest the data, I use azcopy to get a file into a container. Then, I have a function which triggers on file ingestion in this container. The function unzips the .bz2 file and sends it to Kernel Memory for ingestion in the form of a stream. The zipped file is about 30GB, when I unzip it the size becomes 100+GB.

This is the error message I get when I try to ingest the files into Kernel Memory:

The repository to repro the issue is here: https://github.com/Gpadh/KMFileIngestion/tree/master

Please let me know if I can provide any more details to help.

dluc · 2024-08-27T19:14:14Z

Handling a file of over 100GB presents too many challenges, including network transfer limits, memory management constraints, VM capacity, and potential timeouts during processing.

To mitigate these issues, I recommend partitioning the data into smaller chunks, ideally between 10-20MB.

This approach will help reducing the likelihood of encountering errors related to memory allocation or network interruptions and make the ingestion process more manageable.

dluc added the triage label Dec 4, 2023

dluc closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty with ingesting large files #181

Difficulty with ingesting large files #181

Gpadh commented Dec 1, 2023 •

edited

Loading

dluc commented Aug 27, 2024

Difficulty with ingesting large files #181

Difficulty with ingesting large files #181

Comments

Gpadh commented Dec 1, 2023 • edited Loading

dluc commented Aug 27, 2024

Gpadh commented Dec 1, 2023 •

edited

Loading