Understanding the behaviour of Modin #6226
Replies: 1 comment 2 replies
-
Hi @overseek944!
We read each part of the csv file into a temporary buffer, and then pass that buffer as input to the read function of pandas itself. That is, at the peak moment, we can have on each process both a buffer and a pandas dataframe created from it, which can be roughly estimated as 2 times more memory. This is also true for reading json files if they are created with
I believe that in this case the reason is only a lack of RAM. Therefore, you need to either reduce the number of files that you use, or increase the amount of RAM on the machine. |
Beta Was this translation helpful? Give feedback.
-
Data details
Number of Files - 5000+ JSON files
Every file will be of size around 70 MBs
total size - 350 GBs
Instance details
type - ml.m5.24xlarge
memory - 386 Gbs
CPUs - 96
Hi Team,
I want to understand what can be the potential reasons for this failure and how can this be fixed?
I am trying this in TrainingJob where I am using SKLearnProcessor. This is not distributed so I am just using 1 instance type of ml.m5.24xlarge.
Reference -
https://stackoverflow.com/questions/76043804/ray-workers-being-killed-because-of-oom-pressure
I gone through the above post but still I am not able to understand how it can take upto 2x memory overhead.
Beta Was this translation helpful? Give feedback.
All reactions