High memory usage with too many invalid fields. #50

syxolk · 2017-12-15T14:36:12Z

I have a CSV with roughly 140k lines that should be validated with vladiate. The validation code looks like this:

Vlad(source=LocalFile("largefile.csv"), validators={
        "id": [UniqueValidator()],
        "parent_id": [SetValidator(lots_of_ids)],
    }).validate()

The number of items in lots_of_ids is also around 140k.

For some reasons I had a case where the ids in lots_of_ids had no intersection with the values from the parent_id column in largefile.csv. Vladiate will in this case (correctly) collect all invalid rows. However, this resulted in my PC going to a RAM usage of at least 8GB.

Is there any way I can stop the validation early? Can Vladiate detect if there were too many wrong fields and stop validating?

The text was updated successfully, but these errors were encountered:

di · 2017-12-15T16:30:21Z

Hi @syxolk! Thanks for the report, and sorry that vladiate was eating up your memory.

Is there any way I can stop the validation early? Can Vladiate detect if there were too many wrong fields and stop validating?

This is currently not possible, but it could be a valuable optional feature.

I'm a little more interested in figuring out why so much memory is getting consumed and if the footprint can be reduced (I'm guessing it can, as I haven't really tested the upper bounds of this tool).

syxolk · 2017-12-15T20:14:03Z

Hey, I found two issues that lead to high memory usage in my particular case:

First, validate() collects all exceptions in the failures dictionary. In my case, all rows were failures -> Every row's exception was collected in there.

Second, the SetValidator raises a ValidationException that stringifies all elements of valid_set. In my case, valid_set contained 140k UUIDs. That resulted in a pretty huge string.

If we can fix at least one of the two issues the memory problem should be gone:

The exceptions are printed as debug messages and the logger defaults to the info level, so no exceptions are printed by default. Can we disable collecting all the exception objects?
Don't stringify the entire valid_set. Instead cut it at 100 elements, similarly to what is done in _log_validator_failures

PS: I'm happy to contribute!

di · 2017-12-19T15:48:43Z

The exceptions are printed as debug messages and the logger defaults to the info level, so no exceptions are printed by default. Can we disable collecting all the exception objects?

I think there's two things we could do here:

We could only collect the failure exceptions if debugging is turned on. We'll need to add a --verbose flag or something, and I think we'll need to add another additional variable to determine if there were failures when not in debug mode (since we can't just check for a non-empty failures.
We could collect something other than the entire exception (which is probably huge) into failures, like maybe just the exception's message?

I'm leaning towards the first one, since it seems like less work, and still preserves the entire exception for debugging, but I could be convinced otherwise.

Don't stringify the entire valid_set. Instead cut it at 100 elements, similarly to what is done in _log_validator_failures

I think this is a great idea.

PS: I'm happy to contribute!

PRs are welcome!

A new function stringify_set only stringifies n elements of a given set.

* Fix issue with SetValidator and large valid_set (#50) A new function stringify_set only stringifies n elements of a given set. * Use itertools.islice for better python2 performance * Add SetValidator test for coverage * Fix stringify_set: Return {...} instead of [...] * Add test cases for stringify_set Fix stringify_set: Sort the elements before displaying for small sets * Remove previously introduced test case for set validator * Add parameter checks for stringify_set * Rename stringify_set to _stringify_set * Revert d355def

syxolk added a commit to syxolk/vladiate that referenced this issue Dec 20, 2017

Fix issue with SetValidator and large valid_set (di#50)

ea98fe9

A new function stringify_set only stringifies n elements of a given set.

syxolk added a commit to syxolk/vladiate that referenced this issue Dec 22, 2017

Fix issue with SetValidator and large valid_set (di#50)

0eecd98

A new function stringify_set only stringifies n elements of a given set.

syxolk mentioned this issue Dec 22, 2017

Fix set validator #52

Merged

syxolk mentioned this issue Jan 3, 2018

Verbose flag #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage with too many invalid fields. #50

High memory usage with too many invalid fields. #50

syxolk commented Dec 15, 2017 •

edited

Loading

di commented Dec 15, 2017

syxolk commented Dec 15, 2017 •

edited

Loading

di commented Dec 19, 2017

High memory usage with too many invalid fields. #50

High memory usage with too many invalid fields. #50

Comments

syxolk commented Dec 15, 2017 • edited Loading

di commented Dec 15, 2017

syxolk commented Dec 15, 2017 • edited Loading

di commented Dec 19, 2017

syxolk commented Dec 15, 2017 •

edited

Loading

syxolk commented Dec 15, 2017 •

edited

Loading