Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage with too many invalid fields. #50

Open
syxolk opened this issue Dec 15, 2017 · 3 comments
Open

High memory usage with too many invalid fields. #50

syxolk opened this issue Dec 15, 2017 · 3 comments

Comments

@syxolk
Copy link
Contributor

syxolk commented Dec 15, 2017

I have a CSV with roughly 140k lines that should be validated with vladiate. The validation code looks like this:

Vlad(source=LocalFile("largefile.csv"), validators={
        "id": [UniqueValidator()],
        "parent_id": [SetValidator(lots_of_ids)],
    }).validate()

The number of items in lots_of_ids is also around 140k.

For some reasons I had a case where the ids in lots_of_ids had no intersection with the values from the parent_id column in largefile.csv. Vladiate will in this case (correctly) collect all invalid rows. However, this resulted in my PC going to a RAM usage of at least 8GB.

Is there any way I can stop the validation early? Can Vladiate detect if there were too many wrong fields and stop validating?

@di
Copy link
Owner

di commented Dec 15, 2017

Hi @syxolk! Thanks for the report, and sorry that vladiate was eating up your memory.

Is there any way I can stop the validation early? Can Vladiate detect if there were too many wrong fields and stop validating?

This is currently not possible, but it could be a valuable optional feature.

I'm a little more interested in figuring out why so much memory is getting consumed and if the footprint can be reduced (I'm guessing it can, as I haven't really tested the upper bounds of this tool).

@syxolk
Copy link
Contributor Author

syxolk commented Dec 15, 2017

Hey, I found two issues that lead to high memory usage in my particular case:

First, validate() collects all exceptions in the failures dictionary. In my case, all rows were failures -> Every row's exception was collected in there.

Second, the SetValidator raises a ValidationException that stringifies all elements of valid_set. In my case, valid_set contained 140k UUIDs. That resulted in a pretty huge string.

If we can fix at least one of the two issues the memory problem should be gone:

  • The exceptions are printed as debug messages and the logger defaults to the info level, so no exceptions are printed by default. Can we disable collecting all the exception objects?
  • Don't stringify the entire valid_set. Instead cut it at 100 elements, similarly to what is done in _log_validator_failures

PS: I'm happy to contribute!

@di
Copy link
Owner

di commented Dec 19, 2017

The exceptions are printed as debug messages and the logger defaults to the info level, so no exceptions are printed by default. Can we disable collecting all the exception objects?

I think there's two things we could do here:

  1. We could only collect the failure exceptions if debugging is turned on. We'll need to add a --verbose flag or something, and I think we'll need to add another additional variable to determine if there were failures when not in debug mode (since we can't just check for a non-empty failures.

  2. We could collect something other than the entire exception (which is probably huge) into failures, like maybe just the exception's message?

I'm leaning towards the first one, since it seems like less work, and still preserves the entire exception for debugging, but I could be convinced otherwise.

Don't stringify the entire valid_set. Instead cut it at 100 elements, similarly to what is done in _log_validator_failures

I think this is a great idea.

PS: I'm happy to contribute!

PRs are welcome!

syxolk added a commit to syxolk/vladiate that referenced this issue Dec 20, 2017
A new function stringify_set only stringifies n elements of a given set.
syxolk added a commit to syxolk/vladiate that referenced this issue Dec 22, 2017
A new function stringify_set only stringifies n elements of a given set.
di pushed a commit that referenced this issue Jan 2, 2018
* Fix issue with SetValidator and large valid_set (#50)

A new function stringify_set only stringifies n elements of a given set.

* Use itertools.islice for better python2 performance

* Add SetValidator test for coverage

* Fix stringify_set: Return {...} instead of [...]

* Add test cases for stringify_set

Fix stringify_set: Sort the elements before displaying for small sets

* Remove previously introduced test case for set validator

* Add parameter checks for stringify_set

* Rename stringify_set to _stringify_set

* Revert d355def
@syxolk syxolk mentioned this issue Jan 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants