Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve perforamance of _pluck_uniq_cols #216

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

mlazowik
Copy link

@mlazowik mlazowik commented May 2, 2020

Use a set instead of an OrderedDict with empty values for gathering
unique values.

O(n^2) -> O(nlogn)

@coveralls
Copy link

coveralls commented May 2, 2020

Coverage Status

Coverage decreased (-0.7%) to 83.255% when pulling b37f9e6 on mlazowik:perf-set into 8628d7e on kobotoolbox:master.

@mlazowik
Copy link
Author

mlazowik commented May 2, 2020

This buys around 2-3s per reports/{uid} request on a form with ~2k questions and ~100 revisions.

@dorey
Copy link
Contributor

dorey commented May 2, 2020

Thanks, @mlazowik for looking into this. It looks like a clear improvement

@dorey
Copy link
Contributor

dorey commented May 2, 2020

I think this was good at db116cc . It was clear what the code was doing and seemed to be an improvement.

I'd rather merge that than include the orderedset library unless it's helpful for further performance improvements.

Please note, that a lot of the formpack code will be in flux in the next month as we move asset.content in kobotoolbox/kpi over to a json-schema validated structure, which opens up the possibility of loading in just the diffs between versions.

@mlazowik
Copy link
Author

mlazowik commented May 2, 2020

Right, I added the dep b/c I wanted to extract what I've written into utils/ordered_set.py so that it's not loose in a random place, but then I noticed there is already a library for that.

I didn't see any more perf. improvement over my code. I can revert the dep, what do you think about extracting my little code to utils/ordered_set.py?

@dorey
Copy link
Contributor

dorey commented May 2, 2020

If it were going to be reused elsewhere in the code I think that would be helpful. The person who helped us early on was fond of using OrderedDict() with empty values as a workaround for not having a native python implementation of ordered set. I was not aware of performance cost of this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants