Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling of alternatives: performance optimization #39

Open
smmaurer opened this issue Sep 5, 2018 · 2 comments
Open

Sampling of alternatives: performance optimization #39

smmaurer opened this issue Sep 5, 2018 · 2 comments

Comments

@smmaurer
Copy link
Member

smmaurer commented Sep 5, 2018

This is a new issue to discuss performance optimization for sampling of alternatives in the MergedChoiceTable utility.

Earlier discussion is in issues #4, #5, and #11, now closed. Core functionality is implemented in PR #37.

Performance of the underlying sampling algorithms

Sampling items from a list is fastest with uniform weights and "replacement" of items (meaning that an item can appear multiple times in the sample).

In my tests, NumPy sampling implementations are generally faster than core Python (comparing NumPy 1.14 and Python 3.6 on Mac), but the differences are small.

Typical performance penalties

  • sampling with non-uniform weights: ~10x performance penalty
  • sampling without replacement: ~5x performance penalty
  • combining the two: ~50x performance penalty

Sampling time scales linearly with the sample size and also with the population size -- except for uniform sampling without replacement, which is essentially constant with respect to the population size.

The penalties above are for samples and populations greater than a few hundred thousand; with smaller datasets the performance is more idiosyncratic.

Performance in the context of MergedChoiceTable

In real-world use, there's also overhead from building and merging large DataFrames.

At the extremes, this can be quite significant. For example, generating a merged table from 1,000 choosers and 1,000 alternatives, without sampling (i.e. 1 million rows) takes about 40x longer than uniform sampling of 10 alternatives per chooser with replacement.

Here are some MergedChoiceTable performance benchmarks for sampling 10 alternatives from a set of 100k, for 100k choosers:

Sampling method Time
Uniform sampling with replacement 0.3 seconds
Alternative-specific weights, with replacement 0.4 seconds
Uniform sampling without replacement 4.5 minutes
Alternative-specific weights, without replacement 6.5 minutes

At this scale there is only minimal penalty for non-uniform weights. But sampling without replacement is very inefficient, because it requires a separate random draw for each chooser.

Interaction weights, which depend on both the chooser and the alternative, have the same problem. At smaller scales, the performance penalty is about 180x, a little worse than the penalty for sampling without replacement. (For this table I wasn't even able to run the benchmark, because it would require 10 billion interaction weights.)

These benchmarks are from a fast i7 iMac; an old i5 MacBook is about 50% slower.

Comparison with the older UrbanSim MNL codebase

Our older codebase, urbansim.urbanchoice, only supported the basic case of uniform sampling with replacement. It performs a little bit better than the new codebase for this use case, probably through better optimization of the data merging operations.

Resources

Sampling-benchmarks.ipynb

Next steps

Here are some potential lines of investigation:

  1. Why are the repeated random draws (required for sampling without replacement or for interaction weights) so slow? Can we optimize it better?

  2. If the underlying sampling algorithms are part of the problem, are there alternatives to NumPy that would perform better?

  3. For interaction weights, can we get better performance by using generator functions or a different lookup format? What about optimizing for an additional case with shared weights ("buckets") that don't require as many separate draws?

There's also the question of how to best calculate the weights themselves, which is a separate problem.

@smmaurer
Copy link
Member Author

Just a note that there's nicer code now for the "alternative-specific weights, with replacement" case. We wrote it when working on choice simulation.

It's available as choicemodels.tools.monte_carlo_choices(), and can probably be swapped right into MergedChoiceTable for sampling of alternatives.

See PR #43 for details and performance assessment.

@smmaurer
Copy link
Member Author

Noting that the link is broken to the Sampling-benchmarks notebook. It should still be there somewhere, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant