Skip to content

shwestrick/mpl-1brc

Repository files navigation

mpl-1brc

Using MPL to solve the 1 Billion Row Challenge (site, GitHub).

See Sam's blog post for some discussion and details.

To run it, you need smlpkg and mpl installed.

$ git clone https://github.com/shwestrick/mpl-1brc
$ cd mpl-1brc
$ smlpkg sync
$ make
$ ./main @mpl procs 8 -- data/1M.txt --verbose   # run on example data set of 1 million measurements

The above command uses an included data/1M.txt input file which has 1 million measurements. For the full competition, you need the 1 billion measurements file. See the 1brc GitHub repo for instructions on how to generate it.

Contributions are welcome if anyone wants to help improve performance! Feel free to submit PRs. See ideas for performance improvements below. (Not sure about any of these, but it would be interesting to try them.)

Current results

Here are my current results on 72 cores (144 hyperthreads).

With bounds checking disabled:

$ hyperfine --warmup 1 './main @mpl procs 144 set-affinity -- /usr3/data/1brc/measurements.txt --unsafe-no-bounds-checks'
Benchmark 1: ./main @mpl procs 144 set-affinity -- /usr3/data/1brc/measurements.txt --unsafe-no-bounds-checks
  Time (mean ± σ):      2.342 s ±  0.020 s    [User: 324.751 s, System: 2.546 s]
  Range (min … max):    2.317 s …  2.380 s    10 runs

And, with bounds checking enabled:

$ hyperfine --warmup 1 './main @mpl procs 144 set-affinity -- /usr3/data/1brc/measurements.txt'
Benchmark 1: ./main @mpl procs 144 set-affinity -- /usr3/data/1brc/measurements.txt
  Time (mean ± σ):      2.443 s ±  0.018 s    [User: 339.081 s, System: 2.687 s]
  Range (min … max):    2.417 s …  2.475 s    10 runs

These timings are not directly comparable with the timings reported in the competition, because of differences in hardware. (E.g., I'm using a much larger number of cores here.)

Key optimizations

I ended up making four key optimizations to improve performance. In the end this yielded a total improvement of approximately 10x over my initial working implementation.

The key optimizations were:

  1. Don't allocate any intermediate strings. Instead, use file offsets as keys in the hash table, and compare keys by directly reading from the contents of the file. (commit)
  2. Don't tokenize the file. Instead, parse entries on-the-fly in the main loop. (commit)
  3. Reduce contention in the hash table by sharding the values into multiple copies. (commit)
  4. Don't load the file into an array. Instead, operate directly on the mmap'ed file contents. (commits: a, b, c)

Potential improvements still TODO

  • Work directly on the mmap'ed file instead of loading it into an array first.
  • Better hash function?
  • Faster parsing? There are some cool ideas floating around in discussions of other 1brc solutions.
  • Use block-local hash tables to avoid contention?
  • Shard the hash table to avoid contention
  • Option for disabling bounds checks
  • Store the components of the hash table values ("weights" in the code) in SoA style?
  • Pack the min and max components of the weights into a smaller value? (These don't need nearly as many bits as we're using currently...)

At the moment, my guess is that better parsing and hashing would lead to the biggest performance improvements. (We're getting great scalability, but single-core performance is a little slow.)

I suspect also that avoiding copying the whole file into an array would be a big win. Done! Got a ~20% improvement.

About

Solving the 1 Billion Row Challenge in MPL

Resources

License

Stars

Watchers

Forks

Packages

No packages published