Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Add micro-benchmarks #19

Merged
merged 4 commits into from
Jan 28, 2017
Merged

Add micro-benchmarks #19

merged 4 commits into from
Jan 28, 2017

Conversation

julienrf
Copy link
Contributor

This PR adds a few micro-benchmarks that may help us to get some numbers when tuning our implementations.

I was not able to produce nice charts, the output is text-only (/cc @axel22).

@julienrf julienrf mentioned this pull request Jan 18, 2017
@szeiger
Copy link
Contributor

szeiger commented Jan 18, 2017

You should talk to @lrytz and @retronym who are currently working on the benchmarking infrastructure for scalac.

@@ -16,5 +16,6 @@ parallelExecution in Test := false

libraryDependencies ++= Seq(
"org.scala-lang.modules" %% "scala-java8-compat" % "0.8.0",
"com.novocode" % "junit-interface" % "0.11" % "test"
"com.novocode" % "junit-interface" % "0.11" % "test",
"com.storm-enroute" %% "scalameter" % "0.8.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered JMH (via sbt-jmh)? That's what we used for other Scala library benchmarking so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I actually started with JMH but didn’t find a way to make it correctly run the tests. The problem is that JMH has to guess how to build your test suite, and the only way you can tell it how to do so is via annotations with very poor composition capabilities. As a result, I was unable to define a configuration generating test data based on the test @Params (maybe that’s possible, but I couldn’t figure out how to do so…).
With scalameter, in contrast, the control is not inverted: you can build your test suite as you want and then ask scalameter to benchmark it.

Are there known issues in using scalameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, there is another team which is responsible of writing a more polished test suite. In the meanwhile, I think this one can give us a correct estimation of the differences between the various collection implementations.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no known issues in using scalameter that I am aware of, and I use it directly in my projects. I am not actively maintaining it, since I am satisfied with the current feature set, which is roughly measuring running time, measuring running time with outlier elimination, measuring running time with GC elimination, measuring memory footprint, measuring method invocation counts, measuring boxing counts, and measuring GC counts.

There is an HTML reporter that outputs a webpage with curves, and there is also a MongoDB reporter that flushes the results into a MongoDB instance.

I personally switched from using the Bench frontend to JBench frontend which is based on annotations, but both work fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any issues. I don't know why we use sbt-jmh instead. Maybe for historical reasons.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my experience scalameter is really good in case you want to store the history of runs and maybe run regression testing. It's also good as long as value you're measuring isn't less than a millisecond.
Also, I've found JVM to be a better tool in answering the why? question, as it can produce both the heatmap and the actual generated assembly.
I think that for the kind of benchmarking that is done here now both would do fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember when I started using JMH I found some workloads on which JMH gave useful results and Scalameter didn't. I don't remember any longer what this was, though, just that I specifically went looking for it based on the JMH documentation, and found it. Thyme also got it wrong but it usually got head-to-head comparisons qualitatively right due to explicitly testing them head-to-head. (But Thyme is vulnerable to peculiarities of JVM history, since it runs entirely within the VM you've already got.)

Anyway, for now, I think either Scalameter or JMH is fine. Scalameter is prettier.

@@ -0,0 +1,239 @@
package bench
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These benchmarks shouldn't be in the main project. In fact, the more sbt projects I create, the more I am convinced that a single-project build in . (like I created it here) is a bad idea. Eventually you'll have to refactor it, which means moving all existing sources (or using an ugly, non-standard sbt setup or complicating things with configurations instead of projects).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These benchmarks shouldn't be in the main project

Can you elaborate why?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to use configurations, here is an example of how you can create a separate SBT configuration for benchmarks, which means that benchmarks:

https://github.com/scalameter/scalameter-examples/blob/master/basic-with-separate-config/build.sbt

If you do this, your dependency should have an extra % "bench" at the end.

Otherwise, you could use a separate project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's not really important here because we don't expect to publish any artifacts. In general you don't want to have test or benchmarking code be part of the main project / artifact to avoid cluttering the source file structure and introducing accidental dependencies.


benchmark("scala.List", scalaLists) { xs =>
var x: Object = null
xs.foreach(x = _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can potentially be optimized away by a sufficiently aggressive JIT. You need something that leaves some sort of impact that depends on every element. For instance,

var n = 0
xs.foreach{ x => if (x eq null) n += 1 }
n

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

If you really want to be sure that this want be completely eliminated, you would need to assign the value n to a volatile variable at the end of the benchmark. The problem is that aggressive JIT optimizations could also compile the benchmark function (or its caller) and aggressively inline everything including the foreach call here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't ScalaMeter do a BlackHole-like thing to consume the results of a benchmark? It doesn't matter if the JIT compiler inlines the foreach because it will do that in production code, too, if it can. But it does matter if it can throw away the answer and thus throw away all the work.

Copy link

@axel22 axel22 Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does so for memory footprint measurement, but it so far did not do this for time measurements. It does so for running time too from now on:

scalameter/scalameter@f82d561

About the foreach comment - I meant, inline both the caller and foreach in this test, which would be a prerequisite for completely throwing the result away.

}

benchmark("scala.List", scalaLists)(_.tail)
benchmark("List", lists)(_.tail)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a single tail take enough time to be measurable? I rarely get good results on single ns-length operations even with JMH which arguably tries the hardest to make this work. You have to be careful with ArrayBuffer which will be O(n), but I think this may need a little more thought.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - it looks like tail should be a super-fast operation here, and you will not get any good numbers.

Would be better to use the tail operation in a while loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I think I first did it in a loop but that was way too long with Arrays. I had the same issue with apply.

@@ -39,7 +39,7 @@ class ArrayBuffer[A] private (initElems: Array[AnyRef], initLength: Int)
start = 0
}
else {
val newelems = new Array[AnyRef](end * 2)
val newelems = new Array[AnyRef](if (end == 0) 16 else (end * 2))
Copy link
Contributor

@Ichoran Ichoran Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this algorithm could be improved in three aspects. First, allocating size 1 then 2 then 4 then 8 then 16 is pretty wasteful. I'd go for at least 4, maybe 8, on the first realloc. Second, no point doing an Array.copy if end==0. Third, you max out at (1 << 30) elements this way when you could have ((1 << 31) - 1) if you handled the last buffer increase separately. I know you didn't write the entire algorithm, but if we're improving this, let's do it all the way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just created #20 to keep track of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem of this implementation is integer overflow. We should always look at the current collections library for implementation details where these corner cases have already been discovered and fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integer overflow isn't a terrible problem because you get an exception when you try to allocate -(1 << 31) bytes. It's not the ideal exception to have, but you were going to get an exception anyway, and practically nobody checks whether something is a NegativeArraySizeException or whatever we'd throw on a too-full array. But in general I fully agree that we need to keep careful watch on out of bounds errors and other such corner cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overflow cuts the usable size for arrays in half. Once you exceed MaxInt/2 size any attempt to grow the buffer fails even though you should be able to grow up to MaxInt size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szeiger - Yes, I already pointed out the "max out at (1 << 30)" problem. I thought you meant that there was an overflow problem beyond the lack of capacity that I mentioned the first time around.

@Ichoran
Copy link
Contributor

Ichoran commented Jan 18, 2017

Aside from using ScalaMeter instead of JMH, this looks reasonable to me. I haven't used ScalaMeter very much, so I can't vouch for its accuracy except to say that when I did head to head testing of ScalaMeter against Thyme (which I wrote), I didn't find any cases where Thyme seemed to get things right but ScalaMeter got things wrong.

Picking up the benchmarking framework from the main library might be better, though. I'm not sure. I don't think it's terribly important at this point as long as the tests are telling us something reasonably meaningful and we're not duplicating huge amounts of work.

@SethTisue
Copy link
Member

Picking up the benchmarking framework from the main library might be better, though

not sure if everybody even knows about the existence of the (JMH-based) stuff in https://github.com/scala/scala/blob/2.12.x/test/benchmarks/README.md

that's what we normally use for benchmarking stdlib stuff, so if using the same stuff is workable here, we wouldn't have to wonder whether we're comparing apples and oranges when comparing results from two different tools

}

benchmark("scala.List", scalaLists)(_.tail)
benchmark("List", lists)(_.tail)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - it looks like tail should be a super-fast operation here, and you will not get any good numbers.

Would be better to use the tail operation in a while loop.


benchmark("scala.List", scalaLists)(_.tail)
benchmark("List", lists)(_.tail)
benchmark("LazyList", lazyLists)(_.tail)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.


benchmark("scala.List", scalaLists) { xs =>
var x: Object = null
xs.foreach(x = _)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

If you really want to be sure that this want be completely eliminated, you would need to assign the value n to a volatile variable at the end of the benchmark. The problem is that aggressive JIT optimizations could also compile the benchmark function (or its caller) and aggressively inline everything including the foreach call here.


}

trait Cons extends Bench.ForkedTime with Generators {
Copy link

@axel22 axel22 Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note - the ForkedTime predefined template ignores GC:

https://github.com/scalameter/scalameter/blob/dd7f057e70c4653ba4a38afe9699c5c4c1f097a3/src/main/scala/org/scalameter/Bench.scala#L108

This is more stable (hence useful if you care about regression testing), but depending on what you want to measure, it might make more sense to include GC pauses to the measurement. You can do this by overriding the measurer method of the test to return a Measurer.Default instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that, for micro-benchmarks like those I think it is better to ignore GCs (though for a more complex benchmark like a complete sorting algorithm, maybe it would be better to not ignore GCs). What do you think?

@julienrf
Copy link
Contributor Author

@Ichoran and @axel22, I tuned the benchmarks according to your comments: I perform tail and apply within a while loop to make it longer. I didn’t follow the @volatile trick, though.

@axel22
Copy link

axel22 commented Jan 19, 2017 via email

@szeiger
Copy link
Contributor

szeiger commented Jan 20, 2017

We also discussed this internally in the Scala team and the consensus is that JMH is the preferred tool. The current benchmarking infrastructure is for the compiler only but could eventually be extended to cover the library, and it is based on JMH. The biggest argument against ScalaMeter though is binary compatibility. If there are any impediments that prevent us from getting useful numbers with JMH now, we can start with ScalaMeter, but all benchmarks that we want to keep around and move into the main Scala project together with the library will have to be ported to JMH eventually.

@julienrf
Copy link
Contributor Author

@szeiger Alright, I will migrate to JMH.

@SethTisue
Copy link
Member

The current benchmarking infrastructure is for the compiler only

what about the JMH-based stuff in https://github.com/scala/scala/tree/2.12.x/test/benchmarks ?

@odersky
Copy link
Contributor

odersky commented Jan 22, 2017 via email

@julienrf
Copy link
Contributor Author

I just migrated to JMH.

I found no equivalent of the “memory footprint” benchmark of ScalaMeter, so I added a small sub-project with some ad-hoc code doing that. This is not something I’m used to do, so please let me know if you think we should use ScalaMeter (or anything else) for that.

@retronym
Copy link
Member

@julienrf
Copy link
Contributor Author

Thanks @retronym. I just gave it a try and got the following exception:

java.lang.IllegalAccessError: tried to access class org.openjdk.jol.vm.InstrumentationSupport$Installer from class org.openjdk.jol.vm.InstrumentationSupport
        at org.openjdk.jol.vm.InstrumentationSupport.saveAgentJar(InstrumentationSupport.java:109)
        at org.openjdk.jol.vm.InstrumentationSupport.tryDynamicAttach(InstrumentationSupport.java:93)
        at org.openjdk.jol.vm.InstrumentationSupport.instance(InstrumentationSupport.java:63)
        at org.openjdk.jol.vm.VM.current(VM.java:73)
        at org.openjdk.jol.info.GraphPathRecord.<init>(GraphPathRecord.java:60)
        at org.openjdk.jol.info.GraphPathRecord.<init>(GraphPathRecord.java:45)
        at org.openjdk.jol.info.GraphWalker.walk(GraphWalker.java:67)
        at org.openjdk.jol.info.GraphLayout.parseInstance(GraphLayout.java:56)
        at strawman.collection.MemoryFootprint$.$anonfun$benchmark$1(MemoryFootprint.scala:23)
        at scala.runtime.java8.JFunction1$mcJI$sp.apply(JFunction1$mcJI$sp.java:12)
        at scala.collection.immutable.List.map(List.scala:272)
        at strawman.collection.MemoryFootprint$.benchmark(MemoryFootprint.scala:22)

I also get the following error from time to time:

java.lang.IncompatibleClassChangeError: strawman.collection.MemoryFootprint and strawman.collection.MemoryFootprint$delayedInit$body disagree on InnerClasses attribute

This might be related to the nature of the bytecode generated by the Scala compiler.

Or maybe there is some installation step that is missing… I pushed the code in the benchmark-jol branch, in case you want to have a look.

I would be really happy to use the tools that you recommend but I must confess that I am super tired of working with Java-based tools that break your programming model.

@retronym
Copy link
Member

Yeah, JOL and JMH are somewhat magic, but part of that is necessary complexity from the jobs they are doing. But there are real downsides to using Scala frameworks for testing the Scala distribution itself, for instance I've spent close to a week working through with Scalacheck/SBT in our build.

Hopefully #22 helps some.

@julienrf
Copy link
Contributor Author

julienrf commented Jan 25, 2017

It is now possible to produce the following kind of charts for each benchmark (with the timeBenchmark/charts sbt task):

cons

@ktoso
Copy link

ktoso commented Jan 25, 2017

These are very cool @julienrf! Would you consider contributing the charts code back to sbt-jmh?
I'd love the scala benchmarking ecosystem as a whole get easier access to such things, it would improve the way people post results online a lot.

refs sbt/sbt-jmh#103

@DarkDimius
Copy link

@julienrf, would you consider making the y time-scale be linear instead of logarithmic? Log-scale makes it hard to see the factors.

@julienrf
Copy link
Contributor Author

@ktoso The charts are specific to our benchmarks because they expect them to be parameterized by a size. Also, note that they add dependencies to JFreeChart and play-json to the build.

@ktoso
Copy link

ktoso commented Jan 25, 2017

The dependency is not a problem for the plugin, it could be an extra plugin. If you'd be willing to help out let's chat in the other ticket, would be very cool IMO.

@julienrf
Copy link
Contributor Author

@DarkDimius Here is what we would get, for instance:

Linear scale:

cons

Log scale:

cons

IMHO the problem with the linear scale is that the high values on the right make it difficult to read the small values on the left.

I think log scales are good to visualize the behavior of one collection over the number of its elements.

If we want to compare the factors between different collection implementations, I would normalize the values so that for size n we compare values between 0 and 1, for size n + 1 we also compare values between 0 and 1, etc. I will try to implement that, to see if it is worth it.

@Ichoran
Copy link
Contributor

Ichoran commented Jan 26, 2017

I strongly support log scales for comprehensibility.

If you are just doing head-to-head comparisons, you can set one collection as the baseline and plot everything else as log ratio against that collection. This has the effect of spreading the Y-axis out enough so everything is visible. Takes a few seconds longer to grok, but often worth it if you otherwise can't see the differences.

@julienrf
Copy link
Contributor Author

julienrf commented Jan 26, 2017

@DarkDimius We could have something like the following to figure out factors between implementations:

cons-factors

For instance, we see that to build a collection of 8 elements, ArrayBuffer is ~30% faster than ListBuffer. With 512 elements, on the other hand, ListBuffer is ~20% faster than ArrayBuffer.

What do you think?

@szeiger
Copy link
Contributor

szeiger commented Jan 26, 2017

Looks good! We should get this merged so we can use the infrastructure for benchmarking other changes.

@odersky
Copy link
Contributor

odersky commented Jan 28, 2017

Looks good to me. @julienrf please merge at your convenience.

@julienrf julienrf merged commit 960e273 into master Jan 28, 2017
@julienrf julienrf deleted the benchmark branch January 28, 2017 10:43
@biboudis biboudis mentioned this pull request Apr 26, 2017
7 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants