-
Notifications
You must be signed in to change notification settings - Fork 72
Add micro-benchmarks #19
Conversation
@@ -16,5 +16,6 @@ parallelExecution in Test := false | |||
|
|||
libraryDependencies ++= Seq( | |||
"org.scala-lang.modules" %% "scala-java8-compat" % "0.8.0", | |||
"com.novocode" % "junit-interface" % "0.11" % "test" | |||
"com.novocode" % "junit-interface" % "0.11" % "test", | |||
"com.storm-enroute" %% "scalameter" % "0.8.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered JMH (via sbt-jmh)? That's what we used for other Scala library benchmarking so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I actually started with JMH but didn’t find a way to make it correctly run the tests. The problem is that JMH has to guess how to build your test suite, and the only way you can tell it how to do so is via annotations with very poor composition capabilities. As a result, I was unable to define a configuration generating test data based on the test @Params
(maybe that’s possible, but I couldn’t figure out how to do so…).
With scalameter, in contrast, the control is not inverted: you can build your test suite as you want and then ask scalameter to benchmark it.
Are there known issues in using scalameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, there is another team which is responsible of writing a more polished test suite. In the meanwhile, I think this one can give us a correct estimation of the differences between the various collection implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no known issues in using scalameter that I am aware of, and I use it directly in my projects. I am not actively maintaining it, since I am satisfied with the current feature set, which is roughly measuring running time, measuring running time with outlier elimination, measuring running time with GC elimination, measuring memory footprint, measuring method invocation counts, measuring boxing counts, and measuring GC counts.
There is an HTML reporter that outputs a webpage with curves, and there is also a MongoDB reporter that flushes the results into a MongoDB instance.
I personally switched from using the Bench
frontend to JBench
frontend which is based on annotations, but both work fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any issues. I don't know why we use sbt-jmh instead. Maybe for historical reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my experience scalameter
is really good in case you want to store the history of runs and maybe run regression testing. It's also good as long as value you're measuring isn't less than a millisecond.
Also, I've found JVM to be a better tool in answering the why?
question, as it can produce both the heatmap and the actual generated assembly.
I think that for the kind of benchmarking that is done here now both would do fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember when I started using JMH I found some workloads on which JMH gave useful results and Scalameter didn't. I don't remember any longer what this was, though, just that I specifically went looking for it based on the JMH documentation, and found it. Thyme also got it wrong but it usually got head-to-head comparisons qualitatively right due to explicitly testing them head-to-head. (But Thyme is vulnerable to peculiarities of JVM history, since it runs entirely within the VM you've already got.)
Anyway, for now, I think either Scalameter or JMH is fine. Scalameter is prettier.
@@ -0,0 +1,239 @@ | |||
package bench |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These benchmarks shouldn't be in the main project. In fact, the more sbt projects I create, the more I am convinced that a single-project build in .
(like I created it here) is a bad idea. Eventually you'll have to refactor it, which means moving all existing sources (or using an ugly, non-standard sbt setup or complicating things with configurations instead of projects).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These benchmarks shouldn't be in the main project
Can you elaborate why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to use configurations, here is an example of how you can create a separate SBT configuration for benchmarks, which means that benchmarks:
https://github.com/scalameter/scalameter-examples/blob/master/basic-with-separate-config/build.sbt
If you do this, your dependency should have an extra % "bench"
at the end.
Otherwise, you could use a separate project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it's not really important here because we don't expect to publish any artifacts. In general you don't want to have test or benchmarking code be part of the main project / artifact to avoid cluttering the source file structure and introducing accidental dependencies.
|
||
benchmark("scala.List", scalaLists) { xs => | ||
var x: Object = null | ||
xs.foreach(x = _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can potentially be optimized away by a sufficiently aggressive JIT. You need something that leaves some sort of impact that depends on every element. For instance,
var n = 0
xs.foreach{ x => if (x eq null) n += 1 }
n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
If you really want to be sure that this want be completely eliminated, you would need to assign the value n
to a volatile variable at the end of the benchmark. The problem is that aggressive JIT optimizations could also compile the benchmark
function (or its caller) and aggressively inline everything including the foreach
call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't ScalaMeter do a BlackHole-like thing to consume the results of a benchmark? It doesn't matter if the JIT compiler inlines the foreach because it will do that in production code, too, if it can. But it does matter if it can throw away the answer and thus throw away all the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does so for memory footprint measurement, but it so far did not do this for time measurements. It does so for running time too from now on:
About the foreach
comment - I meant, inline both the caller and foreach
in this test, which would be a prerequisite for completely throwing the result away.
} | ||
|
||
benchmark("scala.List", scalaLists)(_.tail) | ||
benchmark("List", lists)(_.tail) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a single tail
take enough time to be measurable? I rarely get good results on single ns-length operations even with JMH which arguably tries the hardest to make this work. You have to be careful with ArrayBuffer
which will be O(n)
, but I think this may need a little more thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - it looks like tail
should be a super-fast operation here, and you will not get any good numbers.
Would be better to use the tail
operation in a while
loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I think I first did it in a loop but that was way too long with Arrays. I had the same issue with apply
.
@@ -39,7 +39,7 @@ class ArrayBuffer[A] private (initElems: Array[AnyRef], initLength: Int) | |||
start = 0 | |||
} | |||
else { | |||
val newelems = new Array[AnyRef](end * 2) | |||
val newelems = new Array[AnyRef](if (end == 0) 16 else (end * 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this algorithm could be improved in three aspects. First, allocating size 1 then 2 then 4 then 8 then 16 is pretty wasteful. I'd go for at least 4, maybe 8, on the first realloc. Second, no point doing an Array.copy
if end==0
. Third, you max out at (1 << 30) elements this way when you could have ((1 << 31) - 1) if you handled the last buffer increase separately. I know you didn't write the entire algorithm, but if we're improving this, let's do it all the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just created #20 to keep track of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another problem of this implementation is integer overflow. We should always look at the current collections library for implementation details where these corner cases have already been discovered and fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integer overflow isn't a terrible problem because you get an exception when you try to allocate -(1 << 31) bytes. It's not the ideal exception to have, but you were going to get an exception anyway, and practically nobody checks whether something is a NegativeArraySizeException
or whatever we'd throw on a too-full array. But in general I fully agree that we need to keep careful watch on out of bounds errors and other such corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overflow cuts the usable size for arrays in half. Once you exceed MaxInt/2 size any attempt to grow the buffer fails even though you should be able to grow up to MaxInt size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szeiger - Yes, I already pointed out the "max out at (1 << 30)" problem. I thought you meant that there was an overflow problem beyond the lack of capacity that I mentioned the first time around.
Aside from using ScalaMeter instead of JMH, this looks reasonable to me. I haven't used ScalaMeter very much, so I can't vouch for its accuracy except to say that when I did head to head testing of ScalaMeter against Thyme (which I wrote), I didn't find any cases where Thyme seemed to get things right but ScalaMeter got things wrong. Picking up the benchmarking framework from the main library might be better, though. I'm not sure. I don't think it's terribly important at this point as long as the tests are telling us something reasonably meaningful and we're not duplicating huge amounts of work. |
not sure if everybody even knows about the existence of the (JMH-based) stuff in https://github.com/scala/scala/blob/2.12.x/test/benchmarks/README.md that's what we normally use for benchmarking stdlib stuff, so if using the same stuff is workable here, we wouldn't have to wonder whether we're comparing apples and oranges when comparing results from two different tools |
} | ||
|
||
benchmark("scala.List", scalaLists)(_.tail) | ||
benchmark("List", lists)(_.tail) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - it looks like tail
should be a super-fast operation here, and you will not get any good numbers.
Would be better to use the tail
operation in a while
loop.
|
||
benchmark("scala.List", scalaLists)(_.tail) | ||
benchmark("List", lists)(_.tail) | ||
benchmark("LazyList", lazyLists)(_.tail) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
|
||
benchmark("scala.List", scalaLists) { xs => | ||
var x: Object = null | ||
xs.foreach(x = _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
If you really want to be sure that this want be completely eliminated, you would need to assign the value n
to a volatile variable at the end of the benchmark. The problem is that aggressive JIT optimizations could also compile the benchmark
function (or its caller) and aggressively inline everything including the foreach
call here.
|
||
} | ||
|
||
trait Cons extends Bench.ForkedTime with Generators { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note - the ForkedTime
predefined template ignores GC:
This is more stable (hence useful if you care about regression testing), but depending on what you want to measure, it might make more sense to include GC pauses to the measurement. You can do this by overriding the measurer
method of the test to return a Measurer.Default
instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that, for micro-benchmarks like those I think it is better to ignore GCs (though for a more complex benchmark like a complete sorting algorithm, maybe it would be better to not ignore GCs). What do you think?
Sgtm. As I said, it mostly depends on whether you want to get a stable number, or an accurate average time.
If I wanted to track GC pressure, I would normally create a separate test for GC counts.
…On January 19, 2017 12:03:13 PM GMT+01:00, Julien Richard-Foy ***@***.***> wrote:
julienrf commented on this pull request.
> + def benchmark[A](name: String, generator: Gen[A]): Unit =
+ performance of name in {
+ measure method "memory-footprint" in {
+ using(generator).curve("memory-footprint") in identity
+ }
+ }
+
+ benchmark("scala.List", scalaLists)
+ benchmark("List", lists)
+ benchmark("LazyList", lazyLists)
+ benchmark("ArrayBuffer", arrayBuffers)
+ benchmark("ListBuffer", listBuffers)
+
+}
+
+trait Cons extends Bench.ForkedTime with Generators {
I would say that, for micro-benchmarks like those I think it is better
to ignore GCs (though for a more complex benchmark like a complete
sorting algorithm, maybe it would be better to not ignore GCs). What do
you think?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#19
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
We also discussed this internally in the Scala team and the consensus is that JMH is the preferred tool. The current benchmarking infrastructure is for the compiler only but could eventually be extended to cover the library, and it is based on JMH. The biggest argument against ScalaMeter though is binary compatibility. If there are any impediments that prevent us from getting useful numbers with JMH now, we can start with ScalaMeter, but all benchmarks that we want to keep around and move into the main Scala project together with the library will have to be ported to JMH eventually. |
@szeiger Alright, I will migrate to JMH. |
what about the JMH-based stuff in https://github.com/scala/scala/tree/2.12.x/test/benchmarks ? |
Note that #14 has a better growing strategy for the backing array. It was
lifted from the current `ResizableArray`.
…On Sat, Jan 21, 2017 at 11:51 AM, Ichoran ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/main/scala/strawman/collection/mutable/ArrayBuffer.scala
<#19>:
> @@ -39,7 +39,7 @@ class ArrayBuffer[A] private (initElems: Array[AnyRef], initLength: Int)
start = 0
}
else {
- val newelems = new Array[AnyRef](end * 2)
+ val newelems = new Array[AnyRef](if (end == 0) 16 else (end * 2))
@szeiger <https://github.com/szeiger> - Yes, I already pointed out the
"max out at (1 << 30)" problem. I thought you meant that there was an
overflow problem *beyond* the lack of capacity that I mentioned the first
time around.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#19>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAwlVkpel_T3q-6f8R_x2FfJineKiFuSks5rUVangaJpZM4Lmrpx>
.
--
Martin Odersky
EPFL and Lightbend
|
I just migrated to JMH. I found no equivalent of the “memory footprint” benchmark of ScalaMeter, so I added a small sub-project with some ad-hoc code doing that. This is not something I’m used to do, so please let me know if you think we should use ScalaMeter (or anything else) for that. |
I'd suggest using http://openjdk.java.net/projects/code-tools/jol/. Example code in http://igstan.ro/posts/2014-09-23-calculating-an-object-graphs-size-on-the-jvm.html |
Thanks @retronym. I just gave it a try and got the following exception:
I also get the following error from time to time:
This might be related to the nature of the bytecode generated by the Scala compiler. Or maybe there is some installation step that is missing… I pushed the code in the I would be really happy to use the tools that you recommend but I must confess that I am super tired of working with Java-based tools that break your programming model. |
Yeah, JOL and JMH are somewhat magic, but part of that is necessary complexity from the jobs they are doing. But there are real downsides to using Scala frameworks for testing the Scala distribution itself, for instance I've spent close to a week working through with Scalacheck/SBT in our build. Hopefully #22 helps some. |
These are very cool @julienrf! Would you consider contributing the charts code back to sbt-jmh? refs sbt/sbt-jmh#103 |
@julienrf, would you consider making the |
@ktoso The charts are specific to our benchmarks because they expect them to be parameterized by a |
The dependency is not a problem for the plugin, it could be an extra plugin. If you'd be willing to help out let's chat in the other ticket, would be very cool IMO. |
@DarkDimius Here is what we would get, for instance: Linear scale: Log scale: IMHO the problem with the linear scale is that the high values on the right make it difficult to read the small values on the left. I think log scales are good to visualize the behavior of one collection over the number of its elements. If we want to compare the factors between different collection implementations, I would normalize the values so that for size |
I strongly support log scales for comprehensibility. If you are just doing head-to-head comparisons, you can set one collection as the baseline and plot everything else as log ratio against that collection. This has the effect of spreading the Y-axis out enough so everything is visible. Takes a few seconds longer to grok, but often worth it if you otherwise can't see the differences. |
@DarkDimius We could have something like the following to figure out factors between implementations: For instance, we see that to build a collection of 8 elements, What do you think? |
Looks good! We should get this merged so we can use the infrastructure for benchmarking other changes. |
Looks good to me. @julienrf please merge at your convenience. |
This PR adds a few micro-benchmarks that may help us to get some numbers when tuning our implementations.
I was not able to produce nice charts, the output is text-only (/cc @axel22).