A first look at WebAssembly performance

WebAssembly gives us the promise to run high performance code in the browser in a standardized way. Now that there are a few WebAssembly previews available I decided it’s time to take a look at their performance. One source for benchmarks is the well known Computer Language Benchmarks Game and I decided to pick nbody (it’s almost four years ago since I did so last time…).

After playing a bit with the results I decided to put the code on github. I’m looking forward to your corrections, improvements and feedback. I’m already excited what the results will look like in a few months…

The following versions were compared:

  • webAssembly: A WebAssembly version compiled from the original c version, because this turned out to be faster than the other version I checked
  • object: The fastest javascript version from the Computer Language Benchmarks Game. It uses a javascript objects for each body to store the data.
  • arrayPerObject: Each body’s data is stored in a plain javascript array.
  • floatArrayPerObject: Each body’s data is stored in a typed array
  • oneTypedArray: All body’s data is stored in a single typed array and the advance function is programmatically unrolled (quite crazy, isn’t it).
  • To get a baseline the fastest java and the original c version were added.

Here are the results:


(click to enlarge)

Firefox does pretty well. The WebAssembly implementation is the fastest browser version and close to the java baseline, but the pure javascript implementation isn’t really much behind. Seems like Javascript VMs are already pretty good at simple numeric code.
For the other browsers WebAssembly couldn’t beat the javascript versions yet. And Safari has a completely different idea what Javascript version it can optimize best.

The fine print

Browser versions:

  • Chrome Canary, 58.0.3004.0, invoked with
    --js-flags="--turbo --trace-opt --trace-deopt --trace-bailout"

    for turbofan and

    --js-flags="--trace-opt --trace-deopt --trace-bailout"

    for crankshaft.

  • Firefox 53.0a2 (2017-02-06) (64-Bit)
  • Safari Technology Preview Release 22 (Safari 10.2, WebKit 12604.1.4.2)

WebAssembly setup:

  • Emscripten and Binaryen were installed as described on Compile Emscripten from Source.  (emcc (Emscripten gcc/clang-like replacement) 1.37.2 (commit 70d370296036cc5046083a3e215cb605c90e004b))
  • The c source code was compiled with that command:
    emcc nbody.c -O3 -s WASM=1 -s SIDE_MODULE=1 -o nbody.wasm

C compiler:

  • The c version was compiled with gcc -O3  nbody.c -o nbody (which is Apple LLVM version 8.0.0 (clang-800.0.42.1))
    This version took 4.4 seconds on my machine and was faster than the fastest C version from the shootout, compiled with gcc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 nbody_fastest.c -o nbody_fastest, which took 4.9 seconds on my machine

Infrastructure:

All tests were performed on a 2015 MacBook Pro, 2.5 GHz Intel Core i7, 16 GB 1600 MHz DDR3. For all tests the best of three runs was selected for the result.

JS web frameworks benchmark – Round 5

Welcome to another round of the js web framework benchmark.
Once again a lot has happened:

  • Elm, Knockout, Nx, Binding.scala, Dio, Polymer, Simulacra and Svelete are new in this round.
  • Most other frameworks have been updated to their latest version (though it took a bit to write this blog post, so they might be outdated again by now…)
  • There are now two result tables: One for “keyed” implementations and one for “non-keyed”. I’ve written a separate blog post about that.  The classification as keyed or non-keyed is not complete yet and only based on the current benchmark implementation and does not indicate that the framework doesn’t support the other mode. Feel free to send me pull requests!
  • The “clear rows a 2nd time” benchmark has been removed. It showed a (maybe theoretical) issue in react that is fixed by now.

I’d really like to thank all contributors (at the time of writing there are 33 of them). Without you it would be impossible to make such a comparison.

The results are here:

Key observations (pun intended):

  • It’s still the case that frameworks are getting faster. Riot 3 is a big step in terms of performance in this benchmark.
  • Ember is included again in this round and 2.10-beta is a big improvement.
  • I’ll really have to rework my vanilla.js implementation again if it should become the fastest possible implementation again 😉
  • Apart from that many frameworks are in a performance range that should be fine for many requirements. Vue.js, kivi, vidom, plastiq, domvm and bobril in the keyed table show little weakness.
  • As for the non-keyed frameworks the results are much closer. Dio, inferno and svelte are incredibly fast, domvm and Vue.js are only a bit behind.

You can to click through the implementations in your browser here. Please keep in mind that all durations printed on the browser console are just approximations, but good enough to feel the effect between fast and slow implementations. The real results are gathered from the chrome timeline with the test driver in webdriver-ts.

JS web frameworks benchmark – Round 1

In this blog post I’ll try to compare the performance a few popular javascript frameworks. Measuring the performance of browser content can be challenging and I’m prepared for harsh replies.

One approach to measure the performance would be to use browser tools like the chrome timeline, which reveals exact timings, but has the disadvantage of being a manual and time consuming process and yielding only a single sample.

At first I tried automated benchmarking tools such as Benchpress or protractor-perf, but I didn’t really understand the results and thus decided to roll my own selenium webdriver benchmark. I wrote an additional blog entry to describe this approach. To put it short it measures the duration from the start of a dom click event to the end of the repainting of the dom by parsing the performance log. To reduce sampling artifacts it takes the average of runnig each benchmark 12 times ignoring the two worst results.

Screenshot_Angular2

Continue reading “JS web frameworks benchmark – Round 1”

C, Java and Javascript numeric benchmark and a big surprise

In older posts on this blog I showed that Java’s performance can come pretty close to C – at least for simple numeric benchmarks. Android’s Dalvik VM is often blamed for its bad performance, though things have improved since the introduction of a JIT. With all that HTML5 vs. native debate and the ongoing Javascript hype I though it would be cool to run some benchmarks and compare the performance of Java, Javascript and C on my notebook, an iPhone 4S and a Google Nexus 7.

I decided to run a mandelbrot benchmark first. This time I wanted to see a nice colored mandelbrot set and found a Javascript version on http://www.atopon.org/mandel/# . I decided to bring that source code to Java and C (no SSE intrinsics, just plain C code to support all platforms). Here are the results showing the duration in milliseconds for each language and platform:Mandelbrot_CJJS

  • On the MacBook Pro the performance is within a narrow range. C wins with 19 msecs, Java is second with 21 msecs and Javascript is amazingly quick with 22 msecs.
  • On the iPhone the Javascript Nitro Engine takes 113% more time that the C version. Nevertheless almost factor 2 isn’t too bad for a dynamic language.
  • The most surprising result was on the Google Nexus 7. C was fastest, the Java code on the Davlik VM was 144% slower and the V8 Javascript engine achieved to beat the Dalvik VM significantly. The first computation was 65% slower and after that warm up subsequent computations were only 42% slower than C. Once again: V8 is much faster than Dalvik and easily within a range of 2 to C!

A single benchmark doesn’t prove anything, so why not add another well known benchmark. I took NBody (as I always did on this blog ;-)).

The results for NBody confirmed those results. For C I took the fastest plain C implementation from the Computer Language Benchmarks Game. Once again the y-axis shows the duration (this time in seconds).NBody_CJJS

  • On the MacBook Pro Javascript was 44% slower than C (Java only 1.5%).
  • The Javascript Nitro VM on my iPhone 4S was 99% slower than C.
  • On the Nexus 7 there’s once again the same image: Java is 106% slower than C, Javascript only 43%. Amazing.

For simple numeric benchmarks the performance of Javascript is simply astonishing. On the MacBook Pro Javascript is incredibly close to the performance of C and Java. On the iPhone Javascript was within a range of factor 2.1. For me it was very surprising to see V8 on Android being able to beat Java on the Davlik VM by a large margin.

Fine print:

  • Slower and faster usually cause headaches in benchmarks (There a nice paper about that http://hal.inria.fr/docs/00/73/92/37/PDF/percentfaster-techreport.pdf). I sticked with the elapsed time, such that e.g. 42% slower means that the factor of the durations was 1.42.
  • On the MacBook Pro C was compiled with clang using -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 for x64. Java was Oracle Hotspot 1.8.0-ea-b87 on 64 bit (thus C2 aka Server Hotspot). Chrome was 28.0.1493.0, but the 32 bit version. I tried to compile V8 myself, but both the x86 and x64 custom built V8 were significantly slower than Chrome so I stick with Chrome.
  • On the iPhone I used a release configuration using clang with (among others) -O3 -arch armv7
  • The Google Nexus 7 runs Android 4.2.2, Chrome 26.0.1410.58. C was compiled with -march=armv6 -marm -mfloat-abi=softfp -mfpu=vfp -O3.

Scalar replacement: Automatic stack allocation in the java virtual machine

The java virtual machine recently introduced a very interesting optimization that allows to eliminate some object allocations. This optimization is called scalar replacement and depends on escape analysis. You can read more about it in an article by Brian Goetz.

Simply spoken when an object is identified as non-escaping the JVM can replace its allocation on the heap with an allocation of its members on the stack which mitigates the lack of user guided stack allocation. The optimization is enabled by default since JDK 6U23 in the hotspot server compiler.
Continue reading “Scalar replacement: Automatic stack allocation in the java virtual machine”

Java vs. C benchmark #2: JET, harmony and GCJ

In one of the comments regarding my Java vs. C benchmark Dmitry Leskov suggested including Excelsior JET. JET has an ahead of time compiler and is known to greatly reduce startup time for java applications. I’ve kept an eye on JET since version 4 or so and while the startup time has always been excellent the peek performance of the hotspot server compiler was better. With JET 5 performance for e.g. scimark has improved greatly so I decided to rerun the benchmark for JET 5 and JET 6 beta 2. JET 6 beta 2 is currently available on windows only and thus the tests were run under Windows Vista, JET 5 (and all other VMs) ran under Ubuntu. I also benchmarked JET 5 on Windows to check if there’s a large OS-related difference, but the results were within 2.4% (still a t-Test showed a significant difference). As a simplification I decided to publish only the Ubuntu JET 5 results. Nevertheless I’ll update the results when beta 3 becomes available for linux.

Another interesting VM is Apache Harmony. It is designed to be a complete open source JDK and it received a lot of attention when it started (and it became pretty quite nowadays). It started before Sun decided to open their JDK under the GPL, so if nothing else harmony was in my opinion one of the reasons that we have Sun’s openjdk project now. Harmony’s VM is based on a intel donation so that alone makes benchmarking interesting. Of course Harmony is still in the early stages and it would be almost a miracle if Harmony 1.0 could beat the performance of Sun’s JDK.

The third VM is also an ahead of time compiler. GCJ is a java frontend for the GCC and thus might produce code roughly identical to GCC. There isn’t too much publicity for GCJ despite its effort to become a really usable JVM. Combined with the GIJ interpreter and the gcj-dbtool GCJ is able to compile even complex applications like eclipse. GCJ uses classpath as its underlying implementation of the JDK classes which means some parts of the JDK are still missing. I decided to use the Ubuntu gcc 4.3 snapshot as it turned out to work best on my PC. Continue reading “Java vs. C benchmark #2: JET, harmony and GCJ”

Java vs. C benchmark

Java’s performance is perceived rather differently depending on who you ask ranging from Java-is-faster-than-C to “java is 10x slower”.
Without actually running some benchmarks it’s hard to tell who is actually right and of course every benchmark will show different results and both sides have good arguments. I don’t know of any real world applications that has been ported from C to Java in such a way that statements about their relative performance are valid. So the only source I know are (micro-)benchmarks. Besides the well known Scimark and linkpack benchmarks there are some interesting benchmarks on the “computer language benchmark game” formerly known as the great language shootout. It has often been criticized for too short duration and including warmup times for JITs. Still I like those benchmarks since they are not classic microbenchmarks, but (almost) every benchmark tries to stress a certain set of language features and returns a well defined output.
To make it short: I decided to select four computational intensive, IO-less benchmark from the shootout. Continue reading “Java vs. C benchmark”