WebAssembly gives us the promise to run high performance code in the browser in a standardized way. Now that there are a few WebAssembly previews available I decided it’s time to take a look at their performance. One source for benchmarks is the well known Computer Language Benchmarks Game and I decided to pick nbody (it’s almost four years ago since I did so last time…).
After playing a bit with the results I decided to put the code on github. I’m looking forward to your corrections, improvements and feedback. I’m already excited what the results will look like in a few months…
The following versions were compared:
- webAssembly: A WebAssembly version compiled from the original c version, because this turned out to be faster than the other version I checked
- floatArrayPerObject: Each body’s data is stored in a typed array
- oneTypedArray: All body’s data is stored in a single typed array and the advance function is programmatically unrolled (quite crazy, isn’t it).
- To get a baseline the fastest java and the original c version were added.
Here are the results:
(click to enlarge)
The fine print
- Emscripten and Binaryen were installed as described on Compile Emscripten from Source. (emcc (Emscripten gcc/clang-like replacement) 1.37.2 (commit 70d370296036cc5046083a3e215cb605c90e004b))
- The c source code was compiled with that command:
emcc nbody.c -O3 -s WASM=1 -s SIDE_MODULE=1 -o nbody.wasm
- The c version was compiled with gcc -O3 nbody.c -o nbody (which is Apple LLVM version 8.0.0 (clang-800.0.42.1))
This version took 4.4 seconds on my machine and was faster than the fastest C version from the shootout, compiled with gcc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 nbody_fastest.c -o nbody_fastest, which took 4.9 seconds on my machine
All tests were performed on a 2015 MacBook Pro, 2.5 GHz Intel Core i7, 16 GB 1600 MHz DDR3. For all tests the best of three runs was selected for the result.
A single benchmark doesn’t prove anything, so why not add another well known benchmark. I took NBody (as I always did on this blog ;-)).
The results for NBody confirmed those results. For C I took the fastest plain C implementation from the Computer Language Benchmarks Game. Once again the y-axis shows the duration (this time in seconds).
- Slower and faster usually cause headaches in benchmarks (There a nice paper about that http://hal.inria.fr/docs/00/73/92/37/PDF/percentfaster-techreport.pdf). I sticked with the elapsed time, such that e.g. 42% slower means that the factor of the durations was 1.42.
- On the MacBook Pro C was compiled with clang using -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 for x64. Java was Oracle Hotspot 1.8.0-ea-b87 on 64 bit (thus C2 aka Server Hotspot). Chrome was 28.0.1493.0, but the 32 bit version. I tried to compile V8 myself, but both the x86 and x64 custom built V8 were significantly slower than Chrome so I stick with Chrome.
- On the iPhone I used a release configuration using clang with (among others) -O3 -arch armv7
- The Google Nexus 7 runs Android 4.2.2, Chrome 26.0.1410.58. C was compiled with -march=armv6 -marm -mfloat-abi=softfp -mfpu=vfp -O3.
About half a year ago I published my first results for a C vs. JVMs benchmark. Some version updates appeared since then and so I thought it’s time for another run.
Some words about the benchmarks
Four of the five benchmarks stem from benchmarks found on the The Computer Language Benchmarks Game. They have been modified to reveal the peak performance of the virtual machines, which means that each benchmark basically runs 10 times in a single process. The first run might be negatively influenced by the JIT compiler and isn’t counted and only the remaining 9 times are used to compute the average duration.
The fifth benchmark is called “himeno benchmark” and was ported from C code to java. Himeno runs long enough such that the warm up phase doesn’t matter much.
Compilers and JVMs used in this comparison
- GCC 4.2.3 was taken as a performance baseline. All programs were compiled with the options “-O3 -msse2 -march=native -mfpmath=387 -funroll-loops -fomit-frame-pointer”. Please note that using profile guided optimizations might improve performance further.
- LLVM has just released version 2.3 and is a very interesting project. It can compile c and c++ code using a GCC frontend to bytecode. It comes with a JIT that runs the bytecode on the target platform (aditionally it even offers an ahead of time compiler). It is used for various interesting projects and companies like most noteably apple for an OpenGL JIT and some people seem to work on using LLVM as a JIT compiler for OpenJDK. I used the lli JIT compiler command with the options -mcpu=core2 -mattr=sse42.
- IBM has released it’s JDK 6 with it’s usual incredible bad marketing (of course there’s no windows version yet). I’ve used JDK 6 SR1, but I couldn’t find a readable list of what changes it includes. The older IBM JDK 5 is also included to see if it was worth the wait.
- Excelsior has released a new version of it’s ahead of time compiler JET. Both version 6.0 and 6.4 have been benchmarked. JET is particularly interesting because it combines fast startup time and high peak performance and is therefore just what you’d expect from a good desktop application compiler.
- Apache Harmony is also included in the benchmark. It’s aimed to become a full APL licenced java runtime and used for the google android platform. Recently the Apache Harmony Milestone 6 was released. I’ve taken a look at this version’s performance in this blog.
- BEA JRockit is no longer included due to a great uncertainness about it’s current and future availability. A short period after Oracle bought BEA all download links were removed from the web page. A few days ago it was announced that JRockit will no longer be available as a standalone download.
- SUN’s JDK 6 Update 2 and Update 6 were put to the test with the hotspot server compiler (i.e. -server option).
- All benchmarks were measured on my Dell Insprion 9400 notebook with 2GB of RAM and a intel Core 2 running at 2GHz under Ubuntu 8.04 (x86).
In one of the comments regarding my Java vs. C benchmark Dmitry Leskov suggested including Excelsior JET. JET has an ahead of time compiler and is known to greatly reduce startup time for java applications. I’ve kept an eye on JET since version 4 or so and while the startup time has always been excellent the peek performance of the hotspot server compiler was better. With JET 5 performance for e.g. scimark has improved greatly so I decided to rerun the benchmark for JET 5 and JET 6 beta 2. JET 6 beta 2 is currently available on windows only and thus the tests were run under Windows Vista, JET 5 (and all other VMs) ran under Ubuntu. I also benchmarked JET 5 on Windows to check if there’s a large OS-related difference, but the results were within 2.4% (still a t-Test showed a significant difference). As a simplification I decided to publish only the Ubuntu JET 5 results. Nevertheless I’ll update the results when beta 3 becomes available for linux.
Another interesting VM is Apache Harmony. It is designed to be a complete open source JDK and it received a lot of attention when it started (and it became pretty quite nowadays). It started before Sun decided to open their JDK under the GPL, so if nothing else harmony was in my opinion one of the reasons that we have Sun’s openjdk project now. Harmony’s VM is based on a intel donation so that alone makes benchmarking interesting. Of course Harmony is still in the early stages and it would be almost a miracle if Harmony 1.0 could beat the performance of Sun’s JDK.
The third VM is also an ahead of time compiler. GCJ is a java frontend for the GCC and thus might produce code roughly identical to GCC. There isn’t too much publicity for GCJ despite its effort to become a really usable JVM. Combined with the GIJ interpreter and the gcj-dbtool GCJ is able to compile even complex applications like eclipse. GCJ uses classpath as its underlying implementation of the JDK classes which means some parts of the JDK are still missing. I decided to use the Ubuntu gcc 4.3 snapshot as it turned out to work best on my PC. Continue reading
Java’s performance is perceived rather differently depending on who you ask ranging from Java-is-faster-than-C to “java is 10x slower”.
Without actually running some benchmarks it’s hard to tell who is actually right and of course every benchmark will show different results and both sides have good arguments. I don’t know of any real world applications that has been ported from C to Java in such a way that statements about their relative performance are valid. So the only source I know are (micro-)benchmarks. Besides the well known Scimark and linkpack benchmarks there are some interesting benchmarks on the “computer language benchmark game” formerly known as the great language shootout. It has often been criticized for too short duration and including warmup times for JITs. Still I like those benchmarks since they are not classic microbenchmarks, but (almost) every benchmark tries to stress a certain set of language features and returns a well defined output.
To make it short: I decided to select four computational intensive, IO-less benchmark from the shootout. Continue reading