Research question

Icon made by Freepik from www.flaticon.com

First of all, I want to share with you my master thesis research question. There are two of them, but in the first article, we will focus on the first one — connected to performance.

RQ: Are there significant differences between execution performance between various implementations of the same benchmark in Kotlin and Java using Java Runtime Environment?

The Computer Language Benchmark Game

The dynamic metrics comparison will be based on the idea propagated by one of the most popular cross-language benchmark suite — The Computer Language Benchmark Game (CLBG).

Introduced by Doug Bagley in 2000 as The Great Computer Language Shootout project with a goal to compare all major languages. Nowadays, the project developed into The Computer Language Benchmark Game which is the most popular cross-language benchmark used scientifically. Project is always growing with the new problem benchmarks and language implementations. It is systematically updated by the creator to follow the programming market trends (by adding new languages, removing ones that are not used anymore and updating the list of benchmark algorithms to test).

There is the goal behind CLBG benchmark — to answer the following question, asked by one of the 4chan users:

My question is if anyone here has any experience with simplistic benchmarking and could tell me which things to test for in order to get a simple idea of each language’s general performance?

To answer that question, CLBG presents a set of 10 different algorithm problems. All of the problems are presented and described on the official web-page. Introduced algorithms have strict rules on how to implement them, what do algorithms use to achieve a correct equivalent result. With that following information, problems are implemented using specific language which will be measured.

To enable an objective comparison of the results, CLBG benchmark always uses fixed scripts that implement the metrics for all of the experiments. The collected measurements are independent of the implementations of algorithms in the indicated languages.

Benchmarks selection

Icon made by Freepik from www.flaticon.com

During the development of the experiment, The Computer Language Benchmark Game consisted of 10 benchmark programs (sidenote: the number of benchmarks in CLBG change in time). Each of them proposes a different kind of problem which uses different language paradigms, language features, and methodologies in general to solve them.

(I am not going to describe every benchmark in CLBG, if you are interested I encourage you to read the information on the webpage)

The main idea standing behind this experiment was to compare Java and Kotlin. To achieve that, the experiment used the base implementation of programs in Java (taken from CLBG benchmarks repository) and two implementations in Kotlin — converted and idiomatic (described in the next sections).

With these assumptions, benchmarks used in the experiment were selected based on two factors:

best Java implementation taken from CLBG repository has to be convertible to Kotlin language the programs must manipulate on as diverse as possible data

Authors of “JVM-Hosted Languages: They Talk the Talk, but do they Walk the Walk?” paper proposed method of distinguishing CLBG corpus of programs by indicating whether a program mostly manipulates integers, floating-point numbers, pointers or strings. This information will help us divide benchmarks into groups.

Taking everything into account, there are only 6 of the CLBG benchmarks that were used in the final experiment. Four of the ten benchmarks were rejected in order to be consistent with the assumption that Java code has to be convertible to Kotlin language without the need for large changes in the code.

int - integer

fp - floating point

ptr - pointer

str - string

Table 1: Selected benchmarks with information about most manipulated data

Remarks

after benchmark selection (as it is depicted in Table 1), the final full benchmark suite does not contain programs which manipulate mostly on string resources.

Implementations

Icon made by Freepik from www.flaticon.com

There are three implementations for every benchmark in this experiment.

Java Kotlin-converted Kotlin-idiomatic

All of those codes were used for experiments — compiled, executed and measured by the external Python scripts. None of the implementations have any measurements or irrelevant code parts which could interfere with final results.

Kotlin was divided into two versions. I wanted to check what is the difference between multiple Kotlin implementations. I assumed that Kotlin-converted might be the code with similar performance and bytecode results to the Java version. On the other hand, I used Kotlin-idiomatic implementation to achieve benchmark results for codes which are more likely to be produced by experienced Kotlin programmer.

If you are interested in more implementation details, check out the Java vs Kotlin comparison repository.

Java implementation

All Java codes were taken directly from most actual benchmark implementations available in The Computer Language Benchmark Game repository, without making any changes to them.



In the benchmark repository, there were multiple versions of Java benchmarks. The codes used in the experiment are those which achieved the best results, according to the leaderboard available on CLBG webpage.

Kotlin-converted implementation

Implementation is created using Java to Kotlin converter provided by IntelliJ IDEA. Multiple changes have been made, to the raw converted versions, to bring the code to an executable state. Most of the changes were necessary to allow the compiler to run the code.

All Kotlin-converted implementations have one change which was introduced into the code. In order to make the code compilable using the command-line interface, the method main(), was extracted outside the original file class.

Kotlin-idiomatic implementation

Changes introduced into the Kotlin-idiomatic implementations are mostly based on:

Idioms — which lists Kotlin frequently used idioms. The site is part of the official Kotlin documentation

Coding Conventions — which contains the current recommended coding style for the Kotlin language. The page is also a part of Kotlin documentation

IDEA default code inspections

Kotlin idiomatic implementations are based on the converted versions.

Remarks

five out of six benchmarks work in parallel using Threads (Java and Kotlin implementations), none of Kotlin implementations uses coroutines, the use of which could significantly affect the performance results

Languages version and hardware

Icon made by Nikita Golubev from www.flaticon.com

Obligatory information about software and hardware environment used for benchmark execution.



The experiments were executed on the hardware environment which details are presented in Table 2. The choice of the Linux Ubuntu System was dictated by the fact that it was recommended OS for the CLBG metrics measurement scripts.

Table 2: Hardware environment

Table 3 presents the versions of Java and Kotlin which were used for executing benchmark implementation. Both versions represent the newest available version of languages at the time.

Table 3: Languages version

Remarks

all of the benchmark executions are done on Oracle HotSpot VM

Dynamic metrics

Icon made by Freepik from www.flaticon.com

So yeah, what did I really measure in that dynamic metrics comparison?

I decided to compare languages using most metrics which are most often considered as most important for programmers:

Execution time

Memory usage

CPU load

Every program was executed and measured 500 times.

Benchmark metrics are also based on those used in The Computer Language Benchmark Game. All of the programs were executed and measured using dedicated CLBG scripts.

Multiple measurement methods were experimented in Kotlin and Java benchmark suite development process and are still available in the repository. Initially, the time was measured using Java/Kotlin code with methods like currentTimeMilis() or nanoTime() from System class, but that idea was abandoned in further work. The results had a large variance, and later on, I decided to abandon this measurement method. Measuring other load metrics like CPU and memory turned out to be also not straightforward and affected by various factors, depending on the environments (deep dive into benchmark measurement ways is a subject for another long article!).

All of that work led to the decision that applying the official CLBG measurement scripts is going to be the most objective method in that case. That method can help put all of the Kotlin measurement conclusions in the context of results given for different languages evaluation presented by CLBG benchmark.

Details of how each parameter is measured by the Python script are available at CLBG measurements page.

Results

Full list of measurement results for each benchmark is available in the project repository.

Execution time

The Figure below presents execution time box plots for each benchmark and each implementation. Letters in brackets represent most manipulated data.