Abstract Computation has become a critical component of research in biology. A risk has emerged that computational and programming challenges may limit research scope, depth, and quality. We review various solutions to common computational efficiency problems in ecological and evolutionary research. Our review pulls together material that is currently scattered across many sources and emphasizes those techniques that are especially effective for typical ecological and environmental problems. We demonstrate how straightforward it can be to write efficient code and implement techniques such as profiling or parallel computing. We supply a newly developed R package (aprof) that helps to identify computational bottlenecks in R code and determine whether optimization can be effective. Our review is complemented by a practical set of examples and detailed Supporting Information material (S1–S3 Texts) that demonstrate large improvements in computational speed (ranging from 10.5 times to 14,000 times faster). By improving computational efficiency, biologists can feasibly solve more complex tasks, ask more ambitious questions, and include more sophisticated analyses in their research.

Citation: Visser MD, McMahon SM, Merow C, Dixon PM, Record S, Jongejans E (2015) Speeding Up Ecological and Evolutionary Computations in R; Essentials of High Performance Computing for Biologists. PLoS Comput Biol 11(3): e1004140. https://doi.org/10.1371/journal.pcbi.1004140 Editor: Francis Ouellette, Ontario Institute for Cancer Research, CANADA Published: March 26, 2015 This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication Funding: We thank the Evolutionary Biodemography Laboratory and the Modelling the Evolution of Ageing Independent Group of the Max Planck Society for Demographic Research (http://www.mpg.de) in Rostock (Germany) and Odense (Denmark)) for supporting the working group where this paper was initiated. This study was supported by the Netherlands Foundation for Scientific Research (www.nwo.nl, NWO-ALW 801-01-009 to MDV & EJ; NWO-ALW 840.11.001 to EJ), the Smithsonian Tropical Research Institute (www.stri.si.edu, MDV) and the USA National Science Foundation (www.nsf.gov NSF 640261 to SMM). The funders had no role in the preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

This is part of the PLOS Computational Biology Education collection.

Introduction Emerging fields such as ecoinformatics and computational ecology [1,2] bear witness to the fact that biology is becoming more quantitative and interdisciplinary. Such research often requires intensive computing, which may be limited by inefficient code that confines the size of a simulation model or restricts the scope of data analysis. It is therefore increasingly necessary for biologists to become versed in efficient programming [3], as well as in mathematics and statistics [4]. Computer scientists have developed many optimization methods (e.g., [5]), however, the efficient translation of mathematical models to computer code has received very little attention in biology [2]. Here we present an overview of techniques to improve computational efficiency in a wide variety of settings. Much of the information we present is currently scattered throughout various textbooks, articles, or online sources, and our goal here is to provide a convenient summary for biologists interested in improving the efficiency of their computational methods. In short, we 1) highlight the processes that slow down computation; 2) introduce techniques, which, via an R package, help to decide whether and where optimization is needed; 3) give a step-by-step guide to implementing various basic, but powerful techniques for optimization; and 4) demonstrate the speed gains that can be achieved. We supplement this with more background information and detailed examples in the Supporting Information (S1–S3 Texts). The widespread adoption in the biological sciences of the R programming language has motivated a focus on techniques that are directly applicable to R—although many principles hold for other platforms.

Focus Many analyses in biology are computationally demanding. Examples include large matrix operations [6], optimizing likelihood functions with complex functional forms [7], many applications of bootstrapping or other randomization-based inference, network analysis [8], and Markov Chain Monte Carlo fits of hierarchical Bayesian models (e.g., [9]). Here, we focus on common issues with large databases and stochastic simulation models, applying general approaches for optimizing code to two simple examples: Bootstrapping mean values 10,000 times in a moderately large dataset of 750 million records. This example is highly suited for parallel computation and employs common data protocols: indexing and grouping, resampling and calculating means, and formatting and saving output (Fig. 1A–C). A simple stochastic two-species Lotka-Volterra competition model, which utilizes basic mathematical operations, randomly sampling statistical distributions, and saving fairly large simulation results. Additionally, as change depends on the state of the population in a previous time step (a Markov process), a single run cannot be conducted in parallel (Fig. 1D–F). The optimization of these examples can be followed in detail in PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Visualization of profiling output using the aprof package for R code, where the amount of time spent in each line of code is indicated by the blue bars. In A, B and C, a bootstrap algorithm is shown, and in D, E, and F, a stochastic Lotka-Volterra competition model is shown. The consecutive optimizations described in the text are indicated with the red lines in B, C, E, and F indicating the altered pieces of code. (A) An inefficiently coded bootstrap algorithm, with most time spent in lines 7–8. This algorithm shuffles the values of a large matrix (750,000 x 1000) stored in object "d", and then calculates columnwise the difference between the mean column values and the overall mean. (B) A slightly improved code where the overall mean calculation is stored in object "avg." (C) A further improved version of the code where column means are calculated by a specialized and vectorized function (colMeans). (D) A slow running stochastic Lotka-Volterra model of species coexistence that runs a simulation over T years where species have normally distributed intrinsic growth rates (r ∼ Norm(rm,rs)) and competition coefficients (a ∼ Norm(am,as)). (E) the Lotka-Volterra model is more efficient when the pre-allocation-and-fill method is applied. (F) Switching to a matrix to store results further decreases run time. A detailed description of each optimization step with profiling analysis is given online (S1 Text, sections 2 and 6). https://doi.org/10.1371/journal.pcbi.1004140.g001 PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Execution time in minutes, required to complete various computational problems, using the optimization techniques discussed. Panels A and B show the execution time as a function of problem size for 10,000 bootstrap resamples conducted on datasets varying in size (A) and time required to run a stochastic population model against the number of time steps (B). "Naive" R code, in which no optimizations are applied, uses most computing resources (solid lines in A and B). Optimized R code, with use of efficient functions and optimal data structures pre-allocated in memory (dashed lines in A and B), is faster. In both panels A and B, the largest speed-ups are obtained by using optimal R code (black lines). Subsequent use of parallelism causes further improvement (dot-dashed green line) in A. In panel B, using R's byte compiler improved execution time further above optimal R code (dotted lines in green) while the smallest execution times were achieved by refactoring code in C (red dot-dashed lines). Panels C and D give the computing time (in minutes) needed to conduct the calculations from (C) Merow et al. [10] and (D) the calculations represented by Fig. 3 in Visser et al. [11]. Bars in panel C represent the original unaltered code from [10] (I), the unaltered code run in parallel (II), the revised R code where we replaced a single data.frame with a matrix (III) and the revised code run in parallel (IV). Bars in panel (D) represent the original unaltered code [11] (I), original run in parallel (II), optimized R code (III), optimized R code using R's byte compiler (IV), optimized R code run in parallel (V), optimized R code using byte compiler run parallel (VI), code with key components refactored in C (VII), and parallel execution of refactored code (VIII). All parallel computations were run on 4 cores, and code is provided in S1 Text, section 3, S2 and S3 Texts. https://doi.org/10.1371/journal.pcbi.1004140.g002 The optimization of these examples can be followed in detail in S1 Text . In all cases, we obtained speed-ups of 10.5 to 14,000 times with benefits that increase with the amount of computation ( Fig. 2 ). Finally we show the relevance of these techniques when applied to two previously published problems concerning spatial models [ 10 ] and the analysis of fitness landscapes [ 11 ] (documented in S2 and S3 Texts).

When (Not) to Optimize? One should consider optimization only after the code works correctly and produces trustworthy results [12]. Correct code should be the primary goal in any analysis. Before optimizing, it is important to recall a fact that is recognized by programmers: “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” [13]. Optimized code may be faster but tends to lose robustness and generality, be more complex and less accessible, introduce new bugs, and have limited portability and maintainability. Loosely written code, in a high-level language, may be slow, but it will be faster to develop and easier to prototype. In concurrence, it is sensible to prioritize robust, general, and simple code above “fast code”—robust and general programs work in multiple situations (S1 Text: examples 2.14 and 2.15), are reusable, and hence save development time, while clear simple code saves time when revisiting old code (or when sharing among peers). Clearly, slower code will lead to lower total project time if it is more generally applicable, or when additional development and debugging time exceeds what is saved in run time. Therefore, before attempting to optimize code, one should first determine if it will be worthwhile.

What to Optimize? Amdahl’s law (Fig. 3) [14] provides insight into the value of making a specific section of code more efficient: unless this code section uses a very large fraction of the overall execution time, the reduction in run time for the whole program may be modest. For example, consider code that requires 120 minutes to run, but one section can be sped up by a factor of 2. If that section consumes 95% of the original run time, optimization will improve total run time to 64 minutes. If that section consumes only 50% of the original run time, total run time will only improve to 90 minutes (Fig. 3A). Amdahl’s law also shows that increased effort in optimization has diminishing returns (Fig. 3B). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. Projected improvements in total program run time using Amdahl's law. (A) Realized total speed up when a section of code, taking up a fraction α of the total run time, is improved by a factor I (i.e., the expected program speed-up when the focal section runs I times faster). We see that optimization is only effective when the focal section of code consumes a large fraction of the total run time (α). (B) Total expected speed-up gain for different levels of α as a function of I (e.g., the number of parallel computations). Theoretical limits exist to the maximal improvement in speed, and this is crucially and asymptotically dependent on α—thus code optimization (and investment in computation hardware) are subject to the law of diminishing returns. All predictions here are subject to the scaling of the problem (S1 Text, section 2). https://doi.org/10.1371/journal.pcbi.1004140.g003 Empirical studies in computer science show that small sections of code often consume large amounts of the total run time [15]. Identifying these code sections allows effective and targeted optimization. “Code profilers” are software engineering tools that measure the performance of different parts of a program as it executes [16]. When dealing with large data sets or large matrices, where memory storage is limiting, memory profilers (e.g., Rprofmem) provide statistics to gauge memory efficiency. We illustrate the value of profiling in S1 Text, sections 1–3, using R’s profiler (Rprof) and a newly developed R package (aprof: “Amdahl's profiler”). This package helps to rapidly (and visually) identify code bottlenecks and potential optimization gains (as illustrated in Fig. 1).

Recommendations Optimizing code can provide efficiency gains of orders of magnitude, as our benchmark results show (e.g., Fig. 2). However, we do not recommend optimizing immediately. Realize that one will inevitably sacrifice clarity, generality, and robustness for speed. At the start of a project, the most productive approach (e.g., [3]) is often to write code in the highest-level language possible ensuring the program runs correctly. High-level languages enable rapid decision-making and prototyping, and correct code enables checking of more optimized versions. When a performance boost is deemed worthwhile, for example, through profiling, only optimize those parts identified as bottlenecks to avoid sacrificing development time in favour of optimization [3,12]. The primary route for optimization should be efficient R code which, as we show in Fig. 2, yields the largest gains for the least effort. The fastest running code examples shown here are the instances where we called compiled code from R (Fig. 2, S1 Text: section 6). This technique is especially powerful when one can use the vast libraries of algorithms that already exist in C (and Fortran), which are often optimized and efficiently coded [12]. However, a programming language like C has a steeper learning curve and when learning C requires too much time, we encourage biologists to collaborate with computer scientists in their research or to include contracts for computational consultation in grant budgets [3].

Conclusion Learning how to program and efficiently use computational resources is not only convenient. Computing has become fundamental to the practice of science (e.g., [1–3,27]). In biology, research is striving toward ever more accurate projections to inform public leaders on nature management or make predictions regarding how ecosystems respond to change (e.g., [28–30]). More often than not, such accurate predictions will require high levels of detail as natural systems are variable and include intricate levels of biotic and abiotic interactions (e.g. [31–32]). With these challenges ahead, the use of computationally intensive analyses in the biological sciences should not be constrained by programming practices.

Acknowledgments We thank Tyler Rinker for valuable comments on the examples in the Supporting Text files.