In this paper, we argue that familiarity with R for data management and analysis is important to include in undergraduate ecology courses by quantifying its increased use for statistical analysis in journal articles over the last ten years, as well as in job advertisements that require R as a qualification. We then examine our own different approaches to teaching students how to use R in conjunction with RStudio for basic data management and to analyze ecological datasets. We identify advantages common to both of our approaches, strengths and weaknesses of each approach, and lessons learned from our approaches. Finally, we make recommendations for those wishing to integrate R into their ecology course design based on our experiences in those courses and lessons we have learned since then. We endeavor to begin an important conversation and to lay the framework for improving pedagogy in teaching R to ecology undergraduates.

We have both recently begun to use the R programming environment in our teaching. R is an open‐source scripted programming language and environment used for statistical and graphical analysis of datasets, with additional packages readily available for extending its functionality (R Development Core Team 2018 ). Like other scripted programming languages, R allows researchers to keep a record of data management and analysis, which allows for reproducibility as one can return to the script at a later date and re‐execute it (Borer et al. 2009 ). R is free and flexible and is increasingly used in the field of ecology. Due to the rise in the use of more complicated statistics in ecology (e.g., mixed‐effects models and Bayesian statistics), use of R has increased as well (Touchon and McCoy 2016 ). In our own scholarly data analyses, we have each been using R and RStudio (RStudio Team 2016 ); the latter is an open‐source and free integrated development environment (IDE) for use with R (henceforth, we use “R” to mean R within the RStudio IDE). The rise in the use of R in ecology, coupled with increasing access to large and complex datasets (Strasser and Hampton 2012 ) and a push for reproducibility in research (Hampton et al. 2015 , Klug et al. 2017 ), means that we should be exposing undergraduate students to these modern approaches. While we suspect R is being taught in many programs, there are still few publications that directly address the training of undergraduate ecology students in R.

We have each taught basic data organization using Microsoft Excel and have used Excel or other proprietary software for teaching data analysis, as have others (Cass and Ismay 2018 ). Because many high school students have experience with Excel, it is easier to adopt this software in the laboratory setting due to the relatively shallow learning curve (Cass and Ismay 2018 ). While it is important that students gain familiarity with spreadsheets, our own experience, as well as the literature, has shown that Excel is insufficient for teaching data management and analysis (Nash 2008 , Cass and Ismay 2018 ). Excel does not allow for clear annotation and commenting, which is critical for reproducibility (Toelch and Ostwald 2018 ). Because of the point‐and‐click nature of formatting and visualizing data and performing statistical operations in Excel, there is no record of the steps performed in data analysis (Nash 2008 ). From a pedagogical standpoint, use of Excel can lead to headaches in grading student work and in determining where students need help, because each student may use a different approach and the lack of code or comments does not separate thought processes from procedural steps. While there are resources available for applying data management skills to using spreadsheets (Data Carpentry 2019 ), the use of programs like Excel and other proprietary GUI‐based software is limited in their functionality when applied to ecological problems (Bialek and Botstein 2004 , Baker 2017 ).

Thorough training in ecology requires instruction in data management and analysis (Borer et al. 2009 , Kloser et al. 2013 , Stevenson et al. 2014 , Klug et al. 2017 ). For the purposes of this paper, data management skills include data access (via nonproprietary formats and hardware), data organization, and quality control (e.g., looking for data entry errors); analysis skills include exploring data, choosing and applying appropriate statistical tests, converting data to graphical representations, and producing reproducible results (Borer et al. 2009 , Wilson et al. 2014 , Bravo et al. 2016 ). Often, undergraduate ecology students are exposed to data analysis in their coursework; however, proper data management is not often taught in undergraduate ecology courses, due primarily to lack of available course time (Strasser and Hampton 2012 ). Furthermore, surveyed ecology instructors gave additional reasons for not including data management topics including lack of preparation among students or instructor, large class sizes, lack of funding or resources, or the expectation that managing datasets is covered in other courses or in laboratory sections (Strasser and Hampton 2012 ). Learning new software for managing data can also present a steep learning curve for both instructors and students (Bialek and Botstein 2004 , Baker 2017 ). Often during instruction, data are presented as ready for analysis and the important steps of data verification and exploratory analysis (e.g., summary statistics and examining distributions) are passed over.

R has a fairly steep learning curve in comparison with commonly used programs like Excel; students with no background in programming may find R more difficult to learn than they would a graphical user interface‐based software application (Baker 2017 ). Thus, it is imperative that undergraduate programs in biology and ecology begin teaching R to adequately prepare students for the next stages in their careers.

The analyses of both ecology journals and job advertisements have shown increase over the past ten years in the use of R in ecological data analysis and, consequently, an increased expectation that ecologists at varying stages of education and training have experience working with R. Thus, familiarity with R has become more commonly advertised as a skill required for post‐baccalaureate employment as well as entry into some graduate programs in ecology. The rise in the use of R for data management and analysis by ecologists can be partially explained by the emergence of R and RStudio as free open‐source options. The first public release of R came in 2000 (Revolution Analytics 2016 ) followed by the first release of RStudio in 2011 (RStudio 2019 ).

To measure the degree to which skills in R programming are expected for post‐undergraduate training and employment, we searched the term “R programming” in the archives of the Ecolog‐L Listserv—an established resource for available jobs in the ecological community since 1992 (Inouye 2018 )—for the periods 1 December–31 May of the years 2007–2008, 2012–2013, and 2017–2018. We found that within ten years (between the 2007–2008 and the 2017–2018 job seasons), positions advertised on the Ecolog‐L listserv requiring R programming experience increased (Fig. 2 ). During 2007–2008, of the four positions advertised using the search criteria “R Programming,” there were no postings specific to those who had obtained only a bachelor's degree. However, in 2017–2018, there were 47 search results for “R programming,” and of these, four were for job opportunities (e.g., research technician) that required a bachelor's degree. There were also three posts for master's program assistantships, and one post for an internship that both required a bachelor's degree and knowledge of R. The increase from zero to 17% of positions geared toward those with a minimum of an undergraduate degree over ten years, in addition to the increased use of R in ecological research, suggests a trend for an increasing requirement that undergraduates who wish to pursue post‐graduate education and, increasingly, employment in ecology should be able to use R.

We examined 56, 54, and 44 Ecology papers (154 total) from 2008, 2013, and 2018, respectively. In 2008, only 50% of published papers indicated the software used for data analysis, and of those, only 14.3% ( N = 4) indicated using R (Fig. 1 ). SAS was the most frequently mentioned software ( N = 9) and was mentioned twice as often as R. In 2013, software used for analysis was indicated in 96% of papers. Of those, 67% ( N = 35) used R, whereas SAS was mentioned in only 23% of papers ( N = 9; Fig. 1 ). In 2018, the software used for analysis was indicated in 82% ( N = 36) of papers, of which 81% ( N = 29) used R. Authors were more detailed about how they used R over time. Of the 4 papers that used R in 2008, two specified the packages used, whereas in 2013, of the 35 papers that used R, 21 specified the packages used, and by 2018, of the 29 papers that used R, 27 specified the packages used. Across years, the most commonly used R packages were lme4 ( N = 11), vegan ( N = 9), and nlme ( N = 7). A list of all identified packages and their frequency of use is in Table 1 . Python, one of the most popular programming languages (TIOBE 2019 ) and used in data science (Coding Compiler 2019 ), was mentioned only once in the articles that we examined.

To determine the degree to which R is used in the analysis of ecological data and how that pattern has changed over time, we collected data from three temporal snapshots at five‐year intervals over the decade 2008–2018. To develop a representative understanding of how R is used for ecological data analysis (and not only for, e.g., constructing figures), we arbitrarily examined the first two issues of Ecology published in 2008, 2013, and 2018. We chose Ecology because it is the flagship journal for the discipline. Within those journal issues, we examined the methods and results sections in each paper that included data analysis, and recorded the statistical software used if it were mentioned or apparent from the writing. For papers in which R was used, we also noted any packages that were identified.

Approaches to Teaching R in Our Classrooms

In fall 2017, ELB offered Forest Ecology, and in spring 2018, LAA offered Community Ecology. These were the first offerings for both courses at our primarily undergraduate institution. The courses had several features in common (Table 2). Each explicitly indicated development of R skills as a course component in the syllabus. In Community Ecology, the goal was to analyze and understand community ecology datasets with R. In Forest Ecology, the goal was to introduce data management, exploratory data analysis, and, to a lesser degree, hypothesis testing with R in the treatment of datasets collected by students in the field. In both courses, students were assigned the book Getting Started with R: An Introduction for Biologists (henceforth “GSWR”; Beckerman et al. 2017) as one of the required texts. While there are many introductory R texts (Wickham and Grolemund 2017), some even by biologists (Zuur et al. 2009, Crawley 2015), we chose GSWR for several reasons: Examples in the text use ecological datasets, the book assumes no previous knowledge of R or RStudio, the book introduces the tidyverse (a set of R packages that aid in organizing, manipulating, and plotting data, Wickham et al. 2019) for data examination and manipulation, and the book presents a coherent workflow for exploring a dataset. Importantly, for students, the book is also affordable.

Table 2. Similarities and differences between two upper‐level ecology courses each focused on teaching R as a key component of the course. Course characteristic Forest ecology Community ecology Introductory biology sequence (Biology 101 and 102) required Yes Yes Biology 221 (Ecology) Required Recommended Number of students 11 8 Laboratory Yes No Number of 1° articles read/discussed 8 5–6 (for one assignment, there were two groups of students who each read a different but related paper) Presumed statistical knowledge for students entering course Low; no assumptions made about prior statistical knowledge Low; assumed students had a basic understanding of t‐tests and ANOVA Number of R assignments 22 low stakes and 2 larger assignments 8‐weekly assignments and one‐semester‐long analysis project How were assignments submitted? Via email as R scripts and Word documents with embedded figures, as applicable to each assignment Via Assignments tool on Sakai (course management software). I required R script file, completed handout, and graphs (if applicable to the assignment)

Both courses were offered for advanced undergraduates, all of whom had completed our two‐semester sequence of introductory biology, most of whom had taken an introductory ecology course, and some of whom had completed an introductory statistics course; there were no pre‐requisites requiring experience with coding or statistics for either course. Finally, both courses were small (Community Ecology N = 8 students, Forest Ecology N = 11 students).

There were important differences between the courses, as well (Table 2). Community Ecology was organized as a twice‐weekly seminar course (with no laboratory) that met for 1.5 h each class period, whereas Forest Ecology met for one 1.5‐h class and one 5.5‐h extended laboratory period each week for fifteen weeks. Approximately two‐thirds of Community Ecology students had previous, though limited, experience with R, whereas no students in Forest Ecology had used R before. In Community Ecology, class time was divided between lecture on community ecology concepts and theories, with some time dedicated to in‐class activities and discussion as well as assessments (weekly quizzes and one mid‐term exam). In Forest Ecology, the 1.5‐h lecture periods were primarily used for lecture on forest ecology and for lessons in R. We used 13.5 of the 15 5.5‐h laboratory sessions for gathering data on forest structure as well as discussing papers from the primary literature, sample processing, and measuring chemical and physical properties of soil. The final two laboratories of the semester were used to learn community ordination with the vegan package (Oksanen et al. 2017) and for preparing students for the final data analysis project (Appendix S1).

Because of the distinct structures of our two courses, our approaches to teaching data management and analysis with R were also different, though complementary. The outline for each course's respective R assignments is outlined in Table 3. In Community Ecology, students were given one assignment per week for the first half of the semester (for a total of eight assignments) ranging from 5 to 15 points based on a specific task that was tied into the content of the lecture (example assignments from both Community Ecology and Forest Ecology are in Appendix S1). The task for these assignments required each student to use RStudio to import and analyze a dataset and answer questions, applying ecological knowledge. For example, when we covered competition and niches in lecture, the R assignment for that week focused on how to use the spaa package (Zhang 2016) to calculate niche overlap among MacArthur's warblers (MacArthur 1958). As the students gained more experience with R, assignments were designed to encourage them to recall how to do steps that they had done before (e.g., import data from a .csv file) rather than explicitly instruct them each time. Therefore, each student was expected to build on previous knowledge as they progressed through the assignments.

Table 3. Week by week map of R skill development activities in each course. Class Week R Activities Community Ecology Forest Ecology 1 Installing and Introduction to R and RStudio GSWR† preface through chapter 1 2 Modeling Competition in R: Installing and Introduction to R and RStudio (1) Learn to plot data using R (2) Introduce basic R script for interspecific competition models and give an opportunity to test different scenarios 3 Calculating Niche overlap using R GSWR Chapter 2: create.csv files, long vs. wide data formats; import data; install and run packages, introduction to dplyr, ggplot2 packages 4 Building R Skills with Predator Data: GSWR Chapter 3 week 1: examine a dataset, change column names, basic filtering and grouping of data, basic data summary tidy, plot, and analyze some fish stomach content analysis data 5 No assignment (mid‐winter break) GSWR Chapter 3 week 2: dplyr and data extraction & summary, factor variables 6 Building R Skills with Plant Cover Data: Build skills in R to prepare for final project. Examine datasets of plants impacted by volcanic eruptions on Mt. St. Helens (del Moral 2010). There are 3 possible analyses; pick two to complete this assignment. GSWR Chapter 4: basics of ggplot2, box plots, scatter plots, faceted plots Analysis 1: Basic mapping Analysis 2: Regression Analysis 3: ANOVA and post hoc analysis 7 Use chi‐square to examine mutualisms: Use similar data to Bronstein and Ziv ( 1997 Yucca schottii and its pollinator moth Tegeticula yuccasella impacts the presence and activity of two other insect species, the beetle Carpophilus longus and the gall moth Prodoxus y‐inversus. Problem Set 1 assigned 8 As week above Review problem set answers 9 Plan analysis for and begin final project; meet with instructor GSWR Chapter 5 week 1: introduction to statistical testing; chi‐square and two‐sample t‐test; more ggplot2. Begin formal analysis of class data 10 Spring Break—no class GSWR Chapter 5 week 2: linear regression, one‐way ANOVA 11 Measuring diversity: Continue formal analysis of class data with content learned in GSWR to present Measure diversity using the Simpson's Index and the Shannon Index; calculate “effective species” measurement 12 Students worked on final R project Problem set 2 13 Dissimilarity index and hierarchical clustering GSWR Chapter 6: two‐way ANOVA 14 Finalize R projects and present Introduction to vegan package; community structure; final project assigned 15 Last week of classes; no R skill‐building GSWR Chapter 8: plotting tricks and tips with ggplot2

In Community Ecology, students used the desktop version of RStudio for their analyses after downloading it independently to their computers. All assignments were written by the instructor or adapted from multiple sources (Gardener 2014; see Appendix S1 for additional references and sample assignments). Students were also assigned readings from GSWR and applied what they learned from those readings to write their own R code for analysis of data relevant to our course. Students were also expected to use R for a final project in which they analyzed a dataset in three different ways: first, through appropriate statistical analysis; second, through some form of visual analysis (either a graph or a map); and third, through analysis of community structure (e.g., diversity or niche overlap). The instructor assigned real datasets from the Ecological Data Wiki (2019) so students could practice data management skills, such as selecting and formatting the data they needed to do their analyses. Each student was required to meet with the instructor twice during the semester; the first time to discuss the three analyses that they planned to do and the second time to show their progress and troubleshoot, if necessary. The final product for this project was a poster that showed the results of their analyses as well as the R code they used to conduct their analyses. The students presented their posters in a symposium format at the end of the semester to an audience of their peers and department faculty.

In Forest Ecology, R and RStudio were presented in the second week of the semester. During the first R lesson, to motivate further R learning, students imported and worked with their own data collected from a local forest during the first laboratory period. Thereafter, we spent less class time devoted to R until the last third of the semester, but each week, students completed 1–4 short R low‐stakes assignments, each worth only two points (1% of the homework grade) to reduce student anxiety. We worked through most of the GSWR book (Chapters 1–6 and 8, of 9 chapters). In each week's set of assignments, students first read the assigned chapter of GSWR and submitted an R script showing that the student had worked through the material in the chapter. The later assignments during the same week asked students to use subsets of the data they collected in the field to complete tasks similar to those covered in that week's GSWR chapter. All homework assignments were submitted as R scripts via email to the instructor. As in Community Ecology, toward the start of the semester, assignment instructions were more detailed including, for example, lines of R code that students should use as well as hints. As the semester progressed, assignments became less detailed so that students had to build on prior knowledge in order to complete the assignments. At both mid‐semester and at the end of the semester, students completed a problem set based on analyzing aspects of their forest data. Each problem set listed specific required end products (figures, analyses, etc.) that students were asked to produce without any instruction, thus pushing students to gain more independence in managing data, writing R code, hypothesis testing, and data visualization. During the last third of the semester, we devoted class or laboratory time to analysis of our forest data and community ordination with R. The analysis workflow presented over the course of the semester followed a similar focus to GSWR in how to approach a dataset and statistical testing, introduced in Chapter 4 (Table 4). In this workflow, the first three steps could be considered data management rather than exploratory or statistical data analysis.

Table 4. Steps to data analysis workflow adapted from Getting Started with R (Beckerman et al. ) and adopted as practice in Forest Ecology. Step Description 1 Import data 2 Summarize and plot data to look for data entry errors, patterns in data, and outliers 3 Repair any data errors 4 Plot relationships and formulate expected outcomes of statistical tests 5 Run statistical tests and check for assumptions 6 Interpret results 7 Generate final figures/tables that are informed by results.

Most of the low‐stakes assignments emphasized use of the dplyr (Wickham et al. 2018) and ggplot2 (Wickham 2016) packages for manipulating and plotting data. Assignment instructions emphasized cleaning the data, for example, looking for and correcting mistakes in data entry, dealing with missing observations, examining data for outliers, etc. In both of the problem set assignments, students were asked to demonstrate, using their R code, that they had completed all seven of the GSWR steps in working with their data. This requirement reinforced the need for data management, including cleaning and repair, prior to analysis, steps often left out of the undergraduate ecology curriculum.

In total, students in Forest Ecology completed 22 low‐stakes R assignments and two larger problem sets in which they had to apply their data management and R skills (Table 3). Students in Community Ecology completed 8 weekly R assignments and a large final project (Table 3). By the end of the semester, students in both courses demonstrated that they were able to independently import .csv files into R, install and load packages, create and save R scripts, and create and save figures. Students in Forest Ecology were able to clean and repair datasets and look for outliers prior to analysis. In both classes, some students were able to run a series of statistical tests independently, and others with assistance. The list of R skills students developed and R packages students were exposed to are summarized in Table 5.

Table 5. R Skills and packages covered in courses. R Skill Course CE FE ANOVA Y Y Calculating niche overlap Y N Chi‐square Y Y Correlation N Y Create and save R scripts Y Y Diversity indices Y Y Dissimilarity indices Y N Dendrograms Y N Import data Y Y Install and load packages Y Y Installation of R and RStudio Y Y Lotka–Volterra models Y N Mapping Y N Ordination N Y Plotting with ggplot2 Y Y Regression Y Y Subsetting/filtering data Y Y Two‐sample t‐test N Y R package CE FE BiodiversityR 1 0 corrgram 0 1 corrplot 0 1 deSolve 1 0 dplyr 4 23 EcoSimR 1 0 ellipse 0 1 ggcorrplot 0 2 ggfortify 0 3 ggmap 1 0 ggplot2 3 20 ggvegan 0 1 gridExtra 0 1 Hmisc 0 1 raster 1 0 readr 0 24 spaa 2 0 tidyr 1 4 vegan 2 2

We have presented two different approaches to teaching R in the context of undergraduate ecology courses. In Forest Ecology, the primary emphasis was on collecting and managing data and exploratory analysis, with limited attention to formal statistical analysis, whereas in Community Ecology, the primary emphasis was on using R to answer specific community ecology questions with already existing datasets. We found distinct advantages to teaching R regardless of approach and found that there were distinct strengths and weaknesses in each of these two approaches.

Community Ecology: strengths and weaknesses The Community Ecology course did not have a laboratory component. As a result, students used published raw datasets that were freely available (Ecological Data Wiki 2019). By using these datasets in R, students were given an opportunity to apply their understanding of theory and concepts they learned in lecture and to see expected patterns within the data. Based on student feedback, the most valuable datasets in terms of student interest and learning were those taken from papers students discussed in the course. In Community Ecology, the focus was on the application of R to specific community ecology problems, and so students used specialized packages not covered in the GSWR book. As far as student attitude and interest in having R introduced in this course, we found that by explicitly stating that R would be included in the course description, students who were enrolled expected that learning R would be part of the coursework. In addition, students were encouraged to learn from their errors. For example, in the first laboratory assignment an error log was included to give students a place to record common errors in their code that they could return to later for reference. Informal comments from students about the material were generally positive about learning R, and at least three of the students in the course used R in their senior‐year capstone projects. Four students from Forest Ecology enrolled in a half‐credit course offered the following semester in using R for data management and analysis, offered by ELB, of whom one was simultaneously enrolled in Community Ecology. One consequence of having a course without a laboratory is that students were limited to pre‐existing datasets. While this saved time so that more theoretical content was covered, students did not have the experience of collecting their own data or practicing good data management for each dataset. However, for their final project, students were required to clean up a dataset and subset variables, with minimal R experience, before they moved on to data analysis. Another result of focusing more on course content, and not having a laboratory component, was that LAA did not spend a lot of time in class going over R assignments. Feedback was mostly limited to comments on assignments submitted via the learning management system, or in one‐on‐one meetings during office hours. Finally, there are few textbooks that incorporate R into theory in the field. Our textbook was purely conceptual, and most of the R assignments were adapted by the instructor. Therefore, there was a disconnect between readings and hands‐on assignments that may be better integrated with a text that uses R to work through relevant community ecology problems.