Introduction

Multivariate statistical analyses are typically used to summarise high‐dimensional data, test hypotheses involving multiple response variables, and examine relationships between large sets of variables (Legendre & Legendre, 1998; Härdle & Simar, 2007). The use of multivariate analyses is supplanting ‘simple’ descriptive analyses across ecology (see James & McCulloch, 1990 and Økland, 2007 for comment) and has become common in microbial ecology, where complex, multidimensional data sets abound (e.g. Ramette, 2007; Bertics & Ziebis, 2009; Frossard et al., 2012; Thioulouse et al., 2012; Hartmann et al., 2013; Rivers et al., 2013). Indeed, numerous software tools used by microbial ecologists implement multivariate analysis techniques and have been recommended as standard components of, for example, microbiome analysis (Kuczynski et al., 2012) and environmental studies (e.g. Zinger et al., 2012). Notable examples include the mothur software (Schloss et al., 2009), the Quantitative Insights Into Microbial Ecology (qiime) platform (Caporaso et al., 2010), the phyloseq package (McMurdie & Holmes, 2013) and the Biodiversity Virtual e‐Laboratory (biovel; http://www.biovel.eu/) project. While such developments may lead one to conclude that standard statistical recipes and ‘workflows’ now exist for microbial ecology data, it is vital to recognise that gauging the appropriateness of a given technique to the data and phenomena under investigation is not necessarily a ‘cut and dried’ affair.

Firstly, it is essential to recognise that the application of statistical techniques to ecological data is the focus of a living field of study: numerical ecologists and statisticians routinely re‐evaluate the properties and limitations of even well‐known techniques in relation to ecological needs. For example, Legendre (2005b) recently re‐examined the value of the Kendall coefficient of concordance (W) in determining species associations in field survey data. Treating species as the ‘judges’ native to W's conceptual formulation allows the identification of species groups with similar ‘opinions’ (gauged by their variable values) which may be used as indicators of a given ecological phenomenon; however, Legendre describes several important caveats to the statistic's use in ecology, as not all variables are suited to its assumptions. Similarly, Warton & Hudson (2004) compared the effectiveness of the well‐known multivariate analysis of variance (manova) to approaches that rely on the calculation of dissimilarities between sampling units rather than analyse abundance data directly. These authors present a developed case suggesting that the use of dissimilarity‐based approaches should be questioned and that alternatives may bring several advantages in generalisation and extensibility. Aside from re‐evaluation, proposals of new techniques and adaptations of existing techniques are steadily encountered. For example, Anderson (2001) developed a nonparametric multivariate analysis of variance approach which is argued to be better‐suited to ecological data while Zou et al. (2006) proposed a form of principal components analysis suited to the sparse data sets generated by, for example, genomic sequencing technologies. Approaches to meaningfully transform ecological data sets for ordination (Legendre & Gallagher, 2001), new ordination approaches (e.g. Pavoine et al., 2004) and methods to systematically assess the impact of rare phylotypes on analytical results (Gobet et al., 2010) provide other examples of relatively recent developments in ecologically oriented multivariate analysis. As they emerge, new techniques which show promise in an empirical setting often require review from expert statisticians to be fully understood. One example features the work of Borcard & Legendre (2002), who proposed a variant of the well‐known principal coordinates analysis to detect and characterise spatial structures in ecological data across all scales. In response to these authors' call for more thorough mathematical appraisal of their technique, Dray et al. (2006) developed supporting theory and connected the original method to a broader set of autocorrelation functions. From the above examples, it is clear that users of multivariate statistical techniques in microbial ecology must stay abreast of a steadily developing body of work involving a wide range of expertise.

Secondly, to make informed methodological choices, users must be aware of the key debates that emerge in the multivariate analysis of ecological data. For example, a multi‐year discussion concerning the analysis of beta diversity using distance‐based and ‘raw data’ approaches recently unfolded in the journal Ecology (Legendre, 2005a; Laliberté, 2008; Legendre et al., 2008; Pélissier et al., 2008; Tuomisto & Ruokolainen, 2008, 2006). Distance and dissimilarity measures, such as the well‐known Bray–Curtis dissimilarity or Jaccard index, are conceptually appealing as they can address issues such as the handling of the double zero problem: accounting for the fact that observed absences (or zero abundances) of several ecological entities across the same sampling units are not necessarily indicators of similarity between those entities. However, the use of these measures introduces dependencies between objects (e.g. sites, samples, or experimental units) which may violate key assumptions of regression‐type analyses and may not deliver as much power as an examination of ‘raw’ presence–absence or abundance data. On another front, Warton et al. (2012) demonstrated that (dis)similarity‐based methods confound the mean–variance relationships characteristic of abundance (or other count‐based) data. These authors call for greater emphasis to be given to model‐based approaches, citing methods based on generalised estimating equations (Warton, 2011) and an original method named constrained additive ordination (Yee, 2006) as examples. Similar debate also surrounds aspects of experimental and sampling design, such as the issue (or, as some contend, nonissue) of pseudoreplication in ecological investigations (Hurlbert, 1984, 2004, 2009; Oksanen, 2001, 2004; Cottenie & De Meester, 2003; Coss, 2009; Koehnle & Schank, 2009; Schank & Koehnle, 2009; Prosser, 2010). While some insist that replication of treatments (or environmental contexts) across ‘truly’ independent sampling or experimental units must occur to draw valid conclusions, others argue that this may not be an achievable, or even necessary, goal in ecological investigations. The contemporary and faceted nature of such debates presents another challenge to the effective and duly cautious application of powerful analytical methods in microbial ecology.

Lastly, the harmonisation of canonical ecological theory with microbial ecology is ongoing (e.g. Prosser et al., 2007; Ramette, 2007) and faces the challenge of keeping pace with new molecular techniques, sequencing technologies and ecological sampling strategies both on global (Rusch et al., 2007; Karsenti et al., 2011; Zinger et al., 2011) and on local scales (e.g. Kuczynski et al., 2012; Böer et al., 2009; Zhou et al., 2013). Zinger et al. (2012) underscored this issue as well as its connection to the use of new statistical techniques in the field of aquatic microbial ecology.

The popularity of multivariate analyses is continuing to increase and their application to microbial ecological data has become technically simplified; however, a developed and up‐to‐date understanding of their properties and limitations is still not widespread in the community. As a result, many microbial ecologists who are not equipped with deep numerical training face a ‘black box’ approach to multivariate analysis and the associated risks of misapplying techniques or misinterpreting results. Reviewers, too, often face uncertainty in evaluating whether researchers have performed appropriate analyses and produced fair interpretations of their results. To support and promote the constantly developing understanding of multivariate analyses in microbial ecology, we present the GUide to STatistical Analysis in Microbial Ecology (GUSTA ME; http://mb3is.megx.net/gustame) – an online, dynamically updated resource with content tailored to the needs of the microbial ecology community.