This section reviews packages, relates some of those with similar functionality, and mentions how some of the packages can be used together. The sections in this review are ordered according to specific analytical approaches and the individual required steps.

A typical next step is the annotation of m / z with putative metabolites using accurate mass lookup, or if the molecular formula was calculated, lookup of the formula in metabolite databases. It has to be noted that annotation with accurate mass search is by no means equivalent to identification. Under the assumption that all the metabolites measured in a sample have some biochemical relation, a global annotation strategy as used in ProbMetab can help as well. Here, the individual ranked lists of formulae are re-evaluated to also maximise the number of pairs with (potential) biochemical substrate-product pairs. The masstrixR package contains several utility functions for accurate mass lookup. This enables matching of measured m / z values against a given database or library and can additionally perform matching based on retention times (RT) and/or collisional cross sections (CCS) if available.

Detailed reconstructed isotope patterns can be used to determine the molecular formula of potential candidates. In the case of molecular formula and isotope analysis, the m / z and intensities for a given (set of) features can be used to calculate a ranked list of possible molecular formulas, based on the accurate mass and relative isotope abundances. The Rdisop , GenFormR and enviPat packages are able to simulate and decompose isotopic patterns into molecular formula candidates. Some post processing can calculate e.g., the double bond equivalents (DBE) and similar characteristics to reduce the number of false positive assignments. Another additional source of information to improve molecular formula estimation is to include MS/MS spectra, as used in MFAssignR , InterpretMSSpectrum or GenFormR .

In MS-based metabolomics, the characterisation and identification of metabolites involves several steps and approaches. After peak (feature) table generation, several tools can be used for grouping features that are postulated to originate from the same molecule. These include the widely used CAMERA for MSdata, as well as(particularly for DIA data),and. Packages that support interpretation of the relationship between the ion species, including adducts, isotopes and in-source fragmentation, areand 82 ]. See Table 2 for a summary of these packages.

The result from the pre-processing is usually a matrix of abundances, rows being features (or features grouped into compounds/molecules) and columns being the samples. Within the statistical community, it is common nowadays to manipulate data matrices with rows as observations and columns as features, this difference stems from the early days, when spreadsheet programs could only handle a limited number of columns smaller than the number of e.g., genes. Such matrices can be easily encapsulated into anclass from Bioconductor’spackage [ 34 ], the more recentdefined in the 35 ] package or theclass from the metabolomics focussed 36 ] package. The main advantage of such objects is their inherent support to align quantitative data along with related metadata (i.e., feature definitions/annotations as row—and sample annotations as column metadata). As an example, acan be generated frompre-processing results by adding the output from thefunction on theresult object as quantitative assay and the outputs of theandfunctions as row and column annotations, respectively. Many Bioconductor packages for omics data analysis have native support for such objects (e.g.,).

Currently, osd provides peak picking for unit resolution GC×GC-MS. While the msPeak package provides peak picking for GC×GC-MS data, the peak picking is done on the total ion chromatogram, thus not taking advantage of the mass selectivity provided by the MS detector. It does not appear that any package for R exists that provides peak picking for GC×GC-MS, LC×LC-MS or LC-IMS-MS, similar to (or even better than) commercial tools (e.g., ChromaTOF, GC Image, ChromSquare). Also, at least in the case of GC×GC-MS, unit mass resolution still seems to be the most common use-case, even though high-resolution MS could further improve signal deconvolution and ultimately, analyte identification. Such capabilities are crucial for moving these new powerful analytical approaches into mainstream metabolomics analysis.

There are several data independent MS/MS approaches, whereby MS/MS precursor selection is done, typically, on a scanning basis. These approaches perform precursor selection in a manner which does not depend on any feedback from the instrument control software or the MS level data. In practice, this precursor window can be either m / z or ion mobility-based. The processing tools within the R universe (discussed below) are so far underdeveloped for these approaches. With the increased popularity of multidimensional separation, the need for algorithms that can fully utilise the increased separation power is also increasing.

Most MS instruments offer the capability to perform selection (or filtering) of ions for fragmentation. The precursor selection can be performed through a quadrupole or ion trap, and fragmentation is often induced by collisions with an inert collision gas. Because this adds a level of mass spectrometry, it is called tandem MS, MS 2 or MS/MS. Ion trap instruments can further select fragment ions and acquire MS n spectra.

Ion mobility separation (IMS) is a gas phase separation method offering resolution of ions based on molecular shape. This separation occurs on timescales of tens of microseconds, which generates a nested data structure in which there are dozens to hundreds of mass spectra collected across the IMS separation time scale. One can envision this as an ion mobility ‘chromatogram’—however, this chromatogram is nested within the actual chromatographic separation, thus LC-IMS-MS data is also four dimensional.

The vast majority of data collected for metabolomics comprises of three dimensions: retention time, m / z , and intensity. However, there are more complicated analytical approaches that add additional dimensionality to the data. Two-dimensional chromatography offers two separations in the chromatographic (retention time) domain. The eluent from one column is captured by retention time range and transferred to a second column, where a fast orthogonal separation occurs. When coupled to a mass spectrometer, this generates four-dimensional data ( m / z , first retention time, second retention time, intensity).

In addition to the most standard “spectra over time” representation of chromatographically separated MS data, there are several alternative ways to represent the data or simplify the data. The signal intensity for a given mass (or mass range) over chromatographic time can be represented as two equal length vectors, with retention time and intensity as units for the values of those vectors. Examples of these vector pairs include the extracted ion chromatogram (EIC, sometimes also referred to as selected or eXtracted ion chromatogram SIC, XIC), where these chromatograms represent the intensity of a given mass over (retention) time. The data thus contains no spectra, but several SICs. Frequently, this is accomplished by summarising the raw data in a two dimensional matrix consisting of m / z and time dimensions, with each cell holding the signal intensity for that m / z and retention time range (or bin). Low mass resolution mass spectrometers often represent the data natively as a SIC and targeted data are also usually represented this way. Recent versions of xcms are also able to process such data, and additional xcms -based functionalities for analysis of targeted data can be found in the packages TargetSearch and SWATHtoMRM , while analysis of isotope labeled data can be found in the packages X 13 CMS , geoRge , and IsotopicLabelling . SIMAT also provides processing for targeted data and does not rely on xcms .

For the pre-processing of LC-MS and GC-MS data,is widely used. A recent paper reviewed some of the “family” packages [ 30 ], although many more packages exist that build onby providing tools for specialised analyses while others provide improvements of some of theprocessing steps such as improved peak picking ().itself provides several different algorithms for peak picking such as 31 ], 32 ] and 33 ].andalso provide peak picking for LC-MS data independently of. In cases where the alignment of the peak data of different samples is considered (e.g., in cohort studies),andinclude methods to group the peaks by theirand retention times within tolerance levels. The groups are split into sub-groups using density functions and the consensusand retention time is assigned to each bin.

Chromatographic separation before MS enables better measurement of complex samples and the ability to separate isobaric compounds. Here, the mass spectra are acquired over time as the sample components separate on the chromatography column. The mass spectrum at any given time has the same data structure as any mass spectrum—units of mass to charge ratio and time. As can be inferred from the above descriptions, chromatographically coupled mass spectrometry data is three-dimensional, with dimensions of retention time, m / z , and intensity.

Currently, one of the highest-throughput analytical approaches is direct infusion MS, where the sample is directly injected into the mass spectrometer without any chromatographic separation. This approach can be used with high mass resolution or ultra-high resolution mass spectrometers to discriminate isobaric analytes [ 29 ]. Summing or averaging these spectra generates a single mass spectrum, which is representative of that sample. Peak picking can be done usingthat applies a continuous wavelet transform-based peak detection.provides a wrapper for this function in thefunction. In the Flow Injection Analysis analytical approach (FIA), the sample is transiently injected into the carrier stream flowing directly into the MS instrument. In the absence of chromatographic separation, matrix effects are a challenge for the quantification, especially in complex matrices. FIA coupled to High-Resolution Mass Spectrometry data can be processed with theworkflow which provides efficient and robust peak detection and quantification.

The mass spectra can be recorded in profile (also called continuum) mode, but are often ‘centroided’. Centroiding is, in effect, a process of peak detection for a profile mode mass spectrum (hence in thedimension, not in a chromatographic dimension)—a gaussian region of a continuum spectrum with a sufficiently high signal to noise ratio is integrated to give a centroided mass (a “stick” in the mass spectrum as opposed to a continuous signal) and integrated area under the curve. This results in data of reduced size—what was many-intensity pairs is reduced to a single-intensity pair. Practically, this reduces the file size considerably, and many data processing tools (e.g.,in) require MS data that has been centroided. The centroiding can be done either during acquisition on the fly by the instrument software, or as an initial processing step. Post-acquisition centroiding can be performed during conversion of the vendor data format to open formats; typically usingfrom ProteoWizard [ 27 28 ], which in some cases provides access to vendor centroiding algorithms or can alternatively use its own built-in centroiding method. Dedicated vendor tools can also be used, and the R packagesalso provides centroiding capabilities.

For all mass spectrometers, the fundamental data generated is a mass spectrum, i.e., mass-signal intensity pairs. MS-based metabolomics data is typically acquired either as a single mass spectrum or a collection of mass spectra over time, with the time axis (retention time) defined by chromatographic (or other time domain) separation. One of the first steps in metabolomics data processing is usually the reduction of the typically large raw data produced by the instrument to a much smaller set of so-called, which are then subjected to downstream data analysis and interpretation. Features normally represent integrated peaks for a given mass that have been aligned across samples. Establishing these features is called. The feature detection approaches and packages applicable depend on the type and characteristics of the input data. This section describes the basic data structure for some of the common analytical approaches and shows appropriate tools in R for pre-processing such data, see Table 1 for an overview of the corresponding packages.

One of the most flexible packages for the handling of NIST msp files is. This package imports and exports the most attributes, although it does not entirely support generic attributes, and the export is very slow (we observed 20 min for an 8 MB file). In addition, a good library reader should also support mgf (mascot generic format) as available for download from GNPS [ 122 ] as well as other common formats such as the MassBank record format and different vendor library formats such as Bruker (.library, another msp flavour) and Agilent (.cef).

There are various R packages that support the import of NIST msp files (see Table 3 ), but the support of different dialects varies, e.g., the NIST-like spectral libraries from RIKEN PRIME [ 120 ] cannot be parsed by some readers. In addition, none of these packages currently supports the import of additional attributes such as ‘InChIKey: ’ or ‘Collision_energy: ’ as used in the export of MoNA libraries [ 121 ]. In essence, most of the packages support the format shown in Listing S1 (see Supplemental File S1 , ‘basic NIST’ in Table S1 ). Thepackage supports NIST msp files as shown in Listing S2 (see Supplemental File S1 , termed ‘canonical NIST’) and RIKEN PRIME provides a similar format with different attributes as shown in Listing S3 (see Supplemental File S1 ). The packages, andsupport the export of NIST msp files. The remaining packages partially support the export of results to NIST msp files (see Table S1 ).

NIST msp files and derived msp-like dialects are a commonly used plain text format for the representation of mass spectra. The msp format is described by NIST as part of their Library Conversion Tool [ 119 ] documentation, but has many different dialects due to rather loose format definitions. R packages that support the import and export of this file format are able to both use spectral libraries for identification, as well as to create and enrich spectral libraries with new data.

A growing number of packages, e.g., 99 ], 100 ] and 101 ], support the annotation of lipids, see Table 2 . They use a combination of lipid database lookup, spectral or selected fragment mass matching and in silico spectra prediction. To improve disambiguation between lipids of the same species that may only differ in their fatty acid chain composition, they usually rely on identifying specific MS/MS feature masses that are indicative of substructure fragments, such as the lipid headgroup, the headgroup with a certain fatty acid attached, or losses of fatty acid(s), and other modifications, such as oxidation. Additionally, they require certain intensity ratios between characteristic fragments of a lipid in order to identify the lipid species or subspecies.

Spectral matching of measured MS/MS data with spectral libraries is an important step in metabolite identification. Different possibilities for matching of two spectra exist, ranging from simple cosine similarity and the normalised dot product to X-Rank and proprietary algorithms. In, different spectra can be compared. Functions for comparison include the number of common peaks, their correlation, their dot product or alternatively a custom comparison function can be supplied. In addition, it will be possible to import spectra from different file formats such as NIST msp, mgf, and Bruker library toobjects using thepackage.therefore seems to be the most flexible R package for the computation of spectral similarities. Spectra are binned before comparison. Thepackage contains a simple cosine spectral matching between two spectra. The two spectra are aligned with each other within a definederror window using one spectrum as the reference. The feature-richcan import msp files and uses the dot product to calculate the spectral similarity, thepackage can perform spectral matching using different similarity functions, andimplements the probabilistic X-Rank algorithm [ 118 ].

While DDA and DIA are convenient methods, users might miss the accuracy and full control over what is fragmented in the targeted approach. The packagesandcan be combined into a workflow (see [ 108 ]) for the generation of records to be uploaded to MS/MS spectral databases (e.g., MassBank [ 109 ]) or to be used off-line.allows the user to specify an arbitrary number RT-pairs and first sorts them into non-overlapping subsets for which in a second step MS/MS methods (Bruker) or target lists (Agilent, Waters) are generated. It is possible to allow multiple collision energies in a single or separate experiment methods.was used for calculation of exact masses of adducts. MS/MS data were then acquired on a Bruker maXis plus UHR-Q-ToF-MS. After data collection each run was manually checked for data quality and processed with

MS/MS spectra can be further processed for example by selecting a representative MS/MS spectrum among all spectra associated with a chromatographic peak or by fusing them into a consensus spectrum. Subsequently, spectra can be used in downstream analyses such as spectral matching or clustering. Due to the re-use of infrastructure from the MSnbase package, xcms has recently gained native support for MS/MS data handling and hence allows to extract all MS/MS spectra associated with a feature or chromatographic peak for further processing.

In data-independent acquisition mode (DIA), the isolation windows are broader, or in some cases, all ions are fragmented, e.g., the Weizmass library [ 107 ] is based on MS. The computational challenge for DIA data is to deconvolute the MS/MS data and assign the correct precursor ion. DIA data analysis support is currently being implemented in several R packages.

In data-dependent acquisition (DDA) the instrument is configured to apply a set of rules, which determine which precursor ions are fragmented and MS/MS spectra acquired. DDA approaches also produce a lot of spectra for background peaks or contaminants, which are often of limited use for the purpose of metabolomics studies. Using the RMassBank package, MS 1 and MS/MS data can be recalibrated and spectra cleaned of artifacts generated. After database lookup of corresponding identifiers, MassBank records are generated.

In case of targeted MS/MS, the instrument isolates specific (specified via method files) masses and fragments them is one possibility. Manually writing targeted MS/MS methods from metabolomics data can be tedious if several tens to hundreds of ions need to be fragmented. The MetShot package supports creating targeted method files for some Bruker and Waters instruments. For all other vendors, optimised lists of non-overlapping peaks (RT- m / z pairs) can be generated to optimise acquisition in the lowest possible number of methods.

Generation of high-quality MS/MS spectral libraries and MS/MS data can be a tedious task. It involves wet lab steps of preparing solutions of reference standards as well as creating MS machine-specific acquisition methods. Several steps can be automated using different R packages presented here.

The annotation of features from MSexperiments alone has limited specificity. Additional structural information for metabolite identification is available from tandem MS and higher-order MSexperiments. There are different approaches, ranging from targeted MS/MS experiments and DDA to DIA (e.g., MS, all-ion, broad-band CID, SWATH and other vendor terms). Table 3 provides a summarised overview of R packages for these types of experiments.

NMR metabolite annotation uses either chemical shifts and multiplicity matching from an existing database, such as Human Metabolome Database [ 125 128 ] (HMDB), a literature experimental search, or uses simulated reference library compounds [ 129 ] to match or to fit the existing biological spectra. 1D NMR data often is not sufficient for a confident assignment of the metabolite peaks [ 130 ] therefore complementary 2D spectral data acquisition are often required to confirm the assignment [ 131 ]. The only package that explicitly deals with 2D NMR isthat takes a targeted approach where the user defines regions of interest to be quantified and compared., originally written in MATLAB [ 132 ], uses both 1D and 2D NMR data for targeted profiling that is also available as an R version called. We are not aware of other R packages that handle 2D NMR data processing. Several general multiway statistical tools such as PARAFAC [ 133 ], Tucker3 [ 134 ] and MCR have been described [ 135 ] that are able to analyse 1D and 2D NMR data, see the section on statistical analysis for a list of packages available for these techniques.uses a Bayesian model and some template information such as chemical shifts,-couplings, multiplicity and intensity ratios derived from spectral database to automatically quantify metabolites in a targeted manner [ 136 ].

NMR is another analytical technique commonly used in metabolomics research. The pre-processing steps for NMR data normally include Fourier transformation, apodisation, zero filling, phase and baseline correction, and finally referencing and alignment of spectra. Other steps commonly used are removing the areas without any metabolites such as the water region (from 4.7 to 4.9 ppm), as they generally contain no useful information. There are several R packages that can carry out the above tasks (see Table 4 ). Theandare two examples of such R-based packages. The 1D NMR spectra can then be segmented into spectral regions (also known as bins or buckets) subjected directly to statistical data analysis after a normalisation step. The size of the bins could be fixed or variable (adopted or intelligent binning) based on NMR peaks or even each data point from each peak (full data point resolution) used for data analysis. The 123 ] package provides a graphical and interactive interface for 1D NMR spectral processing and analysis. Additionally, it provides various spectral alignment methods with the ability to use the corresponding experimental-factor levels in a visual and interactive environment, bridging the gap between experimental design and subsequent statistical analyses. Alternatively, peak picking (based on the regions of interest, ROI) can be performed and individual compounds can be identified and integrated prior to statistical analysis. Targeted profiling aims to identify and quantify specific compounds in a sample. The packages that use such approach (ROI) are, and. The bucketed/integrated spectra are normalised to minimise the biological and technical variation. The most common methods are normalisation to a constant sum (e.g., total sum of integral/bin intensities), probabilistic quotient normalisation [ 124 ] and dry weight tissue or protein content.

Another, in metabolomics sometimes under-appreciated, analytical approach is UV absorption detection, usually coupled with an HPLC or UHPLC system. In some cases, the photo-diode array detector (DAD or PDA) is part of an LC-MS system, actually an LC-UV-MS setup. There are other detectors (e.g., fluorescence) with a different principle, but similar characteristics when it comes to the acquired data. Alignment and baseline correction are typically the first steps of pre-processing LC-UV data. Alignment can be achieved for example with theor thepackage while baseline correction can be achieved using the(or thepackages). Thepackage provides an alternative to using all channels (wavelengths) by first finding unique components (i.e., “pure” spectra) and then performing peak picking in these components. After alignment, general multiway statistical methods like PARAFAC, simultaneous component analysis (SCA), and Tucker Factor Analysis can be applied in the same manner as feature tables would be handled. Table 5 provides an overview of the available R packages for UV data.

is a toolbox built over several R packages and contains more than 500 functions organised in eleven modules. The package was created to overcome the limitations of the homonymous web application, such as the possibility of creating flexible customised workflows (includinginteroperability) and the capacity of dealing with large data sets.functionalities cover a wide range of tools: exploratory statistical analysis, biomarker analysis, power analysis, biomarker meta-analysis, functional enrichment analysis, pathway and joint pathway analysis. Through an implementation of the mummichog algorithm [ 180 ],also allows to infer pathways for from user-generatedpeak-lists. Using the MetaboAnalyst knowledgebase,provides access to metabolite set libraries, compound libraries and pathway libraries.

MetaboDiff is presented as an entry-level, user-friendly package for differential metabolomics analysis. The information contained in the input data (metabolomics measurements and metadata) are stored in S4 objects which are used for the downstream processing. The pre-processing consists of missing value imputation, outlier removal and data normalisation, while the data analysis part offers a variety of statistical methods including tools to explore how metabolites relate to each other in sub-pathways.

MOFA proposes tools for the integration of data coming from different omics disciplines (multi-omics). Using factor analysis, it makes it possible to calculate hidden factors that capture the biological sample variation across multi-omics datasets, thus allowing marker discovery. MOFA also provides various tools for the visualisation of results. IntLIM also supports integration of other omics datasets with metabolomics data by leveraging linear modeling to identify gene-metabolite pairs whose relationship differs from one phenotype to another (e.g., positive correlation in one phenotype, negative or no correlation in another). IntLIM includes a user-friendly web interface to perform data quality control of input data, identification of phenotype-dependent gene-metabolite pairs, and interactive visualisation of results. This tool is particularly useful for integrating transcriptomic and metabolomic or other omics data by generating novel hypothesis in a data-driven manner.

muma is a package designed to be compatible with MS and NMR generated data. The package mainly focuses on performing statistics. It does not contain functions for data extraction and the user has to provide values arranged in a data.frame format. The pre-processing is limited to missing value imputation, noise filtering, variable scaling and normalisation. The package also provides tools for outlier detection, univariate and multivariate analysis. Notably, the package offers a script for Statistical TOtal Correlation SpetroscopY (STOCSY) on NMR data.

A great number of packages are available for performing statistics on metabolomics datasets. Some of them focus on performing several specific tasks, such as sample size estimation, batch normalisation, exploratory data analysis, univariate hypothesis testing, multivariate modeling and omics data integration. Others, listed in the section ‘Multiple workflow steps’ in Table 6 , adopt a more comprehensive approach, providing statistics toolbox that cover different methods and functionalities.

Extracting a restricted list of features, which still provides a high prediction performance (i.e., a molecular signature), is critical for biomarker validation and clinical diagnostic. Several strategies have been described for feature selection [ 174 175 ] (e.g., wrapper approaches such as Recursive Feature Elimination, Genetic Algorithms, or sparse models such as Lasso, Elastic Net, or sparse PLS). Such techniques are implemented in R packages, which also provide detailed comparisons on real datasets in terms of the stability and the size of the selected signature, the prediction performance of the final model, and the computation time [ 176 179 ].

The second strategy, “metabolite fingerprinting”, is commonly used in biomedicine, environmental metabolomics and eco-metabolomics to find metabolite patterns across metabolite profiles. Here, metabolites are characterised without necessarily identifying them, and characterisation usually occurs from spatiotemporally coarser scales to intrinsic scales within biological species [ 166 ]. Multivariate statistical methods are used that require reduction of high-dimensional data and, thus, ordination methods are commonly applied like (Orthogonal) Partial Least Squares regression (sometimes also coupled to Discriminant Analysis) ((O)PLS(-DA)), (Linear) Discriminant Analysis ((L)DA), and (Canonical) Correspondence Analysis ((C)CA) that make it possible to relate sets of explanatory variables containing species traits or environmental properties (such as soil type, plant height, smoker/non-smoker, gender, etc.) to the metabolite feature matrix [ 157 168 ]. Other machine learning methods like Random Forests (RF), Support Vector Machines (SVM) and Neural Networks (NN or ANN) are also applicable [ 169 ]. Lately, untargeted metabolomics data is related to other ‘omics using network analysis or Procrustes analysis to visualise (dis)similarities between two or more ‘omics data sets [ 170 173 ].

With regard to statistical analyses in untargeted metabolomics, two strategies can be differentiated that necessitate the use of different methods. The first strategy “metabolite profiling” is performed by most untargeted metabolomics studies. Here, a bottom-up approach is taken where sets or classes of pre-defined metabolites are studied usually in different phenotypes of the same biological species and differences in metabolites are usually related to more coarse functional or biological levels (e.g., to phenotype or to control vs. treatment in biomedical studies) [ 161 ]. Exploratory data analysis, univariate methods, hierarchical clustering (HCA), Principal Component Analysis (PCA) and Multi-Dimensional Scaling (MDS) like methods are very common in metabolite profiling approaches. Feature/variable selection is performed to find only the most significant metabolite candidates that explain the underlying research question, usually using univariate methods to target only specific metabolites that are interesting to the research question of the study [ 162 165 ].

Following the feature detection and grouping steps outlined in the sections above, different paths to statistical analysis are available in R and Bioconductor. Once the “sample versus variable” feature matrix of molecule intensities or abundances has been generated, comprehensive statistical analyses can be performed by using the vast range of packages provided by the R statistical software and the Bioconductor project (see Table 6 ); see, for instance, StatisticalMethod biocViews [ 146 ] and the ExperimentalDesign [ 147 ], Cluster [ 148 ], Multivariate [ 149 ], MachineLearning [ 150 ] CRAN Task Views [ 151 ]. As mentioned in the introduction, we will only cover common statistical approaches used in metabolomics. Areas such as time-series analysis, clustering methods, machine learning and visualisation of high-dimensional data were dealt with in various books and literature reviews [ 152 160 ].

The analysis of identified compounds on the level of substance classes can give biochemical insights which are not obvious from the individual structures, or in case the structures are not fully elucidated. The web tool ClassyFire is able to annotate a given structure with compound classes from their ChemOnt taxonomy as well as different substituents [ 251 ]. Thepackage supports the retrieval of substance classes using the RESTful API of the ClassyFire tool based on InChIKeys.

Several existing compound databases are useful for metabolomics. These can supply metadata such as common names and synonyms, database identifiers and experimental or predicted properties. Thepackage provides lookup of information available in PubChem [ 244 245 ], while thepackage provide query of a large number of databases including PubChem, ChemSpider [ 246 ], Wikidata [ 247 ], Chemical Translation Service [ 248 ], PHYSPROP [ 249 ], Chemical Identifier Resolver [ 250 ] and others.can be used to map identifiers (metabolites, but also genes and proteins, and interactions) between databases, e.g., PubChem to ChemSpider identifiers;andalso provide some useful web-retrieval functions.

A well-established package iswhich provide a comprehensive subset of functions from the Chemistry Development Kit [ 239 ].provides a computer readable representation of molecular structures and provide a wealth of functions to import structures from different molecule structure description formats, manipulate structures, visualise structures and calculate properties and molecular fingerprints. The packagecan then be used to compare fingerprints.provides reading and writing of InChI and InChIKeys [ 240 ].is an alternative to, providing many similar functions, with more tools for fingerprints, clustering and others through querying the ChemMine Tools web service [ 241 ].also has significantly faster parsing of SDF files, which can be an advantage when reading large databases. A large number of additional descriptors are available in the packagewhich focuses on quantitative predictive models.provides conversion between a large number of chemical structure formats using OpenBabel [ 242 ]. A notable exception is InChI/InChIKey, which is not directly supported byorand one would thus have to go throughandfor offline import from InChI tooris a package that combines the functionality of thewith that of, and. The packagemakes (part of) the functionality of the RDKit [ 243 ] toolkit available from within R.

provides a relational database of Metabolomics Pathways, integrates pathway, gene, and metabolite annotations from KEGG, HMDB, Reactome, and WikiPathways. The database is downloadable as a standalone MySQL dump, for integration with other software, and is also accessible through an R package, and includes a 270 ] web interface that supports four basic queries: (1) retrieve analytes (genes of metabolites) given a pathway name; (2) retrieve a pathway for one or more analytes; (3) retrieve analytes involved in the same reaction; (4) retrieve ontologies (cellular location, biofluid locations, etc.) from metabolites. The web interface also supports pathway overrepresentation analysis on genes, metabolites, or genes and metabolites combined (query 3) and includes clustering of significantly enriched pathways according to the percent of overlapping analytes between pathways. Furthermore, the web interface provides network visualisation of gene-metabolites relationships (query 4).

Another package, paxtoolsr , provides literature-curated pathway using the Biological Pathway Exchange (BioPAX) format by providing an interface to the Pathway Commons database (including data from the NCI Pathway Interaction Database (PID), PantherDB, HumanCyc, Reactome, PhosphoSitePlus and HPRD). rWikiPathways is an interface between R and WikiPathways.org. Pathways can be queried, interrogated and downloaded to the R session. Furthermore, rWikiPathways associates metabolite information to pathways when providing the system code of a chemical database (e.g., from HMDB, ChEBI, or ChemSpider).

A plethora of pathway resources exist, aptly aggregated by Pathguide.org. Several of these resources can be accessed by R packages, which were partly reviewed in [ 268 ]:, and. Of these,stores pathway information for proteins and metabolites of currently fourteen species (version 1.28.0). Available databases are KEGG, Biocarta, Reactome, NCI/Nature Pathway Interaction Database, HumanCyc, Panther, SMPDB and PharmGKB.offers in addition topological and statistical pathway analysis tools for metabolomics data by interfaces with the Bioconductor packagesandand supports functionality to build own pathways. Furthermore,enables the creation and editing of biological pathways.makes it possible to visualise data on pathways, to perform statistics on pathway data, and provides an interface to WikiPathways.makes it possible to access the KEGG REST API via a client interface. The package provides utility to search keywords, convert identifiers and link across databases. The package also makes it possible to return amino acid sequences asor nucleotide sequences asobjects (from the 269 ] package).

The package MetaboLouise simulates longitudinal metabolomics data. The simulation builds on a mathematical representation that is parameterised according to underlying biological networks, i.e., by defining metabolites and relation between them by initialising enzyme rates. Optionally, the package implements functionality to vary the rates depending on the network state, to add external fluxes and to analyse results based on different parameters.

R offers packages to analyse metabolic systems and to estimate biochemical reaction rates in metabolic networks using flux balance analysis, e.g., BiGGR , abcdeFBA , sybil , and fbar . For example, BiGGR interfaces with the BiGG databases that contains reconstructions of metabolic networks. After importing pathways from the database, flux balance and downstream routines can be performed, e.g., linear optimisation routines or likelihood-based ensembles of calculated flux distributions fitting experimental data.

PAPi (Pathway activity profiling) assigns pathway activity scores to samples to represent the potential pathway activity and statistically detects affected pathways by applying t -test or ANOVA. PAPi uses KEGG pathway identifiers. pathwayPCA , with gene selection in mind, offers multi-omics data analysis by estimating sample-specific pathway activities, e.g., taken from the rWikiPathways interface. pathwayPCA takes continuous, binary or survival outcomes as input and estimates contributions of individual genes towards pathway significance.

Many R packages guide the discovery of biomarkers for specific phenotypes. Among these is lilikoi , which maps features to pathways by using standardised HMDB IDs, transforms metabolomic profiles to pathway-based profiles using pathway deregulation scores, a measure how much a sample deviates from a normal level, followed by feature selection, classification and prediction. INDEED (INtegrated DiffErential Expression and Differential network analysis) aims to detect biomarkers by performing a differential expression analysis, which is combined with a differential network analysis based on partial correlation and followed by a network topology analysis. Subsequently, activity scores are calculated based on differences detected in the differential expression and the topology of the differential network that will guide the selection of biomarkers. Another R package for biomarker and feature selection is MoDentify which finds regulated modules, groups of correlating molecules that can span from few metabolites to entire pathways, to a given phenotype. These groups are possibly functionally coordinated, coregulated or driven by a similar or same biological process. Score maximisation using a multivariable linear regression model with the candidate module as dependent and the phenotype and optional covariates as independent variables identifies the modules. Furthermore, MoDentify implements Gaussian graphical models, where depending on the resolution nodes reflect metabolites or entire pathways.

MetaboDiff offers functionality to pinpoint to metabolome-wide differences using PCA and t-distributed stochastic neighbor embedding (tSNE) building on the MultiAssayExperiment S4 class. Using t -test or ANOVA, MetaboDiff identified metabolites that differ in their abundance between groups and identifies modules/sub-pathways by using WGCNA that indicate changes in biological pathways. SDAMS (Semi-parametric differential abundance analysis method for proteomics and metabolomics data from mass spectrometry), building upon the SummarizedExperiment S4 class, performs differential abundance analysis on metabolomics data by linking (non-normally distributed) metabolite levels to phenotypic data, containing zero and possibly non-normally distributed non-zero intensity values.

Another important aspect commonly executed is enrichment analysis to identify pathways that are up- or downregulated given an experimental condition. The R environment offers a whole range of enrichment analysis packages (e.g.,for metabolite data). Targeted more towards pathway analysis,is a Bioconductor package for enrichment analysis.detects discriminative metabolic features, maps these to known biological pathways of the KEGG database and detects enriched terms by a diffusion algorithm.offers enrichment analysis tools extending conventional gene set enrichment methods by incorporating pathway topologies.takes nodes rather than terms for analysis and uses network centralities as weight of nodes incorporating pathways from the Pathway Interaction Database (PID, [ 265 ]), including NCI/Nature Pathway Interaction, BioCarta [ 266 ], Reactome [ 267 ] and KEGG [ 258 260 ].

Several R packages enable pathway analysis that uses quantitative data of metabolites and maps these to biological pathways. The Bioconductor package pwOmics analyses proteomics, transcriptomics and other-omics data in combination to highlight molecular mechanisms for single-point and time-series experiments. In downstream analyses, pwOmics allows for pathway, transcription factor and target gene identification.

Several R packages implement the functionality to generate metabolic networks. These networks can subsequently be analysed by their topological properties, be used to identify motifs that differ between experimental conditions or queried to find associations between metabolic features. MetaMapR generates metabolic networks by integrating enzymatic transformation, structural similarity between metabolites, mass spectral similarity and empirical correlation information. Hereby, MetaMapR queries biochemical reactions in KEGG and molecular fingerprints for structural similarities in PubChem. Furthermore, MetaMapR aims at incorporating metabolites with unknown biochemistry and unknown structures, and integrates other data sources (genomic, proteomic, clinical data). The package Metabox offers a pipeline for metabolomics data analysis, including functionality for data-driven network construction using correlation, estimation of chemical structure similarity networks using substructure fingerprints. Its statistical analysis highlights metabolites that are altered based on the experimental design group, which can be further interrogated by network and pathway analysis tools. Furthermore, the package MetabNet includes functionality to perform targeted metabolome-wide association studies (MWAS) and to guide the association of unknowns to a specific metabolic pathway, followed by mapping a target metabolite to the metabolic network structure.

MetCirc , designed for the annotation of MS/MS features in untargeted metabolomics data, visualises the spectral similarity matrix (e.g., the normalised dot product) between MS/MS spectra in a Circos-like interactive shiny application. Within the shiny application, similarity scores can be thresholded, MS/MS spectra can be interactively explored and annotated based on expert knowledge given the similarity score and displayed spectral features. MetCirc relies on the MSnbase framework to store MS/MS spectral data and to calculate similarities between spectra. Similarly, CluMSID employs spectral similarity matching to guide annotation of MS/MS spectra, incorporates functionality to calculate a correlation networks and for hierarchical and density-based clustering. compMS2Miner is another R package for MS/MS feature annotation and offers functionality for noise filtering, MS/MS substructure annotation, calculation of correlation- and spectral similarity-based networks and interactive visualisation.

Molecular networking starting from MS/MS data can enhance the annotation of metabolites., implemented in R, JavaScript and Python (available via a web interface on http://metdna.zhulab.cn ), combines MSand MS/MS data to putatively annotate features in metabolomics data sets [ 264 ].uses a metabolic reaction network-based recursive algorithm for metabolite annotation employing spectral matching of MS/MS spectra in an automatic fashion. The iterated application of similarity matching between reaction pairs, a substrate metabolite with its product metabolite displaying similar chemical structures, allows the expansion of annotation using seed metabolites or previously annotated metabolites.

As mentioned above in Section 2.2 , a major challenge in metabolomics is metabolite annotation, spanning the annotation of known compounds (dereplication) or annotation of unknown metabolites and proposing hypotheses of their structures. Network and pathway analysis can be employed to putatively annotate metabolites in metabolomics data sets. The Bioconductor packageaims at facilitating detection and putative annotation of unknown MSfeatures in untargeted metabolomic studies.infers networks by using an ensemble of statistical associations between intensity values across samples and structural information (mass difference matching between features to a list of enzymatic transformation, retention time adjustment) to infer metabolic networks and guide the annotation of especially specialised metabolites of plant, fungi or bacteria samples. Another package for improving annotation is the package, which incorporates a multi-criteria scoring algorithm to annotate mass features into different confidence levels.uses coelution, pathway level correlations, correlation and KEGG [ 258 260 ], HMDB, Toxin and Toxin Target Database (T3DB) [ 261 262 ], LipidMaps [ 263 ] and ChemSpider [ 246 ] for annotation and incorporates several filter steps, e.g., by defining modules of co-expressingfeatures using WGCNA and a topological overlap-based dissimilarity matrix and thereby categorising related metabolites into the same network modules.

The R environment offers a general infrastructure for network analysis. Functionality is implemented in a plethora of software packages, among othersor thesuite. These packages offer functions to generate networks from respective data input (e.g., adjacency matrices), to analyse networks, calculate network properties and to visualise networks. Generally, any kind of metabolomics data that can be converted to an interpretable format for one of these packages can be analysed by generic network analysis tools. For example,offers functionality to calculate similarity scores between MS/MS spectral data that can be readily interpreted as a spectral similarity network (see [ 257 ] for the pioneering work of mass spectral molecular networking for biological systems). Such networks can be analysed by the functions provided by the above-mentioned packages or by packages tailored more towards the analysis of biological data (e.g.,). Specifically interesting for metabolomics applications is, an R package to compare correlation networks from two different experimental conditions that builds on an association measure such as Pearson’s correlation coefficient to identify distinctive properties.enables testing of differential correlation of high-dimensional data sets by identifying the first principal component-based ‘eigen-molecules’ in the correlation networks.then tests these differential correlation values based on Fisher’s z-transformation to identify discriminating metabolite pairs that show different response to conditions. Another R package, more tailored towards the analysis of metabolomics data, is, which creates correlation-based networks from metabolite concentration data and analyses the networks based on graph spectra (group of eigenvalues in an adjacency matrix), spectral entropy, degree distribution and node centralities.also allows for KEGG pathway visualisation of metabolite data.

The R environment offers packages to analyse networks of metabolomics data and metabolic pathways (see Table 8 ). Within this section, we refer to a ‘pathway’ as a linked series of chemical reactions between molecules, conveyed by enzymes that lead to a product or change in a cell. These molecules are also known as metabolites and transformations occur in the same cellular compartment or in close vicinity. The term ‘network’ refers to the entity of metabolites that are connected biologically, chemically or structurally (e.g., similarity between MS/MS spectra of two metabolites), functionally or by any other measure (e.g., statistically correlated).

2.8. Multifunctional Workflows

When dealing with non-targeted metabolomics data sets, data processing represents a key step for obtaining meaningful and consistent results. While the type and number of data processing methods may vary according to the experimental design and aim of the study, some key steps can be identified that are common for most metabolomics experiments. For this reason, several multifunctional R-based workflows have been developed over the years. A key advantage of using multifunctional workflows is that most of the functions the user needs are available within the same “environment”, so that the data does not have to be formatted to comply with functions in other packages. In this respect, a quite common backbone of R workflows consists in performing a pre-processing step that generates an R object that can be used as argument for different functions. Another advantage is that, in most cases, workflows allow a certain degree of flexibility so that functionalities can be used as standalone functions (modular workflows) to better comply with the user’s needs. The packages covering larger parts of metabolomics workflows available in R are listed in Table 9

These multifunctional packages include comprehensive workflows that focus on multiple aspects, such as data pre-processing, data validation, preliminary statistical analysis and data visualisation of large metabolomics datasets. The considered workflows support both MS-based data (LC-MS and GC-MS) and data generated by different analytical platforms. MAIT (Metabolite Automatic Identification Toolkit) offers pre-processing, annotation, statistical analysis and data visualisation. It relies on xcms for peak picking and on CAMERA for the preliminary annotation. In addition to CAMERA , the peak annotation process is implemented by including a functionality that allows relating in-source mass losses to specific biotransformations. Human biotransformations are already included, additional biotransformation criteria can be added by the end user. MAIT also provides several statistical tools and visual representations (e.g., PCA, boxplot, PLS), as well as a function to perform identifications using accurate mass search in HMDB. MetMSLine shows some similarities with MAIT in terms of processing stages ( xcms -based pre-processing, multivariate statistics, metabolite identifications). Functionalities characterising MetMSLine include normalisation, signal drift correction using a smoothing method, noise transformation and outlier removal. SimExTargId is a wrapper of different software and R packages for LC-MS data. It includes tools for data conversion (Proteowizard), peak picking and annotation ( xcms and CAMERA ), outlier detection and data correction ( MetMSLine ), and basic statistical analysis. A special feature of SimeExTargId is the real time monitoring of the different workflow stages aimed at metabolomics core facilities; users are notified by email in case of processing errors (e.g., outlier detection, signal drift). mzMatch is slightly different from the above-mentioned workflows and is designed to fit in a broader processing pipeline itself. The project also includes a dedicated file format (peakML) and a Java environment. The different modules can still be used independently. mzMatch supports peak picking and grouping using xcms , reproducibility calculation, data normalisation. The peakMonitor app identifies peaks using the local database. The identification is performed on the basis of m / z and retention time values with user-defined mass accuracy and retention time deviation values.

MetaDB is built by integrating the metaMS R package into a web application written in Grails. It has also been designed to be integrated with the MetaboLights database. MetaDB supports both LC-MS and GC-MS datasets and offers a wide range of functionalities, including: data storage and metadata management (using the ISA-Tab format and ISACreator tool [ metaMS , an xcms and CAMERA add-on) and QC plots. is built by integrating theR package into a web application written in Grails. It has also been designed to be integrated with the MetaboLights database.supports both LC-MS and GC-MS datasets and offers a wide range of functionalities, including: data storage and metadata management (using the ISA-Tab format and ISACreator tool [ 300 301 ]), peak picking and annotation (via, anandadd-on) and QC plots.

MStractor is designed for non-expert users to carry out non-targeted data processing on LC-MS experiments. It gathers xcms and CAMERA functions in a user-friendly pipeline, requiring minimal input and providing graphical QC outputs throughout the workflow. It also includes a manual peak curation step and the possibility of calculating descriptive statistics for each sample class.

patRoon is an interface for different MS-based open source software for non-targeted data processing. patRoon covers different aspects of metabolomics workflows, such as file conversion to open data formats (mzXML and mzML), feature extraction and grouping (using several open software and the R packages xcms , OpenMS , enviPick ), extraction of MS and MS/MS data ( mzR ), component generation ( RAMClustR , CAMERA , nontarget ), formula calculation (GenForm) and compound identification through automatic annotation of MS/MS spectra ( MetFrag and SIRIUS with CSI:FingerID). Other functionalities include (interactive) visualisation and reporting of workflow data, comparison and combining results from different workflow algorithms and several data reduction and selection strategies.

specmine provides a general framework that addresses a variety of different analytical platforms, such as LC-MS, GC-MS, NMR, IR and UV-Vis. The package supports many data formats and includes the possibility of adding metadata in a tabular format. It relies on xcms for LC-MS and GC-MS data pre-processing, on hyperSpec for NMR, IR and UV/VIS data processing and on MAIT for metabolite identification. specmine provides scripts for missing values imputation, univariate and multivariate statistics and machine learning methods. Several case studies are available for testing purposes.

mQTL . NMR is a package specifically for the systematic analysis of 1H NMR metabolomics in quantitative genetics. The package mainly focuses on NMR spectral data pre-processing (normalisation, scaling and peak alignment), mQTL mapping in different model organisms, structural assignment of marker metabolites, and result visualisation.

enviMass is a comprehensive workflow for the data-mining of LC-MS and GC-MS datasets, which also supports MS/MS experiments. It provides the user with a graphical user interface (GUI) and a flexible workflow structure covering common processing steps such as data conversion, peak picking, noise removal,—mass re-calibration, data normalisation, and blank subtraction. It also offers several more specific and advanced functionalities, including isotopologue and adduct grouping, homologous series detection and visualisation, estimation of atom counts for nontarget components, temporal sequences, profile trend detection and processing of both data dependent and data independent acquisition of MS/MS experiments. RMassScreening is a workflow for batch processing of LC-HRMS datasets using a script interface, YAML-based setting configuration and visual interactive data evaluation. It provides wrappers for script-based usage of enviPick and basic enviMass components, and implements suspect screening and combinatorial prediction of possible metabolites (transformation products) from parent compounds. A GUI provides facilities to analyse the results, grouped by sample groups and experimental timepoints, by applying freely adjustable filters.

MetaboNexus is an interactive data analysis platform for metabolomics experiments, which provides a user friendly R shiny -based GUI designed to work without the need for web server connections. It allows pre-processing (using xcms and MZmine), data scaling, univariate and multivariate statistics ( t -test, ANOVA, PCA, PLS-DA, Random Forest, heatmap), putative metabolite identification (library matching of MS and MS/MS adduct with METLIN, HMDB and MassBank databases), and several functions for data visualisation.

Table 9. R packages with multifunctional workflows.