Eden et al., 2009 Eden E.

Navon R.

Steinfeld I.

Lipson D.

Yakhini Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists.

Copy-number variations (CNVs) were identified on the reference-based alignments. Initial read depth profiles were obtained for each isolate based on the average read depth calculated in non-overlapping windows of 1000bp. In 68 samples (BE044-BE102, LA002, SP008-SP011, NA005-NA007, WI019), a deviation in read depth was detected: instead of fluctuating around a constant line, the read depth profile showed a convex trend with high depth at the terminal regions of the chromosomes that gradually decreased toward the center. These samples also showed high local variance. This bias in coverage is further referred to as a “smiley pattern.” Since conventional methods for CNV detection rely on read depth as a proxy for copy number, these methods were not applicable on the “smiley pattern” strains. To tackle this problem, a custom-built algorithm was developed, dubbed Splint (available upon request), which instead measures the size of discontinuities in read depth by using a discontinuous spline regression technique. In Splint, the data were modeled as the product of the bias and the copy number of each region, plus error. Here, the bias was assumed to be a continuous curve (expected depth as a function of chromosomal location), modeled as a smoothing spline. The copy number on the other hand is a piecewise constant function, with discontinuities at breakpoints in between regions of constant copy number. This was modeled as a sum of indicator functions, one for each region. After regression, the fitted value of the coefficient of each indicator function is proportional to the copy number in the corresponding region. The regression method requires the locations of the discontinuities as input values. Initially, these are located in a rough manner by comparing the 50kb regions to the left and right of each 1000 bp window. If the difference between the median depth in the left and right regions is small, the frame is not likely to contain a copy-number breakpoint; if the difference is large, it may contain a breakpoint. This measure is smoothed (by moving average) and corrected for linear bias by subtracting a linear trendline. Peaks that exceed 2.5 times the sample-wide median, in absolute value, are annotated as breakpoints. However, this method only gives rough coordinates of discontinuities, delimiting large regions of constant copy number. After this rough estimation, an initial regression was run, and a hidden Markov model (HMM) was used to find regions where the regressed values are significantly different from the data. The HMM accepts deviance of the estimated curve from the data as input signals (15% greater than, 15% lesser than, or approximately equal), and aggregates high densities of deviant signals into output states (under-estimation, over-estimation or correct estimation of copy number; better results were obtained when a special state was reserved for total deletions. The windows where the state changes are seen as likely breakpoints. The regression and HMM were re-evaluated until no more deviating regions could be found. The regression coefficients of the piecewise constant function in the final regression are proportional to the copy number in the corresponding regions, but the proportionality constant depends on the shape and scale of the continuous (spline) factor in the regression, which is different for each chromosome. The form of the spline is such that its value is always 1 in the left telomere for each chromosome. Using the regression coefficients of the piecewise function as a proportional proxy for the copy numbers implicitly assumes that the bias is the same for each chromosome at the left telomere. We observe that the smiley pattern is generally similar on both sides of the chromosomes, so we repeat the regression setting the spline value at the right telomere at 1, and instead use the means of the two sets of regression coefficients to estimate the copy number. Splint was run using frames of 1000bp and 500bp. Shorter frames will result in higher resolution of the CNV calls at the cost of an increased rate of false positive calls. Because results that depend on the window size were not deemed robust, only CNVs found in both 1000bp and 500bp window analyses were used in the final results. The functional enrichment analysis of CNV-driven genes was carried out using the Gorilla database () using the complete set of S. cerevisiae genes as the reference. False discovery rate (FDR) Benjamini & Hochberg adjusted q ≤ 0.05 were considered significant.