Original Blog Post

I've made a follow-up analysis of this paper, calculating whether fitness genes are conserved across species. The original post is available in my blog: http://bioinfoblog.it/2015/12/are-fitness-genes-more-conserved-my-first-30-minutes-attempt/ but I'll copy and paste it here as well. However some images may not be visualized properly here.

Are fitness genes more conserved across mammals?

A recently published paper by Hart et al presented a genome-wide CRISPR screening to identify fitness genes (a superset of essential genes) in five cell lines. The paper is quite impressive and shows the potentiality of CRISPR to generate large scale knockouts and to characterize the importance and function of genes in different conditions.

In the discussion the authors propose that fitness genes are more likely to be more conserved across species. However they do not follow-up on this hypothesis, probably for lack of space. They can’t be blamed as they already present a lot of results in the paper.

Distribution of conservation scores in the phastcons.100way.UCSC.hg19 track. Are essential genes more conserved than other genes? Distribution of conservation scores in the human genome. Are essential genes more conserved than other genes? This post presents a follow-up analysis on the hypothesis that fitness genes are more conserved than non-essential genes. I’ll take the original data from the paper, get the conservation scores from bioconductor data packages, and do a Wilcoxon test to compare the two distribution. The full code is available as a github repository, and please feel free to contribute if you want to do some free R/Bioconductor analysis.

Getting the data

Like most bioinformatics pipelines, half of the time is spent in preparing the data and importing other datasets. Luckily with the BioConductor data packages we can access a wide range of datasets without many efforts.

1. loading data & calculating number of rows

The starting point is table S3 from the paper, containing a list of 17,661 gene symbols, along with a bayesian score indicating whether gene is a “fitness gene” (essential) in a given cell line. Unfortunately the table doesn’t contain any gene id, so we will have to convert the gene symbols manually.

The data can be read using the xlsx library. In this analysis I will make an heavy use of dplyr and ggplot2, so let’s load these packages as well:

# reading initial data > library(dplyr) > library(ggplot2) > library(xlsx) > mmc3.raw = read.table("data/mmc3.xlsx", 1) > mmc3.raw %>% nrow [1] 17661 > mmc3.raw %>% summarise(n_distinct(Gene)) n_distinct(Gene) 1 17649 # reading initial data > library(dplyr) > library(ggplot2) > library(xlsx) > mmc3.raw = read.table("data/mmc3.xlsx", 1) > mmc3.raw %>% nrow [1] 17661 > mmc3.raw %>% summarise(n_distinct(Gene)) n_distinct(Gene) 1 17649 The last command highlights a possible problem with the data. There are 17,661 rows in the table, but only 17,649 are unique! Thus, some gene symbols are duplicated. To find them, we can combine filter() with duplicated(): ` # duplicated gene symbols > mmc3.raw %>% filter(duplicated(Gene) | duplicated(Gene, fromLast=T)) %>% dplyr::select(Gene) Gene 1 FFAR2 2 FFAR2 3 HBD 4 HBD 5 01-Mar 6 02-Mar 7 01-Mar 8 02-Mar 9 MMD2 10 MMD2 11 OFCC1 12 OFCC1 13 SFPQ 14 SFPQ 15 TEC 16 TEC # duplicated gene symbols > mmc3.raw %>% filter(duplicated(Gene) | duplicated(Gene, fromLast=T)) %>% dplyr::select(Gene) Gene 1 FFAR2 2 FFAR2 3 HBD 4 HBD 5 01-Mar 6 02-Mar 7 01-Mar 8 02-Mar 9 MMD2 10 MMD2 11 OFCC1 12 OFCC1 13 SFPQ 14 SFPQ 15 TEC 16 TEC

This also shows another problem with the data: it seems that some of the symbols have been converted to dates in excel (01-Mar, 02-Mar). We will fix these later.

2. converting symbols to entrez

We will start by loading the Homo.sapiens package, a container for many other packages which we will use in the analysis. See my tutorial on working with Genomics Data for more information.

The table for the symbol 2 entrez conversion is org.Hs.egSYMBOL2EG. There are many ways to use this data package but I prefer to convert it to a dataframe and do a left join:

# converting symbols to entrez (attempt 1) mmc3 = mmc3.raw %>% left_join(as.data.frame(org.Hs.egSYMBOL2EG), by=c("Gene"="symbol")) > mmc3 %>% head Gene BF_hct116 BF_hela BF_gbm BF_rpe1 BF_dld1 BF_a375_GeCKo 1 A1BG -73.388 -28.842 -3.132 -30.994 -8.070 -3.433 2 A1CF -52.561 -42.187 -24.579 -21.692 -10.265 -19.307 3 A2M -105.736 -42.970 -9.507 -31.308 -14.764 -7.414 4 A2ML1 -78.033 -64.316 -34.891 -41.933 -16.825 -15.788 5 A4GALT -60.333 -22.806 -2.786 -2.162 -1.841 -8.833 6 A4GNT -82.131 -59.956 -14.539 0.083 -8.300 -6.742 BF_hct116_shRNA numTKOHits gene_id 1 -20.387 0 1 2 -16.635 0 29974 3 -15.358 0 2 4 -32.065 0 144568 5 -8.323 0 53947 6 -20.125 0 51146 # converting symbols to entrez (attempt 1) mmc3 = mmc3.raw %>% left_join(as.data.frame(org.Hs.egSYMBOL2EG), by=c("Gene"="symbol")) > mmc3 %>% head Gene BF_hct116 BF_hela BF_gbm BF_rpe1 BF_dld1 BF_a375_GeCKo 1 A1BG -73.388 -28.842 -3.132 -30.994 -8.070 -3.433 2 A1CF -52.561 -42.187 -24.579 -21.692 -10.265 -19.307 3 A2M -105.736 -42.970 -9.507 -31.308 -14.764 -7.414 4 A2ML1 -78.033 -64.316 -34.891 -41.933 -16.825 -15.788 5 A4GALT -60.333 -22.806 -2.786 -2.162 -1.841 -8.833 6 A4GNT -82.131 -59.956 -14.539 0.083 -8.300 -6.742 BF_hct116_shRNA numTKOHits gene_id 1 -20.387 0 1 2 -16.635 0 29974 3 -15.358 0 2 4 -32.065 0 144568 5 -8.323 0 53947 6 -20.125 0 51146

The left_join command added a new column called gene_id, containing the entrez id of each gene.

At this point we should check whether all the id conversions occurred correctly. In principle if we had a 1:1 correspondence between gene symbol and entrez, we would expect to find 17,649 unique symbols (as we saw previously), and 17,649 entrez. However this doesn’t seem to be the case:

# count of unique gene symbols and entrez > mmc3 %>% summarise(n_distinct(Gene), n_distinct(gene_id)) n_distinct(Gene) n_distinct(gene_id) 1 17649 17326 # count of unique gene symbols and entrez > mmc3 %>% summarise(n_distinct(Gene), n_distinct(gene_id)) n_distinct(Gene) n_distinct(gene_id) 1 17649 17326

To identify which genes have not been converted correctly, we can just subset the rows where gene_id is NA. To improve readability, we can select only the symbol column, transpose it, and paste it to get a list of symbols that are not converted:

# symbols that failed the conversion to entrez on the first attempt > mmc3 %>% filter(is.na(gene_id)) %>% dplyr::select(Gene) %>% t %>% paste(collapse=", ") [1] "ABP1, ACN9, ..., C10orf112, ..., CXorf30, CXorf48, 01-Dec, DOM3Z, ..., LSMD1, 01-Mar, 02-Mar, .., 09-Mar, MEF2BNB, ..., 01-Sep, ..., ZFYVE20, ZNF259, ZSCAN5C" # symbols that failed the conversion to entrez on the first attempt > mmc3 %>% filter(is.na(gene_id)) %>% dplyr::select(Gene) %>% t %>% paste(collapse=", ") [1] "ABP1, ACN9, ..., C10orf112, ..., CXorf30, CXorf48, 01-Dec, DOM3Z, ..., LSMD1, 01-Mar, 02-Mar, .., 09-Mar, MEF2BNB, ..., 01-Sep, ..., ZFYVE20, ZNF259, ZSCAN5C"

The output shows that there are a lot of Orf genes, that some gene symbols have been converted to dates, and that other symbols are not identified. The latter must be un-official gene symbols used by the authors instead of the official Hugo symbols.

These problems can be solved by manually renaming the “date” genes, and by using the org.Hs.egALIAS table instead of org.Hs.egSYMBOL. This will however generate duplicated ids, which will have to filter out:

# converting symbols to entrez (final attempt) > mmc3 = mmc3.raw %>% mutate(Gene=gsub("\\d(\\d)-Mar", "MARCH\\1", Gene)) %>% mutate(Gene=gsub("(\\d\\d)-Sep", "SEP\\1", Gene)) %>% mutate(Gene=gsub("(\\d\\d)-Dec", "DEC\\1", Gene)) %>% left_join(as.data.frame(org.Hs.egALIAS2EG), by=c("Gene"="alias_symbol")) %>% filter(!duplicated(Gene)) > mmc3 %>% summarise(n_distinct(Gene), n_distinct(gene_id)) %>% print n_distinct(Gene) n_distinct(gene_id) 1 17648 17352 > mmc3.final = mmc3 %>% filter(!duplicated(gene_id)) # converting symbols to entrez (final attempt) > mmc3 = mmc3.raw %>% mutate(Gene=gsub("\\d(\\d)-Mar", "MARCH\\1", Gene)) %>% mutate(Gene=gsub("(\\d\\d)-Sep", "SEP\\1", Gene)) %>% mutate(Gene=gsub("(\\d\\d)-Dec", "DEC\\1", Gene)) %>% left_join(as.data.frame(org.Hs.egALIAS2EG), by=c("Gene"="alias_symbol")) %>% filter(!duplicated(Gene)) > mmc3 %>% summarise(n_distinct(Gene), n_distinct(gene_id)) %>% print n_distinct(Gene) n_distinct(gene_id) 1 17648 17352 > mmc3.final = mmc3 %>% filter(!duplicated(gene_id))

In the end we have 17,648 gene symbols corresponding to 17,352. For simplicity we remove all the duplicated entrez, and take only one per entrez.

2. Getting the conservation scores

The next step is to get the conservation scores for all human genes. My first thought was to use AnnotationHub to get the scores from UCSC. However, it seems it’s not there:

# searching for conservation tracks in AnnotationHub > library(AnnotationHub) > ahub = AnnotationHub() > ahub %>% query(c("hg19", "cons")) # searching for conservation tracks in AnnotationHub > library(AnnotationHub) > ahub = AnnotationHub() > ahub %>% query(c("hg19", "cons"))

The reason why the conservation scores are not in Annotation Hub is that there is a separate package for them. Thanks to Robert Castelo from UPF we can get the scores from the package phastCons100way.UCSC.hg19:

# get coordinates and conservation scores for all genes > allgenes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene) %>% subset(gene_id %in% mmc3$gene_id) > allgenes$cons100way = scores (phastCons100way.UCSC.hg19, allgenes) > allgenes GRanges object with 17095 ranges and 2 metadata columns: seqnames ranges strand | gene_id cons100way <Rle> <IRanges> <Rle> | <character> <numeric> 1 chr19 [ 58858172, 58874214] - | 1 0.06897089 10 chr8 [ 18248755, 18258723] + | 10 0.25185185 100 chr20 [ 43248163, 43280376] - | 100 0.17251561 1000 chr18 [ 25530930, 25757445] - | 1000 0.04142651 10000 chr1 [243651535, 244006886] - | 10000 0.10477027 ... ... ... ... ... ... ... 999 chr16 [ 68771195, 68869444] + | 999 0.12147190 9990 chr15 [ 34522197, 34630265] - | 9990 0.09580371 9991 chr9 [114979995, 115095944] - | 9991 0.23728249 9992 chr21 [ 35736323, 35743440] + | 9992 0.15518753 9997 chr22 [ 50961997, 50964905] - | 9997 0.10095944 ------- seqinfo: 93 sequences (1 circular) from hg19 genome # get coordinates and conservation scores for all genes > allgenes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene) %>% subset(gene_id %in% mmc3$gene_id) > allgenes$cons100way = scores (phastCons100way.UCSC.hg19, allgenes) > allgenes GRanges object with 17095 ranges and 2 metadata columns: seqnames ranges strand | gene_id cons100way <Rle> <IRanges> <Rle> | <character> <numeric> 1 chr19 [ 58858172, 58874214] - | 1 0.06897089 10 chr8 [ 18248755, 18258723] + | 10 0.25185185 100 chr20 [ 43248163, 43280376] - | 100 0.17251561 1000 chr18 [ 25530930, 25757445] - | 1000 0.04142651 10000 chr1 [243651535, 244006886] - | 10000 0.10477027 ... ... ... ... ... ... ... 999 chr16 [ 68771195, 68869444] + | 999 0.12147190 9990 chr15 [ 34522197, 34630265] - | 9990 0.09580371 9991 chr9 [114979995, 115095944] - | 9991 0.23728249 9992 chr21 [ 35736323, 35743440] + | 9992 0.15518753 9997 chr22 [ 50961997, 50964905] - | 9997 0.10095944 ------- seqinfo: 93 sequences (1 circular) from hg19 genome

The first instruction gets the coordinates of all human genes from the TxDb object, imported when we loaded Homo.sapiens. We also filter this object to retain only the genes tested in the paper.

The second instruction gets the conservation scores across 100 species, from the phastCons package.

Finally, we print the allgenes object to see if everything went well. Notice that only 17,095 of the 17,352 unique genes have coordinates – the rest are probably withdrawn genes or genes with no annotation.

3. Are core fitness genes more conserved across species?

Now that we have the correct entrez id and the scores, we can join them in a dataframe and start doing some analysis.

First, let’s create a dataframe with all the information:

> allgenes.df = allgenes %>% mcols %>% as.data.frame %>% left_join(mmc3, by="gene_id")<span class="pl-s"> %>% mutate(gene_type = ifelse(numTKOHits>0, "fitness", "nonfitness")) > allgenes.df %>% head gene_id cons100way Gene BF_hct116 BF_hela BF_gbm BF_rpe1 BF_dld1 1 1 0.06897089 A1BG -73.388 -28.842 -3.132 -30.994 -8.070 2 10 0.25185185 NAT2 -17.494 -12.672 -6.814 -10.237 -4.267 3 100 0.17251561 ADA -151.203 -42.485 -29.426 -28.761 -9.064 4 1000 0.04142651 CDH2 -140.898 -61.958 -25.142 44.041 -17.757 5 10000 0.10477027 AKT3 -100.130 -71.773 -32.805 -9.473 -17.645 6 10001 0.17247231 MED6 52.872 19.795 42.076 31.606 10.981 BF_a375_GeCKo BF_hct116_shRNA numTKOHits 1 -3.433 -20.387 0 2 -2.921 -1.511 0 3 -11.923 -4.137 0 4 -10.849 -24.690 1 5 21.261 -33.525 0 6 5.207 -29.754 5 > allgenes.df = allgenes %>% mcols %>% as.data.frame %>% left_join(mmc3, by="gene_id")<span class="pl-s"> %>% mutate(gene_type = ifelse(numTKOHits>0, "fitness", "nonfitness")) > allgenes.df %>% head gene_id cons100way Gene BF_hct116 BF_hela BF_gbm BF_rpe1 BF_dld1 1 1 0.06897089 A1BG -73.388 -28.842 -3.132 -30.994 -8.070 2 10 0.25185185 NAT2 -17.494 -12.672 -6.814 -10.237 -4.267 3 100 0.17251561 ADA -151.203 -42.485 -29.426 -28.761 -9.064 4 1000 0.04142651 CDH2 -140.898 -61.958 -25.142 44.041 -17.757 5 10000 0.10477027 AKT3 -100.130 -71.773 -32.805 -9.473 -17.645 6 10001 0.17247231 MED6 52.872 19.795 42.076 31.606 10.981 BF_a375_GeCKo BF_hct116_shRNA numTKOHits 1 -3.433 -20.387 0 2 -2.921 -1.511 0 3 -11.923 -4.137 0

4 -10.849 -24.690 1 5 21.261 -33.525 0 6 5.207 -29.754 5 The first command extracted the gene_id and conservation scores from the GenomicRanges object (mcols), transformed it to a data.frame, joined it with the other table to add the bayesian fitness scores, and then added a column to distinguish fitness and non-fitness genes.

We can plot the two distributions using ggplot2:

allgenes.df %>% ggplot(aes(x=gene_type, y=-log(cons100way), fill=gene_type)) + geom_violin() + theme_bw() + ggtitle("conservation scores of fitness and non-fitness genes") allgenes.df %>% ggplot(aes(x=gene_type, y=-log(cons100way), fill=gene_type)) + geom_violin() + theme_bw() + ggtitle("conservation scores of fitness and non-fitness genes")

Distribution of conservation scores for fitness and non-fitness genes. Unfortunately, there is no much difference there! Unfortunately, after all these efforts, it seems that there is not much difference!

To verify this further, we can compare the two distributions with a wilcoxon test:

> allgenes.df %>% wilcox.test(cons100way~gene_type, .) Wilcoxon rank sum test with continuity correction data: cons100way by fitness W = 26793000, p-value = 0.7679 alternative hypothesis: true location shift is not equal to 0 > allgenes.df %>% wilcox.test(cons100way~gene_type, .) Wilcoxon rank sum test with continuity correction data: cons100way by fitness W = 26793000, p-value = 0.7679 alternative hypothesis: true location shift is not equal to 0

Our p-values here is pretty clear. The two distributions are not significantly different!

Conclusions

This post provided an example on how the Bioconductor data packages allow to easily get the coordinates and ids of a set of genes, and intersect them with other annotation tables from other sources.

Unfortunately my analysis on the conservation of fitness genes proved to be inconclusive, as there is no much difference between the two sets. However there are many other things that can be tried, from using other tests to determine sequence conservation (e.g. dN/dS), to restricting the analysis to fewer species. If you have any other idea or proposal, feel free to drop a comment here or to contribute to the github repository.

Hart, T., Chandrashekhar, M., Aregger, M., Steinhart, Z., Brown, K., MacLeod, G., Mis, M., Zimmermann, M., Fradet-Turcotte, A., Sun, S., Mero, P., Dirks, P., Sidhu, S., Roth, F., Rissland, O., Durocher, D., Angers, S., & Moffat, J. (2015). High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities Cell, 163 (6), 1515-1526 DOI: 10.1016/j.cell.2015.11.015

Share this: