October 2nd 2018

Note: a lot of information is in the wrong order due to poor dating of notes. Information will be correctly cited in my dissertation, which will become available after the results are obtained/after graduation. For now, if any information here is incorrect or if you have any questions, please contact me!



Introduction

My project was based on bioinformatics & computational biology. Computational biology using computational methods to study the biology of a sample. Bioinformatics: creating a solution to a biological problem.

DNA makes us what we are: our eye colour, hair, etc. It is coiled and contains the genetic materials of the genome: set of genes. The coiled module of DNA contains two strands: polynucleotides, which are composed of units called nucleotides. Each nucleotide is composed of one of four nucleobases (bases): cytosine C, guanine G, adenine A, thymine T. These ACGT bases are paired: GC and AT are base pairs.

Structure of DNA...

Credit: U.S. National Library of Medicine



BBC Radio 4 came to Aberystwyth to interview researcher and senior lecturer, Arwyn Edwards - podcast available here (1h44 and 2h57). He was assisted by his PhD students, André Soares and Aliyah Debbonaire. After his interview he wrote up his experience, which you can read here.



Edwards used a new sequencing technology, Nanopore MinION, to extract long, raw ACGT sequences of a soil sample from London and Aberystwyth.

Edwards used Kaiju, an online webserver, to generate profiles of the sequenced samples: classify the reads with species.

In his brief period of findings, he stated that the London sample had more biodiversity, yet it was contaminated, with reads assigning to acne, from the “unorthodox” container the sample arrived in. Edwards flagged that the Aberystwyth soil was about twice as rich and the London compost.



An example of a sequence... AGGCTACGTGTGCCGTAGCAATATAACATACGA



In my project, I only handled the Aberystwyth data: collected from the Vice Chancellor, Elizabeth Treasure. Edwards ran the Nanopore: before the event, during/for the event, and afterwards. The podcast-specific data was not included in my project since Edwards had already done sufficient research and it is only a fragment of the whole run. I studied the before and after data-sets.



A total of ~2million sequences were in our data-set.



Below is a time-line of my project, such as weekly meetings of tasks discussed.





pH acidity of soils...

Credit: UK Soil Observatory.

Note: Aberystwyth is in the mid-west of Wales, on the coast.

Last Edited: 13:54 | 2018/10/02

So Aberystywth is quite acidic compared to London (due to climate)…Acidobacteria is a phylum of bacteria belonging in the bacteria kingdom ( see taxonomy ranks here ). It was only recognised in 2012 despite the most abundant and diverse on Earth soils. It has been observed in mines, soils, and metal-contaminated soils; which is quite unique as there are metal mines near Aberystywth that have contaminated the Rheidol streams possibly resulting in metal contaminated soils. Amanda Clare , my supervisor, and I discussed looking into a variety of tools to study Nanopore/long-read data to observe read count, quality, and time-yield plots. Plus look into a BLAST , “finds regions of local similarity between sequences”, job to look into species; a minor issue: BLAST is quite a slow process. I used Kaiju and found that Acidobacteria was present with a major portion classified as “unclassified” (not yet placed in a class group/subdivision). Furthermore, the GC content of the genomes are consistent within their subdivisions (class ranks in Acidobacteria), for example: in subdivision 3 the GC content for those species will all be around the same GC coverage - plus subdivisions are dependent on pH, e.g. a pH of 4 means subdivision 1, 2, 3, and 13 will be more likely to appear.After meeting 3, we wanted to find a way to extract the Acidobacteria sequences which Kaiju classified. Despite Kaiju providing an output file with the sequence IDs, we can’t determine which are Acidobacteria due to seqIDs are coincide with taxonIDs. This is where the idea ofcame: using a Kaiju output file with a list of Acidobacteria taxonomy IDs and find the links.I downloaded the full and partial genomes from NCBI (assembly) and found that the GC content was somewhat consistent in the subdivisions. After the first week of my package being successful, we decided to expand it further by looking into the GC content in the sequences and see if we can plot the pattern of the subdivisions.We found that the BLAST job of the 2 million reads took a month to process 400,000 sequences. We thought about filtering the data-set to 200,000 reads so a new BLAST job would only take 2 weeks but we thought to look into an alternative. We found Blast2Go that runs a BLAST job and looks at the genes in further detail.Blast2Go PRO subscription expired after a week and so we were again stuck on how to BLAST the sequences. So rather than a BLAST job to study all species, we used Blast2Go to create a database of Acidobacteria genomes and ran a local BLAST to find the sequences which identified as Acidobacteria. Regarding acidoseq, we were plotting the AT content to compare with GC (high AT can prove an unstable DNA).Due to not being able to use the fast BLAST of Blast2Go, I looked into Diamond (BLAST for proteins) - a recent update with Diamond includes the XML results being compatible with Blast2Go. I added a feature to acidoseq that outputs subdivisions of sequences which have that particular GC content.Aberystwyth University IBERS cluster was slow during this time so the Diamond job stopped multiple times. During this time, we started to look into assembly: building up the sequences into larger ones.The assembly job with Miniasm was unsuccessful: due soil being diverse, the output didn’t build up larger sequences: largest being 16,000 base-pairs long. The Diamond XML did not work with Blast2Go, perhaps the job I ran didn’t have the correct parameters.We thought to look into another assembly tool, Canu . However, the versions of kernel on compute nodes are out of sync with the head nodes and so out of date. We started to look into command line options for acidoseq.We filtered the data to a quality score of 12 and read-length of 2500: 89 reads. We decided to use Blast2Go to finally BLAST this data-set and look into the genes, we found Acidobacteria, however, due to lack of time we couldn’t explore the genes further. We finally make acidoseq into a package and made it available (previously was a script ran via a terminal however now command line options are used and the script does the rest). For the next two weeks the time was mostly focused on writing up my dissertation.During my final meeting, Amanda and I discussed corrections and she provided great feedback. Three days later, I submitted!And so it is done! My Masters is completed and it feels great! After submission, I only had 4 days until the start of my PhD. I had such fun with this project that I made a Twitter bot, acidobot , that dispenses facts about Acidobacteria once a day!is available on PyPI and GitHub I would like to thank Amanda for a fun project and Edwards plus his team for the intellectual engagement.tl;dr: project was great! I found an enjoyment for acid (Acidobacteria! Not the illegal substance), and I look forward to my PhD.