October 6th 2017

Early August 2017 I was accepted onto the Masters course: Data Science at Aberystwyth University, where I studied my undergraduate course: Business IT. I was simultaneously excited and nervous: in preparation I started teaching myself Python (link to first script) & R.

I saw the position for a Summer researcher in bioinformatics for an AU student and remembered I have a module next year (second semester 2018) for Statistics for Computational Biology, so I applied hoping this would be useful experience and to gain practical knowledge - I know somewhat the base of Bioinformatics: studying molecules through analysing DNA sequences. I'm no expert in coding or specifically bioinformatics, but more than willing to learn & teach myself.



My understanding so far (correct me if I'm wrong) of bioinformatics is: DNA has genomes (genetics) and proteins (mechanics of a cell). When a cell wants proteins, it finds that part of the DNA and replicates it: RNA. So protein is made up of Amino Acids, and there are 20 amino acids: 20! (factorial) is millions...so amino acids can make millions of molecules. Bioinformatics is the studying of these molecules through analysing DNA, RNA, and amino acid sequences.

An analysis of current state of the art software on nanopore metagenomic data

which were “just right” for later variant calling and comparison

First gel: success! Image with dna extracted.

Second gel from some of the extracted DNA: results unfortunately poor.

Third gel: changing the protocals resulted in positive results!

At the BCS event with my Poster!

Last Edited: 18:22 | 2017/10/06

This research project will be interesting: in preparation, I installed Linux Mint separation for my laptop - this benefits my research (installing packages and source code of software); plus may be useful for my Masters - I was tempted to try Ubuntu however chose Mint as I’ve had experience with it.For my first day, I read some papers about the sort of research that was conducted from an IBERS research team (from AU): what DNA data sets I’ll be analysing with different software; here’s a brief overview on one paper that specifically relates to my project (the terminology was a little difficult to understand at first, though after spending time reading up I felt a little more confident): Paper :- research expedition in a Mine (South Wales), DNA was extracted to study the Earth’s subsurface, where there is no internet access, to find bacteria - there are 2 datasets: the second version results are better quality DNA because improved extraction protocols were used.I am to obtain the data of the research expedition; use software to run them on the datasets then analyse, including compare, the results.In a meeting I learnt what nanopores are: DNA goes in, nanopore has an electric current running through it, proteins sends signals which creates a squiggle – then squiggles are run through software to get an output, which can be visualised. I did some background research looking into what biologist want to look for: ACGT. specifically GC content is what scientists want: DNA with low GC content is less stable. AT content, especially when repetitive, are errors and bad quality.First, I am to run the data using Goldilocks so I spent time reading the documentation: it locates regions on the human genome that expressed a desired level of variability,So...my first major issue with this project: Goldilocks is Python 2 (2.7); my Spyder runs Python 3 so I had to install the appropriate packages for my machine – continued but once again had some issues but asked the developer of Goldilocks, Sam Nicholls , for help.That day was emotionally draining: I cried, struggled, tried, then cried again – was tempted to quit and the turmoil caused me to reconsider my future (I know...very dramatic).However, I put my stubbornness aside and asked for help and was finally able to produce graphs that I could analyse.To continue: Goldilocks works with fasta files, not new/upgraded fast5 files so I installed poretools to do the converting process (which also was useful to convert to fastq files that are quality centered), then used samtools to index the files. (fasta.fai).Side note: poretools created histograms of the data – I was able to recreate the ones previously produced in the paper: I somewhat found it to be a personal success that I was able to recreate evidence.Using Goldilocks, I found some sort of anomaly: in one graph, the GC ratio was low whilst T was suddenly high (50%) – I am to find out if this is a data error or not. on the other hand, the other dataset was mostly covered in A and T content (first dataset: lower quality). I went into the lab to watch DNA through a Nanopore: minIon . At the end of my first week, I met up with my supervisor and we discussed my jobs for my next week: she seemed impressed with my progress since I don’t have any background experience in Bioinformatics or have advanced coding abilities.I am getting use to this (I think) – using poretools has turned out to be very useful: I have found out that the longest reads were potential errors and the quality of the data is low overall. The research team left the location after 50 minutes however continued running the data: the longest reads occurred after 50 minutes. However, the T heavy reads are not the same as the longest reads so I’ll be conducting tests if T heavy were after 50 mins too.So interestingly, one T heavy read is within the 50 mins; when blasted, the query cover is low though results included fungus and bacteria, which was what the research team was looking for. On a side note: I used the Linux command uniq to see there are no duplicate reads. I once again attempted using poretools to produce squiggles, this yet again had no results, however I tried this on the data set without the MUX (QC reads: not quality filtered) reads and the results were ever so slightly different (still no diagrams though) - I noticed there was an issue opened on Git for poretools, however apparently one fix is it use poretools downloaded from the git rather than the linux package.So at the beginning of the week, I met with my supervisor and she saw the report I started: so far it is decent however it needs a lot more comparison of the datasets (BP_v1 & BP_v2). After asking a researcher, Andre Soares , I found out that despite both data sets are from the same mine, they are actually taken at different times of the year (BP1: Dec 2016, BP2: Apr 2017) - though must be noted: the researcher stated this is field work so the results are usually low in quality unfortunately).I spent most of the week working on the report, improving it and including the analysis the the version 1 data set. I need to Blast both data sets fully but to do so I need to be added to AU network IBERS cluster.I blasted some random reads of the dataset: specifically the long reads which weren’t T heavy and those that were shorter and perhaps more reliable - data set 1 had nothing useful at all, however data set 2 has the bacteria: specifically bacteria that thrives in 40+ degrees which doesn’t add up as the mine was 15-20 degrees.So I found out that I’ll be presenting my work through a poster to the AU Creevy lab (bioinformatics lab team) - I have started to produce a poster - my supervisor said they’ll also be available around the Computer Science department.I was finally connected to the AU IBERS cluster and ran a blast on both datasets, the first data-set (BP_v1) had barely any good results: scores (query cover/read lengths) were all low (less than 50) with random species that included peppers and piranha. However, BP_v2’s results were much better, results in ~800 which we could observe there were different types of bacteria within the subsurface.So I went into the lab with the creator of Goldilocks, I saw him run DNA through Gel and extract them - plus take some pictures under UV light, which I found out that ‘degrades’ DNA quality - I also got the opportunity to watch centrifuge (machine: vortex) occur.So the creator of Goldilocks had a few issues during a second run - but I had the pleasure of watching him solve the issue and finally fixing the problem...So this is a final wrapping up of this summer research project: I am to blast the data-sets against each other, plus use Centrifuge and compare the results with Blast, then use Pavian to create hierarchy tree diagrams from the results of both software (if there is enough time). I blasted BP_v1 against BP_v2 and vise versa and there were no similarities, however, BP_v1 against BP_v1 (plus BP_v2 against BP_v2) had reads that were very similar (90%+).I have been having issues with Centrifuge but after finding a better documentation I realised it will need to run on the Cluster with the nt database, my laptop does not have enough space or RAM. Also, Centrifuge runs on an old version of C (head node is an updated version) so I have to remove then reinstall and recompile. After trying to do this and giving Centrifuge more RAM on the Cluster, I found out that this task needed redoing multiple times due to more RAM being needed.Also regarding Pavian, it only works with Kraken and Centrifuge outputs so unless I can get Centrifuge to work, I won’t be able to recreate any diagrams.On a side note, I’ve been asked to do a talk at the BCS Mid Wales “Show & Tell” at Aberystwyth on Friday 27th September. I have prepared a presentation and made a poster . (As reference for time scale, the Show and Tell is on Friday 29th September and my talk was confirmed on Thursday 28th September so I had 24 hours to finalise the poster and create a presentation that’ll last 5 minutes long).Friday 27th September is my final official day, however I was asked to go to the lab to present my poster again on Friday 6th October.I would like to thank Amanda Clare for the supervision and guidance. Andre Soares , who was a part of the IBERS research team that went to the mine, for constant help throughout.And Sam Nicholls , the creator of Goldilocks, for the assistance and support.To read the report/see the poster, see my GitHub - profile: sap218 tl;dr: I did bioinformatics and it's fun! I worked in a lab, used various software, and did my first talk.