The rapid rise of technology in the last few decades has led to huge advancements in computational science, which in turn has resulted in a boost in affiliated sciences such as bioinformatics. Genome Wide Association Studies (GWAS) are a modern area of research that combine the power of computers with our current knowledge of the genome to deepen and improve our understanding about how genes work.

Bioinformatics combines biology, computer science, information engineering, mathematics and statistics to explain data. As new technologies today generate enormous amounts of data, we require methods to sieve through them. An important and growing field within bioinformatics, genome wide association studies attempt to establish links between genetic data and disease. After all, the better we understand our genome, the closer we will get to treating and curing genetic-related diseases in the future.

The Genetic Code

To understand the data that is stored in our genetic code, we first need an introduction to how genes work. The hard-to-pronounce deoxyribonucleic acid (DNA) is the basic, hereditary material found in humans and other organisms which stores all the information necessary for our body to function.

What is DNA?

This information is stored using a code made up of four chemical bases that we call adenine, guanine, cytosine and thymine. For ease of use, we refer to them by their first letter: A, G, C and T. The bases don’t come alone: they also have two structures attached to them, a sugar molecule and a phosphate molecule, as can be seen in the image. The combination of base plus sugar and phosphate is what we call a nucleotide.



This diagram shows the chemical structure of each base pair. The colors blue, black, red and green indicate the different parts that identify each base. The P-O cross shapes represent the phosphate molecules and the pentagons with the “O” represent the sugar molecules. (Genome.gov, 2020)

Our DNA consists of ~3 billion of these bases and the order they follow ? the sequence ? determines the information that allows to build and maintain an organism. These bases show preferences in what we call pairing: A prefers to pair with T and C with G. When the bases with the sugar and the phosphate pair, they end up forming two long strands that form a spiral, and this is what we refer to as the ?double helix?.

Imagine the double helix as a type of ladder where the base pairs (A, T, C, G) form the steps and the sugar and phosphate form the backbones of the ladder. This double helix is capable of replication – copying itself almost perfectly – and each of the strands of our ladder serves as a pattern for duplication of the sequence of bases. When cells divide, what they do is copy the DNA from the old cell into the new cell. This is how the genetic material is passed from cell to cell.

What is a Gene?

We previously mentioned that certain characteristics are passed down from generation to generation, but how does this work? A gene is ‘the basic physical and functional unit of heredity’, and each gene is composed of DNA base pairs. The different forms of genes that present different characteristics in individuals are called alleles, and they contribute to the unique features of each person. Some of them provide instructions to make proteins, while others don’t seem to have any function. We still have a long way to go to fully understand our own genetic code!



Within each of our cells, we have compact structures called chromosomes which are made up of genes, which in turn are composed of two long strands of nucleotides that are our DNA.

(Mayo Clinic, 2020)

The proteins that genes code for regulate all processes that are needed for our organism to work properly. There are thousands of genes in an organism, packed in compact structures that we call chromosomes. Each person has 23 chromosomes and two copies of every single gene, one inherited from each parent. Even though genes are usually the same in everyone, some of them present tiny (or not-so-tiny!) differences between people.

Medical Genetics

Genetics is a branch of biological sciences that focuses on the study of genes, genetic variation and heredity in organisms. Inheritance consists of the passing of information via genes from parents to offspring. This is a complicated area of study because DNA-based organisms have thousands of genes, and each gene contains alleles that present as one or more traits or ‘phenotypes’ in an individual. This in turn can determine their interactions with other genes and phenotypes, resulting in many entangled relationships.

Genes can also present mutations that create new phenotypes. These mutations can be beneficial, but altering the code generally leads to detrimental outcomes – these can lead to diseases that medical genetics seeks to identify and understand. In medicinal chemistry, the first step in drug design is to identify a potential target. When looking for an unknown gene that could be involved in a disease, there are different methods that researchers use to map out the set of genetic material present in an organism. The complete set of genes is known as the organism’s genome, which is what genome-wide association studies, or GWAS, encompasses.

Genome-Wide Association Studies

Genome-wide association studies consist of studying genetic differences between the genomes of many individuals in order to find associations between the genetic code (genotypes) and how they present themselves in the individual (phenotypes). This method of handling large amounts of data is only possible thanks to the development of powerful computers which allow researchers to compare millions of data points within a reasonable span of time.

This method takes hundreds of thousands to millions of genetic variants obtained from the genomes of many individuals and tests them to identify the already mentioned associations. The method requires not only the understanding of gene inheritance, but also an understanding of statistics, as it involves looking for patterns in large amounts of data which need to be properly interpreted by the researcher to obtain useful outcomes.

This diagram shows an overview of the genome-wide association studies used to detect associations, from choosing what to study to obtaining a “map” with the possible genes of interest. (Tam et al., 2019)

The Process of GWAS

The steps of GWAS are as follows:

Identification of the disease or trait to be studied and selection of an appropriate population to perform the study.

Characterization of the genetic variants to be able to move from the statistical association to the identification of those variants and genes that are causing the disease. Note that even though a disease is sometimes caused by a single gene, it is much more common to find that a disease is regulated by the interactions between multiple genes.

The selected genes can be confirmed using different experimental approaches involving cell-based systems or model organisms.

Genetic variants are very heterogeneous. In our analysis, we can expect to find variants that are either rare and have a very small impact on the phenotype, rare variants with a large effect on the phenotype, or common variants with a large effect on the phenotype.

Effectiveness of GWAS

The use of GWAS can help us discover new biological mechanisms. GWAS loci are usually of unknown functions or relevance, and the experimental procedures that take place after the identification of the genetic variant using computer algorithms are helpful in discovering biological pathways and processes that regulate a disease.

A good example is the use of GWAS to assess mood instability, using the UK Biobank database as a source of information. This database stores the health information of 500,000 participants, including their medical histories. Several disorders were studied, including major depressive disorder (MDD), bipolar disorder (BD), schizophrenia, attention deficit hyperactivity disorder (ADHD), anxiety and post-traumatic stress disorder (PTSD). After applying the methodology explained above, the study managed to identify four loci that seem to play a role in mood instability.

GWAS have also been successful in identifying risk loci (fixed positions on chromosomes where a gene variation responsible for a disease is located) for diseases such as anorexia nervosa, major depressive disorder, type 2 diabetes and many others.

What does this mean? The biggest conclusion is that there is a polygenic basis for mood instability, which implies that mood issues are not due to a single genetic mutation, but are related to genes that we could have never identified without GWAS. As computational capabilities continue to improve alongside technological growth, GWAS will play an increasingly important role in our understanding of genetic diseases. In the future, methods such as these can be combined with gene therapy and other biologic drugs to cure currently incurable genetic diseases.

Reference