Alpha-2,8-sialyltransferase 2 (ST8SIA2) is an enzyme responsible for the transfer of polysialic acid (PSA) to glycoproteins, principally the neuronal cell adhesion molecule (NCAM1), and is involved in neuronal plasticity. Variants within ST8SIA2 have previously shown association with bipolar disorder, schizophrenia and autism. In addition, altered PSA-NCAM expression in brains of patients with schizophrenia or bipolar disorder indicates a functional dysregulation of glycosylation in mental illness. To explore the role of sequence variation affecting PSA-NCAM formation, we conducted a targeted re-sequencing study of a ∼100 kb region – including the entire ST8SIA2 gene and its region of interaction with NCAM1 – in 48 Caucasian cases with bipolar disorder using the Roche 454 platform. We identified over 400 DNA variants, including 47 putative novel variants not described in dbSNP. Validation of a subset of variants via Sequenom showed high reliability of Roche 454 genotype calls (97% genotype concordance, with 80% of novel variants independently verified). We did not observe major loss-of-function mutations that would affect PSA-NCAM formation, either by ablating ST8SIA2 function or by affecting the ability of NCAM1 to be glycosylated. However, we identified 13 SNPs in the UTRs of ST8SIA2, a synonymous coding SNP in exon 5 (rs2305561, P207P) and many additional non-coding variants that may influence splicing or regulation of ST8SIA2 expression. We calculated nucleotide diversity within ST8SIA2 on specific haplotypes, finding that the diversity on the specific “risk” and “protective” haplotypes was lower than other non-disease-associated haplotypes, suggesting that putative functional variation may have arisen on a spectrum of haplotypes. We have identified common and novel variants (rs11074064, rs722645, 15∶92961050) that exist on a spectrum of haplotypes, yet are plausible candidates for conferring the effect of risk and protective haplotypes via multiple enhancer elements. A Galaxy workflow/pipeline for sequence analysis used herein is available at: https://main.g2.bx.psu.edu/u/a-shaw-neura/p/next-generation-resources .

Funding: This work was supported by the Australian National Medical and Health Research Council (project grant 630574, and program grants 510135 and 1037196) and by the Schizophrenia Research Institute, utilizing infrastructure funding from NSW Health. Genetic Repositories Australia is supported by an Australian National Health and Medical Research Council (Grant # 401184). The authors used Garvan Galaxy, funded by a Cancer Institute NSW grant (11/REG/1–10). Mr Yash Tiwari was supported by an Australian Postgraduate Award. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2014 Shaw et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Recent advances in high-throughput sequencing have allowed the rapid sequencing of whole gene loci in multiple individuals [27] . We have exploited this technique to sequence the coding, intronic and flanking sequence of ST8SIA2 and its region of interaction in NCAM1, in lymphocyte DNA from individuals with bipolar disorder. We incorporated a number of methods into our analysis pipeline for quality control in variant detection and base-calling. Altogether, we have identified 396 single nucleotide variants including 42 novel single nucleotide variants across 95 kb of genomic DNA in ST8SIA2, and 21 variants, 5 of which are novel, in a 6 kb region of NCAM1. We utilised data from the 1000 Genomes Project [28] to identify variants unique to bipolar individuals, and those which are at altered frequency in bipolar cases compared to the 1000 Genomes population panels [29] with the same racial background. We examined the haplotypic background of each variant in ST8SIA2 with respect to a previously identified risk haplotype [20] and explored evidence of functional impact by comparisons with ENCODE data [30] , [31] and co-localisation with ST8SIA2-specific enhancer elements that contribute to altered target gene expression [32] . Through an integrative analysis of the above methods, we have identified and verified the most compelling candidate variants for increasing susceptibility to mental illness by modulating ST8SIA2 function.

There is growing genetic evidence of the involvement of ST8SIA2 (also known as SIAT8B or STX) in conferring risk to mental illness. We and others have previously identified the 15q25–26 genomic region as harbouring a mental illness risk gene through linkage in family cohorts with bipolar spectrum disorder [16] , bipolar disorder and schizophrenia [17] , [18] or psychosis [19] . We subsequently found evidence of disease association with variants in ST8SIA2 in both bipolar disorder and schizophrenia [20] . Other groups have shown association with schizophrenia in Japanese and Chinese cohorts in the promoter region of ST8SIA2 [21] , [22] , which was selected as a candidate gene after NCAM1 was implicated in disease risk [23] . More recently, genome-wide association studies have implicated SNPs downstream of ST8SIA2 in risk of bipolar disorder [24] , and variants in intron 2 have also been associated with autism spectrum disorder (verbal subtype) [25] . Furthermore, a 520 kb hemizygous deletion of three genes including ST8SIA2 has been identified in a patient with autism spectrum disorder [26] . Hence, characterisation of the nucleotide variation in ST8SIA2 in patients with a variety of mental illnesses is an important step in understanding the molecular mechanisms through which this gene increases disease risk.

Polysialic acid (PSA) is an important carbohydrate that plays a crucial role in neural development and plasticity [1] – [3] (reviewed in [4] , [5] ). Homopolymers of α2,8-linked sialic acid are synthesised in the Golgi by sialyltransferase enzymes, of which there are six in the human genome. Two sialyltransferases, encoded by ST8SIA2 on human chromosome 15q26.1 and ST8SIA4 on 5q21.1, catalyse the transfer of long chains (>50 residues) of PSA onto their major substrate, neuronal cell adhesion molecule 1 (NCAM1) [2] , [6] (reviewed in [5] ), to form PSA-NCAM. Expression of PSA-NCAM is particularly high in early brain development [1] . PSA addition to NCAM1 increases the hydrated volume of cells and limits the adhesive properties of NCAM1, enabling neuronal migration, dendrite formation, axon targeting and synaptic plasticity (reviewed in [4] ). Mice expressing only PSA-free NCAM after knockout of the relevant sialyltransferase genes (st8sia2 −/− ; st8sia4 −/− ), show severe malformation of major brain axon tracts, which are reminiscent of those observed in patients with schizophrenia [7] , [8] . Alterations in the expression of PSA-NCAM in the adult post-mortem brain – particularly in the amygdala and hippocampal dentate gyrus & CA1 regions – have been shown in patients with schizophrenia, bipolar disorder, major depression and drug-refractory temporal lobe epilepsy [9] – [12] . Further, expression of PSA-NCAM is modulated in rats by treatment with common antipsychotics [13] – [15] .

Results

454 Sequencing and Mapping To characterise genetic variation that may quantitatively impact formation of PSA-NCAM, we performed massively parallel sequencing on the 454 GS-FLX platform of two regions that we hypothesise are associated with ST8SIA2-related developmental pathology in mental illness: 1) a ∼95 kb region including the entire coding and intronic regions and ∼20 kb of flanking sequence of ST8SIA2 (hg19/GRCh37: chr15∶92,919,255–93,013,920 bp; Figure 1); and 2) a ∼6 kb region within NCAM1 (hg19/GRCh37: chr11∶113,100,972–113,107,571 bp; Figure 2). This latter region contains the 5th immunoglobulin-like domain and fibronectin-1 domain, and specifically the 5th and 6th glycosylation sites and acidic patch (that are critical for enzyme recognition, docking and polysialylation by ST8SIA2 to form PSA-NCAM [33]–[35]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Coverage of ST8SIA2 gene locus by 454 and Sanger sequencing. The 96∼20 kb flanking sequence (chr15∶92,919,255–93,013,920 bp; UCSC Genome Browser hg19 http://genome.ucsc.edu) was divided into eight long-range PCR amplicons (454 Amplicons 1–8, in pink) for library preparation prior to 454 sequencing. The mean depth of coverage across all 48 samples for each long-range amplicon is shown in burgundy. A GC-rich region surrounding the promoter and exon 1 was sequenced via Sanger sequencing using 8 short overlapping amplicons (Sanger amplicons 1–8; inset). Pink dots on the GC percent or mean depth per sample tracks indicate loci where the level of GC richness, or read depth is above the range indicated. Regions which failed to amplify, or which contained simple repeats, long homopolymers (length > = 10), or short interspersed elements (SINEs) tended to have lower depth of coverage, and were inaccessible to next-generation sequencing in phase I of the 1000 genomes project (1000 G Accs strict). https://doi.org/10.1371/journal.pone.0092556.g001 PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. Region of NCAM1 gene covered via Roche 454 sequencing. The sequenced region spanned 6.6(chr11∶113,100,972–113,107,571 bp; hg19) and included exons 8–12 (red boxes) and less frequently used exons 8a and mini-a (grey boxes) [52]. The locations of the 5th and 6th glycosylation sites and acidic patch required for glycosylation are in exons 9, 10 and 11 and are indicated by green, blue and white lines, respectively. The locations of SNPs identified in this study are indicated with arrows. Names of known SNPs are in black text, and novel SNPs are listed as NV (novel base substitutions) or InDel (novel insertion-deletion change) followed by their base position (hg19). Two coding SNPs located in exons 8a and 11 are marked with closed boxes, with their effect on amino acid sequence shown. https://doi.org/10.1371/journal.pone.0092556.g002 We applied BWA mapping [36] to the entire genome to ensure that mapped reads were specific to regions amplified by long-range PCR, and to determine rates of target enrichment and non-specific amplification. We found that 96% of reads across all individuals mapped to a unique site within the human genome, and that 93% of reads that mapped to the human genome mapped within the target regions. The remaining mapping is likely due to some amplification of non-targeted regions by the long-range PCR primers.

Coverage Depth and uniformity of coverage are critical for sensitive SNP detection and accurate genotype calls, and are important considerations in assessing of the completeness of variant detection in targeted resequencing. Hence, we examined the mean depth of coverage across each base-pair of the ST8SIA2 region (Figure 1). We found that depth of coverage dropped substantially and locally in regions of low complexity, which included simple repeat sequences and very long homopolymers (length ≥10). These regions are known to be difficult to base call accurately due to the deficiencies in sequencing chemistry of 454 [37]. Indeed, many of these regions were found to be less accessible to sequencing according to the “strict” stringency assessment of the human genome conducted in phase I of the 1000 Genomes Project [28]. Many of these regions also coincided with the ends of primate-specific (recently evolved) Alu repeat elements, which are rich in A and T homopolymers [38]. Aside from local sequence context, a major determinant of mean depth of coverage was inclusion within a specific long-range PCR amplicon, with the NCAM amplicon and ST8SIA2 amplicons 2, 5 and 6 having the highest mean coverage (Figure 3). We assessed the depth of coverage in each sample, and also found variation between samples (Figure 3). These differences demonstrate that amplicon pooling and sample pooling steps each contributed to variation in the depth of sequence coverage within genomic regions. Overall, although some samples contained amplicons with poor coverage (Figure 3), the mean coverage for each amplicon across all samples was >10, a level sufficient for ascertainment of diploid genotypes. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. Heatmap of mean depth of coverage per amplicon and per sample. Amplicons are listed along the x-axis, and samples (x1-48) are listed along the y-axis. Depth of coverage is indicated according to the colour key, with white indicating <2 reads coverage and red indicating >35 reads coverage. The plot was created using R [63]. https://doi.org/10.1371/journal.pone.0092556.g003

SNP calling We used two different algorithms to call SNPs from 454-generated data: i) GS Reference Mapper (Refmapper), which is a vendor-supplied algorithm specific to the 454 platform, and calls SNPs based on alignment data from each individual independently; and ii) GATK [39], [40], which is not platform specific, and calls SNPs based on alignment data from all individuals combined. As raw GATK SNP calls require some level of post-hoc filtering, we initially used the GATK-recommended filtering criteria for whole genome SNP detection, and then increased the stringency to remove a subset of likely spurious variant calls (further details are supplied in Note S1; Figure S1). In order to maximise our ability to detect true SNPs for functional assessment, the union of both Refmapper and GATK algorithms (i.e., those SNPs that were called by at least one algorithm) formed our list of putative SNPs. We compared the locus of SNPs called using each algorithm and found that 89% of called SNPs were common to both algorithms (Table 1). Taking only the subset of putative novel SNPs in each set, we found that each algorithm alone and the intersection between the algorithms generated approximately the same proportion of putative novel variants (8–9%), and ∼60% of the putative novel variants were called in both algorithms (Table 1). Twenty one SNPs filtered out in GATK were called as genuine variants by Refmapper. A further 23 SNPs were identified by only one algorithm. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Comparison of SNP detection methods from 454 data, and identification of bona fide SNPs for functional analysis. https://doi.org/10.1371/journal.pone.0092556.t001 We sequenced a GC-rich region that was not amplified by long-range PCR using bi-directional Sanger sequencing (Figure 1). SNPs were called using Consed and the Phred/Phrap/Polyphred software suite [41], [42], with manual inspection of chromatograms for SNP identification and genotyping. Non-reference SNP and genotype calls were only made if the non-reference base was called in both the forward and reverse reads. Using this methodology we identified 24 additional SNPs in the ST8SIA2 locus (19 known, 5 novel).

Genotype calling and accuracy Diploid genotype calling from high throughput sequencing relies on a minimal level of read coverage in each individual at each base position to accurately distinguish true heterozgous sites from sites with random base-calling errors. Due to variation in coverage across regions and across amplicons (Figure 3), we reasoned that a proportion of the targeted region in each individual was marginal, i.e. was covered to a depth below that required for statistically confident heterozygote calling (coverage <10). In order to use data from these regions, and other regions with low local coverage, we used the probabilistic genotype calling module within GATK. This module uses read data at all sites, even those with low coverage, in the simultaneous estimation of cohort allele frequency and calculation of genotype likelihoods. The diploid genotype with the highest likelihood is assigned as the genotype call, along with a likelihood ratio calculation of the confidence of this call [denoted Genotype Quality (GQ)]. In order to assess the overall accuracy of applying this mode of genotype calling to our 454 sequence data, we compared the GATK 454-derived genotypes with genotypes determined directly via Illumina genotyping assay for the same individuals from an earlier study [20], [43], which were assumed to be 100% accurate. The available Illumina data contained 2133 genotype calls from 50 SNPs (minor allele frequency [MAF] average 0.29; range 0.01–0.49), including 1081 (50.7%) non-homozygous-reference calls. One sample (# 10) was excluded from further analysis based on high genotype discordance between platforms (∼50%). The overall genotype discordance across the remaining individuals was low (∼2%; Table 2). We reasoned that some genotype calls may have been made at low confidence in regions with low coverage, and that these could be removed from all genotype calls to reduce discordance. Indeed we found that eliminating low confidence genotype calls (GQ <10), which are most subject to random error, reduced the discordance 2-fold at the expense of a decrease in the number of genotypes called (Table 2). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 2. Genotype concordance between 454-generated genotype calls and those directly genotyped via Illumina GoldenGate or Illumina 660 W. https://doi.org/10.1371/journal.pone.0092556.t002 Genotype calling for Sanger data was performed using Polyphred. Only two SNPs were available with known genotypes from the Illumina genotyping data for assessing the accuracy of Sanger genotype calling. We found that all known genotypes were accurately called in 44 individuals with available known genotypes (88 known genotypes, 14 non-reference).

Comparison to 1000 Genomes Project data Having ascertained a list of putative SNPs that were observed in Caucasian individuals with bipolar disorder, we were able to assess our ability to detect SNPs present in the general population by comparing our detected variants to those of Caucasian Europeans represented in phase I of the 1000 Genomes (1 kG) project (CEU and GBR populations; n = 174). We found that many rare SNPs (MAF <0.05) were found to be unique to either the bipolar cohort or 1 kG individuals (Figure 4), which is expected due to the inevitable sampling bias which occurs when sampling small numbers of subjects from a population. In contrast, 100% of common variants (MAF ≥0.1) found in the 1 kG cohort were also found in our bipolar cohort, confirming our ability to detect common variants in the population. A small number of common variants found in our cohort (n = 3) were apparently not detected in 1 kG: however, manual inspection showed that these SNPs were adjacent to indels, or low-complexity sequence, and therefore were probably filtered from the list of integrated variants in the 1 kG cohort. We excluded these putative SNPs from further analysis. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Comparison of detected SNP variants with all SNPs detected in 1 kG data. The minor allele frequency (MAF) of all SNPs identified in the sequenced region from 174 Caucasian Europeans from phase I of the 1 kG project (n = 519) and 48 bipolar disorder cases (n = 412) were compared. SNPs found in bipolar individuals only (yellow) and 1 kG individuals only (light blue) were predominantly rare alleles whereas those found in both sets (green) were predominantly common. https://doi.org/10.1371/journal.pone.0092556.g004 To determine which variants might be associated with illness, we next compared the allele frequencies of all SNPs detected in ST8SIA2 and NCAM1 in the 48 bipolar cases with those reported in the 174 1 kG individuals. While the 1 kG population may contain individuals with bipolar disorder, this would be expected to be at the population rate of around 1.5%, so it can be treated as an unscreened control population. We found that two SNPs in NCAM1 (rs584427 and rs646558, representing ∼10% of all observed) and 53 SNPs in ST8SIA2 (∼14% of all observed) showed a frequency difference of 5% or more between the two populations (Table 3, Table S1). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 3. Summary of genetic variation identified in NCAM1. https://doi.org/10.1371/journal.pone.0092556.t003

Nucleotide diversity on specific ST8SIA2 haplotype backgrounds We previously identified specific haplotypes which were associated with increased or decreased disease risk, and were defined by six SNPs within the ST8SIA2 locus [20]. In order to identify potentially functional variants which lie specifically on these haplotypes, we performed haplotype phasing and imputation using the BEAGLE algorithm. The data used as input was combined from Illumina genotypes, Sanger genotypes and 454 genotype likelihoods from GATK (excluding data in regions with potential allelic bias; see Note S2). Genotypes from 1 kG phase I Caucasian European individuals (n = 174) were used as a reference panel. After phasing and imputation, 40 bipolar chromosomes were predicted to have the risk haplotype, whereas 13 bipolar chromosomes were predicted to have the protective haplotype (Figure 5). The remaining 41 bipolar chromosomes harboured neither risk nor protective haplotypes, and were designated as “other”. A large number of common variants in the bipolar individuals appear to have arisen through co-segregation of alleles in a large haplotype block between the 1st and 5th exons of ST8SIA2 (Figure 5). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 5. Haplotype phasing of all detected nucleotide variants in ST8SIA2 gene region. The gene structure is shown above, with the minor allele frequency of each SNP detected shown by the histogram immediately below the gene (red represents frequency of alternate allele). The red, green and yellow bars on the left indicate the haplotype on which the detected variation resides (risk = red; protective = green; other haplotypes = yellow). Detected variation with respect to the specific haplotype on which it resides is shown to the right of haplotype bars, where non-reference alleles are coloured in light blue (homozygous) and dark blue (heterozygous). The locations of the six SNP markers that were used to define the haplotypes are shown below the variation. The location of the PreSTIGE enhancer elements within the 454 sequenced region from human skeletal muscle myoblasts (HSMM) are indicated in orange, and from neuronal precursor cells (NPC) are indicated in purple. PreSTIGE enhancer locations were obtained from http://genetics.case.edu/prestige/). https://doi.org/10.1371/journal.pone.0092556.g005 The nucleotide diversity across the sequenced region, as measured under the assumptions of an infinite site neutral allele model, was similar across ST8SIA2 (θ = 5.42×10−4) to rates previously reported for other genes [44]. Surprisingly, the nucleotide diversity was lower on the designated risk and protective haplotypes (θ = 3.45×10−4 and 3.74×10−4) compared to other haplotypes (θ = 8.35×10−4). This may be indicative of allelic heterogeneity, i.e., that putative functional variants may have arisen on a spectrum of haplotypes, rather than restricted specifically to the risk haplotype previously associated with illness.

Analysis of putative function Having ascertained the location, diploid genotype and phase of identified SNPs, we turned our attention to the predicted functional impact of SNPs in the ST8SIA2 and NCAM1 sequenced regions, looking at the coding, transcribed and non-transcribed regions. We found no non-synonymous SNPs in coding regions of ST8SIA2, although we identified a previously known synonymous SNP in exon 5 (rs2305561, P207P). We also identified a known SNP 5 bp from the 3′ end of exon 5 (rs11637874), for which the minor T allele was predicted to improve splice donor site efficiency (BDGP score: T allele = 0.95; C allele = 0.81). Allele frequencies for these two SNPs were similar to those in the 1 kG cohort (Table S1). In NCAM1, we found a single non-synonymous SNP in exon 8a (rs117219783, G/A, G>E; Figure 2), which resulted in a non-conservative amino acid change (non-polar glycine to polar glutamic acid), although the MAF was not different to that observed in 1 kG data (MAF (A) = 0.02 vs 0.01; χ2 = 0.37, p = 0.54; Table 3). This exon is rarely used, and was reported to exist in mRNA from testis (AK314589). We also observed a synonymous variant in exon 11 of NCAM1 (rs584427, G/T, V540V), which showed some frequency distortion in the bipolar cases compared to 1 kG individuals (MAF (T) = 0.34 vs 0.45 respectively; χ2 = 3.16, p = 0.08; Table 3). Two novel insertion/deletion SNPs (at 113,101,620 and 113,104,997 bp) showed nominal allelic association (χ2 = 5.37; p = 0.02 and χ2 = 3.57; p = 0.05, respectively). No variants were observed in the glycosylation regions or the acidic patch that would affect the ability of NCAM1 to be glycosylated. The recent release of comprehensive functional genomics data [31] allowed us to assess the potential gene-regulatory impact of the non-coding variants we identified in ST8SIA2. To identify SNPs that may affect transcriptional regulation of ST8SIA2 we looked at the localisation of SNPs within DNAse hypersensitivity sites (DHSs) in human Embryonic Stem Cells (hESC) or neuronal cell lines, surveyed in the ENCODE project [31], [45]. DHSs indicate a region of open chromatin, and DHS peaks (DHSPs) within the DNAse-Seq signal indicate chromatin that is potentially accessible for transcription factor binding, which could be modulated by sequence variants. We found that 20 SNPs in total (five novel) were within DHSPs, five SNPs were within DHSPs in both hESC and neuronal cell lines (one novel), six within neuronal cell lines only (two novel) and nine within hESC cell lines only (two novel; Table 4). Consistent with enrichment for functional variants, assessment of Genomic Evolutionary Rate Profiling (GERP) scores revealed that 20% of DHSP-overlapping sites were conserved (GERP >2) compared to only 3% of non-DHS-overlapping sites. The five novel SNPs which were coincident with DHSPs (at 92937845, 92938435, 92962229, 92973414, 92984673) were present in only a single sequenced individual each. Two of these SNPs were conserved with a high GERP score (4.12 at 92937845 and 5.55 at 92973414) and one SNP was predicted to introduce a splice acceptor site 69 bp downstream of exon 2 (92973414, BDGP score = 0.98 with A allele). Despite the conservation of a high proportion of DHSP-overlapping SNPs, the majority (n = 9) of conserved SNPs were neither within DHS nor transcribed. This indicates that either their transcriptional regulatory function is not captured in the cell lines surveyed, or that they impart a conserved function through another mode of regulation. A further survey of DHS clusters across 125 additional cell lines included in the ENCODE project revealed that the majority (n = 6) were not present in DHSs of any other cell lines (data not shown), suggesting that for these SNPs an alternate mode of regulation is responsible rather than a conserved transcriptional regulatory function in another cell type. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 4. Summary of transcribed and DHSP overlapping genetic variation identified in ST8SIA2. https://doi.org/10.1371/journal.pone.0092556.t004 To examine SNPs that may affect post-transcriptional regulation, we examined SNPs that were within ST8SIA2 untranslated regions, according to Refseq annotation. These SNPs may impact RNA-binding proteins, which are known to play a crucial role in RNA localisation, stability and splicing of neuronally expressed transcripts, which is in turn critical during brain development [46]. We identified two SNPs in the 5′UTR (rs3743365, rs3743364), and 10 SNPs within the 3′UTR, including one variant rs2290492 that lies ∼350 bp past the stop codon, near predicted miRNA binding sites [47]. A large proportion of the ST8SIA2 3′UTR proximal to the stop codon is conserved and we found two SNPs in the 3′UTR with GERP >2 in conserved elements (rs139149207 and rs2290492).

Haplotype-specific functional variants Finally, we examined how SNPs segregated with haplotypes in ST8SIA2 that were associated with mental illness. We considered the 376 SNP loci genotyped by GATK or Sanger. Of these, 347 were successfully phased and thus could be assessed for whether variants segregated with the risk, the protective or other haplotypes. To look for variants that could confer the risk or protective effect of our previously identified haplotypes in isolation, we looked for variants that resided exclusively on the risk or protective haplotypes and not other haplotypes. We found that 19 variants (seven novel) lay exclusively on the risk haplotype (MAF = 0.011–0.024), and five variants (one novel) lay exclusively on the protective haplotype (MAF = 0.011–0.025). Of those on the risk haplotype, two are transcribed (rs115781738 and rs145948851), and five were within DHSPs (rs117156502, rs112804054, rs115781738, 92937845 and 92973414). The latter two of these sites were conserved with GERP >2. The minor alleles of five SNPs were found exclusively on the protective haplotype: one of these was novel (at base position 92956314) and one localised within a DHSP (rs141457407). All of the SNPs that reside exclusively on the risk or protective haplotype are rare, and therefore cannot individually account for the increased risk or protective effect. A recent study demonstrated that clusters of common disease-associated variants, located within multiple enhancer elements in cis, can act in combination to regulate gene expression and increase disease risk [32]. Alleles of such SNPs would contribute to a risk or protective effect when inherited together with other regulatory alleles in a specific haplotype, but would not necessarily have a discernible regulatory effect when inherited as part of other haplotypes. To attempt to capture this class of variant, we looked for alleles that segregated with either the risk or the protective haplotype in at least one chromosome, regardless of segregation with other haplotypes in additional chromosomes, and intersected these variants with ST8SIA2-specific PreSTIGE enhancer predictions [32], which existed for two cell lines: neuronal precursor cells (NPC) and human skeletal muscle myoblasts (HSMM), the locations of which are indicated in Figure 5. We expected that alleles having a discernible effect via this mechanism would be relatively common and present in the general population, so we used phased genotypes from 1 kG CEU and GBR population panels [28], which provide a larger population sample than our bipolar cohort. Of 348 chromosomes in the 1 kG individuals, 127 carried the risk haplotype and 42 carried the protective haplotype. Seventy-eight SNPs segregated with the ST8SIA2 6-SNP risk haplotype in ≥ 1 individual and the protective haplotype in 0 individuals, including two SNPs which were used to tag the originally identified ST8SIA2 risk haplotype. Eight SNPs were tagged by two of six SNPs constituting the risk haplotype (rs2035645 and rs4777974). Of these, three are in PreSTIGE enhancer elements (rs11074066, rs921846 and rs722645), one is also in a DHSP in hESC cells (rs921846) and one is in a foetal brain tissue DHSP cluster (rs722645). Conversely, 40 SNPs segregated with the protective haplotype in ≥ 1 individual and the risk haplotype in 0 individuals. Twenty-five SNPs were tagged by four of six SNPs constituting the protective haplotype (rs4586379, rs11637898, rs11074070 and rs3784735). Of the tagged SNPs, 13 were in PreSTIGE enhancers. Although none of these 13 were in DHSPs from hESC or neuronal cell lines, two (rs11074064 and rs11074065) were in a foetal brain tissue (DHSP) cluster.