Prime Sequencing Primers

I do a lot of molecular cloning. That means I do a lot of Sanger sequencing to make sure the recombinant DNA I physically create has all of the desired nucleotides (with genes and other genetic elements) in the order I had originally planned out on the computer. Once in a while, I would notice a set of samples with poor sequencing read lengths, and figure that the sequencing primer (common between the samples) wasn’t very good, but not think too much about it and go on with my day. Recently, I realized I could be far more quantitative and methodological about this, and comprehensively analyze how long my sequencing reactions are with a given primer; this way, I could throw out really bad primers, and use the best performing ones when given multiple options, so that all of my sequencing reactions are as good as I could get them (at least, with the information I already have). We run the sequencing reactions and Sanger runs ourselves (cheaper, and sometimes faster than Genewiz / a sequencing service); all of these reactions were prepared with the same chain-termination protocols, and run on either an Applied Biosystems Abi3100 or Abi3500 Genetic Analyzer, so they should all be roughly comparable. While the template plasmids have differed, they’ve all been either miniprepped or midiprepped DNA, so all relatively clean stuff.

Thus, I entered the results of the ~900 sequencing reactions I’ve run over the last 10 months or so, noting the primer name and sequencing read length, and entered the data into R. (I didn’t include reactions that failed outright, since that could have been due to other reasons, such as bad/wrong template). I subsetted on primers I had used 5 or more times (to get rid of the really really small sample sizes), and made bar plots describing the performance of each primer. I color coded primers with read lengths >500bp as green, primers that gave median read lengths between 300 and 500bp as yellow/orange, and primers that gave median read lengths <300bp as red.

To note: I label all of my primers with my initials and a unique three digit number to easily organize/store them.

As you can see, there are clearly differences in the performances of certain primers over others (with many working ~ 2.5x better than others). For example, KAM202 (a custom primer) has been consistently terrible, KAM010 (The mCherry-Reverse primer) is consistently “ok”, and KAM027 (also a custom primer) is consistently great. Standard primers were never terrible, which makes sense, but they still varied between “ok” and “good” in my hands; KAM007 (The T3 promoter primer), KAM030 (M13-Reverse), KAM024 (LNCX-Forward) gave reads around the “ok” range, whereas KAM008 (T7 promoter primer), KAM020 (CMV promoter primer), KAM022, and KAM023 (EGFP-C and EGFP-N, respectively) each gave >500bp per reaction.

I now have this plot sitting behind my desk, so that I can refer to it next time I’m trying to choose between sequencing primers when validating a construct… It may very well be the difference between a sequencing run long enough to assess my region of interest, and one too short to be informative.

Some of the nuts and bolts:

I normally log each set of sequencing runs on an excel spreadsheet as I plan the run, so it’s pretty trivial for me to copy-paste into a separate spreadsheet afterwards to keep track of all the sequencing results. I then exported the table as a csv file.

I then imported the data into RStudio, and used ggplot to make the visualization. While I’m usually pretty self-conscious about my code, this one’s simple enough that I can swallow my pride and reveal how much of an amateur I am, so I’ve made it available here.

Here’s some sample data, showing how the data is structured, if you want to try running the script yourself.