1

n

s

i

s

> seq1 this is the description of my first sequence

AAT



> seq2 this is the description of my second sequence

ATG



> seq3 neat!

TGC



> seq4 this is a pretty simple string of base!

TTT



for

line in open(fastaSeqFilePath,'r').readlines():

FastaSequence

FastaSequence

seq = [s.getSequence() for s in listOfSeqs];

findPairWithMaxOverlap()



getMatches()

1

2

1

2

2

2

reading sequences...

> seq1 this is the description of my first sequence

AAT

> seq2 this is the description of my second sequence

ATG

> seq3 neat!

TGC

> seq4 this is a pretty simple string of base!

TTT

finding a Pretty Short super String..

AATGTTTGC

finished.



After reading through the excellently written Dive Into Python , the next step was to write something a bit more involved than standard 'hello world' toy-programs. Combining two nascent interests, learning Python and computational biology, I decided to write up some code to solve a rather simple Bioinformatics problem, the Shortest Supersequence problem. This sort of approach can be used as a naive means of approximating DNA sequences. The problem is stated as follows: given a set of strings s,...,s, find the shortest stringsuch that each sappears as a substring ofPlease note that this is really the first semi-substantial amount of Python I've ever written, and, as such, I'm sure there are a number of things I could have expressed more succiently (any comments containing syntactically- (or algorithmically-) improved versions of the code here would be appreciated).The first thing to consider is piping in the input. There is a wonderfully pithy standard for listing gene sequences called FASTA . The format includes a single '>' character followed by a one line descriptor for the sequence, followed by the sequence itself. Here is what my FASTA file looks like:I decided to roll-my-own FASTA parser, to help me get the hang of I/O in Python. Here is the listing for my method that parses out FASTA formatted sequences from files.The most amazing line to me here is:I'm primarily a Java developer, and while I'm not jumping on the all-too-fashionable "Java sucks!" bandwagon just yet, you certainly do not get this sort of brevity in J2EE. The machinery it takes to read a file line by line in Java is admittedly rather bloated. What immediatly attracts me about Python is that it expresses complex things succinctly and that it reads like psuedo-code.Note that I buildobjects as I go, handing off parametrically the one-line description and the base sequence to the constructor. Here is the code for FastaSequence.py:Now I've got a collection ofobjects; the next step is to build the superstring. Here is where I cheat a little. Finding the shortest substring is a known NP-Hard problem. While there are some pretty-efficient algorithms for optimal solutions, they are quite involved and so here I employed the relatively straight-forward Greedy approach; finding and merging the pair of sequences with the highest amount of overlap, removing both strings from the list, and repeating until no strings remain. Unfortunately, this is not always going to be optimal, but it will be pretty good (notice the method name). In fact, there is an upper limit; it will be no 'worse' than 2.75x the optimal solution (and usually it does much better).Notice the list-mapping line:Being able to express a list transformation so naturally is again indicative of Python's pithiness (perhaps more pertinently it's 'Pythiness'? Sorry.) Now, arguably loading up all the sequences into objects was unnecessary (I told you that I'm primarily a Java developer!) and it's true that passing around the base strings would have been adequate here, but it's certainly possible (likely?) that you would want access to descriptor strings somewhere down the line.I've left a few things out, the code for, for instance, and why it returns a 3-tuple. Here is the code:returns a tuple [numChars, s, s] where numChars is the number of characters that match in sand s, starting at s[0] and ending at s[numChars]. That's it, really. We can slap the whole thing together and run it on the sample file listed above.And we get the output:This is the sort of output we would expect. The take-home here is that I'm very positive about Python, it is very intuitive and it's syntax feels very natural.