13 April 2008, 10:39 UTCRobust topological sorting and Tarjan's algorithm in Python
The Strongly Connected Components of a directed graph are subsets of nodes such that each node within a subset can be reached from each other node. Tarjan's algorithm can identify these components efficiently (thanks njh for finding this algorithm).
By identifying strongly connected components ahead of time we can create a topological sorting algorithm that does the best it can in the presence of cycles.
Here is an implementation in Python:
Here are some things you might use this for:
- Install packages with circular dependencies in the best order you can.
- Work out in which order a set of equations must be solved, and which must be solved simultaneously.
- In a revision control system align as well as possible many versions of a file. (This is basically what I am using this for, but with DNA.)
7 April 2008, 6:27 UTCAMOS message reader/writer module for Python
(of interest to bioinformaticians, and few others)
AMOS (A Modular Open-Source Assembler) defines a standard message format for describing DNA sequence assemblies. It has a Perl module for reading these assemblies, but no Python module.
Shock! This must be rectified, forthwith!
I am currently using this module to import Newbler assemblies of 454 data into AMOS. The quality data is thoroughly b0rked. If anyone has any experience with this, please contact me.
23 February 2008, 10:51 UTCIntroducing Myrialign - align short reads to a reference genome
Monash has just bought a short read genome sequencer which is about to start throwing great gobs of data in my general direction. To deal with this looming deluge, I've started writing some bits and pieces of (mostly) Python code. These are going into a Savannah project called Myrialign.
The major component at this stage is a utility to align short reads to a reference genome, a common first step when analysing short read data. An alignment can contain any number of SNPs, insertions and deletions, up to a user specified cutoff.
One cute thing I'm doing with this is some extreme bit-parallelism. A single alignment can be writen as a logical expression. Either two things align, or they don't (if they do you go back and work out how, but this is infrequent and is therefore allowed to be slow). We could perform each of the operations required to evaluate this logical expression for each pair of sequences we want to align, but it is possible to do much better. Instead, we can pack 32 (or 64 or 128, depending on architecture) independent alignments into a machine word, and perform the operations on this. Bingo, 32x parallelism!
We may be able to do better still with vector operations. Even with bit-parallelism, we can still get pipeline stalls waiting for operations to complete. It may actually be better to work with many words worth of alignements at once. Performing a vector logical operation isn't going to stall the pipeline. When performing vector operations, the CPU will be able to easily predict what data it will need to load ahead of time. Any housekeeping gets amortized across more alignments. It will at the least be no worse to do this.
This assumes we are able to pump data in and out of memory as fast as we can operate on it. From what I have read on the memory hierachy (someone please correct me if I am wrong), while latency gets progressively worse the further down the cache hierachy one goes, throughput only starts to suffer once you hit main memory. And with vector operations, it's throughput we're interested in here. When your vectors are large enough, latency doesn't matter.
Suddenly, Python and Numpy become a competitive option.
The other cute thing: It really screams on a PlayStation 3. I'm trying to convince my department that it needs a cluster of them.
4 December 2007, 9:51 UTClol.py
Small python script to translate English into LOL spelling. You will need to install the eSpeak speech synthesizer (for phoneticization).
LOL> I have a bucket. ai hav a bukit LOL> No, they are stealing my bucket! no dei ar stilen mai bukit LOL> In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. And God said, Let there be light: and there was light. And God saw the light, that it was good: and God divided the light from the darkness. And God called the light Day, and the darkness he called Night. enteh beginen g0d kri8ed teh hevan and deh eth and deh eth w0z widaut fOam and vOid and darknas w0z ap0n teh feis 0v teh dip and teh spirit 0v g0d muvd ap0n teh feis 0v teh wOtez and g0d sed let deh bi lait and deh w0z lait and g0d sO teh lait datit w0z gud and g0d devaidid teh lait fr0m teh darknas and g0d kOld teh lait dei and teh darknas hi kOld nait
24 October 2007, 23:32 UTCLazy lists in Python
Convert an iterator in Python into a lazily evaluated list. Surprisingly useful.
class Lazy_list:
def __init__(self, iterator):
self.data = [ ]
self.iterator = iterator
def __getitem__(self, index):
while len(self.data) <= index:
self.data.append( self.iterator.next() )
return self.data[index]
9 August 2007, 0:46 UTCSearch a large sorted text file in Python
This is a very simple way to deal with a large static database:
- Generate your data in any order, one element per line. Suggested format:
<table identifier><key><delimiter><value>
By sticking in a table identifier you can put multiple tables in the one file (or multiple indexes of the one table).
- Sort your data. If it's too large to fit in memory, use this script by Nicholas Lehuen.
- Look up data using the class below.
class Searcher:
def __init__(self, filename):
self.f = open(filename, 'rb')
self.f.seek(0,2)
self.length = self.f.tell()
def find(self, string):
low = 0
high = self.length
while low < high:
mid = (low+high)//2
p = mid
while p > 0:
self.f.seek(p)
if self.f.read(1) == '\n': break
p -= 1
line = self.f.readline()
if line < string:
low = mid+1
else:
high = mid
p = low
while p > 0:
self.f.seek(p)
if self.f.read(1) == '\n': break
p -= 1
result = [ ]
while True:
line = self.f.readline()
if not line or not line.startswith(string): break
if line[-1:] == '\n': line = line[:-1]
result.append(line[len(string):])
return result
This class can be used to quckly find all lines in a sorted file that start with a given string. Pass find a string containing <table identifier><key><delimiter> and it will return all associated values.
31 July 2007, 11:19 UTCEuclidean Distance Transform in Python
This may be useful to someone.
Algorithmic complexity doesn't seem bad, but no guarantees. To use, pass distance_transform a 2D boolean numpy array. Psyco helps. So would rewriting it in C.
def _upscan(f):
for i, fi in enumerate(f):
if fi == numpy.inf: continue
for j in xrange(1,i+1):
x = fi+j*j
if f[i-j] < x: break
f[i-j] = x
def distance_transform(bitmap):
f = numpy.where(bitmap, 0.0, numpy.inf)
for i in xrange(f.shape[0]):
_upscan(f[i,:])
_upscan(f[i,::-1])
for i in xrange(f.shape[1]):
_upscan(f[:,i])
_upscan(f[::-1,i])
numpy.sqrt(f,f)
return f

23 April 2007, 12:03 UTCPrototype-based programming in Python
In prototype-based programming there is inheritance but no instantiation. I find myself using a prototype-based style in web applications sometimes. For example I might define a "Field" object that contains the logic for displaying and editing a field in a row of a database table. I might then specialize it in derived objects to perform various kinds of validation.
In Python this can be achieved by neutering a class:
import types
class Prototype:
class __metaclass__(type):
def __new__(self, name, bases, dict):
for member, value in dict.items():
if type(value) == types.FunctionType:
dict[member] = classmethod(value)
return type.__new__(self, name, bases, dict)
__new__ = NotImplemented
All methods are converted into classmethods, and the class can not be instantiated. Use of classmethod rather than staticmethod is crucial: When classmethod is used, the method is passed the class as the first argument (much like an instance is passed "self"). This means a method defined in a superclass can call other methods which have possibly been customized in sub-classes.
One trick here: When invoking a super-class's overridden method from a sub-class, one must call Superclass.function.im_func(self, other arguments).
29 November 2006, 8:46 UTCPython VP-tree implementation
VP-trees are a data structure for performing fast nearest-neighbour queries in metric spaces.
Here's a fairly naive implementation in Python:
A Levenshtein-distance based spelling corrector is included in this program as a test.
$ python VP_tree.py /usr/share/dict/words
Load dictionary
98568 words
Construct tree
727925 comparisons
Ready to answer queries
query> mispelt
76 comparisons later... 1 misspelt
2348 comparisons later... 2 dispels
0 comparisons later... 2 misspell
0 comparisons later... 2 misspent
0 comparisons later... 2 respelt
query> correct
8 comparisons later... 0 correct
1591 comparisons later... 1 corrects
1623 comparisons later... 2 Forrest
0 comparisons later... 2 collect
0 comparisons later... 2 connect
query> fandabulous
18600 comparisons later... 3 fabulous
7333 comparisons later... 4 nebulous
0 comparisons later... 4 scandalous
14885 comparisons later... 4 pendulous
2314 comparisons later... 5 Daedalus
2 November 2006, 5:37 UTCSimple fairly robust time and date parsing in Python
For a Python based web project I'm working on, I needed some robust date and time parsing routines. I couldn't find anything suitable, so I wrote my own.
I place this software in the public domain.
Note that these routines use the Australian format of dd/mm/yyyy. Americans will have to adjust this to handle mm/dd/yyyy.
Correctly handles inputs such as:
1pm
13:30
5:05pm
2/6/1977
02/06/77
1977-06-02
2 June, 1977
June 2, 1977
2 June
June 2
If a year is not given, the current year is used.
Caveat: Doesn't handle 1st, 2nd, 3rd, etc.