pfh's blog: code section

home | blog


13 April 2008, 10:39 UTCRobust topological sorting and Tarjan's algorithm in Python
The Strongly Connected Components of a directed graph are subsets of nodes such that each node within a subset can be reached from each other node. Tarjan's algorithm can identify these components efficiently (thanks njh for finding this algorithm).

By identifying strongly connected components ahead of time we can create a topological sorting algorithm that does the best it can in the presence of cycles.

Here is an implementation in Python:

Here are some things you might use this for:

[permalink]


7 April 2008, 6:27 UTCAMOS message reader/writer module for Python
(of interest to bioinformaticians, and few others)


AMOS (A Modular Open-Source Assembler) defines a standard message format for describing DNA sequence assemblies. It has a Perl module for reading these assemblies, but no Python module.

Shock! This must be rectified, forthwith!


I am currently using this module to import Newbler assemblies of 454 data into AMOS. The quality data is thoroughly b0rked. If anyone has any experience with this, please contact me.

[permalink]


23 February 2008, 10:51 UTCIntroducing Myrialign - align short reads to a reference genome
Monash has just bought a short read genome sequencer which is about to start throwing great gobs of data in my general direction. To deal with this looming deluge, I've started writing some bits and pieces of (mostly) Python code. These are going into a Savannah project called Myrialign.

The major component at this stage is a utility to align short reads to a reference genome, a common first step when analysing short read data. An alignment can contain any number of SNPs, insertions and deletions, up to a user specified cutoff.


One cute thing I'm doing with this is some extreme bit-parallelism. A single alignment can be writen as a logical expression. Either two things align, or they don't (if they do you go back and work out how, but this is infrequent and is therefore allowed to be slow). We could perform each of the operations required to evaluate this logical expression for each pair of sequences we want to align, but it is possible to do much better. Instead, we can pack 32 (or 64 or 128, depending on architecture) independent alignments into a machine word, and perform the operations on this. Bingo, 32x parallelism!

We may be able to do better still with vector operations. Even with bit-parallelism, we can still get pipeline stalls waiting for operations to complete. It may actually be better to work with many words worth of alignements at once. Performing a vector logical operation isn't going to stall the pipeline. When performing vector operations, the CPU will be able to easily predict what data it will need to load ahead of time. Any housekeeping gets amortized across more alignments. It will at the least be no worse to do this.

This assumes we are able to pump data in and out of memory as fast as we can operate on it. From what I have read on the memory hierachy (someone please correct me if I am wrong), while latency gets progressively worse the further down the cache hierachy one goes, throughput only starts to suffer once you hit main memory. And with vector operations, it's throughput we're interested in here. When your vectors are large enough, latency doesn't matter.

Suddenly, Python and Numpy become a competitive option.


The other cute thing: It really screams on a PlayStation 3. I'm trying to convince my department that it needs a cluster of them.

[permalink]


4 December 2007, 9:51 UTClol.py

Small python script to translate English into LOL spelling. You will need to install the eSpeak speech synthesizer (for phoneticization).

LOL> I have a bucket.

ai hav a bukit

LOL> No, they are stealing my bucket!

no 
dei ar stilen mai bukit

LOL> In the beginning God created the heaven and the earth. And the earth was 
without form, and void; and darkness was upon the face of the deep. And the 
Spirit of God moved upon the face of the waters. And God said, Let there be 
light: and there was light. And God saw the light, that it was good: and God 
divided the light from the darkness. And God called the light Day, and the 
darkness he called Night.

enteh beginen g0d kri8ed teh hevan and deh eth 
and deh eth w0z widaut fOam 
and vOid 
and darknas w0z ap0n teh feis 0v teh dip 
and teh spirit 0v g0d muvd ap0n teh feis 0v teh wOtez 
and g0d sed 
let deh bi lait 
and deh w0z lait 
and g0d sO teh lait 
datit w0z gud 
and g0d devaidid teh lait fr0m teh darknas 
and g0d kOld teh lait dei 
and teh darknas hi kOld nait

[permalink]


24 October 2007, 23:32 UTCLazy lists in Python
Convert an iterator in Python into a lazily evaluated list. Surprisingly useful.

class Lazy_list:
    def __init__(self, iterator):
        self.data = [ ]
        self.iterator = iterator
    
    def __getitem__(self, index):
        while len(self.data) <= index:
            self.data.append( self.iterator.next() )
        return self.data[index]

[permalink]


9 August 2007, 0:46 UTCSearch a large sorted text file in Python
This is a very simple way to deal with a large static database:

class Searcher:
    def __init__(self, filename):
        self.f = open(filename, 'rb')
        self.f.seek(0,2)
        self.length = self.f.tell()
        
    def find(self, string):
        low = 0
        high = self.length
        while low < high:
            mid = (low+high)//2
            p = mid
            while p > 0:
                self.f.seek(p)
                if self.f.read(1) == '\n': break
                p -= 1
            line = self.f.readline()
            if line < string:
                low = mid+1
            else:
                high = mid
        
        p = low
        while p > 0:
            self.f.seek(p)
            if self.f.read(1) == '\n': break
            p -= 1
        
        result = [ ]    
        while True:
            line = self.f.readline()
            if not line or not line.startswith(string): break
            if line[-1:] == '\n': line = line[:-1]
            result.append(line[len(string):])
        return result

This class can be used to quckly find all lines in a sorted file that start with a given string. Pass find a string containing <table identifier><key><delimiter> and it will return all associated values.

[permalink]


31 July 2007, 11:19 UTCEuclidean Distance Transform in Python
This may be useful to someone.

Algorithmic complexity doesn't seem bad, but no guarantees. To use, pass distance_transform a 2D boolean numpy array. Psyco helps. So would rewriting it in C.

def _upscan(f):
    for i, fi in enumerate(f):
        if fi == numpy.inf: continue
        for j in xrange(1,i+1):
            x = fi+j*j
            if f[i-j] < x: break
            f[i-j] = x

def distance_transform(bitmap):
    f = numpy.where(bitmap, 0.0, numpy.inf)
    for i in xrange(f.shape[0]):
        _upscan(f[i,:])
        _upscan(f[i,::-1])
    for i in xrange(f.shape[1]):
        _upscan(f[:,i])
        _upscan(f[::-1,i])
    numpy.sqrt(f,f)
    return f

[permalink]


23 April 2007, 12:03 UTCPrototype-based programming in Python
In prototype-based programming there is inheritance but no instantiation. I find myself using a prototype-based style in web applications sometimes. For example I might define a "Field" object that contains the logic for displaying and editing a field in a row of a database table. I might then specialize it in derived objects to perform various kinds of validation.

In Python this can be achieved by neutering a class:

import types

class Prototype:
    class __metaclass__(type):
        def __new__(self, name, bases, dict):
            for member, value in dict.items():
                if type(value) == types.FunctionType:
                    dict[member] = classmethod(value)
            return type.__new__(self, name, bases, dict)

    __new__ = NotImplemented

All methods are converted into classmethods, and the class can not be instantiated. Use of classmethod rather than staticmethod is crucial: When classmethod is used, the method is passed the class as the first argument (much like an instance is passed "self"). This means a method defined in a superclass can call other methods which have possibly been customized in sub-classes.

One trick here: When invoking a super-class's overridden method from a sub-class, one must call Superclass.function.im_func(self, other arguments).

[permalink]


29 November 2006, 8:46 UTCPython VP-tree implementation
VP-trees are a data structure for performing fast nearest-neighbour queries in metric spaces.

Here's a fairly naive implementation in Python:

A Levenshtein-distance based spelling corrector is included in this program as a test.

$ python VP_tree.py /usr/share/dict/words
Load dictionary
98568 words

Construct tree
727925 comparisons

Ready to answer queries

query> mispelt
    76 comparisons later...     1 misspelt
  2348 comparisons later...     2 dispels
     0 comparisons later...     2 misspell
     0 comparisons later...     2 misspent
     0 comparisons later...     2 respelt

query> correct
     8 comparisons later...     0 correct
  1591 comparisons later...     1 corrects
  1623 comparisons later...     2 Forrest
     0 comparisons later...     2 collect
     0 comparisons later...     2 connect

query> fandabulous
 18600 comparisons later...     3 fabulous
  7333 comparisons later...     4 nebulous
     0 comparisons later...     4 scandalous
 14885 comparisons later...     4 pendulous
  2314 comparisons later...     5 Daedalus

[permalink]


2 November 2006, 5:37 UTCSimple fairly robust time and date parsing in Python
For a Python based web project I'm working on, I needed some robust date and time parsing routines. I couldn't find anything suitable, so I wrote my own.

I place this software in the public domain.

Note that these routines use the Australian format of dd/mm/yyyy. Americans will have to adjust this to handle mm/dd/yyyy.

Correctly handles inputs such as:

1pm 
13:30 
5:05pm

2/6/1977
02/06/77
1977-06-02
2 June, 1977
June 2, 1977
2 June
June 2

If a year is not given, the current year is used.

Caveat: Doesn't handle 1st, 2nd, 3rd, etc.

[permalink]


All older entries




[atom feed]  
[æ]