You are on page 1of 17

Python, PyTables and meta-classes

Author: Marcin Swiatek, Visimatik Inc., 2007

Introduction
My task was to organize genotyping data for analysis. At the time I was working on a whole-genome
association study. For a programmer, the sheer size of input datasets is one of defining characteristics of
such projects. In this case I had over 70 GB of compressed text to deal with.
Data for a WGA study is best pictured as a very large matrix, with columns corresponding to single
nucleotide polymorphisms (SNPs). Rows correspond to samples or – in human terms, people who
donated their genetic material for the study. Values in cells of this huge1 matrix indicate nucleotides found
at these positions. Unfortunately, genotyping data does not necessarily arrive in a format convenient for
analysis. More often it will have structure suitable for recording measurements and will need to be
converted.

This document and the associated source code can be downloaded from www.visimatik.com.

The toolkit
Python and numpy are my tools of choice for this type of work. For data persistence, I would normally
use pickling or XML, but given the volume of data, neither was appropriate. Due to the required form of
data, a relational database would not fit the purpose, either. Enter HDF5 and PyTables2. HDF5 is a data
format and a library for managing large data collections, with a niche in scientific data. PyTables is a
Python interface to HDF5 files. The structure of an HDF5 file comprises of three types of entities,
groups, tables and arrays (arrays will not be present in the discussion). These entities3 are arranged into a
tree, not unlike a file system.

The problem
Ultimately, my goal was to explore data through various programs and scripts I was yet to write. At that
point I could not possibly know what exactly these programs would do. What I did know was how they
might be accessing their data. For instance, retrieval of parts of the main matrix, along with indexes
describing loaded fragments, was expected to be a frequent request. This perspective defined my first
design. My grasp of the domain was still weak at the time. Knowing it, I consciously tried to ignore my
inner prophet's predictions of how data structures would evolve. In principle, it is better to refrain from
making generalizations if you do not have enough detailed knowledge from which to draw these
generalizations.
Still, some rudimentary abstractions were necessary. To insulate computations from technicalities of
managing the data file itself, I put together a simple encapsulation a 'database connection'. My wrapper
consisted of two classes. TableDef defined the database structure, while Notebook was responsible for
'housekeeping chores': most importantly, creating, opening and closing the file.
Implementation of __getattr__ (listing 1) shows that no attempt was made to hide the structure of data

1 Whole genome studies frequently profile more then a thousand people using about several thousand SNPs.
2 The best place to start looking for tools of this kind is Scipy, a comprehensive resource for using Python in science.
3 In the sequel, I will often refer to these entities by a common moniker 'part'.

1
storage: clients are allowed to directly access the catalog structure of the dataset. Likely, this would not
be allowed to stand in the long run. However, this exercise is not about the mythical long run. It is chiefly
about createDatabase and verifyDatabase routines. The former, invoked only when a new file is created,
puts in place basic structures required by the application. The latter asserts that the opened HDF5 file
appears to have these basics. The code in listing 1 is intended to give the idea of that first, sketchy
solution. It is provided for illustration purposes only.
This system worked for me for several days, until I realized that additional data, describing relations
between SNPs on the same chromosome, was needed. The new information came in the form of several
large matrices, again with associated indexes. I needed to adjust my design and, since re-running the
computation-heavy conversion was not practical, to upgrade my data files accordingly.
As one would expect, HDF allows modifications of data structures in existing files. There are several
functions in Pytables, which manipulate data definition; I had already used some of them to create the
original structures. What could be simpler than adding similarly shaped code injecting additions into the
model? Well, the route of small, incremental changes has inherent problems. For instance, the new code
would have to be very particular about verifying its preconditions, to check, for example, if the entity
about to be created already is there. Managing data definitions through writing more and more imperative
definitions is bound to result in something rather ugly as the software evolves. Consider the following
argument:
Suppose, I would like to add a new table, defined by:
class DistanceInChromsomes(tables.IsDescription):
chromosome = tables.UInt8Col()
source_name = tables.StringCol(itemsize = 256)
array_set = tables.StringCol(itemsize = 256)

A simple conditional instruction would do splendidly:


if '/source/hapmap/files' not in dbfile:
dbfile.createTable(/source/hapmap, 'files', TableDef.DistanceInChromosomes)

It is easy enough and it apparently works. So where is the problem?


Firstly, the definitions of tables' structures are separated from declaration of their position within the
hierarchy.
Secondly, adding a new structure requires changing code in three (or maybe even four) places: creation,
verification and 'upgrade' routines, on top of the definition itself. It would be better to have just one place
to modify. Inevitably, if there are four places to update with every change, one will occasionally get
missed.
Thirdly, while generic 'upgrade and verify' code may be more difficult to write, you need to write and test
it just once. A sequence of nested if statements is simple enough to edit, but these additions will grow
more and more difficult to review and test.
On top of all these concerns: writing code of that sort is incredibly boring and makes any good
programmer suffer. In the spirit of increasing world happiness, I will propose an elegant solution to the
problem at hand.
Pytables declares table structures in an interesting way. The function createTable will accept a class

2
declaration4 as the parameter describing layout of the new table. I liked the idea for its apparent elegance
and decided to extend it into a reusable mechanism tying these classes-definitions with their incarnations
in the physical file. Not just tables, but also their place in the hierarchy ought to be deducted directly
from declarations. This definition will be allowed to evolve along with the application. The necessary
updates of the file structure will be taken care of as a matter of course.

The first stab


The Python idiom applied in Pytables' createTable, where a class definition is used to guide a
computation5, requires a facility to examine that definition in runtime6. This facility is called 'reflection'.
The next section gives a general outline of the concept.
Reflection
According to wiktionary, reflection is:
1. the act of reflecting or the state of being reflected
2. the property of a propagated wave being thrown back from a surface (such as a mirror)
3. something, such as an image, that is reflected
The dog barked at his own reflection in the mirror.
4. careful thought or consideration
After careful reflection, I have decided not to vote for that proposition.
5. an implied criticism
It is a reflection on his character that he never wavered in his resolve.
I like to think of the computer science term 'reflection' as relating more to 4 than to 1-3. More
introspection7 and careful consideration, less mirror-gazing. Whatever the etymology, the crux of the
concept is programs' ability to examine and possibly modify their own structure. It is a very interesting
paradigm: programs that rewrite themselves; computer science does not get much more philosophical
than that. Dynamic languages with dynamic type systems8, like Smalltalk, Lisp and. of course, Python,
particularly encourage styles of programming relying on reflection.
The key to practical realization of reflection is an execution mechanism (a virtual machine) which
preserves the organization of the program as it was seen by the programmer. This request is naturally
meet in interpreted languages, but the concept is not confined to that realm anymore. Java is presently the
most popular programming environment employing reflection. Both Java and its spiritual cousin C# are
compiled and statically typed languages. But, unlike languages like C or C++, Java and C# target
execution mechanisms which do have understanding of objects and classes. Consequently, the platform
itself can, and does, provide the necessary means.
There is a notable difference between the view of reflection in Java and in Python. In statically-typed
languages, where types have to be known in compile-time, one cannot evolve a type or change objects'

4 The class must be derived from tables.IsDescription. Consult Pytables manual for details.
5 In our case the computation results in writing something to a disk file.
6 I can already hear my inner C++ programmer (have I already confessed that?) arguing the merits of meta-programing C+
+ style. One could imagine generating code for creating the table from a class using macros and/or templates.
7 Be aware that the term 'introspection' has a specific meaning in computer science. See type introspection.
8 I do realize it sounds awful. But neither dynamic languages have to be dynamically typed, nor dynamic typing implies
dynamic execution model. Consult your favorite on-line source or your local computer science guru for more
information.

3
layout in runtime. Consequently, reflection APIs in .NET and in JRE9 do not allow alterations to
definitions of class or functions in the runtime. Owing to its dynamic type system, Python permits
programs to modify themselves during its execution.

The introspective code


Listing 2 presents the code I wrote at first. Be very suspicious of it! It plays the role of a negative
example, as it well illustrates several pitfalls of manipulating class objects in run-time. But before the
self-flagellation starts, a word about my design objectives.
First, the entire HDF5 file structure (excluding arrays10) should by defined by a class declaration. This
premise begot two entities we will see in listing 2: TableDef and GroupDescriptor classes. The former is
to group definitions of all entities in a HDF5 file, while the latter is the base class for declaring groups. In
truth, croups could be readily deduced from tables' declarations, which from now on have to specify their
place in hierarchy (objName). But I still thought it was better to have the option of defining them
explicitly.
Next, I wanted to handle versions of the structure. Each declared part of the file structure bears a version
number. Note that the Version attribute of TableDef and objVersion of classes defining tables have
different roles to play. The number associated with the top level class is simply the version number of the
entire database; each addition of a subordinate structure should trigger increase of Version attribute. The
attribute objVersion of a particular table or groups shall equal to the version of TableDef from when that
table was added to the data model. This will ease the processing of database upgrades, even if only
partly11.
Finally, the mechanism will be designed for reuse.
The most interesting things happen in the __init__ function of TableDef. A newly created instance walks
through its own class dictionary, reviewing all its members defined in the class' body. By design, tables
and groups are represented by nested classes derived from designated bases; inheritance hierarchy of a
class can be accessed in runtime through its '__bases__' variable. Once a table or a group declaration is
found, it is put into the list of tables or groups, along with some auxiliary information. Having these lists,
we can easily find a differential between two versions of the database or find the correct processing order
for any manipulations. This order is defined by nesting hierarchy implied by values of objName
attributes.
This simple code has just one shortcoming: it will not work. Your own “personal programmer's warning
bell” should be ringing by now. Have a look at what is done when a declaration of a class inheriting from
tables.IsDescription is encountered. I extract information from objName and objVersion fields of a class
declaration and then I delete these fields! Evidently, this could work only the first time an object of
TableDef class is created. On the next attempt, classes declaring tables' structures would be already

9 Admittedly, one can think up several ways of doing these things within the framework of .NET CLR (or in JRE), but
outside the framework of statically typed Java and C#. This is certainly a real need, since the software of this kind is
already emerging.
10 Arrays are different, because they do not have a structure to be defined up-front and they do not have to be declared
before they are inserted. Since they confuse the picture, I have decided to conveniently exclude them from consideration
for now.
11 Neither dropping of tables nor changes to tables' structure are supported by the code. It is quite apparent, though, how
additions of fields and removal of tables and groups could easily be handled in this framework (as soon, as we get it to
work). Data transforms are clearly out of scope of this simple example.

4
stripped of this information. An obvious mistake, easy to catch in the first code review! However, as it
turns out, this monstrosity would not work even once.

Unexpected complications
For now, let us ignore the impropriety of fiddling with the definition of a class in a method of its instance.
This will be dealt with in time. Why is the program not working at all? As it turns out, attempts to access
field objVersion in chld variable (we are still in __init__ method of TableDef) will often result in
AttributeError exception. This may be surprising, since all members of the class that may pass type filters
in if statements should have this attribute. Yet closer inspection in the debugger will reveal that there is no
such field. It is as if this piece of information has been lost. It is easy to find, though: all variables
assigned in class scope were turned into members of the columns collection (see Illustration 1, below).
Overall, the object representing a class derived from IsDescription looks quite different from what you
might expect. This is because the creators of Pytables have elected to customize the class creation process
through the use of __metaclass__.
Earlier in the article we have discussed the notion of reflection. There is one important conclusion to
draw from that section, which should be stated clearly: classes are objects, too. In Python classes are first-
class objects, which means they can be used as any other objects in Python. In particular, they are at some
point constructed.

Illustration 1: 'columns' collection (debugger screenshot)

Normally, classes are created by a call to a built-in function type(name, bases, dict). This default behavior
can be easily changed by substituting the factory of the class object: the class's metaclass. A metaclass
(itself a class, too) has to have a special method called __new__, with signature matching that of type
function above. All that remains is a single assignment to class's __metaclass__ variable in the body of its
declaration:
class metaMetaBase(type):
def __new__(cls, classname, bases, classdict):
pass

class MetaBase:
__metaclass__ = metaMetaBase

The substituted method can modify the created class12 or return some other representation altogether.
12 It is a dictionary; in Python all objects are dictionaries in a way.

5
Stepping though the process of constructing any class derived from tables.IsDescription13, we get quickly
to the implementation of the class metaIsDescription in description.py. There, the class dictionary is
customized. All declared fields that 'look like' field names are grouped in a single item in the class
dictionary: columns. In effect, instead of chld.objVersion, we end up with chld.columns['objVersion'].
This explains why the original code does not work and hints at what needs to be done to fix it.
Firstly, I will either need to override the existing mechanism of creating IsDescription, or hack my way
through nested declarations to extract objVersion and objName. Secondly, all processing of class
declarations will need to be moved from instance code of TableDef (or whatever will replace it) to its
class code.
After some thinking I decided to work around the nested dictionaries, rather than to insert another layer
of inheritance. I wanted to keep my little overlay consistent with Pytables' documentation and allow
deriving table definitions directly from IsDescription. Source code related to these changes is in listing 3
(note the lack of the usual 'suspicious code' disclaimer).

Form follows function


Now let's put it all together and examine, how the system has been laid out and how it works. The
structure of an HDF5 file will be defined as a single class, derived from MetaBase (see Listing 4). Class
TestTableDef (already presented in listing 3) is an example of such definition and contains several nested
classes, declaring groups and tables. Although MetaBase is the main extension point of the mechanism
and thus its most visible part, the bulk of the 'meta' functionality has been relegated to the class object's
creator: metaMetaBase (listing 3). Upon loading the module with the definition of a class derived from
MetaBase (for instance TestTableDef), the Python interpreter will attempt to create an object representing
that class. TestTableDef does not define its own metaclass, but its base class does. Consequently, the
__new__ function of metaMetaBase will be invoked and its return value will be used to represent
TestTableDef class.
What metaMetaBase.__new__ does to the original representation is easy to interpret. First, notice is that
it calls upon the base implementation by invoking:
the_default = super(metaMetaBase, cls).__new__(cls, classname, bases, classdict)
The variable the_default is then modified, but in a rather non-abusive way, at least compared to the high
standard set by tables.IsDescription. Information regarding groups and tables is gathered in collections,
which will be made available to the class and instance routines of the class being created. It is worth
pointing out that all class objects derived from IsDescription are stripped of objName and objVersion
attributes; these fields are not real columns and that they need to be removed before being passed to
routines creating tables.
There is still some very pretty Python syntactic sugar to be appreciated here. I apply it with the intent to
make inheriting from MetaBase look as natural, as possible. Arguably, the natural way to use a class is
through its instances; any recourse to the 'meta' level should remain an exception. The code of Listing 5
illustrates intended usage patterns: all access to the described system is channeled through plain
invocations of instance methods of the __metadata member variable of the Notebook class. Results of
these method calls are used by routines create, upgrade and verify. It would probably be acceptable to
have the routines of Notebook contend with table information as created by metaMetaBase. But this
would make these two classes logically coupled: whoever works with the Notebook class would have to
know how to interpret values returned by calls to __metadata and, at times, how they were generated. In
13 Put a breakpoint on any assignments creating column specifications.

6
a reusable mechanism, this is to be avoided. I found it best to complete the encapsulation of the
mechanism by creating 'callable' objects, representing actions of creating tables and groups (listing 6; see
use in 4).

'__metaclass__' domesticated
Python's __metaclass__ idiom allows overriding Python's default constructor of objects representing
classes. This is an interesting semantic device and, although it certainly invites abuse, it can be very
helpful at times. While there are many ways to define database structures in code, the proposed approach
is elegant and, very importantly, offers strong encapsulation. Hence, classes which rearrange their own
layout in unexpected ways may still appear 'normal' to the outside world and the use of reflection
remains contained.
A reader interested in other uses of __metaclass__ should pick up a copy of the excellent Python
Cookbook. Standard Python documentation outlines issues related to class creation. Several suggestions
regarding potential application of metaclasses are given there, too.

Source code examples

Listing 1: ad-hoc wrapper classes


class TableDef:
class SNPs(tables.IsDescription):
name = tables.StringCol(itemsize = 32)
chromosome = tables.UInt8Col()
aminor = tables.UInt8Col()
amajor = tables.UInt8Col()

class Individuals(tables.IsDescription):
symbol = tables.StringCol(itemsize = 64)
cohort = tables.UInt32Col()

class Cohorts(tables.IsDescription):
symbol = tables.StringCol(itemsize = 32)
fIndivIdx = tables.UInt32Col()
countIndiv = tables.UInt32Col()

class Chromosomes(tables.IsDescription):
symbol = tables.UInt8Col()
fSNP_Idx = tables.UInt32Col()
countSNP = tables.UInt32Col()

class Chunk(tables.IsDescription):
chromosome = tables.UInt8Col()
cohort = tables.StringCol(itemsize = 8)
chunk_name = tables.StringCol(itemsize = 256)

class Notebook:
Version = 1.0
def __init__(self, name, path, write_access = True, feedback =
Reporting.DummyFeedback()):
"""

7
If the directory exists, attempt to open the file and the indice files (if these
do not exist,
we have an error). If there is no directory, attempt to create it (and all
tables that should be there).
"""
# name will be a pickle with some basic info (just to mark the teritory)
self.__paths = {}
self.__paths["main"] = path
self.__paths["pickle"] = os.path.join(path, name.strip() + ".pck")
self.__paths["HD5"] = os.path.join(path, "data.HD5")

self.__HD5 = None
self.Stamp = None
self.__GUI = feedback
self.__ReadOnly = not write_access

try:
if os.path.isdir(path):
self.Stamp = cPickle.load(open(self.__paths["pickle"], "r"))
self.__HD5 = tables.openFile(self.__paths["HD5"], ["r", "r+"]
[self.__ReadOnly])
self.verifyDatabase(self.__HD5)
elif not self.__ReadOnly:
os.mkdir(path)
self.Stamp = (name, self.version, time())
cPickle.dump(self.Stamp, open(self.__paths["pickle"], "w"))
self.__HD5 = self.createDatabase(self.__paths["HD5"])

if self.__HD5 is None:
raise NotebookDoesNotExist(path, name)

except IOError, eio:


raise NotebookInvalidSource(eio)
except tables.NoSuchNodeError, ensn:
raise NotebookInvalidStructure(ensn)

def close(self):
if not self.__ReadOnly:
self.Stamp = (self.Stamp[0], self.Version, time())
cPickle.dump(self.Stamp, open(self.__paths["pickle"], "w"))

self.__HD5.close()
self.__HD5 = None
self.Stamp = None

def __del__(self):
if not self.__HD5 is None:
self.__HD5.close()
self.__HD5 = None

if not self.Stamp is None:


self.Stamp = (self.Stamp[0], self.version, time())
cPickle.dump(self.Stamp, open(self.__paths["pickle"], "w"))
self.Stamp = None

def __getattr__(self, aname):


return getattr(self.__HD5, aname)

def verifyDatabase(self, dbfile):


"""
check if this is a proper database. If it is not, feel free to thow an exception
"""
dbfile.isVisibleNode("/analyses")

8
dbfile.isVisibleNode("/metadata")
dbfile.isVisibleNode("/definition")
dbfile.isVisibleNode("/source")
dbfile.isVisibleNode("/definition/chromosomes")
dbfile.isVisibleNode("/definition/cohorts")
dbfile.isVisibleNode("/definition/SNPs")
dbfile.isVisibleNode("/definition/individuals")
dbfile.isVisibleNode("/source/imports")

def createDatabase(self, fpath):


"""
create a new HD5 file and its content
"""
dbfile = tables.openFile(fpath, "a")
gAgg = dbfile.createGroup(dbfile.root, 'analyses', 'Information derived in
analysis')
gMeta = dbfile.createGroup(dbfile.root, 'metadata', 'Metainformation about performed
analyses')
gSubj = dbfile.createGroup(dbfile.root, 'definition', 'Subject data defiition')
gRaw = dbfile.createGroup(dbfile.root, 'source', 'Raw source data uploaded with
nbcp')

dbfile.createTable(gSubj, 'chromosomes', TableDef.Chromosomes)


dbfile.createTable(gSubj, 'cohorts', TableDef.Cohorts)
dbfile.createTable(gSubj, 'individuals', TableDef.Individuals)
dbfile.createTable(gSubj, 'SNPs', TableDef.SNPs)
dbfile.createTable(gRaw, "imports", TableDef.Chunk)

return dbfile

def kill(self):
self.close()

assert not self.__ReadOnly

for key in self.__paths.keys():


if "main" != key:
os.remove(self.__paths[key])

os.rmdir(self.__paths["main"])

Listing 2 – the first version with reflection


"""
Some work on meta - structure (handling db creation and updates)
"""
import tables

class GroupDescriptor:
pass

def getHierarchyFromPath(all_path):
pieces = all_path.split("/")
if len(pieces) > 1:
return [ "/".join(pieces[:idx + 2]) for idx in xrange(len(pieces) - 1)]
else:
return []

class TableDef:
Version = 1.1
class SNPs(tables.IsDescription):
"""SNPs table"""
objName = "/definition/SNPs"

9
objVersion = 1.0

name = tables.StringCol(itemsize = 32)


chromosome = tables.UInt8Col()
aminor = tables.UInt8Col()
amajor = tables.UInt8Col()

class Individuals(tables.IsDescription):
objName = "/definition/individuals"
objVersion = 1.0

symbol = tables.StringCol(itemsize = 64)


cohort = tables.UInt32Col()

class Cohorts(tables.IsDescription):
objName = "/definition/cohorts"
objVersion = 1.0

symbol = tables.StringCol(itemsize = 32)


fIndivIdx = tables.UInt32Col()
countIndiv = tables.UInt32Col()

class Chromosomes(tables.IsDescription):
objName = "/definition/chromosomes"
objVersion = 1.0

symbol = tables.UInt8Col()
fSNP_Idx = tables.UInt32Col()
countSNP = tables.UInt32Col()

class Chunk(tables.IsDescription):
objName = "/source/imports"
objVersion = 1.0

chromosome = tables.UInt8Col()
cohort = tables.StringCol(itemsize = 8)
chunk_name = tables.StringCol(itemsize = 256)

class DistanceChromsomes(tables.IsDescription):
objName = "/source/hapmap/files"
objVersion = 1.1

chromosome = tables.UInt8Col()
source_name = tables.StringCol(itemsize = 256)
array_set = tables.StringCol(itemsize = 256)

class SourceGroup(GroupDescriptor):
"""Raw source data uploaded with nbcp"""
objName = "/source"
objVersion = 1.0

class AnalysesGroup(GroupDescriptor):
"""Information derived in analysis"""
objName = "/analyses"
objVersion = 1.0

class MetadataGroup(GroupDescriptor):
"""Metainformation about performed analyses"""
objName = '/metadata'
objVersion = 1.0

class DefinitionGroup(GroupDescriptor):
"""Subject data definition"""

10
objName = '/definition'
objVersion = 1.0

def __init__(self):
classOf = self.__class__
# now list all dependant classes
groups = []
tbls = []
for chld in classOf.__dict__.values():
try:
bases = chld.__bases__

if GroupDescriptor in bases:
description = chld.__doc__
name_parts = chld.objName.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])
groups.append((short_name, path, count_parts, description,
chld.objVersion))

elif tables.IsDescription in bases:


description = chld.__doc__

# this code does not work. It is here for illustration purposes only
# DO NOT ATTEMPT TO USE
version = chld.objVersion
del chld.objVersion
objName = chld.objName
del chld.objName

name_parts = objName.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])

tbls.append((short_name, path, count_parts, description, version, chld))

except AttributeError:
pass

# now find all groups that may be required (based on 'path' concept), but are not
explicitly listed
grdct = {}
for gr in groups:
created_grp = "/".join([gr[1], gr[0]])
grdct[created_grp] = gr[4]

extra_groups = []
for gr in groups:
extra_groups += [ (spath, gr[4]) for spath in getHierarchyFromPath(gr[1])]
for tb in tbls:
extra_groups += [ (spath, tb[4]) for spath in getHierarchyFromPath(tb[1])]

# walk through all extra (potential) groups. If they are not there yet, add them to
the list
extra_groups.sort()
prev_item = ""
for gr in extra_groups:
if gr[0] != prev_item and gr[0] not in grdct:

11
prev_item = gr[0]
name_parts = prev_item.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])
groups.append((short_name, path, count_parts, None, gr[1]))

self.Groups = groups
self.Tables = tbls

def __str__(self):
outs = "Tables:\r\n"
for table in self.Tables:
outs += "%s/%s ver %.1f (%s)\r\n" % (table[1], table[0], table[4],
type(table[5]))

outs += "Groups:\r\n"
for group in self.Groups:
outs += "%s/%s ver %.1f (%s)\r\n" % (group[1], group[0], group[4], group[3])

return outs

if __name__ == "__main__":
metad = TableDef()
print metad

Listing 3: essential parts of the implementation using __metaclass__


class metaMetaBase(type):
def __new__(cls, classname, bases, classdict):
the_default = super(metaMetaBase, cls).__new__(cls, classname, bases, classdict)

if len(bases) == 0:
return the_default # do not process the base class...

groups = []
tbls = []
for chld in the_default.__dict__.values():
try:
bases = chld.__bases__

if GroupDescriptor in bases:
description = chld.__doc__
name_parts = chld.objName.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])
groups.append((short_name, path, count_parts, description,
chld.objVersion))

elif tables.IsDescription in bases:


description = chld.__doc__

# this class has been transformed in pyTables through use of


__metaclass__ idiom
# need to get to the dictionary...
version = chld.columns["objVersion"]
del chld.columns["objVersion"]
objName = chld.columns["objName"]
del chld.columns["objName"]

12
name_parts = objName.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])

tbls.append((short_name, path, count_parts, description, version, chld))

except AttributeError:
pass

# now find all groups that may be required (based on 'path' concept), but are not
explicitly listed
grdct = {}
for gr in groups:
created_grp = "/".join([gr[1], gr[0]])
grdct[created_grp] = gr[4]

extra_groups = []
for gr in groups:
extra_groups += [ (spath, gr[4]) for spath in getHierarchyFromPath(gr[1])]
for tb in tbls:
extra_groups += [ (spath, tb[4]) for spath in getHierarchyFromPath(tb[1])]

# walk through all extra (potential) groups. If they are not there yet, add them to
the list
extra_groups.sort()
prev_item = ""
for gr in extra_groups:
if gr[0] != prev_item and gr[0] not in grdct:
prev_item = gr[0]
name_parts = prev_item.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])
groups.append((short_name, path, count_parts, None, gr[1]))

#TODO - check if that works for more levels of inheritance (than just something
derived from MetaBase)
the_default.Groups = groups
the_default.Tables = tbls

return the_default

class MetaBase:

__metaclass__ = metaMetaBase

Version = 0.0

def __str__(self):
outs = "Tables:\r\n"
for table in self.Tables:
outs += "%s/%s ver %.1f (%s)\r\n" % (table[1], table[0], table[4],
type(table[5]))

outs += "Groups:\r\n"
for group in self.Groups:
outs += "%s/%s ver %.1f (%s)\r\n" % (group[1], group[0], group[4], group[3])

13
return outs

class TestTableDef(MetaBase):
Version = 1.1
class SNPs(tables.IsDescription):
"""SNPs table"""
objName = "/definition/SNPs"
objVersion = 1.0

name = tables.StringCol(itemsize = 32)


chromosome = tables.UInt8Col()
aminor = tables.UInt8Col()
amajor = tables.UInt8Col()

class Individuals(tables.IsDescription):
objName = "/definition/individuals"
objVersion = 1.0

symbol = tables.StringCol(itemsize = 64)


cohort = tables.UInt32Col()

class Cohorts(tables.IsDescription):
objName = "/definition/cohorts"
objVersion = 1.0

symbol = tables.StringCol(itemsize = 32)


fIndivIdx = tables.UInt32Col()
countIndiv = tables.UInt32Col()

class Chromosomes(tables.IsDescription):
objName = "/definition/chromosomes"
objVersion = 1.0

symbol = tables.UInt8Col()
fSNP_Idx = tables.UInt32Col()
countSNP = tables.UInt32Col()

class Chunk(tables.IsDescription):
objName = "/source/imports"
objVersion = 1.0

chromosome = tables.UInt8Col()
cohort = tables.StringCol(itemsize = 8)
chunk_name = tables.StringCol(itemsize = 256)

class DistanceChromsomes(tables.IsDescription):
objName = "/source/hapmap/files"
objVersion = 1.1

chromosome = tables.UInt8Col()
source_name = tables.StringCol(itemsize = 256)
array_set = tables.StringCol(itemsize = 256)

class SourceGroup(GroupDescriptor):
"""Raw source data uploaded with nbcp"""
objName = "/source"
objVersion = 1.0

class AnalysesGroup(GroupDescriptor):
"""Information derived in analysis"""
objName = "/analyses"
objVersion = 1.0

14
class MetadataGroup(GroupDescriptor):
"""Metainformation about performed analyses"""
objName = '/metadata'
objVersion = 1.0

class DefinitionGroup(GroupDescriptor):
"""Subject data definition"""
objName = '/definition'
objVersion = 1.0

Listing 4: MetaBase, the base class for structure definition


class MetaBase:

__metaclass__ = metaMetaBase

Version = 0.0

def __str__(self):
outs = "Tables:\r\n"
for table in self.Tables:
outs += "%s/%s ver %.1f (%s)\r\n" % (table[1], table[0], table[4],
type(table[5]))

outs += "Groups:\r\n"
for group in self.Groups:
outs += "%s/%s ver %.1f (%s)\r\n" % (group[1], group[0], group[4], group[3])

return outs

def getVersionDifferential(self, base_version):


"""get all items that are newer than the base version.
This is a list of callables that needs to be executed against the database"""
lobj = []
lobj += [ tpl for tpl in self.Groups if tpl[4] > base_version]
lobj += [ tpl for tpl in self.Tables if tpl[4] > base_version]

lobj.sort(lambda a, b : cmp(a[2], b[2]))


return [self.callableFactoryFunction(tpl) for tpl in lobj]

def getCreateSequence(self):
return self.getVersionDifferential(0.0)

def getVersionPaths(self, version):


lobj = []
lobj += [ tpl for tpl in self.Groups if tpl[4] <= version]
lobj += [ tpl for tpl in self.Tables if tpl[4] <= version]

lobj.sort(lambda a, b : cmp(a[2], b[2]))


return [("%s/%s" % (tpl[1], tpl[0])) for tpl in lobj]

def callableFactoryFunction(self, tpl):


if len(tpl) == 5:
return CreateGroupWrapper(tpl[1], tpl[0], tpl[3])
elif len(tpl) == 6:
return CreateTableWrapper(tpl[1], tpl[0], tpl[5], tpl[3])
else:
return None

15
Listing 5: The main wrapper class (Notebook)
class Notebook:
"""
THIS IS AN ILLUSTRATION ONLY. CODE NOT RELEVANT TO THE EXAMPLE HAS BEEN LEFT OUT
"""
def version(self):
ver = 0.0
if not self.__ReadOnly:
ver = self.metadata().Version
elif not self.Stamp is None:
ver = self.Stamp[1]

# kludge - past sins


if ver == "0.1.0":
ver = 1.0

return ver

def upgradeDatabase(self, db, oldstamp):


ver = oldstamp[1]
assert not self.__ReadOnly

cmdlst = self.metadata().getVersionDifferential(ver)

if len(cmdlst):
for cmd in cmdlst:
cmd(db)

return oldstamp[0], self.version(), time()


else:
return oldstamp
def verifyDatabase(self, dbfile):
"""
check if this is a proper database. If it is not, feel free to thow an exception
"""

lst = self.metadata().getVersionPaths(self.version())
assert len(lst) > 0
for path in lst:
assert path in dbfile

def metadata(self):
if self.__metadata is None:
self.__metadata = TableDef()

return self.__metadata

def createDatabase(self, fpath):


"""
create a new HD5 file and its content
"""
dbfile = tables.openFile(fpath, "a")
createl = self.metadata().getCreateSequence()
for cmd in createl:
cmd(dbfile)

return dbfile

Listing 6: callable wrapper classes for part metadata


class CreateTableWrapper:

16
def __init__(self, path, name, defobj, dsc):
self.Path = ["/", path][path is not None and path != ""]
self.Name = name
self.DefClss = defobj
self.Title = ["", str(dsc)][dsc is not None]

def __call__(self, fobj):


fobj.createTable(self.Path, self.Name, self.DefClss, self.Title)

def __str__(self):
return "table %s=>%s (%s)" % (self.Path, self.Name, self.Title)

class CreateGroupWrapper:
def __init__(self, path, name, dscr):
self.Path = ["/", path][path is not None and path != ""]
self.Name = name
self.Title = dscr

def __call__(self, fobj):


fobj.createGroup(self.Path, self.Name, self.Title)

def __str__(self):
return "group %s=>%s (%s)" % (self.Path, self.Name, self.Title)

17

You might also like