Generators, comments and the data fetish

Author: Marcin Swiatek, Visimatik Inc., 2008 I worked on a tool for navigating data hierarchies stored in HDF5 files. As it usually happens, at some point I wanted to convince myself that the program would be adequate for the intended purpose1. To this end, I needed to exercise it against a reasonably wide set of exemplary data. While I already had examples downloaded from the HDF5 web site and the datasets I had generated for my project, I needed more still. To this end, I have contrived a tool to help me quickly populate tables in an HDF5 file of arbitrary structure with random data2. This is where things get awkward. Given the sheer thrill of writing test code and the comforting obviousness of several solutions to the problem, how do I convince at least one person to stay with me past this paragraph? I promise to spend little time on trivial matters, like generating random numbers or the syntax of loop statements. The article will focus on generators and reflection and suggest a practical and entertaining use of these fairly obscure, yet very useful, aspects of Python. This document and the associated source code can be downloaded from www.visimatik.com.

The set-up
Frequently the problem statement suggests possible solutions. And while prudent programmers should treat solutions that 'invite themselves' with distrust, in this case the description is best taken at its face value. The tandem routines PopulateTablesInFile and PopulateTable (Listing 1) implement a straightforward strategy: find all tables in the file and against each table execute the prescribed number of insertions. Insertion consists of an assignment (each field gets assigned a new value), followed by a call to the append method on an object representing the table's row. The only aspect that calls for comment is how assigned values are produced. This task is delegated to callable objects I will refer to (somehow awkwardly) as 'data makers3'. It is rather obvious what a 'maker' needs to do: it will, when called, return a new value of the expected type. Values for all column types, excepting strings, are ultimately generated by calls to functions imported form the standard library module random. It seems reasonable to associate these 'makers' with data types, or, more exactly, with types of types: kinds4. However, I decided to have the front-line 'maker holder' indexed by column descriptions, rather then just types. Column descriptors carry information about column type and for the purpose of this exercise can be reduced to kinds. In the future, however, they will let me introduce another level of indirection between the dataMaker parameter of PopulateTable and the KindMakers dictionary. I plan to use it to populate columns with variates of different distributions, depending on data semantics. PopulateTablesInFile constitutes the highest-level interface to the 'table stuffer' module
1 This process is usually called 'testing'. 2 There is a very long argument to be made about using synthetic data for verification of algorithms and data analysis methods. In several fields I am familiar with, synthetic data is used often and to good results. However, in this text I intend to focus on some features of the programming language: Python. Building models for generators of synthetic data is out of scope of this article. 3 I am trying hard to avoid calling these objects 'generators'. The name 'generators' might seem natural to describe a routine that generates something. But since this article is about generators in a different meaning, it is better kept this way. 'Makers' for objects serving data when called, 'generators' for... 4 For practical purposes it the details (like representation length) of columns' types won't really matter. Makers return values in Python types, which are converted upon assignment. To keep things simple I have decided to ignore the distinction between fixed-length string columns in pytables and Python strings – the only practical situation where exact type is indeed important.

(TableStuffer.py) and is used by the actual command script (TestMaker.py). There is nothing remarkable about the command line script and we will spend no time analyzing it. Structures of HDF files the script can create have been predefined using the mechanism discussed previously. These definitions can be found in Schemas/Canned.py. You will need pytables and HDF5 libraries to run the example.

First, there was Word – or on the benefits of wordy comments
When it comes to populating text fields, the simplest solution is to generate random sequences of characters. However, this approach has considerable drawbacks. While the practice may be adequate for load testing, it will not do if data is to be ever evaluated by a human being. Chance doesn't look life-like and a person evaluating test results based on random symbols will have considerable difficulties finding and recalling any point of reference. The usual way out of this difficulty is to take words from a file containing natural language text (literature classics always work best). However, this is a truly light-weight project and hauling a large text file around with a tiny test script just seems to be out of proportion. How could I do without, then? The program itself is text, although admittedly with a limited and peculiar vocabulary. Yet a good part of any source file is written for human readers: comments and doc-strings. This will be my source of test data. Given Python's nature, it is relatively easy to access this information in the runtime. The standard library module inspect offers several useful tools to get the task done. In wordManufacture routine (Listing 2), I traverse the live graph of runtime artifacts. It is worth underscoring that this is not the graph of relations of program's data (objects), which may refer to each other in very complex ways. Here, we will remain on the meta level, where relations between entities (such as module, type, class or method) are defined by the lexical structure of the program5. One could expect a graph with edges defined by relations of inheritance and containment to be free of cycles. Unfortunately, this is not the case:
>>> mmth = inspect.getmembers(__main__) >>> mnm = [t[1] for t in mmth if t[0] == '__main__'] >>> mnm [<module '__main__' (built-in)>] >>> mnm[0] == __main__ True >>>

The practical consequence of this observation is that the code cannot be treated as a tree. Normally, one would strive to devise an algorithm avoiding infinite loops. However, in this particular situation it made more sense for me to embrace infinity. After all, the program is supposed to generate words until the end of time. The source code associated with a programming artifact may be inspected using appropriate routines from the inspect module. In particular, functions getcomments and getdoc extract comments and documentation strings, information that will suit best the purpose. Each obtained string will likely contain several words: the smallest pieces of text that can be easily noticed, memorized and referenced. The object producing text data will thus return individual words. Notice that the algorithm gathering words will have to operate on an nested structure, a graph of objects containing lists of words. A nested iteration is easy to program, but there is an additional challenge: the words need to be returned one-by-one, in subsequent calls. One could gather all text up front and return words from a storage. This, however, requires the graph traversal problem to be addressed properly. The alternative is to encapsulate the process of data
5 This is a major simplification. In reality, Python's dynamic character makes the lexical structure of the program more malleable then one may expect.

extraction in a class, which would progress iteration 'on demand', when more data is needed. There is nothing unusual about this proposition. For instance, classes reducing a complex data structure or an algorithm to an iteration are the favorite vehicle of database access libraries. Devising a class for the task would not difficult. The only aspect that might call for special care is the question of representing the state of nested iterations in object's variables. Interestingly, in Python the task of finding the suitable representation can be delegated to the language itself.

The state of a computation
Suppose you invoke wordManufacture() as presented in Listing 2. What the call will return? Well, it is easy if you try...
>>> >>> p = wordManufacture() >>> p <generator object at 0x660d0> >>> p.next() '__main__' >>>

Instead of returning a string, as one might have expected, wordManufacture() returns an object – a generator. According to Python documentation, it is enough to place a yield statement in the function's body to make the interpreter create a wholly different code execution structure and in place of a normal function produce a generator function. I find it convenient to look at generators in a Generators in Python are now something more similar way as at iterators6. One could say an than just enhanced iterators. In the scenario iterator represents an iteration. By the same described here, one-way communication implies token, generator can be thought of as an object representing, and permitting some control over a that one routine ‘plays-back’ another, as if having it perform a certain task. However, it is flow of a computation. In this context ‘control’ amounts to something very much akin stepping possible to communicate in both directions, thorough the send method of the generator through an iteration. However, the routine, interface and the return value of the yield which is to be controlled through a generator, instruction. This enables compositions, where needs to be written in a specific way, with two or more routines collaborate on some task explicit definition of junction points, where the (the term for that is, I believe, collaborative generator function will communicate with the multitasking). In other words, generator outside. In Python, these junction points are functions can be coroutines, with all associated defined using the yield keyword. benefits. For instance, it is a natural way of Upon invocation of the method next() of a expressing several interesting algorithms. generator object, the related generator routine will execute up until the next yield in its code. A Programming coroutines is an interesting topic on its own right, but a fairly broad one, too. It call to next() returns whatever the generator will not be discussed here; instead refer all routine yields. In earlier versions of Python (pre PEP-342), the readers interested in writing coroutines in Python to the already invoked PEP-342 and other interface of a generator was exactly that of an resources on the web. iterator and the construct lacked the two-way communication, enabled by the send method of the interface. Thus, generator functions were just a way to code some iterations more conveniently. Most examples given in literature reinforce this association.
6 When discussing the concept, I much prefer to focus on the design pattern, rather than on its interpretation in a specific language. Oddly enough, I could not find on-line an explanation that I liked. Wikipedia does have an article, but it is poorly written, in my opinion. This one seems good, but the link just looks like it is not going to last. The seminal GoF book 'Design Patterns' brings a good discussion, but it is not available as an on-line reference.

The design presented in this article also follows that established usage pattern. However, please keep in mind that while similarities between generators and iterators are really difficult to overlook, one should not reduce one to another. Generators and iterators are most often employed in the context of a for loop, which in Python is always about iterating through something (an iterable) using an iterator object. The loop's semantics completely occludes the use of iterators, usually to programmer's great benefit. Owing it to that, the behavior of iterators is usually of concern only in the context of writing container classes. However, this example requires placing a generator outside the usual context. Adapting an iterator or a generator to a callable interface is not complicated, but requires some care. In my example, the generator is adapted to a callable interface through a thin wrapper object. Its __call__ function invokes explicitly the generator's next() function. Notice the try-except clause, surrounding this call. According to iterator and generator contracts, StopIteration exception is used to signal that there is no more elements in the collection (or no more computations to perform). Hence, this is something we must expect7; the semantics of for construct in Python includes exception handling, but here – it is up to the programmer. The wordSmith class in Listing 2 implements the wrapper.

Remaining business
In order to ensure stop of the wordManufacture routine, I have introduced a primitive counter into the algorithm. While it came useful in debugging, in real life I would use a different way of extracting finite sequences from an infinite-loop generator. The standard itertools module brings a variety of tools extending the standard iterators mechanism8. One of them, islice, first perfectly the purpose.

Listings
Listing 1: The essential interface
KindMakers = {'string' : wordSmith(), 'int' : lambda : random.randint(-1000, 1000), 'uint' : lambda : random.randint(0, 1000), 'bool' : lambda : random.random() > 0.5, 'float': lambda : random.expovariate(.1), 'complex' : lambda : random.gauss(0, 100.0) + 1j * random.expovariate(.1), 'time' : lambda : time.time() + random.gauss(0, 10000.0)}

class MakerFromColumn(object): def __init__(self, kinddct = KindMakers, stoplst = []): self._src = kinddct self._stoplist = stoplst[:] def __getitem__(self, key): if key._v_name in self._stoplist: raise KeyError(key) return self._src[key.kind] TypedMakers = MakerFromColumn(KindMakers, ['vID']) def PopulateTable(table_obj, count = 1, dataMakers = TypedMakers): """ The function will insert into the data table 'count' rows that will be generated by the supplied dataMakers. dataMakers object is expected to implement simple indexing operator, with column descriptions (tables.Col) as parameter. See the default value of

7 A generator function will raise StopIteration upon termination. 8 The documentation of the itertools module also brings several examples of use of the yield statement.

dataMakers (TypedMakers) for an imlementation. """ boundflds = [] for col_path in table_obj.colpathnames: try: data_gen = dataMakers[table_obj.coldescrs[col_path]] if data_gen: boundflds.append((col_path, data_gen)) except KeyError: pass accs = table_obj.row for i in xrange(count): for (key, gen) in boundflds: accs[key] = gen() accs.append() table_obj.flush() def PopulateTablesInFile(hdf5File, rowcounts = {}, defcount = 100): for tb in hdf5File.walkNodes("/", "Table"): PopulateTable(tb, rowcounts.get(tb.name, defcount))

Listing 2: The data maker for the 'string' kind
def wordManufacture(max_iter_count = -1): import __main__ import inspect item_queue = [] filterette = lambda itm : inspect.isclass(itm) or inspect.isroutine(itm) or inspect.ismodule(itm) while max_iter_count != 0: try: this_object = item_queue.pop(0) nextpass = [tp[1] for tp in inspect.getmembers(this_object, filterette)] if this_object in nextpass: nextpass.remove(this_object) nextpass.append(this_object) item_queue += nextpass txtlst = [] doc_string = inspect.getdoc(this_object) comment_string = inspect.getcomments(this_object) if doc_string is not None: txtlst += doc_string.split() if comment_string is not None: txtlst += comment_string.split() for wrd in txtlst: yield wrd except IndexError: assert not item_queue item_queue.append(__main__) #da capo... if max_iter_count > 0: max_iter_count -= 1 class wordSmith(object): def __init__(self): self.__gen = wordManufacture() def __call__(self): try: return self.__gen.next() except StopIteration: self.__gen = wordManufacture()

return self.__gen.next()