You are on page 1of 10

Practical threaded programming with Python

Threading usage patterns
Noah Gift Software Engineer Giftcs 03 June 2008

Threaded programming in Python can be done with a minimal amount of complexity by combining threads with Queues. This article explores using threads and queues together to create simple yet effective patterns for solving problems that require concurrency.

Introduction
With Python, there is no shortage of options for concurrency, the standard library includes support for threading, processes, and asynchronous I/O. In many cases Python has removed much of the difficulty in using these various methods of concurrency by creating high-level modules such as asynchronous, threading, and subprocess. Outside of the standard library, there are third solutions such as twisted, stackless, and the processing module, to name a few. This article focuses exclusively on threading in Python, using practicle examples. There are many great resources online that document the threading API, but this article attempts to provide practicle examples of common threading usage patterns.
The Global Interpretor Lock refers to the fact that the Python interpreter is not thread safe. There is a global lock that the current thread holds to safely access Python objects. Because only one thread can aquire Python Objects/C API, the interpreter regularly releases and reacquires the lock every 100 bytecode of instructions. The frequency at which the interpreter checks for thread switching is controlled by the sys.setcheckinterval() function. In addition, the lock is released and reacquired around potentially blocking I/O operations. Please see Gil and Threading State and Threading the Global Interpreter Lock in the Resources section for more details. It is important to note that, because of the GIL, the CPU-bound applications won't be helped by threads. In Python, it is recommended to either use processes, or create a mixture of processes and threads.

It is important to first define the differences between processes and threads. Threads are different than processes in that they share state, memory, and resources. This simple difference is both a strength and a weakness for threads. On one hand, threads are lightweight and easy to communicate with, but on the other hand, they bring up a whole host of problems including
© Copyright IBM Corporation 2008 Practical threaded programming with Python Trademarks Page 1 of 10

py Thread-1 says Hello World at time: 2008-05-13 13:22:50.developerWorks® ibm. While these options exist. it would be considered a best practice to inherit from threading. we will start with a simple "Hello World" example: hello_threads_example import threading import datetime class ThreadClass(threading. The threading module does provide many synchronization primatives. Fortunately. including semaphores. events. and make threaded programming considerably safer. For most situations. If you look at the actual code.getName().Thread and because of this.start() is what actually starts the threads. you get the following output: # python hello_threads. as it creates a very natural API for threaded programming. condition variables. threading can be complicated when threads need to share data or resources.252576 Looking at this output.datetime. threading in Python is much less complex to implement than in other languages.Thread): def run(self): now = datetime. one imports the datetime module and the other imports the threading module.com/developerWorks/ deadlocks.5 or greater installed. as many examples will be using newer features of the Python language that only appear in at least Python2. there are two import statements. The class ThreadClass inherits from threading. as they effectively funnel all access to a resource to a single thread. and locks. If you notice. Hello Python threads To follow along. it is considered a best practice to instead concentrate on using queues. and sheer complexity. To get started with threads in Python. The threading module was designed with inheritance in mind. Practical threaded programming with Python Page 2 of 10 . and was actually built on top of a lower-level thread module. The last three lines of code actually call the class and start the threads. Using queues with threads As I referred to earlier. you need to define a run method that executes the code you run inside of the thread. I assume that you have Python 2. due to both the GIL and the queuing module. Queues are much easier to deal with.5.Thread. now) for i in range(2): t = ThreadClass() t.252069 Thread-2 says Hello World at time: 2008-05-13 13:22:50. you can see that you received a Hello World statement from two threads with date stamps.now() print "%s says Hello World at time: %s" % (self. and allow a cleaner and more readible design pattern. race conditions.start() If you run this example.getName() is a method that will identify the name of the thread. t. The only thing of importance to note in the run method is that self.

com".start) When you run this.com"] queue = Queue. it would take approximately 50 seconds.Queue() class ThreadUrl(threading. given the current average.com".queue. and print out the first 1024 bytes of the page.com"] start = time.com/developerWorks/ developerWorks® In the next example. you get a lot of output to standard out. "http://amazon. you create a start time value by calling time.queue = queue def run(self): while True: #grabs host from queue host = self. you will first create a program that will serially. "http://apple. or one after the other. the result of "Two and a half seconds" isn't horrible. This is a classic example of something that could be done quicker using threads.time().__init__(self) self. "http://ibm.com". Second. First.ibm. queue): threading. First.com". "http://amazon. "http://google.com".urlopen(host) print url. grab a URL of a website. let's use the urllib2 module to grab these pages one at a time. You import only two modules. the urllib2 module is what does the heavy lifting and grabs the Web pages. "http://ibm. Finally.com". and time the code: URL fetch serial import urllib2 import time hosts = ["http://yahoo. but if you had hundreds of Web pages to retrieve. Look at how creating a threaded version speeds things up: URL fetch threaded #!/usr/bin/env python import Queue import threading import urllib2 import time hosts = ["http://yahoo.40353488922 Let's look a little at this code.com".get() Practical threaded programming with Python Page 3 of 10 . as the pages are being partially printed. "http://google. But you get this at the finish: Elapsed Time: 2.time() #grabs urls of hosts and prints first 1024 bytes of page for host in hosts: url = urllib2. "http://apple. and then call it again and subtract the initial value to determine how long the program takes to execute.Thread.time() .Thread): """Threaded Url Grab""" def __init__(self.com". in looking at the speed of the program.read(1024) print "Elapsed Time: %s" % (time.

task_done() start = time. After the work is done. This creates a simple way to control the flow of the program. When the count of unfinished tasks drops to zero.start) This example has a bit more code to explain. Just a note about this pattern: By setting daemonic threads to true.setDaemon(True) t. to exit if only daemonic threads are alive.join() main() print "Elapsed Time: %s" % (time.Queue() and then populate it with data. and then exit the main program. because you can then join on the queue. 3. thanks to the use of the queuing module. Practical threaded programming with Python Page 4 of 10 .read(1024) #signals to queue job is done self. Create an instance of Queue. send a signal to the queue with queue. before exiting. Join on the queue. but it isn't that much more complicated than the first threading example. join() unblocks. 5.start() #populate queue with data for host in hosts: queue. 6. The exact process is best described in the documentation for the queue module. The count goes down whenever a consumer thread calls task_done() to indicate that the item was retrieved and all work on it is complete. 4. Pull one item out of the queue at a time. Spawn a pool of daemon threads. or wait until the queue is empty. 2.queue. and use that data inside of the thread.put(host) #wait on the queue until everything has been processed queue. The steps are described as follows: 1. it allows the main thread. the run method.Thread. or program.time() def main(): #spawn a pool of threads. as seen in the Resources: join() "Blocks until all items in the queue have been gotten and processed.urlopen(host) print url. Pass that instance of populated data into the threading class that you created from inheriting from threading. and pass them queue instance for i in range(5): t = ThreadUrl(queue) t. to do the work.developerWorks® ibm.com/developerWorks/ #grabs urls of hosts and prints first 1024 bytes of page url = urllib2.time() .task_done() that the task has been completed. which really means to wait until the queue is empty. The count of unfinished tasks goes up whenever an item is added to the queue. This pattern is a very common and recommended way to use threads with Python.

out_queue = out_queue def run(self): while True: #grabs host from queue chunk = self.queue.get() #parse the chunk soup = BeautifulSoup(chunk) print soup.Thread.__init__(self) self. and then places it into another queue.com".queue. with this module. "http://ibm.Thread. it is relatively simple to extend it by chaining additional thread pools with queues.Thread): """Threaded Url Grab""" def __init__(self.__init__(self) self. "http://amazon.read() #place chunk into out queue self. Then set up another pool of threads that join on the second queue. Multiple queues data mining websites import Queue import threading import urllib2 import time from BeautifulSoup import BeautifulSoup hosts = ["http://yahoo.com/developerWorks/ developerWorks® Working with multiple queues Because the pattern demonstrated above is so effective. out_queue): threading.put(chunk) #signals to queue job is done self. you simply printed out the first portion of a Web page. In the above example.out_queue = out_queue def run(self): while True: #grabs host from queue host = self.out_queue.ibm.findAll(['title']) Practical threaded programming with Python Page 5 of 10 . queue.urlopen(host) chunk = url. "http://apple. This next example instead returns the whole Web page that each thread grabs.com"] queue = Queue. Using just a couple of lines of code.out_queue. out_queue): threading.com".com".Queue() class ThreadUrl(threading. "http://google.Thread): """Threaded Url Grab""" def __init__(self.queue = queue self.task_done() class DatamineThread(threading. you will extract the title tag and print it out for each page you visit. and then do work on the Web page.get() #grabs urls of hosts and then grabs chunk of webpage url = urllib2.com". The work performed in this example involves parsing the Web page using a third-party Python module called Beautiful Soup.Queue() out_queue = Queue.

and to promote readable code. Computers. In this case.join() out_queue. One idea is to extract the links from each page using Beautiful Soup and then follow them.put(host) for i in range(5): dt = DatamineThread(out_queue) dt.setDaemon(True) dt.developerWorks® #signals to queue job is done self. In the run method of this class. chunk.start) ibm. and then passed that queue into the first thread pool class.join() main() print "Elapsed Time: %s" % (time.py [<title>Google</title>] [<title>Yahoo!</title>] [<title>Apple</title>] [<title>IBM United States</title>] [<title>Amazon. Practical threaded programming with Python Page 6 of 10 . you almost copy the exact same structure for the next thread pool class. In the final section. grab the Web page. Next . DVDs & more</title>] Elapsed Time: 3. you use Beautiful Soup to simply extract the title tags from each page and print them out.time() .time() def main(): #spawn a pool of threads. ThreadURL.task_done() start = time. from off of the queue in each thread.setDaemon(True) t. Apparel.75387597084 In looking at the code.start() #wait on the queue until everything has been processed queue. Books. out_queue) t.com/developerWorks/ If you run this version of the script. and then process this chunk with Beautiful Soup.start() #populate queue with data for host in hosts: queue. as you have the core for a basic search engine or data mining tool.com: Online Shopping for Electronics.out_queue. you began to explore creating a more complex processing pipeline that can serve as a model for future projects. and pass them queue instance for i in range(5): t = ThreadUrl(queue. it can be used to solve a wide number of problems by chaining queues and thread pools together. you get the following output: # python url_fetch_threaded_part2. you can see that we added another instance of a queue. This example could quite easily be turned into something more useful. While this basic pattern is relatively simple. There are quite a few excellent resources on both concurrency in general and threads in the Resources section. Summary This article explored threads in Python and demonstrated the best practice of using queues to allieviate complexity and subtle errors. DatamineThread.

com/developerWorks/ developerWorks® In closing. and that processes can be quite suitable for many situations.ibm. Please consult the Resources section for the official documentation on this. The standard library subprocess module in particular can be much simpler to deal with if you only require forking many processes and listening for a response. Practical threaded programming with Python Page 7 of 10 . it is important to point out that threads are not the solution to every problem.

developerWorks® ibm.zip Size 24KB Practical threaded programming with Python Page 8 of 10 .com/developerWorks/ Downloads Description Sample threading code for this article Name threading_code.

technical forum • AIX for Developers Forum • Cluster Systems Management • IBM Support Assistant • Performance Tools -.ibm. connect to their input/output/ error pipes. • Podcasts: Tune in and catch up with IBM technical experts.technical • More AIX and UNIX forums Practical threaded programming with Python Page 9 of 10 .com/developerWorks/ developerWorks® Resources Learn • This Thread Modulemodule provides low-level primitives for working with multiple threads. • See how Wikipedia defines a Thread. • This Threading Module constructs higher-level threading interfaces on top of the lower level thread module. and obtain their return codes • The AIX and UNIX developerWorks zone provides a wealth of information relating to all aspects of IBM® AIX® systems administration and expanding your UNIX skills. Get products and technologies • IBM trial software: Build your next development project with software for download directly from developerWorks.technical • Virtualization -. • The Subprocess Module allows you to spawn new processes. • developerWorks technical events and webcasts: Stay current with developerWorks technical events and webcasts. • GIL and Threading State • Learn about Threading the Global Interpreter Lock. • See how the turn Toward Concurrency in SoftwareThe Free Lunch Is Over • Queue Module • Beautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. • New to AIX and UNIX? Visit the New to AIX and UNIX page to learn more. Discuss • Participate in the AIX and UNIX forums: • AIX 5L -. • Concurrency and Python • The Asyncore module provides the basic infrastructure for writing asynchronous socket service clients and servers. • The PMOTW Threading module lets you run multiple operations concurrently in the same process space.

In his free time he enjoys spending time with his wife Leah. Caltech.shtml) Trademarks (www.com.pyatl. Sony Imageworks. and his personal website is www. speaker. Disney Feature Animation.org. He has a Master's degree in CIS from Cal State Los Angeles. He is an author. playing the piano. consultant.noahgift.com/developerworks/ibm/trademarks/) Practical threaded programming with Python Page 10 of 10 . is an Apple and LPI certified SysAdmin.com/developerWorks/ About the author Noah Gift Noah Gift is the co-author of "Python For Unix and Linux" by O'Reilly. and MacTech. Noah is also the current organizer for www. and Turner Studios. Red Hat Magazine. O'Reilly. and exercising religiously.com/legal/copytrade. © Copyright IBM Corporation 2008 (www.ibm. His consulting company's website is www. in Nutritional Science from Cal Poly San Luis Obispo. and community leader.S. writing for publications such as IBM developerWorks.com. B. and their son Liam. which is the Python User Group for Atlanta.giftcs. GA. and has worked at companies such as.developerWorks® ibm.ibm.