You are on page 1of 98

Martin Rubli

Building a Webcam Infrastructure for GNU/Linux

Master Thesis EPFL, Switzerland 2006

Prof. Matthias Grossglauser, Laboratory for Computer Communications and Applications, EPFL

Richard Nicolet, Logitech Remy Zimmermann, Logitech


Martin Rubli
School of Computer and Communication Sciences, Swiss Federal Institute of Technology, Lausanne, Switzerland Logitech, Fremont, California

Revision a. All trademarks used are properties of their respective owners.

A This document was set in Meridien LT and Frutiger using the LTEX typesetting system on Debian GNU/Linux.

Abstract In this thesis we analyze the current state of webcam support on the GNU/Linux platform. Based on the results gained from that analysis we develop a framework of new software components and improve the current platform with the goal of enhancing the user experience of webcam owners. Along the way we get a close insight of the components involved in streaming video from a webcam and of what todays hardware is capable of doing.

1 Introduction 2 Current state of webcam hardware 2.1 Introduction . . . . . . . . . . . . . . . . . 2.2 Terminology . . . . . . . . . . . . . . . . . 2.3 Logitech webcams . . . . . . . . . . . . . . 2.3.1 History . . . . . . . . . . . . . . . . 2.3.2 Cameras using proprietary protocols 2.3.3 USB Video Class cameras . . . . . . 2.4 USB Video Class . . . . . . . . . . . . . . . 2.4.1 Introduction . . . . . . . . . . . . . 2.4.2 Device descriptor . . . . . . . . . . . 2.4.3 Device topology . . . . . . . . . . . 2.4.4 Controls . . . . . . . . . . . . . . . . 2.4.5 Payload formats . . . . . . . . . . . 2.4.6 Transfer modes . . . . . . . . . . . . 2.5 Non-Logitech cameras . . . . . . . . . . . . 3 An introduction to Linux multimedia 3.1 Introduction . . . . . . . . . . . . . . 3.2 Linux kernel multimedia support . . . 3.2.1 A brief history of Video4Linux 3.2.2 Linux audio support . . . . . . 3.3 Linux user mode multimedia support 3.3.1 GStreamer . . . . . . . . . . . 3.3.2 NMM . . . . . . . . . . . . . . 3.4 Current discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 . 3 . 3 . 4 . 4 . 5 . 6 . 8 . 8 . 9 . 9 . 9 . 10 . 10 . 10 12 12 12 12 13 13 14 15 16 18 18 18 19 19 20

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

4 Current state of Linux webcam support 4.1 Introduction . . . . . . . . . . . . . . . 4.1.1 Webcams and audio . . . . . . . 4.2 V4L2: Video for Linux Two . . . . . . . 4.2.1 Overview . . . . . . . . . . . . . 4.2.2 The API . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .


4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . 4.3 Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 The Philips USB Webcam driver . . . . . . . . . . 4.3.2 The Spca5xx Webcam driver . . . . . . . . . . . . 4.3.3 The QuickCam Messenger & Communicate driver 4.3.4 The QuickCam Express driver . . . . . . . . . . . 4.3.5 The Linux USB Video Class driver . . . . . . . . . 4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 V4L2 applications . . . . . . . . . . . . . . . . . . 4.4.2 V4L applications . . . . . . . . . . . . . . . . . . . 4.4.3 GStreamer applications . . . . . . . . . . . . . . . 4.5 Problems and design issues . . . . . . . . . . . . . . . . . 4.5.1 Kernel mode vs. user mode . . . . . . . . . . . . . 4.5.2 The Video4Linux user mode library . . . . . . . . 4.5.3 V4L2 related problems . . . . . . . . . . . . . . . . 5 Designing the webcam infrastructure 5.1 Introduction . . . . . . . . . . . . . . . . . . . 5.2 Goals . . . . . . . . . . . . . . . . . . . . . . . 5.3 Architecture overview . . . . . . . . . . . . . . 5.4 Components . . . . . . . . . . . . . . . . . . . 5.4.1 Overview . . . . . . . . . . . . . . . . . 5.4.2 UVC driver . . . . . . . . . . . . . . . . 5.4.3 V4L2 . . . . . . . . . . . . . . . . . . . . 5.4.4 GStreamer . . . . . . . . . . . . . . . . 5.4.5 v4l2src . . . . . . . . . . . . . . . . . . 5.4.6 lvlter . . . . . . . . . . . . . . . . . . . 5.4.7 LVGstCap (part 1 of 3: video streaming) 5.4.8 libwebcam . . . . . . . . . . . . . . . . 5.4.9 libwebcampanel . . . . . . . . . . . . . 5.4.10 LVGstCap (part 2 of 3: camera controls) 5.4.11 liblumvp . . . . . . . . . . . . . . . . . 5.4.12 LVGstCap (part 3 of 3: feature controls) 5.4.13 lvcmdpanel . . . . . . . . . . . . . . . . 5.5 Flashback: current problems . . . . . . . . . . 6 Enhancing existing components 6.1 Linux UVC driver . . . . . . . 6.1.1 Multiple open . . . . . 6.1.2 UVC extension support 6.1.3 V4L2 controls in sysfs . 6.2 Video4Linux . . . . . . . . . . 6.3 GStreamer . . . . . . . . . . . 6.4 Bits and pieces . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

21 21 21 22 22 23 23 24 24 26 26 27 27 33 34 39 39 39 43 45 45 46 46 47 47 49 50 50 51 51 52 52 53 53 56 56 56 59 62 62 65 65

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

7 New components 7.1 libwebcam . . . . . . . . . . . . . 7.1.1 Enumeration functions . . 7.1.2 Thread-safety . . . . . . . . 7.2 liblumvp and lvlter . . . . . . . . 7.3 libwebcampanel . . . . . . . . . . 7.3.1 Meta information . . . . . 7.3.2 Feature controls . . . . . . 7.4 Build system . . . . . . . . . . . . 7.5 Limitations . . . . . . . . . . . . . 7.5.1 UVC driver . . . . . . . . . 7.5.2 Linux webcam framework 7.6 Outlook . . . . . . . . . . . . . . . 7.7 Licensing . . . . . . . . . . . . . . 7.7.1 Libraries . . . . . . . . . . . 7.7.2 Applications . . . . . . . . 7.8 Distribution . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

67 67 68 68 70 71 71 73 74 75 75 78 79 80 80 81 81

8 The new webcam infrastructure at work 83 8.1 LVGstCap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 8.2 lvcmdpanel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 9 Conclusion A List of Logitech webcam USB PIDs 86 88


Chapter 1

Getting a webcam to work on Linux is a challenge on different levels. Making the system recognize the device properly sets the bar to a level that many users feel unable to cross, often for mostly unsubstantiated fear of compiling kernel drivers. Even once that rst hurdle is cleared, the adventure has only just started. A webcam is perfectly useless without good software that takes advantage of its features, so where do users go from here? Since the rst webcams appeared on the market, they have evolved from simple devices that captured relatively poor quality videos the size of a postage stamp to hightech devices that allow screen-lling videos to be recorded all while applying complex real-time video processing in hardware and software. Traditionally, Linux has been used for server installations and only in the recent years has it started to conquer the desktop. This fact still shows in the form of two important differences when one compares webcam support on Linux and Windows. For one, Linux applications have primarily focused on retrieving still images from the cameras, oftentimes for "live" cameras on the Internet that update a static picture every few seconds. These programs often work in a headless environment, i.e. one that does not require a graphical user interface and a physical screen. For another, webcam manufacturers have provided little support for the Linux platform, most of which was in the form of giving technical information to the open source community without taking the opportunity to actively participate and inuence the direction that webcam software takes. This project is an attempt of Logitech to change this in order to provide Linux users with an improved webcam experience that eventually converges towards the one that Windows users enjoy today. Obviously, the timeline of such an undertaking is in the order of years due to the sheer amount of components and people involved. Luckily, the scope of a Master thesis is enough to lay the foundations that are required, not only of a technical nature but also in terms of establishing discussions between the parties involved. In the course of this project, apart from presenting the newly developed 1

framework, we will look at many of the components that already exist today, highlighting their strengths but also their weaknesses. It was this extensive analysis that eventually led to the design of the proposed framework in an attempt to learn from previous mistakes and raise awareness of current limitations. The latter is especially important for a platform that has to keep up with powerful and agile competitors. The foundations we laid with the Linux webcam framework make it easier for developers to base their products on a common core which reduces development time, increases stability, and makes applications easier to maintain. All of these are key to establishing a successful multimedia platform and delivering users the experience they expect from an operating system that has ofcially set out to conquer the desktop. I would like to thank rst of all my supervisors at Logitech, Richard Nicolet and Remy Zimmermann, for their advice and the expertise they shared with me, but also the rest of the video driver and rmware team for their big help with various questions that kept coming up. Thanks also to Matthias Grossglauser, my supervisor at EPFL, for his guidance. A big thank you to the people in the open source community I got to work with or ask questions to. In particular this goes to Laurent Pinchart, the author of the Linux UVC driver, rst of all for having written the driver, thereby letting me concentrate on the higher-level components, and second of all for the constructive collaboration in extending it. Last but not least, thanks to everybody who helped make this project happen in one way or another but whose name did not make it into this section. Fremont, USA, September 2006

Chapter 2

Current state of webcam hardware

2.1 Introduction

The goal of this chapter is to give an overview of the webcams that are currently on the market. We will rst focus on Logitech devices and devote a small section to cameras of other vendors later on. We will also give an overview of the USB Video Class, or simply UVC, specication, which is the designated standard for all future USB camera devices. The Linux webcam framework was designed primarily with UVC devices in mind and the main goal of this chapter is to present the hardware requirements of the framework. Therefore, the majority of the chapter is dedicated to UVC cameras as devices using proprietary protocols are slowly phased out by the manufacturers. We will nevertheless mention the most important past generations of webcams because some of them remain in broad use and it will be interesting to see how they differ in functionality.



There are a few terms that will keep coming up in the rest of the report. Let us quickly go over some of them to avoid any terminology related confusion. USB modes In the context of USB we will often use the terms high-speed to denote USB 2.0 operation and full-speed for the USB 1.x case. There also exists a mode called low-speed that was designed for very low bandwidth devices like keyboards or mice. For webcams, low-speed is irrelevant. Image resolutions There is a number of standard resolutions that have corresponding acronyms. We will sometimes use these acronyms for readabilitys 3

sake. Table 2.1 has a list of the most common ones.1 Width [px] 160 176 320 352 640 1024 1280 1280 Height [px] 120 144 240 288 480 768 960 1024 Acronym QSIF QCIF QVGA (also SIF) CIF VGA XGA SXGA (4:3) SXGA (5:4)

Table 2.1: List of standard resolutions and commonly used acronyms.


Logitech webcams

In the last years the market has seen a myriad of different webcam models and technologies. The rst webcams were devices for the parallel port allowing very limited bandwidth and a user experience that was far from the plug-andplay that users take for granted nowadays. With the advent of the Universal Serial Bus, webcams nally became comfortable and simple enough to use for the average PC user. Driver installation became simple and multiple devices could share the bus. Using a printer and a webcam at the same time was no longer a problem. One of the limitations of USB, however, was a bandwidth that was still relatively low and image resolutions above 320x240 pixels required compression algorithms that could send VGA images over the bus at tolerable frame rates. Higher resolution video at 25 or more frames per second only became possible when USB 2.0 was introduced. A maximum theoretical transfer rate of 480 Mb/s provides enough reserves for the next generations of webcams with multi-megapixel sensors. All recent Logitech cameras take advantage of USB 2.0, although they still work on USB 1.x controllers, albeit with a limited resolution set.
1 For some of the acronyms there exist different resolutions depending on the analog video standard they were derived from. For example, 352x288 is the PAL version of CIF whereas NTSC CIF is 352x240.


Cameras using proprietary protocols

From a driver point of view Logitech cameras are best distinguished by the ASIC2 they are based on. While the sensors are also an important component that the driver has to know about, such knowledge becomes less important because the rmware hides sensor specic commands from the USB interface. In the case of UVC cameras, even the ASIC is completely abstracted by the protocol andin the optimal caseevery UVC camera works with any UVC driver, at least as far as the functionality covered by the standard is concerned. This following list shows a number of Logitechs non-UVC cameras and is therefore grouped by the ASIC family they use. We will see in chapter 4 that this categorization is useful when it comes to selecting a driver. Vimicro 30x based Cameras with the Vimicro 301 or 302 chips are USB 1.1 devices, in the case of the 302 with built-in audio support. They support a maximum resolution of VGA at 15 frames per second. Apart from uncompressed YUV data, they can also deliver uncompressed 8 or 9-bit RGB Bayer data or, with the help of an integrated encoder chip, JPEG frames. Logitech QuickCam IM Logitech QuickCam Connect Logitech QuickCam Chat Logitech QuickCam Messenger Logitech QuickCam for Notebooks Logitech QuickCam for Notebooks Deluxe Logitech QuickCam Communicate STX Labtec Webcam Plus Labtec Notebook Pro Philips SAA8116 based The Philips SAA8116 is also a USB 1.1 chipset that supports VGA at a maximum of 15 fps. It has built-in microphone support and delivers image data in 8, 9, or 10-bit RGB Bayer format. It can also use a proprietary YUV compression format that we will encounter again in section 4.3.1 where we talk about the Linux driver for cameras based on this chip. Logitech QuickCam Zoom Logitech QuickCam Pro 3000 Logitech QuickCam Pro 4000 Logitech QuickCam Orbit/Sphere3
2 The application-specic integrated circuit in a webcam is the processor designed to process the image data and communicate them to the host. 3 There also exists a model of this camera that does not use Philips ASICs but the SPCA525 described below. This model has a different USB identier as can be seen in the table in appendix A.

Logitech QuickCam Pro for Notebooks Logitech ViewPort AV100 Cisco VT Camera Sunplus SPCA561 based The Sunplus SPCA561 is a low-end USB 1.1 chipset that only supports the CIF format at up to 15 fps. The following is a list of cameras that are based on this chip: Logitech QuickCam Chat Logitech QuickCam Express Logitech QuickCam for Notebooks Labtec Webcam Labtec Webcam Plus


USB Video Class cameras

Logitech was the rst webcam manufacturer to offer products that use the USB Video Class protocol, although this transition was done in two steps. It started with a rst set of cameras containing the Sunplus SPCA525 chip which supports both a proprietary protocol as well as the UVC standard. The USB descriptors of these cameras still announce the camera as a so-called vendor class device. This conservative approach was due to the fact that the rst models did not pass all the tests required to qualify as UVC devices. As we will see later on when we talk about the Linux UVC driver in more detail, the UVC support of these cameras is still fairly complete, which is why it simply overrides the device class and treats them as ordinary UVC devices. The following is a complete list of these devices: Logitech QuickCam Fusion Logitech QuickCam Orbit MP/Sphere MP Logitech QuickCam Pro 5000 Logitech QuickCam for Notebooks Pro Logitech QuickCam for Dell Notebooks (built-in camera for notebooks) Acer OrbiCam (built-in camera for notebooks) Cisco VT Camera II Figure 2.1 shows product photos of some of these cameras. All SPCA525 based cameras are USB 2.0 compliant and include an audio chip. They support VGA at 30 fps and, depending on the sensor used, higher resolutions up to 1.3 megapixels at lower frame rates. To reduce the trafc on the bus they feature a built-in JPEG encoder to support streaming of MJPEG data in addition to uncompressed YUV. 6

(a) QuickCam Fusion

(b) QuickCam Orbit MP

(c) QuickCam Pro 5000

(d) QuickCam for Notebooks Pro

Figure 2.1: The rst Logitech webcams with UVC support.

The next generation of Logitech webcams scheduled for the second half of 2006 are pure UVC-compliant cameras. Among those are the QuickCam Ultra Vision and the 2006 model of the QuickCam Fusion.

Figure 2.2: The rst pure Logitech UVC webcam: QuickCam UltraVision

All of these new cameras are supported by the Linux UVC driver and are automatically recognized because their USB descriptors mark them as USB Video Class devices, therefore eliminating the need to hardcode their product identiers in the software.


USB Video Class


We have already quickly mentioned the concept of USB device classes. Each device can either classify itself as a custom, vendor-specic, device or as belonging to one of the different device classes that the USB forum has dened. There exist many device classes with some of the best-known being mass storage, HID (Human Interface Devices), printers, and audio devices. If an operating system comes with a USB class driver for a given device class, it can take advantage of most or all of the devices features without requiring the installation of a specic driver, hence greatly adding to the users plug-and-play experience. The USB Video Class standard follows the same strategy supporting video devices such as digital camcorders, television tuners, and webcams. It supports a variety of features that cover the most frequently used cases while allowing device manufacturers to add their own extensions. The remainder of this section gives the reader a short introduction to some of the key concepts of UVC. We will only cover what is important to understand the scope of this report and refer the interested reader to [6] for the technical details.


Device descriptor

USB devices are self-descriptive to a large degree, exporting all information necessary for a driver to make the device work in a so-called descriptor. While the USB standard imposes a few ground rules on what the descriptor must contain and on the format of that data, different device classes build their own class-specic descriptors on top of these. The UVC descriptor contains such information as the list of video standards, resolutions, and frame rates supported by the device as well as a description of all the entities that the device denes. The host can retrieve all information it needs from these descriptors and make the devices features available to applications.


Device topology

The functionality of UVC devices is divided up into two different entities: units and terminals. Terminals are data sources or data sinks with typical examples being a CCD sensor or a USB endpoint. Terminals only have a single pin through which they can be connected to other entities. Units, on the other hand, are intermediate entities that have at least one input and one output pin. They can be used to select one of many inputs (selector unit) or to control image attributes (processing unit). There is a special type of unit that we will talk most about in this report, the extension unit. Extension units are the means through which vendors can add features to their devices that the UVC standard does not specify. To do anything useful with the functionality that extension units provide, the host driver or application must have additional knowledge about the device because while the extension units themselves are self-descriptive, the controls they contain are not. We shall see the implications of this fact later on when we discuss the Linux UVC driver. When the driver initializes the device, it enumerates its entities and builds a graph with two terminal nodes, an input and an output terminal, and one or multiple units in between.



Both, units and terminals contain sets of so-called controls through which a wide range of camera settings can be changed or retrieved. Table 2.2 lists a few typical examples of such controls grouped by the entities they belong to. Note that the controls in the third column are not specied by the standard but are instead taken from the list of extension controls that the current Logitech UVC webcams provide.

Camera terminal Exposure time Lens focus Zoom Motor control (pan/tilt/roll)

Processing unit Backlight compensation Brightness Contrast Hue Saturation White balance

Extension units Pan/tilt reset Firmware version LED state Pixel defect correction

Table 2.2: A selection of UVC terminal and unit controls. The controls in the rst two columns are dened in the standard, the availability and denition of the controls in the last column depends on the camera model.


Payload formats

The UVC standard denes a number of different formats for the streaming data that is to be transferred from the device to the host, such as DV, MPEG-2, MJPEG, or uncompressed. Each of these formats has its own adapted header format that the driver needs to be able to parse and process correctly. MJPEG and uncompressed are the only formats used by todays Logitech webcams and they are also currently the only ones understood by the Linux UVC driver.


Transfer modes

UVC devices have the choice between using bulk and isochronous data transfer. Bulk transfers guarantee that all data arrives without loss but do not make any similar guarantees as to bandwidth or latency. They are commonly used in le transfers where reliability is more important than speed. Isochronous transfers are used when a minimum speed is required but the loss of certain packets is tolerable. Most webcams use isochronous transfers because it is more acceptable to drop a frame than to transmit and display the frames delayed. In the case of a lost frame, the driver can simply repeat the previous frame, something that is barely noticeable by the user, whereas delayed frames are usually considered more disruptive of a video conversation.


Non-Logitech cameras

Creative WebCam Creative has a number of webcams that work on Linux, most of them with the SPCA5xx driver. A list of supported devices can be found on the developers website[23]. Creative also has a collection of links to drivers that work with some of their older camera models[3].


Microsoft LifeCam In summer 2006 Microsoft entered the webcam market with two new products, the LifeCam VX-3000 and VX-6000 models. Both of them are not currently supported by Linux due to the fact that they use a proprietary protocol. Further models are scheduled but none of them are reported to be UVC compliant at this time.


Chapter 3

An introduction to Linux multimedia

3.1 Introduction

This chapter gives an overview of what the current state of multimedia support looks like on GNU/Linux. We shall rst look at the history of the involved components and then proceed to the more technical details. At the end of this chapter the reader should have an overview of the different multimedia components available on Linux and how they work together.


Linux kernel multimedia support

A brief history of Video4Linux

Video devices were available long before webcams became popular. TV tuner cards formed the rst category of devices to spur the development of a multimedia framework for Linux. In 1996 a series of drivers targeted at the popular BrookTree Bt848 chipset that was used in many TV cards made it into the 2.0 kernel under the name of bttv. The driver evolved quickly to include support for radio tuners and other chipsets. Eventually, more drivers started to show up, among others the rst webcam driver for the Connectix QuickCam. The next stable kernel version, Linux 2.2, was released in 1999 and included a multimedia framework called Video4Linux, or short V4L, that provided a common API for the available video drivers. It must be said that the name is somewhat misleading in the sense that Video4Linux not only supports video devices but a whole range of related functions like radio tuners or teletext decoders. With V4L being criticized as too inexible, work on a successor had started as early as 1998 and, after four years, was merged into version 2.5 of the of-


cial Linux kernel development tree. When version 2.6 of the kernel was released, it was the rst version of Linux to ofcially include Video for Linux Two, or simply V4L21 . Backports of V4L2 to earlier kernel versions, in particular 2.4, were developed and are still being used today. V4L and V4L2 coexisted for a long time in the Linux 2.6 series but as of July 2006 the old V4L1 API was ofcially deprecated and removed from the kernel. This leaves Video4Linux 2 as the sole kernel subsystem for video processing on current Linux versions.


Linux audio support

Linux has traditionally separated audio and video support. For one thing, audio has been around much longer than video has, and for another both subsystems have followed a rather strict separation of concerns. Even though they were developed by different teams at different times, their history is marked by somewhat similar events. Open Sound System The Open Sound System, or simply OSS, was originally developed not only for the Linux operating system but for a number of different Unix derivatives. While successful for a long time its rather simple architecture suffers from a number of problems, the most serious of which to the average user being the inability to share a sound device between different applications. As an example it is not possible to hear system notication sounds while an audio application is playing music in the background. The rst application to claim the device blocks the device for all other applications. Together with a number of non-technical reasons this eventually led to the development of ALSA, the Advanced Linux Sound Architecture. Advanced Linux Sound Architecture Starting with Linux 2.6, ALSA became the standard Linux sound subsystem, although OSS is still available as a deprecated option. The reason for this is the lack of ALSA audio drivers for some older sound devices. Thanks to features like allowing devices to be shared among applications most new applications come with ALSA support built in and many existing applications make the conversion from older audio frameworks.


Linux user mode multimedia support

The Linux kernel community tries to move as many components as possible into user space. On the one hand this approach brings a number of advan1 Note the variety in spelling. Depending on the author and the context Video for Linux Two is also referred to as Video4Linux 2 or just Video4Linux.


tages like easier debugging, faster development, and increased stability. On the other hand, user space solutions can suffer from problems such as reduced exibility, the lack of transparency, or lower performance due to increased overhead. Nevertheless the gains seem to outweigh the drawbacks, which is why a lot of effort has gone into the development of user space multimedia frameworks. Depending on the point of view, the fact that there is a variety of such frameworks available can be seen as a positive or negative outcome of this trend. The lack of a single common multimedia framework undoubtedly makes it more difcult for application developers to pick a basis for their software. The available choices range from simple media decoding libraries to fully grown network-oriented and pipeline-based frameworks. For the rest of this section we will present two of what we consider the most promising frameworks available today, GStreamer and NMM. The latter one is still relatively young and therefore not as wide-spread as GStreamer which has found its way into all current Linux distributions, albeit not always in its latest and most complete version. Both projects are available under open source licenses (LGPL and LGPL/GPL combined, respectively).



GStreamer can be thought of as a rather generic multimedia layer that provides solid support for pipeline centric primitives such as elements, pads, and buffers. It bears some resemblance to Microsoft DirectShow, which has been the center of Windows multimedia technology for many years now. The GStreamer architecture is strongly plugin-based, i.e. the core library provides basic functions like capability negotiation, routing facilities, or synchronization, while all input, processing, and output is handled by plugins that are loaded on the y. Each plugin has an arbitrary number of so-called pads. Two elements can be linked by their pads with the data owing from the source pad to the sink pad. A typical pipeline consists of one or more sources that are connected via multiple processing elements to one or more sinks. Figure 3.1 shows a very simple example.

Figure 3.1: A simple GStreamer pipeline that plays an MP3 audio le on the default ALSA source. The mad plugin decodes the MP3 data that it receives from the le source and sends the raw audio data to the ALSA sink.

Table 3.1 lists a few plugins for each category. Source elements are characterized by the fact that they only have source pads, sink elements only have


sink pads, and processing elements have at least one of each. Sources lesrc alsasrc v4l2src Processing audioresample identity videoip Sinks udpsink alsasink xvimagesink

Table 3.1: An arbitrary selection of GStreamer source, processing, and sink plugins.



NMM stands for Network-Integrated Multimedia Middleware and, as the name already suggests, it tightly integrates network resources into the process. By doing so NMM sets a counterpoint to most other multimedia frameworks that take a machine centric approach where input, processing, and output usually all happen on the same machine. Let us look at two common examples of how todays multimedia software interacts with the network: 1. Playback of a le residing on a le server in the network 2. Playback of an on-demand audio or video stream coming from the network 1. Playback of a network le From the point of view of a player application, this is the easiest case because it is almost entirely transparent to the applications. The main requirement is that the underlying layers (operating system or desktop environment) know how to make network resources available to their applications in a manner that resembles access to local resources as closely as possible. There are different ways how this can be realized, e.g. in kernel mode or user mode, but all of these are classied under the name of a virtual le system. As an example, an application can simply open a le path such as \\192. 168.0.10\media\clip.avi (UNC path for a Windows le server resource) or sftp:// (generic URL for a secure FTP resource as used by many Linux environments). The underlying layers make sure that all the usual input/output functions work the same on these les as on local les. So apart from supporting the syntax of such network paths the burden is not on the application writer.


2. Playback of an on-demand stream Playing back on-demand multimedia streams has been made popular by applications such as RealPlayer or Windows Media Player. The applications communicate with a streaming server via partially proprietary protocols based on UDP or TCP. The burden of ow control, loss detection, and loss recovery lies entirely on the applications shoulders. Apart from that, the client plays a rather passive role by just processing the received data locally and exercising relatively little control over the provided data ow. It is usually limited to starting or stopping the stream and jumping to a particular location within the stream. In particular, the application has no way of actively controlling remote devices, e.g. the zoom factor of the camera from which the video stream originates. Note how there is no transparency from the point of view of the streaming client. It requires deep knowledge of different network layers and protocols, which strongly reduces platform independence and interoperability. NMM tries to escape this machine centric view by providing an infrastructure that makes the entire network topology transparent to applications using the framework. The elements of the ow graph can be distributed within a network without requiring the application to be aware of this fact. This allows applications to access remote hardware as if it were plugged into the local computer. It can change channels on a remote TV tuner card or control the zoom level of a digital camera connected to a remote machine. The NMM framework abstracts all these controls and builds communication channels that reliably transmit data between the involved machines. The website of the NMM project[16] lists a number of impressive examples of the softwares capabilities. One of them can be seen in gure 3.2. The photo is from an article that describes the setup of a video wall in detail[13].


Current discussion

Over the years many video device drivers have been developed by many different people. Each one of these developers had their own vision of what a driver should or should not do. While the V4L2 API species the syntax and semantics of the function calls that drivers have to implement, it does not provide much help in terms of higher-level guidance, therefore leaving room for interpretation. The classic example where different people have different opinions is the case of video formats and whether V4L2 drivers should include support for format conversion. Some devices provide uncompressed data streams whereas others offer compressed video data in addition to uncompressed formats. Not every application, however, may be able to process compressed data, which is why certain driver writers have included decompressor modules in their drivers. In the case of a decompressor-enabled driver format conversion can occur transparently if an application asks for uncompressed data but the device provides only compressed data. This guarantees maximum compatibility and 16

Figure 3.2: Video wall based on NMM. It uses two laptop computers to display one half of a video each and a third system that renders the entire video.

allows applications to focus on their core business: processing or displaying video data. Other authors take the view that decompressor modules have no place in the kernel and base their opinion partly on ideological and partly on technical reasons like the inability of using oating point mathematics in kernel-space. Therefore, for an application to work with devices that provide compressed data, it has to supply its own decompressor module possibly leading to code and bugduplication unless a common library is used to carry out such tasks. We will see the advantages and disadvantages of both approaches together with possible solutionsexisting and non-existingin more detail in the next chapter. What both sides have in common is that the main task of a multimedia framework is to abstract the device in a high-level manner so that applications need as little as possible a priori knowledge of the nature, brand, and model of the device they are talking to.


Chapter 4

Current state of Linux webcam support

4.1 Introduction

In the previous chapter we saw a number of components involved in getting multimedia data from the device to the users eyes and ears. This chapter will show how these components are linked together in order to support webcams. We will nd out what exactly they do and dont do and what the interfaces between them look like. After this chapter readers should understand what is going on behind the scenes when a user opens his favorite webcam application and they should have enough background to understand the necessity of the enhancements and additions that were part of this project.


Webcams and audio

With the advent of USB webcams vendors started including microphones in the devices. To the host system these webcams appear as two separate devices, one of them being the video part, the other being the microphone. The microphone adheres to the USB Audio Class standard and is available to every host that supplies a USB audio class driver. On Linux, this driver is called snd-usb-audio and exposes recognized device functions as ALSA devices. Due to the availability of the Linux USB audio class driver there was no particular need for us to concentrate on the audio part of current webcams as they work out of the box. For this reason, and the fact that Video4Linux does not (need to) know about the audio part of webcams, audio will only come up when it requires particular attention in the remainder of this report.



V4L2: Video for Linux Two

Video for Linux was already quickly introduced in section 3.2.1 where we saw the evolution from the rst video device drivers into what is today known as Video for Linux Two, or just V4L2. This section focuses on the technical aspects of this subsystem.



In a nutshell, V4L2 abstracts different video devices behind a common API that applications can use to retrieve video data without being aware of the particularities of the involved hardware. Figure 4.1 shows a schematic of the architecture.

Figure 4.1: Simplied view of the components involved when a V4L2 application displays video. The dashed arrows indicate that there are further operating system layers involved between the driver and the hardware. The gray box shows which components run in kernel space.

The full story is a little more complicated than that. For one thing, V4L2 not only supports video devices but related subdevices like audio chips integrated on multimedia boards, teletext decoders, or remote control interfaces. The fact that these subdevices have relatively little in common makes the job of specifying a common API difcult. The following is a list of device types that are supported by V4L2 and, where available, a few examples: Video capture devices (TV tuners, DVB decoders, webcams) Video overlay devices (TV tuners) Raw and sliced VBI input devices (Teletext, EPG, and closed captioning decoders) Radio receivers (Radio tuners integrated on some TV tuner cards) Video output devices


In addition, the V4L2 specication talks about codecs and effects, which are not real devices but virtual ones that can modify video data. However, support for these was never implemented, mostly due to disagreement how they should be implemented, i.e. in user space or kernel space. The scope of this project merely encloses the rst category of the above list, video capture devices. Even though the API was originally designed with analog devices in mind, webcam drivers also fall into this category. It is also the category that has by far the greatest number of devices, drivers, and practical applications.



Due to its nature as a subsystem that communicates both with kernel space components and user space processes, V4L2 has two different interfaces, one for user space and one for kernel space. The V4L2 user space API Every application that wishes to use the services that V4L2 provides needs a way to communicate with the V4L2 subsystem. This communication is based on two basic mechanisms: le I/O and ioctls. Like most devices on Unix-like systems V4L2 devices appear as so-called device nodes in a special tree within the le system. These device nodes can be read from and written to in a similar manner as ordinary les. Using the read and write system calls is one of two ways to exchange data between video devices and applications. The other one is the use of mapped memory where kernel space buffers are mapped into an applications address space to eliminate the need to copy memory around, thereby increasing performance. Ioctls are a way for an application and a kernel space component to communicate data without the usual read and write system calls. While ioctls are not used to exchange large amounts of data, they are an ideal means to exchange control commands. In V4L2 everything that is not reading or writing of video data is accomplished through ioctls1 . The V4L2 API[5] denes more than 50 such ioctls, ranging from video format enumeration to stream control. The fact that the entire V4L2 API is based on these two relatively basic elements makes it quite simple. That simplicity does, however, come with a few caveats as we will see later on when we discuss the shortcomings of the current Linux video architecture. The V4L2 kernel interface The user space API is only one half of the V4L2 subsystem. The other half consists of the driver interface that every driver that abstracts a device for V4L2 must implement.
1 In the case of memory mapped communication, or mmap, even the readiness of buffers is communicated via ioctls.


Obviously kernel space does not know the same abstractions as user space, so in the case of the V4L2 kernel interface all exchange is done through standard function calls. When a V4L2 driver loads, it registers itself with the V4L2 subsystem and gives it a number of function addresses that are called whenever V4L2 needs something from the driverusually in response to a user space ioctl or read/write system call. At each callback the driver carries out the requested action and returns a value indicating success or failure. The V4L2 kernel interface does not specify how drivers have to work internally because the devices that these drivers talk to are fundamentally different. While webcam drivers usually communicate with their webcams through the USB subsystem, other drivers nd themselves accessing the PCI bus to which TV tuner cards are connected. Therefore, each driver depends on its own set of kernel subsystems. What makes them V4L2 drivers is the fact that they all implement a small number of V4L2 functions.



We have seen that the V4L2 subsystem itself is a rather thin layer that provides a standardized way through which video applications and video device drivers can communicate. Compared to other platforms where the multimedia subsystems have many additional tasks like converting between formats, managing data ow, clocks, and pipelines the V4L2 subsystem is rather low level and focused on its core task: exchange of video and control.



This section presents four drivers that are in one way or another relevant to the Logitech QuickCam series of webcams. All of them are either V4L1 or V4L2 drivers and available as open source.


The Philips USB Webcam driver

The Philips USB Webcam Driver, or simply PWC, has a troubled history and has caused a lot of discussion and controversy in the Linux community. The original version of the driver was written by a developer known under the pseudonym Nemosoft as a project he did with the support of Philips. At the time there was no USB 2, so video compression had to be applied for video streams above a certain data rate. These compression algorithms were proprietary and Philips did not want to release them into open source. Therefore, the driver was split into two parts: the actual device driver (pwc) that supported the basic video modes that could be used without compression and a decompressor module (pwcx) that attached to the driver and enabled the higher resolutions. Only the former one was released in source code, the decompressor module remained available in binary form. The pwc driver eventually made it into


the ofcial kernel but the pwcx module had to be downloaded and installed separately. In August 2004, the maintainer of the Linux kernel USB subsystem, Greg Kroah-Hartman, decided to remove the hook that allowed the pwcx module to hook into the video stream. The reason he gave was the fact that the kernel is licensed under the GPL and such functionality is considered in violation of it. As a reaction, Nemosoft demanded that the pwc driver be removed entirely from the kernel because he felt that his work had been crippled and did not agree with the way the situation was handled by the kernel maintainers. Much of the history can be found in [1] and the links in the article. Only a few weeks later, Luc Saillard published a pure open source version of the driver after having reverse-engineered large parts of the original pwcx module. Ever since, the driver has been under continuous development and was even ported to V4L2. The driver works with many Philips-based webcams from different vendors, among others a number of Logitech cameras. The complete list of Logitech USB PIDs compatible with the PWC driver can be found in appendix A.


The Spca5xx Webcam driver

The name of the Spca5xx Webcam driver is a little misleading because it suggests that it only works with the Sunplus SPCA5xx series of chipsets. While that was true at one time, Michel Xhaard has developed the Spca5xx driver into one of the most versatile Linux webcam drivers that exist today. Next to the mentioned Sunplus chipsets it supports a number of others from manufacturers such as Pixart, Sonix, Vimicro, or Zoran. The (incomplete) list of supported cameras at [23] contains more than 200 cameras and the author is working on additional chipsets. The main drawback of the Spca5xx driver is the fact that it does not support the V4L2 API yet. This limitation, and the way the driver has quickly grown over time, are the main reasons why the author has recently started rewriting the driver from scratch, this time based on V4L2 and under the name of gspca. Among the many supported cameras on the list, there is a fair number of Logitechs older camera models as well as some newer ones. Again, appendix A has a list of these devices.


The QuickCam Messenger & Communicate driver

This driver supports a relatively small number of cameras, notably a few models of the QuickCam Messenger, QuickCam Communicate, and QuickCam Express series. They are all based on the STMicroelectronics 6422 chip. The driver supports only V4L1 at the time of this writing and can be found at [14].



The QuickCam Express driver

Another relatively limited V4L1 driver, [19] focuses on the Logitech QuickCam Express and QuickCam Web models that contain chipsets from STMicroelectronics 6xx series. It is still actively maintained, although there are no signs yet of a V4L2 version.


The Linux USB Video Class driver

Robot contests have been the starting point for many an open source software project. The Linux UVC driver is one of the more prominent examples. It was developed in 2005 by Laurent Pinchart because he needed support for the Logitech QuickCam for Notebooks Pro camera that he was planning to use for his robot. The project quickly earned a lot of interest with Linux users who tried to get their cameras to work. Driven by both, personal and community interest, the driver has left the status of a hobby project behind and is designated to become the ofcial UVC driver of the Linux kernel. Since this driver is one of the corner stones of this project, we will give here a basic overview of the driver. Later in section 6.1 we shall discuss extensions and changes that were done to support the Linux webcam infrastructure. The ofcial project website can be found at [17]. Technical overview The Linux UVC driver, or short uvcvideo, is a Video4Linux 2 and a USB driver at the same time. It registers with the USB stack as a handler for devices of the UVC device class and, whenever a matching device is connected, the driver initializes the device and registers it as a V4L2 device. Let us now look at a few tasks and aspects of the UVC driver in the order they typically occur. Device enumeration The rst task of any USB driver is to dene a criteria list for the operating system so that the latter one knows which devices the driver is willing and able to handle. We saw in section 2.3.3 that some Logitech cameras do not announce themselves as UVC devices even though they are capable of the protocol. For this reason, uvcvideo includes a hard-coded list of product IDs of such devices in addition to the generic class specier. Device initialization As soon as a supported device is discovered, the driver reads and parses the devices control descriptor and, if successful, sets up the internal data structures for units and terminals before it nally registers the camera with the V4L2 subsystem. At this point, the device becomes visible to user space, usually in the form of a device node, e.g. /dev/video0. Stream setup and streaming If a V4L2 application requests a video stream, the driver enters the so-called probe/commit phase to negotiate the parameters of the video stream. This includes setting attributes like video data format, 23

frame size, and frame rate. When the driver nally receives video data from the device, it must parse the packets, check them for errors and reassemble the raw frame data before it can send a frame to the application. Controls Video streaming does not only consist of receiving video data from the device, but applications can use different controls to change the settings of the camera or the properties of the video stream. These control requests must be translated from the V4L2 requests that the driver receives to UVC requests understood by the device. This process requires some mapping information because the translation is all but obvious. We will have a closer look at this problem and how it can be solved later on. Outlook For obvious reasons V4L2 cannot support all possible features that the UVC specication denes. The driver thus needs to take measures that allow user space applications to access such features nonetheless. In section 6.1 we shall see one such example that was realized with the help of the sysfs virtual le system and is about to be included in the project. It is safe to say that the Linux USB Video Class driver is going to be the most important Linux webcam driver in the foreseeable future. Logitech is already moving all cameras onto the UVC track and other vendors are expected to follow given that UVC is a Windows Vista logo requirement. For Linux users this means that all these cameras will be natively supported by the Linux UVC driver.


V4L2 applications

Ekiga is a VoIP and video conferencing that supports SIP and H.323, which makes it compatible not only to applications such as NetMeeting but also to conferencing hardware that supports the same standards. It comes with plugins for both, V4L1 and V4L2, and is therefore able to support a large number of different webcams. Given the resemblance to other popular conferencing software, Ekiga is one of the main applications for webcams on Linux. It is licensed under the GPL, documentation, sources and binary packages can be downloaded from [18]. luvcview This tool was developed by the author of the Spca5xx driver with the intention to support some features unique to the Linux UVC driver, hence its name. 24

Figure 4.2: The main window of Ekiga during a call.

Thanks to its simplicity it has become one of the favorite programs for testing whether the newly installed camera works. It is based on V4L2 for video input and the SDL library for video output. The simple user interface allows basic camera controls to be manipulated, including some of the custom controls that the UVC driver provides to enable mechanical pan/tilt for the Logitech QuickCam Orbit camera series. The latest version includes a patch that was written during this project to help with debugging of camera and driver issues. It allows to easily save the raw data received from the device into les with the help of command line options. luvcview can be downloaded from [22]. Figure 4.3 shows a screenshot of the luvcview user interface and the command line used to start it in the background. fswebcam This nifty application is the proof that not all webcam software needs a GUI to be useful. Purely command-line based it can be used to retrieve pictures from a webcam and store them in les, e.g. for uploading them to a web server in regular intervals. The fswebcam website can be found at [9].


Figure 4.3: The window of luvcview and the console used to start it in the background.


V4L applications

Camorama Camorama is a V4L1 only application made for taking pictures either manually or in specied intervals. It can even upload the pictures to a remote web server. Camorama allows adjusting the most common camera controls and includes a number of video lters, some of which dont seem very stable, though. It can be downloaded from [11] and is part of many Linux distributions. Unfortunately development seems to stand still at the moment. Figure 4.4 shows Camorama in action.


GStreamer applications

There are many small multimedia applications that use the GStreamer engine as a back-end but a relatively small number of prominent ones. The most used ones are probably Amarok, the default KDE music player and Totem, the GNOMEs main media player. At the moment Amarok is limited to audio, although video support is being discussed. What makes Totem interesting from the point of view of webcam users is a little webcam utility called Vanity. Unfortunately it has received very little attention from both developers and users and it remains to be seen whether the project is revived or even integrated into Totem. 26

Figure 4.4: Camorama streaming at QVGA resolution from a Logitech QuickCam Messenger camera using the Spca5xx driver.

We will see another webcam application based on GStreamer in the next chapter when we look at the software that was developed for this project. At that time we shall also see how GStreamer and V4L2 work together.


Problems and design issues

As with every architecture, there are a number of drawbacks, some of which were briey hinted at in the previous sections. We will now look at these issues in more detail and see what their implications on webcam support on the Linux platform are. At the same time we will look at possible solutions to these problems and how other platforms handle them.


Kernel mode vs. user mode

The discussion whether functionality X should be implemented in user mode or in kernel mode is an all-time classic in the open source community, particularly in the Linux kernel. Unfortunately these discussions are oftentimes far from conclusive leading to slower progress in the implementation of certain


features or, in the worst case, to factually discontinued projects due to lack of consent and acceptance. Table 4.1 shows the most notable differences between kernel mode and user mode implementations of multimedia functionality. While the points are focused on webcam applications, many of them can also be applied to other domains like audio processing or even devices completely unrelated to multimedia. In the following we will analyze these different points and present possible solutions and workarounds. Kernel space + Transparency for user space Direct device access Device works "out of the box" User space Simple upgrading Simple debugging Safer (bugs only affect one process) More exible licensing No oating point math Complicated debugging Open source only No callback functions Difcult to establish standard Requires exible kernel back-end

Table 4.1: Kernel space vs. user space software development

Format transparency One of the main problems in multimedia application is the myriad of formats that are in use. Different vendors use different compression schemes for a number of reasons: licensing and implementation costs, memory and processing power constraints, backward compatibility, and personal or corporate preference. For application developers it becomes increasingly difcult to stay current on which devices use which formats and to support them all. In some cases, as in the case of the cameras using the PWC driver it may even be impossible for someone to integrate certain algorithms for legal reasons. This is a strong argument for hiding the entire format conversion layer from the application, so that every application only needs to support a very small number of standard formats to remain compatible with all hardware and drivers. A typical example is the way the current Logitech webcam drivers for Windows are implemented. While the devices usually provide two formats, compressed MJPEG and uncompressed YUY2, applications get to see neither of these formats. Instead, they are offered the choice between I420 and 24-bit RGB with the latter one being especially easy to process because each pixel 28

is represented by a red, green, and blue 8-bit color value. These formats are provided independent of the mode in which the camera is being used. For example, if the camera is streaming in MJPEG mode and the capturing software requests RGB data, the driver uses its internal decompressor module to convert the JPEG data coming from the camera into uncompressed RGB. The capturing software is not aware of this process and does not need to have its own JPEG decoder; one nontrivial module less to implement. At which layer this format conversion should happen depends on a number of factors of both technical and historical nature. Traditionally, Windows and Linux have seen different attempts at multimedia frameworks and many of them have only survived because their removal would break compatibility with older applications still relying on these APIs. If vendors and driver developers are interested in the support of these outdated frameworks, they may need to provide format lters for each one of these frameworks in the case of a proprietary streaming format. If, however, the conversion takes place in the driver itself, all frameworks can be presented with some standard format that they are guaranteed to understand. This can greatly simplify development by concentrating the effort on a single driver instead of different framework components. There are also performance considerations when deciding on which level a conversion should take place. If two or more applications want to access the video stream of a camera at the same time, they will create as many different pipelines as there are applications. If the format conversionor any other computationally intensive processis done in the user space framework, the same process has to be carried out in the pipeline of each application because there is no way through which the applications could share the result. This has the effect of multiplying the required work, something that leads to poor scalability of the solution. In the opposite case, where the conversion process is carried out before the stream is multiplexed, the work is done just once in the driver and all the frameworks receive the processed data as an input, therefore importantly reducing the overhead associated with multiple streams in parallel. Feature transparency Up until now our discussion has focused primarily on format conversion. There exists another category of video processing that is different in a very important way: computer vision. Computer vision is a form of image or video processing with the goal of extracting meta data that enables computers to "see" or at least recognize certain features and patterns. A few classic examples are face tracking, where the algorithm tries to keep track of the position of one or multiple faces, feature tracking, where the computer locates not only the face but features like eyes, nose, or mouth, and face recognition, where software can recognize faces it has previously memorized. To see the fundamental difference between computer vision and format conversion modules we have to look rst at a basic mechanism of multimedia frameworks: pipeline graph 29

construction. When an application wants to play a certain media source it should not have to know the individual lters that become part of the pipeline in order to do so. The framework should automatically build a ow graph that puts the right decoders and converters in the right order. The algorithms that do this are usually based on capability descriptors that belong to each element combined with priorities to resolve ambiguities. For example, a decoder lter could have a capability descriptor that says "Able to parse and decode .mp3 les" and "Able to output uncompressed audio/x-wav data". When an application wants to play an .mp3 le, it can simply request a pipeline that has the given .mp3 le as input and delivers audio/x-wav data as output. In many cases there exist multiple graphs that are able to fulll the given task, so the graph builder algorithm has to take decisions. Back in our example there could be two MP3 decoders on the system, one that uses the SIMD instruction set of the CPU if available and one that uses only simple arithmetic. Let us call the rst module mp3_simd and assume it has a priority of 100. The default MP3 decoder is called mp3_dec and has a lower priority of 50. Naturally, the graph builder algorithm will rst try to build the graph using mp3_simd. If the current CPU supports the required SIMD instructions, the graph construction will succeed. In the opposite case where the current machine lacks SIMD, mp3_simd can refuse to be part of the graph but the framework will still be able to build a working graph because it can fall back to our standard decoder, mp3_dec. Imagine now an audio quality improvement lter called audio_qual that accepts uncompressed audio/x-wav data as input and outputs the same type of data. How can the application benet from audio_qual without having to know about it? The graph builder algorithm will always take the simplest graph possible, so it does not see an advantage in introducing an additional lter element thatfrom the algorithms capability oriented perspectiveis nothing but a null operation. This problem is not easy to solve because making every audio application aware of the plugins existence is not always practical. The case of computer vision is very similar with respect to the pipeline graph creation process. The computer vision module does not modify the data, so the input and output formats are the same and the framework does not see the need to include the element into the graph. One elegant solution to this problem is to do the processing in kernel mode in the webcam driver before the data actually reaches the pipeline source. Obviously, this approach can require a format conversion in the driver if the computer vision algorithms cannot work directly on the video format delivered by the camera. So the solution presented in the previous section becomes not only a performance advantage but a necessity to support certain features transparently for all applications.


Direct device access Another main advantage of a kernel mode multimedia framework is that the framework has easy access to special features that the device provides. For example, a new camera model can introduce motion control for pan and tilt. If the user mode multimedia framework is not aware of this or incapable to map these controls onto its primitives, applications running on top of it cannot use these features. Obviously this point is also valid for kernel mode frameworks but it is generally easier to communicate between kernel components than across the barrier between user mode and kernel mode. For an application to be able to communicate with the driver, it is not enough to use the framework API, but a special side channel has to be established. The design of such a side channel can turn out to be rather complicated if future reusability is a requirement because of the difculty to predict the features of upcoming devices. We will see a concrete example of this issueand a possible solutionlater on when we look at how the webcam framework developed as part of this project communicates with the device driver. Callback Many APIs rely on callback to implement certain features as opposed to polling or waiting on handles. The advantage of this approach is that it has no impact on performance (especially compared to polling) and is much simpler because it does not require the application to use multiple threads to poll or wait. There are many cases where such notication schemes are useful: Notication about newly available or unplugged devices Notication about controls whose value has changed, possibly as a result of some device built-in automatism Notication about device buttons that have been pressed Notication about the success or failure of an action asynchronously triggered by the application (e.g. a pan or tilt request that can take some time to nish) Notication about non-fatal errors on the bus or in the driver Unfortunately, current operating systems provide no way to do direct callback from kernel mode to user mode. Therefore, for V4L2 applications to be able to use the comfort of callback notication, a user space component would have to be introduced that wraps polling or waiting and calls the application whenever an event occurs. In chapter 7 we propose a design that does just that. Ease of use The Linux kernel comes with a variety of features built in including many drivers that users of other operating systems have to download and install 31

separately. If a certain device works "out of the box" it provides for good user experience because people can immediately start using the device and launch up their favorite applications. Such behavior is obviously desirable because it frees users from having to compile and install the driver themselves, something that not every Linux user may be comfortable doing. On the other hand, the disadvantage of such an approach is the limited upgradeability of kernel components. Even though current distributions provide comfortable packaging of precompiled kernels, such an upgrade usually requires rebooting the machine. In comparison, upgrading a user mode application is as easy as restarting the application once the application package has been upgraded. In high-availability environments, e.g. in the case of a popular webcam streaming server, the downtime incurred by a reboot can be unacceptable. Development aspects For a number of reasons programming in user mode tends to be easier than programming in kernel mode. Three of these reasons are the variety of development tools, the implications of a software bug, and the comfort of the API. Traditionally there are many more tools available for developing applications than kernel components. The simple reason is for one that the development of user space tools itself is easier and for another that the number of application developers is just much higher than the one of system developers. There is a large variety of debug tools and helper libraries out there but almost none of them are applicable to kernel mode software. Therefore the Linux kernel mode developer has to rely mostly on kernel built-in tools. While these are very useful, they cannot compare with the comfort of the kernel debugger tools available on the Windows platform. If a problem in the kernel component occurs, the implications can be manifold. In some cases the entire machine can freeze without so much as a single line of output that would help locate the problem. In less severe cases the kernel manages to write enough useful debug information to the system log and may even continue to run without the component in question. Nevertheless, such an isolated crash often requires a reboot of the test machine because the crashed component cannot be replaced by a new, and possibly xed, version anymore. These circumstances inevitably call for two machines, one for development and one for testing. In user mode an application bug is almost always limited to a single process and trying out a new version is as easy as recompiling and relaunching the program. Finally, not all the comfort of the API that application programmers are used to is available in kernel space. Seemingly simple tasks like memory allocation, string handling, and basic mathematics can suddenly become much more complicated. One important difference is that oating point operations


are oftentimes not available in kernel mode for performance reasons2 . One has to resort to algorithms that avoid oating point computations or apply tricks that are unlikely to receive a positive echo in the Linux kernel community. All of these points make the development of multimedia software in user mode much easier, an important point given the complexity that the involved algorithms and subsystems often have. Licensing Nothing speaks against writing closed source software for Linux. As a matter of fact, there is a large number of commercial Linux applications out there that were ported from other operating systems or written from scratch without releasing their source code. The GNU General Public License (GPL), under which the Linux kernel and most of the system software is released, does not forbid closed source applications. The situation for kernel modules, however, is more complicated than that. Since the GPL requires derived works of a GPL-licensed product to be published under the same terms, most kernel modules are assumed derived works, therefore ruling out the development of closed source kernel modules[20]. There seems, however, to be an acceptable way of including a binary module into the Linux kernel. It basically consists of having a wrapper module, itself under the GPL, that serves as a proxy for the kernel functions required by the second module. This second module can be distributed in a binary only form and does not have to adopt the kernels license because it cannot be considered a derived work anymore. Even after sidestepping the legal issues of a binary only kernel module there remain a few arguments against realizing a project in such a way, notably the lack of acceptance in the community and the difcult maintenance given the large number of different kernel packages that exist. In many cases, the software would have to be recompiled for every minor upgrade and for every avor and architecture of the supported Linux distributions. This can drastically limit the scope of supported platforms.


The Video4Linux user mode library

One solution to most of the problems just described keeps coming up when new and missing features and design issues are discussed on the V4L mailing list: a widely available, open source, user mode library that complements the kernel part of V4L2. Such a library could take over tasks like format conversion, providing a exible interface for more direct hardware access, and taking complexity away from todays applications. At the same time, the kernel part could entirely concentrate on providing the drivers that abstract device capabilities and making sure that they implement the interfaces required by the V4L library.
2 Banning oating point from kernel mode allows the kernel to omit the otherwise expensive saving and restoring of oating point registers when the currently executing code is preempted.


While the approach sounds very promising and would bring the Linux multimedia platform a large step forward, nobody has found themselves willing or able to start such a project. In the meantime, other user mode frameworks like GStreamer or NMM have partly stepped into the breach. Unfortunately, since these frameworks do not primarily target V4L, they are rarely able to abstract all desirable features. The growing popularity of these multimedia architectures, in turn, makes it increasingly harder for a V4L library to become widespread and eventually the tool of choice for V4L2 front-ends. It seems fair to say that the project of the V4L user mode library has died long before it even got to the stage of a draft and it would require a fair amount of initiative to be revived.


V4L2 related problems

Video4Linux has a number of problems that have their roots partially in the legacy of V4L1 and Unix systems in general as well as in design decisions that were made with strictly analog devices in mind. For some of them easy xes are possible, for others solutions are more difcult. Input and output We saw in section 4.2.2 that V4L2 provides two different ways for applications to read and write video data. The use of the standard read and write system calls and memory mapped buffers (mmap). Device input and output using the read/write interface used to beand still is in some casesvery popular but is not the technique of choice due to the fact that it does not allow meta information such as frame timestamps to be communicated alongside the data. This classic I/O-based approach, in turn, has the advantage of enabling every application that supports le I/O to work with V4L2 devices. While it would be possible for drivers to implement both techniques, some of them choose not to support read/write and mmap at the same time. The uvcvideo driver for example does not support the read/write protocol in favor of the more exible mmap. The fact that for the application the availability of either protocol depends on the driver in use erodes the usefulness of the abstraction layer that V4L is supposed to provide. To be on the safe side, an application would have to implement both protocols at the same time, again something that not all application authors choose to do. Usually their decision depends on the purpose of their tool and the hardware they have access to during development. The legacy of ioctl The ioctl system call was rst introduced with AT&T Unix version 7 in the late seventies. It was used to exchange control data that did not t into the stream-oriented I/O model. The operating system forwards ioctl requests directly to the driver responsible for the device. Let us look at the prototype


of the ioctl function to understand where some of the design limitations in V4L2 come from: int ioctl(int device, int request, void *argp); There are two properties that stick out for an interface based on this function: 1. There is only one untyped argument for passing data. 2. Every call needs a device handle. The fact that ioctl provides only one argument for passing data between caller and callee is not a serious technical limitation in practice and neither is its untypedness. The ways that this interface is used, however, deprives the compiler of doing any sort of compile-time type checking leading to possibly hard to nd bugs if a wrong data type is passed. For developers this also makes for a little intuitive interface since even relatively simple requests require data structures to be used where a few individual arguments of basic types would be simpler. While the rst point is mostly a cosmetic one, the second one opposes a more important limitation on applications: there are no "stateless" calls to the V4L2 subsystem possible. Since the operating system requires a device handle to be passed to the ioctl request, the application has no choice but to open the device prior to doing the ioctl call. As a consequence this eliminates the possibility of device independent V4L2 functions. It is easy to come up with a few occasions where such stateless functions would be desirable: Device enumeration. It is currently left to the application to enumerate the device nodes in the /dev directory and lter those that belong to V4L2 devices. Device information querying. Unless the driver supports multiple opening of the same device, something that is not trivial to implement because the associated policies have to be carefully thought through, applications have no more information than what the name of the device node itself provides. Currently this is restricted to the device type (Video devices are called videoN , radio devices radioN , etc. where N is a number.) Module enumeration. If the V4L2 system were to provide format conversion and other processing lters, applications would want to retrieve a list of the currently available modules without requiring opening a device rst. System capability querying. Similarly, V4L2 capabilities whose existence is independent of a devices presence in the system could be queried without the need for the application to know which capability was introduced with which kernel version and hardcoding corresponding conditionals. 35

It is clear that the current API was designed to blend in nicely with the Unix way of communicating between applications and system components. While this keeps the API rather simple from a technical point of view, it has to be asked whether it is worth sticking to these legacy interfaces that clearly were notand could not at the timedesigned to handle all the cases that come up nowadays. Especially for fast advancing areas like multimedia a less generic but more exible approach is often desirable. Missing frame format enumeration We have mentioned that the current Video4Linux API was designed mostly with analog devices in mind. Analog video devices have a certain advantage over digital ones in that they oftentimes have no constraints as to the video size and frame rate they can deliver. For digital devices, this is different. While the sensors used by digital webcams theoretically provide similar capabilities, these are hidden by the rmware to adapt to the way that digital video data is transmitted and used. So while an analog TV card may very well be capable of delivering an image 673 pixels wide and 187 pixels high, most webcams are not. Instead, they limit the supported resolutions to a nite set, most of them with a particular aspect ratio such as 4:3. Similar restrictions apply for frame rates where multiples of 5 or 2.5 dominate. One implication of this is that at the time V4L2 was designed, there was no need to provide applications with a way to retrieve these nite sets. This has peculiar effects at times: Many applications are completely unaware of the frame rate and rely on the driver to apply a default value. The only way for V4L2 applications to enumerate frame rates is to test them one by one and check if the driver accepts them. Since a one-by-one enumeration of resolutions is impossible due to the sheer number of possible value combinations, applications simply have to live with this limitation and either provide a hardcoded list of resolutions likely to be supported or have the user enter them by hand. Once a selection is made, the application can test the given resolution. To make this process less frustrating than what it seems V4L2 drivers return the nearest valid resolution if a resolution switch fails. As an example, if an application requests 660x430, the driver would be likely to set the resolution to 640x480. We shall see in 6.2 how this severe limitation was removed by enhancing the V4L2 API. Control value size Another limitation that is likely to become a severe problem in the future is the structure that V4L2 uses to get and set the values of device controls: 36

struct v4l2_control { __u32 id /* Identifies the control. */ __s32 value /* New value or current value. */ }; The value eld is limited to 32 bits, which is satisfactory for most simple controls but not for more complex controls. This has already given rise to the recent introduction of extended controls (see the VIDIOC_G_EXT_CTRLS, VIDIOC_S_EXT_CTRLS, and VIDIOC_TRY_EXT_CTRLS ioctls in [5]), which allow applications to group several control requests and provide some room for extension. We will come back to this issue at the beginning of chapter 5 when we discuss the goals of our webcam framework. Lack of current documentation The last problem we want to look at is unfortunately not limited to V4L2 but affects a wide range of software products, especially in the non-commercial and open source sector: poor documentation. The V4L2 documentation is split into two parts, an API specication for application programmers[5] and a driver writers guide[4]. While the rst one is mostly complete and up-to-date, the latter one is completely outdated, little helpful except for getting a rst overview, and it gives no guidelines on how to implement a driver and what to watch out for. The main source of information on how to write a V4L2 driver is therefore the source code of existing drivers. The lack of a reference driver doesnt make the choice easy, though, and there exist some poorly written drivers out there. Moreover, there is little documentation available on what the V4L2 subsystem actually does and doesnt do. Again, delving into the source code is the best and only way to get answers. This lack of starting points for developers is likely one of the biggest problems of V4L2 at the moment. It sets the threshold for newcomers quite high and makes it hard for established developers to nd common guidelines to adhere to, something that in turn prevents code sharing and modularization of common features. As part of this project the author has tried to set a good example by properly documenting the newly added frame format enumeration features and providing a reference implementation that demonstrates their usage. One can only hope that the current developers eventually take a little time out of their schedules to document the existing code as long as the knowledge and recollection is still there. Stream synchronization There is one important aspect normally present in multimedia frameworks that all applications known to the author have blissfully ignored without any obviously bad consequences: synchronization of multimedia streams. 37

Whenever a computer processes audio and video inputs simultaneously there is an inevitable tendency for the two streams to slowly drift apart when they are recorded. This has numerous reasons and there are different strategies to reduce the problem, many of which are explained in [12], an excellent article by the author of VirtualDub, an extremely popular video processing utility for Windows. The fact that no bad consequences can be observed with current Linux webcam software does not mean, however, that the problem does not exist on the Linux platform. The problem only becomes apparent when videos are recorded that include an audio stream and none of the common applications seem to do that yet. Once this has changed, applications will need to gure out a way to avoid the problem of having the video and audio streams drift apart. V4L2 on its own cannot prevent this because it has no access to the audio data. Despite all these problems, Linux has a functioning platform for webcams today. It is only a matter of time and effort to resolve them one buy one. The next chapter is a rst step in that direction, as it provides some ideas and many real solutions.


Chapter 5

Designing the webcam infrastructure

5.1 Introduction

After having seen all the relevant requirements for operating a webcam on Linux, we can nally discuss what our webcam framework looks like. This chapter treats the ideas and goals behind the project, how we have tackled the difculties and why the solution looks as it looks today. We will present all the components involved in a high-level manner and save the technical details for the two following chapters. To conclude we shall revisit the problems discussed in the previous chapters and summarize how our solution solves them and strives to avoid similar problems in the future. Before doing so, however, we need to be clear about the goals we want to achieve and set priorities. Software engineering without having clear goals in mind is almost guaranteed to lose focus of the main tasks over the little things and features.



The main goal of the project, enhancing the webcam experience of Linux users, is a rather vague one and does not primarily lend itself as a template for a technical specication. It does, however, entail a number of secondary goals, or means, that t together to achieve the primary goal. These goals are of a more concrete nature and can be broken down into technical or environmental requirements. Apart from the obvious technical challenges that need to be solved, there is another group of problems that are less immediate but must nevertheless be carefully considered: business and legal decisions. When a company takes a go at open source software, conicts inevitably arise, usually between protection


of intellectual property and publishing source code. Their consideration has played an important role in dening the infrastructure of the webcam framework and we will return to the topic when discussing the components affected by it. Let us now look at the different goals one by one and how they were achieved. A solution that works As trivial as it may sound, the solution should work. Not only on a small selection of systems that happens to be supported by the developer but on as broad a system base as possible and for as many users as possible. Nothing is more frustrating for a user than downloading a program just to nd out that it does not work on his system. Unfortunately it cannot always be avoided to limit the system base to a certain degree for practical and technical reasons. Practical reasons are mostly due to the fact that it is impossible to test the software on every system combination out there. Many different versions of the kernel can be combined with just as many different versions of the C runtime library. On the technical side there is an entire list of features that a given solution is based on and without which it cannot properly work. The size of the supported system base is therefore a tradeoff between development and testing effort on one side and satisfying as many users as possible on the other. Making this tradeoff was not particularly difcult for this project as one of the pillars of the webcam framework already sets a quite strict technical limit. For USB 2.0 isochronous mode to work properly a Linux kernel with version 2.6.15 or higher is strongly recommended because the USB stack of earlier versions is known to have issues that can cause errors in the communication between drivers and devices. In a similar way, certain features of Video4Linux 2 only became available in recent versions of the kernel, notably the frame format enumeration that we will see in 6.2. This does not mean, however, that the solution does not work at all on systems that do not meeting these requirements. The feature set of the webcam framework on older platforms is just smaller. Everything that does not depend on features of the UVC driver works on kernels older than 2.6.15 and the lack of a V4L2 implementation that does not provide frame format enumeration prevents only this particular feature from working. A solution that works bestbut not exclusivelywith Logitech cameras Parts of the solution we have developed are clearly optimized for the latest Logitech cameras, no need to hide this fact. Logitech has invested large amounts of money and time into developing the QuickCam hardware and software. There is a lot of intellectual property contained in the software as well as some components licensed from third party companies. Even if Logitech wanted to distribute these features in source code form, it would not be legally possible. As a result, these components must be distributed in binary format and they are designed to work only if a Logitech camera is present in


the system because other cameras dont implement the necessary features. These binary components are limited to a single dynamic library that is not required for the webcam infrastructure to work. For users this means that there is some extra functionality available if they are using a Logitech camera but nothing stops them from using the same software with any other UVC compliant camera. Planning ahead In the fast moving world of consumer electronics it is sometimes hard to predict where technology will lead us in a few years from now. Future webcams will have many features that todays software does not know about. It is therefore important to be prepared for such features by designing interfaces in a way that makes them easily extensible to accommodate new challenges. A typical example of this necessity is the set of values of certain camera controls. Most controls are limited to 32-bit integer values, which is enough for simple control such as image brightness or camera tilt. One can imagine, however, that certain software supported features could need to transmit chunks of data to the camera that do not t in 32 bits. Image processing on the host could compute a list of defect pixels that the camera should interpolate in the rmware or it could transmit region information to help the camera use different exposure settings for foreground and background. In the provided solution we have avoided xed-length value limitations wherever possible. Each control can have arbitrary long values and all xedlength strings, often used in APIs for simplicity reasons, have been replaced by variable-length, null-terminated strings. While it is true that this approach is slightly more complicated for all involved parties, it assures that future problems do not encounter data width bottlenecks. We have carefully planned the API in a way that puts the burden on the libraries and not the applications and their developers. For applications, buffer management is mostly transparent and the enumeration API functions are no different than if xed-width data had been used. Another example that guarantees future extensibility is the generic access to UVC extension units that we added to the UVC driver. Without such a feature, the driver would need to be updated for every new camera model, the very process that generalizing standards like UVC strive to avoid. The new sysfs interface of the UVC driver allows user mode applications generic raw access to controls provided by a cameras UVC extension units. Since these extension units are self-descriptive, the driver can retrieve all required information at runtime and need not be recompiled. There are a few other places where we have planned ahead for future extensions, such as the abstraction layers we are taking advantage of and the modularity of some of the involved modules. These examples will be explained in more detail in the rest of this chapter.


Dealing with current problems A prerequisite for and a goal of this project at the same time was solving the problems we saw in chapter 4 in the best manner for everybody. This means that we did not want to further complicate the current situation by introducing parallel systems but instead help solve these problems so that currently existing applications can also leverage off the improvements we required for our framework. Admittedly, it may sometimes seem easier to reinvent the wheel than improve the wheels already in place, but in the end having a single solution that suits multiple problems is preferable because a combined effort often achieves a higher quality than two half-baked solutions do. The effects of a developer branching the software out of frustration with the line a project is following can be seen quite often in the open source community. The recent Mambo/Joomla dispute1 is a typical example where it is doubtful that the split has resulted in an advantage of any of the involved parties. Let us use the UVC driver as an example to illustrate the situation in the webcam context. Creating our own driver or forking the current one would have made it easier to introduce features that are interesting for Logitech because we could have changed the interface without discussing the implications with anyone. By doing so, both drivers would have received less testing and it would have been harder to synchronize changes applicable to both branches. Keeping a single driver is a big win for the Linux webcam user and avoids the frustrating situation where two similar devices require two slightly different drivers. Community acceptance Many Linux projects with a commercial background have received a lukewarm reception from the open source community in the past, sometimes for valid reasons, sometimes out of fear and skepticism. There is no recipe for guaranteed acceptance by the Linux community but there are a few traps one can try to avoid. One of the traps that many companies fall into is that they strictly limit use of their software to their own products. Obviously, for certain device classes they may not have any choice, take the example of a graphics board. Fortunately, for the scope of this project, this was relatively easy given that the webcams for which it was primarily designed adhere to the USB Video Class standard. Linux users have every interest in good UVC support, so there were very few negative reactions to Logitechs involvement. The fact that somebody was already developing a UVC driver when we started the project may also have helped convince some of the more suspicious characters out there that it was not our intent to create a software solution that was merely for Logitechs benet. Throughout the project we have strived to add features to the UVC driver that we depend on for the best support of our cameras in the most generic
1 The open source content management system Mambo was forked in August 2005 after the company that owned the the trademark founded a non-prot organization with whose organization many of the developers did not agree with. The fork was named Joomla.


way so that devices of other vendors can take advantage of them. A typical example for this is the support for UVC extensions. While not strictly necessary for streaming video, all additional camera features are built on top of UVC extension units. It can therefore be expected that other vendors will use the same mechanisms as Logitech, so that by the time that more UVC devices appear on the market, they will already be natively supported by Linux. Avoid the slowness of democracy This goal may at rst seem diametrical to the previous point. The open source community is a democracy where everyone can contribute their opinions, concerns, and suggestions. While this often helps make sure that bad solutions never even end up being realized, it renders the process similarly slow as in politics. For projects with time constraints and full-time jobs behind it, this is less than optimal, so we had to avoid being stalled by long discussions that dissolve without yielding an actual solution. However, like so often it can turn out to be more fruitful to confront people with an actual piece of software that they can touch and test. Feedback becomes more concrete, the limitations become better visible, and so do the good points. If a project nds rapid acceptance with users, developers are likely to become inspired and contribute or eventually use some of the ideas for other projects. We are condent that the webcam framework will show some of the pros as well as cons that a user mode library brings. Maybe one day somebody revives the project of a V4L2 user mode library and integrates parts of the webcam framework as a subset of its functionality because that is where it would ideally lie.


Architecture overview

With a number of high-level goals in mind, we can start to translate these goals into an architecture of components and specify each components tasks and interfaces. To start off, let us compare what the component stack looks like with the conventional approach on one side and with the webcam framework on the other. From section 4.2 we already know how V4L2 interfaces with the UVC driver on one side and the webcam application on the other (gure 5.1a). The stack is relatively simple as all data, i.e. control and video data, ows through V4L2 without carrying out any processing itself. This approach is used by all current webcam applications and suffers from a few issues identied in section 4.5. The webcam framework positions itself between the operating system and the application that receives live video from a camera. Figure 5.1b illustrates the different subsystems involved and where the core of the webcam framework is located. We see that the webcam framework lls a relatively small spot in the entire system but it is one of two interfaces that a webcam application interfaces 43



Figure 5.1: Layer schema of the components involved in a video stream with (a) the conventional approach and (b) the webcam framework in action. Note the border between user space and kernel space and how both V4L2 and sysfs have interfaces to either side.


with to communicate with the camera. This leaves the application the exibility to choose for every task the component that performs it best: V4L2 for video streaming and related tasks such as frame format enumeration or stream setup, the webcam framework for accessing camera controls and advanced features that require more detailed information than what V4L2 provides.



Despite of what the previous schema suggests, the Linux webcam framework is not a single monolithic component but a collection of different libraries with strictly separated tasks. This modularity ensures that no single component grows too complicated and that the package remains easy to maintain and use. Figure 5.2 gives an overview of the entire framework in the context of the GStreamer and Qt based webcam application, as well as a panel application. Both these applications are provided as part of the package and can be seen in action in chapter 8.

Figure 5.2: Overview of the webcam framework kernel space and user space components. The dashed box shows the three components that use the GStreamer multimedia framework.

In the remainder of this section we will look at all of these components,


what their tasks are, and what the interfaces between them look like. While doing so we shall see how they accomplish the goals discussed above.


UVC driver

The UVC driver was already introduced in chapter 4.3.5, therefore we will only give a short recapitulation at this point. Its key tasks are: Supervise device enumeration and register the camera with the system. Communicate with the camera using the UVC protocol over USB. Verify and interpret the received data. Respond to V4L2 requests originating from applications. Provide additional interfaces for features not supported by V4L2. It is the last of these points that makes it a key component in the webcam framework. Conventional webcam drivers oriented themselves at the features supported by V4L2 and tried to implement these as far as possible. This was not an easy task since the specications available to the developers were often incomplete or even had to be reverse engineered from scratch. Therefore the necessity to support features unknown to V4L2 rarely arose. With the USB Video Class standard this is completely different. The standard is publicly available and if both, device manufacturers and driver engineers stick to it, compatibility comes naturally. The challenge stems from the fact that the functions described in the UVC standard are not a subset of those supported by V4L2. It is therefore impossible for a Video4Linux application to make use of the entire UVC feature spectrum without resorting to interfaces that work in parallel to the V4L2 API. For the UVC driver the sysfs virtual le system takes over this role. It provides raw access to user mode software in a generic manner, all of this in parallel to the V4L2 API, which is still used for the entire video streaming part and provides support for a fairly general subset of the camera controls.



We have seen previously that Video4Linux has two key tasks relevant to webcams: Make the video stream captured by the device driver available to applications. Provide image and camera related controls to applications. V4L2 is good at the rst point but it has some deciencies when it comes to the second one due to its limitation of control values to 32 bits (see 4.5.3). This is why our scenario does not rely solely on V4L2 for webcam controls but uses the UVC drivers sysfs interface where necessary. 46

We can see from the gures that V4L2 serves as interface between user mode and kernel mode. In user mode it takes requests from the application, which it then redirects towards the UVC driver that runs in kernel mode, vice-versa for the replies that originate from the driver and end up in the application. Another important point is that V4L2 is not limited to talking to one application at a time. As long as the driver supports itthere is no multiplexing done on Video4Linux part, the same device can be opened multiple times by one or more processes. This is required by the current webcam framework because the video application is not the only component to access the V4L2 device handle. We will see the different access scenarios as we go.



Parts of our webcam framework are built on top of GStreamer because, in our opinion, it is currently the most advanced multimedia framework on the Linux platform. Its integration with the GNOME desktop environment proves that it has reached a respectable grade of stability and exibility and Phonon, the multimedia framework of KDE 4, will have a back-end for GStreamer. Together with the ongoing intensive development that takes place, this makes it a safe choice for multimedia applications and is likely to guarantee a smooth integration into future software. Note that even though, currently, GStreamer is the only framework supported by the Linux webcam framework, plugins for different libraries like NMM can be written very easily. All that needs to be ported in such a case is the lvlter plugin, the interface between GStreamer and liblumvp. This will become clear as we talk more about the components involved. There are three elements in the gure that take advantage of the GStreamer multimedia framework. Simply speaking, the box labeled GStreamer is the "application" as far as V4L2 is concerned. Technically speaking, only the GStreamer v4l2src plugin uses the V4L2 API, all other components use techniques provided by the GStreamer library to exchange data. Figure 5.3 visualizes this by comparing the component overview of a V4L2 application to a GStreamer application that uses a V4L2 video source.



As the name already suggests, this plugin is the source of all V4L2 data that ows through the GStreamer pipeline. It translates V4L2 device properties into pad capabilities and pipeline state changes into V4L2 commands. This is best illustrated by an example. Figure 5.1 shows the functions that v4l2src uses and the V4L2 counterparts that they call. Note that v4l2src does not directly process the GStreamer state transitions but is based on the GstPushSrc plugin that wraps those and uses a callback mechanism. The capability negotiation that is carried out during stream initialization uses the information retrieved from V4L2 function calls like ENUMFMT or 47

(a) Components involved when a V4L2 application displays video.

(b) Components involved when a GStreamer based application displays V4L2 video.

Figure 5.3: Component overview with and without the use of the GStreamer multimedia framework.

GStreamer start get_caps set_caps create ... stop


Description Initialization Format enumeration Stream setup Streaming Cleanup

Table 5.1: Translation between GStreamer and V4L2 elements and functions.


G_FMT to create a special data description format that GStreamer uses internally to check pads for compatibility. There are two so-called caps descriptors involved in our example, the pad capabilities and the xed capabilities. The former is created by enumerating the device features during the get_caps phase. It is a set that contains the supported range of formats, resolutions, and frame rates and looks something like this: video/x-raw-yuv, format=YUY2, width=[ 160, 1280 ], height=[ 120, 960 ], framerate=[ 5/1, 30/1 ]; image/jpeg, width=[ 160, 960 ], height=[ 120, 720 ], framerate=[ 5/1, 25/1 ] The format is mostly self-explanatory. The camera supports two pixel formats, YUV (uncompressed) and MJPEG (compressed) and the intervals give the upper and lower limits on frame size and frame rate. Note that the section for the uncompressed format has an additional format attribute that species the FourCC code. This is necessary for the pipeline to identify the exact YUV format used as there are many different ones with YUY2 being only one of them. The descriptor for the xed capabilities is set only after the set_caps phase when the stream format has been negotiated with V4L2. This capability contains no ranges or lists but is a simple subset of the pad capabilities. After requesting an uncompressed VGA stream at 25 fps from the camera, for example, it would look as follows: video/x-raw-yuv, format=YUY2, width=640, height=480, framerate=25/1 We can clearly see that the format chosen for the pipeline is a subset of the pad capabilities seen above. The intervals have disappeared and all attributes have xed values now. All data that ows through the pipeline after the caps are xed are of this format.



The Logitech video lter or, short, lvlter component is also realized as a GStreamer plugin. Its task is relatively simple: intercept the video stream when enabled (lter mode) and act as a no-op when disabled (pass-through mode).


We will come back to the functionality of lvlter when we look at some of the other components, in particular liblumvp. For the moment, let lvlter be a no-op.


LVGstCap (part 1 of 3: video streaming)

The sample webcam software provided as part of the framework is LVGstCap, the Logitech Video GStreamer Capture application. It is the third component in our schema that uses GStreamer and the only one with a user interface. LVGstCap is also the rst webcam capture program to use the approach depicted in 5.1b, i.e. use both V4L2 and the webcam framework simultaneously to access the device. This fact remains completely transparent to the user as everything is nicely integrated into a single interface. Among others, LVGstCap provides the basic features expected from a webcam capture application: List the available cameras and select one. List the available frame formats (i.e. a combination of pixel format, image resolution, and frame rate) and select one. Start, stop, and freeze the video stream. Modify image controls (e.g. brightness, contrast, sharpness). These features work with all webcams as long as the camera is supported by Linux and its driver works with the GStreamer v4l2src plugin. On top of this basic functionality LVGstCap supports some additional features. We will talk about them in parts 2 and 3.



The Webcam library is a cornerstone of the webcam framework in that all other new components rely on it in one way or another. Being more than only an important technical element, libwebcam realizes part of what the Video4Linux user space library was always supposed to be: an easy to use library that shields its users from many of the difculties and problems of using the V4L2 API directly. Today libwebcam provides the following core features: Enumeration of all cameras available in the system. Provide detailed information about the detected devices. Wrapper for the V4L2 frame format enumeration. Provide unied access to V4L2 and sysfs camera controls. In addition, the interface is prepared to handle device events ranging from newly detected cameras over control value changes to device button events. It is easy to add new features without breaking application compatibility and the addition of new controls or events is straightforward. 50



The Webcam panel library takes libwebcam one step further. While libwebcam is still relatively low-level and does not interpret any of the controls or events directly, libwebcampanel does just that. It combines internal information about specic devices with the controls provided by libwebcam to provide applications with meta information and other added value. This makes it a common repository for device-specic information that would otherwise be distributed and duplicated within various applications. The core features of libwebcampanel are: Provide meta data that applications need to display camera information and user-friendly control elements. Implement a superset of libwebcams functionality. Give access to the feature controls that liblumvp provides. We can see that the main goal of libwebcampanel is making the development of generic webcam applications easier. It is for this reason that most applications will want to use libwebcampanel instead of the lower-level libwebcam. The last point of the above list will become clear when we discuss liblumvp. Before doing so, however, let us look at LVGstCap one more time to see how it uses the control meta information.


LVGstCap (part 2 of 3: camera controls)

When the user selects a device in LVGstCap, it immediately enumerates the controls that the chosen device provides and displays them in a side panel. Ordinarily, i.e. in the case of V4L2 controls, there is no additional information on the control apart from the value range and whether the control is a number, a Boolean, or a list of choices. While most controls can be made to t in one of these categories, in practice there are a number of controls for which this representation is not quite right. Two examples are controls whose value is a bitmask and read-only controls. In the former case it seems inappropriate to present the user with an integer control that accepts values from, say, 0 to 255 when each bit has a distinct meaning. libwebcampanel might transform such a control either into a list of eight choices if the bits are mutually exclusive or split it up into eight different Boolean controls if arbitrary bit combinations are allowed. This allows LVGstCap to display the controls in a generic manner. In the case of read-only controls the user should not be allowed to change the GUI element but still be able to read its current value. Therefore, if libwebcampanel sets the read-only ag on a certain control, LVGstCap will disable user interaction with it and gray it out to make this fact visually clear to the user. We will see a few concrete examples of such cases later in chapter 7.




The name liblumvp stands for Logitech user mode video processing library. It is the only component of the webcam framework that is not open source because it contains Logitech intellectual property. liblumvp consists of a fairly simple video pipeline that passes the video data it receives through a list of plugins that can process and modify the images before they are output again. The library receives all its input from lvlter. Whenever lvlter is in lter mode, it sends the video data it intercepts to liblumvp and uses thepossibly modiedvideo buffer it receives back as its output. All of this remains transparent to the application2 . One can think of a multitude of plugins that liblumvp could include, basically it could implement all the features that Logitech QuickCam provides on Windows. This requires applications to be able to communicate with these plugins, for example to enable or disable them or change certain parameters. For this reason, the library exposes a number of controls, so-called feature controls in a manner almost identical to how libwebcam does it. This is where the second reason for the additional layer introduced by libwebcampanel lies: it can provide applications with a list of hardware camera controls on the one hand and a list of liblumvp software controls on the other hand. Applications can handle both categories in an almost symmetric manner3 , which is just what LVGstCap does.


LVGstCap (part 3 of 3: feature controls)

LVGstCap uses libwebcampanel not only for presenting camera controls to the user but also for feature controls if liblumvp is currently enabled. When a video stream is started, the feature control list is retrieved and its control items are displayed to the user in a special tab next to the ordinary controls. The application also has access to the names of the different features that liblumvp has compiled in. This information can be used to group the controls into categories when required. When the user changes a feature control, LVGstCap communicates this to libwebcampanel, which takes care of the communication with liblumvp. We will later see that this communication is not as trivial in all cases as it may look at rst. In the example of a video application that incorporates both video output and control panel in a single process, there is no need for special measures. There is, however, a case where this does not hold true: panel applications.
2 As a matter of fact, the application must explicitly include lvlter in its GStreamer pipeline, but once the pipeline stands, its presence is transparent and needs no further attention. We will see the advantages and disadvantages of this in chapter 7 3 The reasons why the two are not treated exactly the same are explained in chapter 7.




A panel application is ausually simpleprogram that does not do any video handling itself but allows the user to control a video stream that is currently active in another application. There are a few situations where panel applications are useful: Allow command line tools or scripts to modify video stream parameters. Permit control over the video stream of an application that does not have its own control panel. Provide an additional way of changing controls, e.g. from a tray application. Our webcam framework includes an example application of the rst kind, a command line tool called lvcmdpanel. Figure 5.4 shows the output of the help command. Chapter 8 has a sample session to illustrate some of the commands.
lvcmdpanel 0.1 Control webcam video using the command line Usage: lvcmdpanel [OPTIONS]... [VALUES]... -h, -V, -v, -d, -l, -c, -g, -s, --help --version --verbose --device=devicename --list --clist --get=control --set=control Print help and exit Print version and exit Enable verbose output Specify the device to use List available cameras List available controls Retrieve the current control value Set a new control value

Figure 5.4: Command line options supported by lvcmdpanel.


Flashback: current problems

In chapter 4 we discovered a number of issues that current V4L2 applications have to deal with. Let us now revisit them one by one and show how our webcam framework avoids or solves them. Note that we dont go to great technical details here but save those for chapter 7. Avoid kernel mode components Apart from some work on the UVC driver and V4L2 that are necessary to exploit the full feature set provided by current webcams the entire framework consists of user mode components. This demonstrates that there are good ways to realize video processing and 53

related tasks in user mode today and that for most of the associated drawbacks good solutions can be found. Direct device access While direct device access can never be achieved without the support of select kernel mode components, we tackled this problem by extending the UVC driver so that it allows user mode applications to access the full spectrum of UVC extensions. With the help of sysfs, we have developed an interface that is superior to any standard C interface in that it allows shell scripts and system commands to access the hardware in an intuitive way. Simple API We have seen that mechanisms such as function callback are valuable if not indispensable for certain features like event notication. The webcam framework provides the corresponding interfaces that can be used as soon as the kernel space components implement the necessary underlying mechanisms. In addition, the enumeration APIs that our libraries provide are superior in terms of usability to those that V4L2 offers. While some V4L2 functions like frame format enumeration can require dozens of ioctl calls and the management of dynamic data structures in the client, our framework allows all enumeration data to be retrieved in two function calls. The rst one returns the required buffer size and the second one returns the data in one self-contained block of memory. The complexity on the applications side is minimal and so is the overhead. Complicated device enumeration Applications should not have to loop through the huge number of device nodes in the system and lter out the devices they can handle. This approach requires the applications to know criteria they should not have to know like the decision whether a given device node is a video device or not. If these criteria change, all applications have to be updated, which is a big problem if certain programs are no longer maintained. This problem is solved by the device enumeration function of libwebcam. No stateless device information querying It seems unnecessary to open a device just to retrieve its name and other information an application may want to present to its user. In the same way that listing the contents of a directory with ls does not open each single le, it would be desirable to query the device information at enumeration time. libwebcam does this by maintaining an internal list of camera devices that contains such data. It can be retrieved at any time by any application without opening a V4L2 device. Missing frame format enumeration As we will see later on, this problem was solved by adding the missing functionality directly to V4L2 with the UVC driver being the rst one to support it. To keep the API as uniform and simple as possible for application developers, libwebcam has a wrapper for frame


format enumeration that severely reduces the complexity associated with retrieving the supported frame formats. Lack of current documentation While we have not solved the problem of parts of the V4L2 documentation being outdated or incomplete, we did make sure that all libraries that application developers can interact with are thoroughly documented; an extensive API specication is available in HTML format. In addition, this report gives a vast amount of design and implementation background. This is a big advantage for developers who want to use parts of the webcam framework for their own applications. The next two chapters are devoted to the more technical details of what was presented in this chapter. We will rst look at the extensions and changes that were applied to currently existing components before we focus on the newly developed aspects of the webcam framework.


Chapter 6

Enhancing existing components

In order to realize the webcam framework like it was described in the previous chapter a few extensions and changes to existing components were necessary. These range from small patches that correct wrong or inexible behavior to rewrites of bigger software parts. This chapter sums up the most important of these and lists them in the order of their importance.


Linux UVC driver

With UVC devices being at the center of the Linux webcam framework the UVC driver was the main focus of attention as far as preexisting components are concerned. The following sections describe some important changes and give an outlook of what is about to change in the near future.


Multiple open

From chapter 5 we know that multiple open is a useful feature to work around some of V4L2s limitations. Since the webcam framework relies on the camera driver being able to manage multiple simultaneously opened le handles to a given device, this was one of the most important extensions to the UVC driver. The main challenge when developing a concept for multiple device opening are permissions and priorities. As with ordinary le handles where the operating system must make sure that readers and writers do not disrupt each other, the video subsystem must make sure that two video device handles cannot inuence each other in unwanted ways. Webcam drivers that are unable to multiplex the video stream must make sure that only a single device handle is streaming at a time. While this seems easy enough to do the problem arises because the concept of "streaming" is not clearly denable. When does streaming start? When does 56

it stop? There are several steps involved between when an application decides to start the video stream and when it frees the device again: 1. Open the device. 2. Set up the stream format. 3. Start the stream. 4. Stop the stream. 5. Close the device. Drawing the line at the right place is a trade-off between preventing ill interactions on the one hand and allowing a maximum of parallel access on the other. We decided to make the boundary right before the stream setup. To this end we divided the Video4Linux functions into privileged (or streaming) ioctls and unprivileged ioctls and introduced a state machine for the device handles (gure 6.1).

Figure 6.1: The state machine for the device handles of the Linux UVC driver used to guarantee device consistency for concurring applications. The rounded rectangles show which ioctls can be carried out in the corresponding state.

There are four different states: 57

Closed The rst unprivileged state. While not technically a state in the software, this state serves as a visualization for all inexistent handles that are about to spring into existence when they are opened by an application. It is also the state that all handles end up in when the application closes them. Passive The second unprivileged state. Every handle is created in this state. It stands for the fact that the application has opened the device but has not yet made any steps towards starting the stream. Querying device information or enumerating controls can already happen in this state. Active The rst privileged state. A handle moves from passive to active when it starts setting up the video stream. Four ioctls can be identied in the UVC driver that applications use before they start streaming: TRY_FMT, S_FMT, and S_PARM for stream format setup and REQBUFS for buffer allocation. As soon as an application calls one of these functions, its handle moves into the active stateunless there already is another handle for the same device in a privileged state, in which case an error is returned. Streaming The second privileged state. Using the STREAMON ioctl lets a handle move from active to streaming. Obviously only one handle can be in this state at a time for any given device because the driver made sure that no two handles could get into the active state in the rst place. The categorization of all ioctls into privileged and unprivileged ones not only yields the state transition events but also decides which ioctls can be used in which states. Table 6.1 contains a list of privileged ioctls. Also note that the only way for an application with a handle in a privileged state to give up its privileges is to close the handle. ioctl S_INPUT QUERYBUF QBUF DQBUF STREAMON STREAMOFF Description Select the current video input (no-op in uvcvideo). Retrieve information about a buffer. Queue a video buffer. Dequeue a video buffer. Start streaming. Stop streaming.

Table 6.1: Privileged controls in the uvcvideo state machine used for multiple open.

This schema guarantees that different device handles for the same device can perform the tasks required for panel applications and the Linux webcam framework while ensuring that the panel application cannot stop the stream or change its attributes in a way that could endanger the video application.



UVC extension support

We saw that in section 2.4 when we discussed the USB Video Class specication that extension units are important for device manufacturers to add additional features. For this reason, UVC drivers should have an interface that allows applications to access these extension units. Otherwise, they may not be able to exploit the full range of device capabilities. Raw extension control support through sysfs The rst and obvious way to expose UVC extension controls in a generic way is to give applications raw access. Under Linux sysfs is an ideal way to realize such an interface. Extensions and their controls are mapped to a hierarchical structure of virtual directories and les that applications can read from and write to. The les are treated like binary les, i.e. what the application writes to the le is sent as is to the device and what the application reads from the le is the same buffer that the driver has received from the device. During this whole process no interpretation of the relayed data is being done on the drivers side. Let us look at a simplied example of such a sysfs directory structure: extensions/ |-- 63610682-5070-49AB-B8CC-B3855E8D221D |-- 63610682-5070-49AB-B8CC-B3855E8D221E |-- 63610682-5070-49AB-B8CC-B3855E8D221F +-- 63610682-5070-49AB-B8CC-B3855E8D2256 |-- ctrl_1 +-- ctrl_2 |-- cur |-- def |-- info |-- len |-- max |-- min |-- name +-- res We can see that the camera supports four different extension units, each of which identied by a unique ID. The contents of the last one show two controls and one of the controls has its virtual les visible. All these les correspond directly to the UVC commands of the same name. For example, the read-only les def and len map to GET_DEF and GET_LEN. In the case of the only writable le cur there are two corresponding UVC commands: GET_CUR and SET_CUR. Whatever is written to the cur le is wrapped within a SET_CUR command and sent to the device. In the opposite case where an application opens cur and reads from it, the driver creates a GET_CUR request, sends it to the device and turns the device response into the le contents, followed by an 59

end-of-le marker. If an error occurs during the process, the corresponding read or write call returns an error message. While this approach works well and is supported by our extended UVC driver, there is a limitation associated with it that has to do with the way that ownership and permissions are set on these virtual les. This can lead to security issues on multi-user machines like section 7.5.1 will show. Another problem with this approach of using raw data is that applications must know exactly what they are doing. This is undesirable in the case of generic applications because the knowledge has to be duplicated in every single one of them. The following section describes a possible way to resolve this issue. Mapping UVC to V4L2 controls V4L2 applications cannot use the raw sysfs controls unless they include the necessary tools and knowledge. Obviously, it would be easier to just use a library like libwebcam or libwebcampanel that can wrap any sort of controls behind a simple and consistent interface, but there are situations where this may not be an option, for example in the case of applications that are no longer maintained. If such an application has functions to enumerate V4L2 controls and present them in a generic manner, then all it would take to allow the program to use UVC extension controls is a mapping between the two. Designing and implementing a exible mechanism that can cover most of the cases to be expected in the foreseeable future is an ongoing process for which the ground stones were laid as part of this project. One of the assumptions we made was that there could be a 1:n mapping between UVC and V4L2 controls but not in the opposite direction. The rationale behind this is that V4L2 controls must already be as simple as possible and sensible since the application is in contact with them. For UVC controls however, it is conceivable that a device would pack multiple related settings into a single control1 . If that is the case, applications should see multiple V4L2 controls without knowing that the driver maps them to one and the same UVC control in the background. Figure 6.2 gives a schema of such a mapping. The next fundamental point was the question where the mapping denitions should come from. The obvious answer is from the driver itself but with the perspective of an increasing release frequency of new UVC devices in mind this cannot be the nal answer. It would mean that new driver versions would have to be released on a very frequent basis only to update the mappings. We therefore came to the conclusion that the driver should hardcode as few control mappings as possible with the majority coming from user space. The decision on how such mappings are going to be fed to the driver has not yet been made. Two solutions seem reasonable: 1. Through sysfs. User space applications could write mapping data to a sysfs le and the driver would generate a mapping from the data. The
1 As

a matter of fact we shall see such an example in section 7.3.


Figure 6.2: Schema of a UVC control to V4L2 control mapping. The UVC control descriptor contains information about how to locate and access the UVC control. The V4L2 control part has attributes that determine offset and length inside the UVC control as well as the properties of the V4L2 control.

main challenge here would be to nd a reasonable format that is both human-readable and easily parseable by the driver. XML would be ideal for the rst one but a driver cannot be expected to parse XML. Binary data would be easier for the driver to parse but contradict the philosophy of sysfs after which exchanged data should be human-readable. Whatever the format looks like, the mapping setup would be as easy as redirecting a conguration le to a sysfs le. 2. Through custom ioctls. For the driver side the same argument as for a binary sysfs le applies here with the difference that ioctls were designed for binary data. The drawback is that a specialized user space application would be necessary to install the mapping data, such as a control daemon. For the moment, we restrict ourselves to hardcoded mappings. The future will show which way turns out to be the best to manage the mapping conguration from user space. Internally the driver manages a global list of control descriptors with their V4L2 mappings. In addition, a device-dependent list of controls, the so-called control instances, is used to store information about each devices controls like


the range of valid values. When a control descriptor is added, the driver loops through all devices and adds a control instance only if the device in question supports the new control. This process required another change to the drivers architecture: the addition of a global device list. Many drivers do not need to maintain an internal list of devices because almost all APIs provide space for a custom pointer in the structures that they make available when they call an application. Such a pointer allows for better scaling and less overhead because the driver does not have to walk any data structures to retrieve its internal state. This is indispensable for performance critical applications and helps simplify the code in any case. The Linux UVC driver also uses this technique whenever possible but for adding and removing control mappings it must fall back to using the device list. Luckily, this does not cause any performance problems because these are exceptional events that do not occur during streaming. Once all the data structures are in place, the V4L2 control access functions must be rewritten to use the mappings. Laurent Pinchart is currently working on this as part of his rewrite of the control code that xes a number of other small problems.


V4L2 controls in sysfs

In connection with the topics mentioned above there is an interesting discussion going on whether all V4L2 controls could be exposed to sysfs by default and in a generic manner. The idea comes from the pvrusb2 driver[10] which does just that. What originally started out as a tool for debugging turned out to be a useful option for scripting the supported devices. Given the broad application scenarios and generic nature of the feature it would be optimal if the V4L2 core took care of automatically exposing all device controls to sysfs in addition to the V4L2 controls that are available today. While currently not more than a point of discussion and an entry on the wish list, it is likely that Video4Linux will eventually receive such a control mapping layer. It would complete the sysfs interface of uvcvideo in a very nice manner and open the doors for entirely new tools. If such an interface became reality, libwebcam could automatically use it if the current driver does not support multiple opening of the same device because this would prevent it from using the V4L2 controls that it uses now. This switch would be completely transparent to users of libwebcam.



In section 4.5.3 we saw a number of issues that developers of software using the current version of V4L2 have to deal with. While most of them could not be xed without breaking backwards compatibility, the most severe one, the lack of frame format enumeration as described in 4.5.3, was relatively easy to overcome. 62

V4L2 currently provides a way to enumerate a devices supported pixel formats using the VIDIOC_ENUM_FMT ioctl. It does this by using the standard way for list enumeration in V4L2: The application repeatedly calls a given ioctl with an increasing index, starting at zero, and receives all the corresponding list entries in return. If there are no more entries left, i.e. the index is out of bounds, the driver returns the EINVAL error value. There are two fundamental problems with this approach: Application complexity. The application cannot know how many entries there are in the list. Using a single dynamically allocated memory buffer is therefore out of question unless the buffer size is chosen much bigger than the average expected size. The only reliable and scalable way is to build up a linked list within the application and add an entry for each ioctl call. This shifts the complexity towards the application, something that should be avoided by an API in order to encourage developers to use it in the rst place and to discourage possibly unreliable hacks. Non-atomicity. If the list that the application wants to enumerate does not remain static over time, there is always a chance that the list changes while an application is enumerating the contents of the list. If this happens, the received data is inevitably inconsistent leading to unexpected behavior in the best case or crashes in the worst case. The rst idea for a workaround that comes to mind is that the driver could return a special error value indicating that the data has changed and that the application should restart the enumeration. Unfortunately this does not work because the driver has no way of knowing if an application is currently enumerating at all. Nothing forbids the application to start with a different index than zero or quit the enumeration process before the driver has had a chance to return the end of list marker. When we decided to add frame size and frame rate enumeration, our rst draft would have solved both of these problems at once. The entire list would have been returned in a single buffer making it easy for the application to parse on the one hand and rendering the mechanism insusceptible to consistency issues on the other. The draft received little positive feedback, however, and we had to settle for a less elegant version that we present in the remainder of this section. The advantages of the second approach are its obvious simplicity for the driver side. It is left up to the reader to decide whether driver simplicity justies the above problems. No matter what enumeration approach is chosen, an important point must be kept in mind: the attributes pixel format, frame size, and frame rate are not independent of each other. For any given pixel format, there is a list of supported frame sizes and any given combination of pixel format and frame size determines the supported frame rates. This seems to imply a certain hierarchy of these three attributes, but it is not necessarily clear what this hierarchy


should look like. Technical details, like the UVC descriptor format, suggest the following: 1. Pixel format 2. Frame size 3. Frame rate However, for users it may not be obvious why they should even care about the pixel format. A video stream should mainly have a large enough image and a high enough frame rate. The pixel format and whether compression is used, is just a technicality that the application should deal with in an intelligent and transparent manner. As a result, a user might prefer a list of frame sizes to choose from rst and, possibly, a list of frame rates as a function of the selected resolution. In order to keep the V4L2 frame format enumeration API consistent with the other layers, we decided to leave the hierarchy in the order mentioned above. An application can still opt to collect the entire attribute hierarchy and present the user with a more suitable order. Once such a hierarchy has been established, the input and output values of each of the enumeration functions becomes obvious. The highest level has no dependency on lower levels, the lower levels have dependencies on only the higher levels. This mechanism can theoretically be extended to an arbitrary number of attributes although in practice there are limits to what can be considered a reasonable number of input values. Table 6.2 gives the situation for the three attributes used by webcams. Enumeration attribute Pixel format Frame size Frame rate Input parameters none Pixel format f Pixel format f, Frame size s Output values Pixel formats Frame sizes supported for frame format f Frame rates supported for frame format f and frame size s

Table 6.2: Input and output values of the frame format enumeration functions.

As it happens the V4L2 API already provided a function for pixel format enumeration, which means that it could be seamlessly integrated with our design for frame size and frame rate enumeration. These functions are now part of the ofcial V4L2 API, the documentation for which can be found at [5].




GStreamer has had V4L2 support in the form of the v4l2src plugin for a while, but had not received any testing with webcams using the UVC driver. There is a particularity about the UVC driver that causes it not to work with a few applications, notably the absence of the VIDIOC_G_PARM and VIDIOC_S_PARM ioctls that do not apply to digital devices. The GStreamer V4L2 source was one of these applications that would rely on these functions to be present and fail in the adverse case. After two small patches, however, the rst to remove the above dependency and the second to x a small bug in the frame rate detection code, the v4l2src plugin worked great with UVC webcams and proved to be a good choice as a basis for our project. In September 2006, Edgard Lima, one of the plugins authors, added proper support for frame rate negotiation using GStreamer capabilities which allows GStreamer applications to take full advantage of the spectrum of streaming parameters.


Bits and pieces

Especially during the rst few weeks the project involved a lot of testing and bug xing in various applications. Some of these changes are listed below. Ekiga During some tests with a prototype camera a bug in the JPEG decoder of Ekiga became apparent. The JPEG standard allows an encoder to add a customized Huffman table if it does not want to use the one dened in the standard. The decoder did not process such images properly and failed to display the image as a result. Also there were two issues with not supported ioctls and the frame rate computation, very similar to those in GStreamers v4l2src. Spca5xx The Spca5xx driver already supports a large number of webcams as we saw in section 4.3.2 and the author relies to a large part on user feedback to maintain his compatibility list. We also did some tests at Logitech with a number of our older cameras and found a few that were not recognized by the driver but would still work with the driver after patching its USB PID list. luvcview The luvcview tool had a problem with empty frames that could occur with certain cameras and which would make the application crash. This was xed as part of a patch that added two different modes for capturing raw frames. One mode writes each received frame into a separate le (raw frame capturing), the other one creates one single le where it stores the complete video stream (raw frame stream capturing). The rst mode can be used to easily


capture frames from the camera, although, depending on the pixel format, the data may require some post processing, e.g. adding of an image header.


Chapter 7

New components
Chapter 5 gave an overview of our webcam framework and described its goals without going into much technical detail. This chapter is dedicated to elaborate how some of these goals were achieved and implemented. It will also explain the design decisions and explain why we have chosen certain solutions before others. At the same time we will show the limitations of the current solution and their implications towards future extensibility. Another topic of this chapter is the licensing model of the framework, a crucial topic of any open source project. We will also give an outlook on future work and opportunities.



The goals of the Webcam library, or simply libwebcam, were briey covered in section 5.4.8. The API is described in great detail in the documentation that comes with the sources. The functions can be grouped into the following categories: Initialization and cleanup Opening and closing devices Device enumeration and information retrieval Frame format enumeration Control enumeration and usage Event enumeration and registration The general usage is rather simple. Each application must initialize the library before it is rst used. This allows the library to properly set up its internal data structures. The client can now continue by either enumerating devices or, if it already knows which device it wants to open, directly go ahead and open a device. If a device was successfully opened, the library returns a handle 67

that the application has to use for all subsequent requests. This handle is then used for tasks such as enumerating frame formats or controls and reading or writing control values. Once the application is done, it should close the device handles and uninitialize the library to properly free any resources that the library may have allocated. Let us now look at a few implementation details that are important for application developers using libwebcam to know.


Enumeration functions

All enumeration functions use an approach that makes it very easy for applications to retrieve the contents of the list in question. This means that enumeration usually takes exactly two calls, the rst one to determine the required buffer size, the second one to ll the buffer. In the rare occasion where the list changes between the two calls, a third call can be necessary, but with the current implementation this situation can only arise for devices. The following pseudo code illustrates the usage of this enumeration schema from the point of view of the application. buffer := NULL buffer_size := 0 required_size := c_enum(buffer : NULL, size : buffer_size) while(required_size > buffer_size) buffer_size := required_size buffer := allocate_memory(size : buffer_size) required_size := c_enum(buffer : buffer, size : buffer_size) Obviously, the syntax of the actual API looks slightly different and applications must do proper memory management and error handling. Another aspect that makes this type of enumeration very easy for its users is that the buffer is completely self-contained. Even though the buffer can contain variable-sized data, it can be treated as an array through which the application can loop. Figure 7.1 illustrates the memory layout of such a buffer.



The entire library is programmed in a way that makes it safe to use from multithreaded applications. All internal data structures are protected against simultaneous changes from different threads that could otherwise lead to inconsistent data or program errors. Since most GUI applications are multi-threaded, this spares the application developer from taking additional steps to prevent multiple simultaneous calls to libwebcam functions.


Figure 7.1: Illustration of the memory block returned by a libwebcam enumeration function. The buffer contains three list items and a number of variable-sized items (strings in the example). Each list item has four words of xed-sized data and two char pointers. The second item shows pointers to two strings in the variable-sized data area at the end of the buffer. Pointers can also be NULL, in which case there is no space reserved for them to point to. Note that only the pointers belonging to the second item are illustrated.



liblumvp and lvlter

The Logitech user mode video processing library is in some ways similar to libwebcam. It also provides controls as we have seen in section 5.4.11 and its interface is very similar when it comes to library initialization/cleanup or control enumeration. The function categories are: Initialization and cleanup Opening and closing devices Video stream initialization Feature enumeration and management Feature control enumeration and usage Video processing In our webcam framework, liblumvp is not directly used by the application. Instead, its two clients are lvlter, the video interception lter that delivers video data, and libwebcampanel, from which it receives commands directed at the features that it provides. Nothing, however, prevents an application from directly using liblumvp apart from the fact that this would make the application directly dependent of a library that was designed to act transparently in the background. lvlter hooks into the GStreamer video pipeline where it inuences the stream capability negotiation in a way that makes sure that the format is understood by liblumvp. It then initializes the latter with the negotiated stream parameters and waits for the pipeline state to change. When the stream starts, it redirects all video frames through liblumvp, where they can be processed and, possibly, modied before it outputs them to the remaining elements in the pipeline. While lvlter takes care of the proper initialization of liblumvp, it does not use the feature controls that liblumvp provides. Interaction with these happens through libwebcampanel as we will see shortly. We have mentioned that applications must explicitly make use of liblumvp by including lvlter in their GStreamer pipeline. This has positive and negative sides. The list of drawbacks is led by the fact that it does not smoothly integrate into existing applications and that each application must test for the existence of lvlter if it wants to use the extra features. It is this very fact, however, that can also be seen as an opportunity. Some users do not like components to work transparently, either because they could potentially have negative interactions that would make problems hard to debug or because they do not trust closed source libraries. Before we move on to the next topic a few words about the two plugins that are currently available:


Mirror The rst, very simple plugin is available on any camera and lets the user mirror the image vertically and horizontally. While the rst one can be used to turn a webcam into an actual mirror, the second one can be useful for laptops with cameras built into the top of the screen because these are usually rotatable by 180 degrees along the upper edge and allow switching between targeting the user and what is in front of the user. Face tracking This module corresponds closely to what users of the QuickCam software on Windows know as the "Track two or more of us" mode of the face tracking feature. The algorithm detects peoples faces and zooms in on them, so that they are better visible when the user moves away from the camera. If the camera supports mechanical pan and tilt, like the Logitech QuickCam Orbit, it does so by moving the lens head in the right direction. For other cameras the same is done digitally. This feature is only available for Logitech cameras that are UVC compatible. In the future, more features from the Logitech QuickCam software will be made available for Linux users through similar plugins.



The interface of the Webcam panel library is very similar to the one provided by libwebcam. This was a design decision that should make it easy for applications that started out using libwebcam to switch to libwebcampanel when they want more functionality.


Meta information

Section 5.4.9 gave a high-level overview of what sort of information ltering libwebcampanel adds on top of what libwebcam provides. Let us look at these in more detail. Devices Camera name change: The camera name string in libwebcam comes from the V4L2 driver and is usually generic. In the case of the UVC driver it is always "USB Video Class device", not very helpful for the user who has three different UVC cameras connected. For this reason libwebcampanel has a built-in database of device names that it associates with the help of their USB vendor and product IDs. This gives the application more descriptive names like "Logitech QuickCam Fusion". If the library recognizes only the vendor but not the exemplary device ID 0x1234, it is still able to provide a somewhat useful string like "Unknown Logitech camera (0x1234)". Controls 71

Control attribute modication: These modication range from simple name changes to more complex ones like modication of the value ranges or completely changing the type of a control. Controls can also be made read-only or write-only. Control deletion: A control can be hidden from the application. This can be useful in cases where a driver wrongly reports a generic control that is not supported by the hardware. The library can lter those out stopping them from appearing in the application and confusing users. Control splitting: A single control can be split into multiple controls. As a ctitious example, a 3D motion control could be split up into three different motion controls, one for each axis. While the rst point is pretty self-explanatory, the second one deserves a few real life examples. Example 1: Control attribute modication The UVC standard denes a control called Auto-exposure mode. It determines what parameters the camera changes to adapt to different lighting conditions. This control is a 8-bit wide bitmask with only four of the eight bits actually being used. The bits are mutually exclusive leaving 1, 2, 4, 8 as the set of legal values. However, due to the limited control description capabilities of UVC, the control is usually exported as an integer control with valid values ranging from 1 to 255. If an application uses a generic algorithm to display such a control, it might present the user with a slider or range control that can take all possible values between 1 and 255. Unfortunately, most values will have no effect because they do not represent a valid bitmask. libwebcampanel comes with enough information to avoid this situation by turning the auto-exposure mode control into a selection control instead that allows only four different settingsthe ones dened in the UVC standard. Now, the user will see a list box, or whatever the application developer decided to use to represent a selection control, with each entry having a distinct and clear meaning and no chance for the user to accidentally select invalid values, a major gain in usability. Example 2: Control splitting The Logitech QuickCam Orbit series has mechanical pan and tilt capabilities with the help of two little motors. Both motors can be moved separately by a given angle. The control, through which these capabilities are exposed, however, combines both values, i.e. relative pan angle and relative tilt angle, in a single 4-byte control containing a signed 2-byte integer for each. For an application such a control is virtually unusable without the knowledge how the control values have to be interpreted. libwebcampanel solves this problem very elegantly by splitting up the control into two separate controls: relative pan angle and relative tilt angle. It also


marks both controls as write-only, because it makes no sense to read a relative angle, and as action controls, meaning that changing the controls causes a one-time action to be performed. The application can use this information, as in the example of LVGstCap, to present the user with a slider control that can be dragged to either side and jumps back to the neutral position when let go of. Obviously, most of this information is device-specic and needs to be kept up-to-date whenever new devices become available. It can therefore be expected that new minor versions of the library appear rather frequently including only minor changes. An alternative approach would be to move all device-specic information outside the library, e.g. in XML conguration les. While this would make it easier to keep the information current, it would also make it harder to describe device-specic behavior. The future will show, which one of these approaches is more suitable.


Feature controls

Feature controls inuence directly what goes on inside liblumvp. They can enable or disable certain features or change the way video effects operate. They are different to ordinary controls in a few ways and they require a few special provisions as we shall see now. Controls vs. feature controls We have previously mentioned that controls and feature controls are handled in an almost symmetrical manner. The small but important difference between the two is that ordinary controls are device-related but feature controls are stream-related. What this means is that the list of device controls can be queried before the application takes any steps to start the video stream. The driver and therefore V4L2 know about them from the very start. At this time, the GStreamer pipeline may not even be built and lvlter and liblumvp not loaded. So in practice, a video application will probably query the camera controls right after device connection but feature controls only when the video is about to be displayed. This timing difference would make it considerably more complicated for applications to manage a combined control list in a user-friendly manner. As a nice side-effect, it becomes easy for the application to selectively support only one set of controls or to clearly separate the two sets. Communication between client and library There is another very important point that was left unmentioned until now and that only occurs in the case of a panel application. The video stream, and therefore liblumvp, and the panel application that uses libwebcampanel run in


two different processes. This means that the application would in vain try to change feature controls. liblumvp could well be loaded into the applications address space, but it would be a second and completely independent instance. To avoid this problem, the two libraries must be able to communicate across process bordersa clear case for inter-process communication. Both, libwebcampanel and liblumvp have a socket implementation over which they can transfer all requests related to feature controls. Their semantics are completely identical, only the medium differs. Whenever a client opens a device using liblumvp (in our case this is done by lvlter), it creates a socket server thread that waits for such requests. libwebcampanel, on the other side, has a socket client that it uses to send requests to liblumvp whenever one of the feature control functions is used. There is a possible optimization here, namely the use of the C interface instead of the IPC interface whenever both libraries run in the same process. However, the IPC implementation does not cause any noticeable delays since the amount of transmitted data remains in the order of kilobytes. We opted for the simpler solution of using the same interface in both cases, although the C version is still available and ready to use if circumstances make it seem preferable.


Build system

The build system of the webcam framework is based on the Autotools suite, the traditional choice for most Linux software. The project is mostly selfcontained with the exception of liblumvp which has some dependencies on convenience libraries1 outside the build tree. These convenience libraries contain some of the functionality that liblumvp plugins rely on and were ported from the corresponding Windows libraries. The directory structure of the open source part looks as follows: / +--lib | +--libwebcam | +--libwebcampanel | +--gstlvfilter | +--src +--lvgstcap The top level Makele generated by Autotools compiles all the components, although each component can also be built and installed on its own. Generic build instructions are included in the source archive.
1 Convenience libraries group a number of partially linked object les together. While they are not suitable for use as-is, they can be compiled into other projects in a similar way to ordinary object les.




Each solution has its trade-offs and limitations and it is important to be aware of them. Some of them have technical reasons; others are the result of time constraints or are beyond the projects scope. This section is dedicated to make developers and users of the Linux webcam framework aware of these limitations. At the same time it gives pointers for future work, which are the topic of the next section.


UVC driver

Even though the Linux UVC driver is stable and provides support for all basic UVC features needed to do video streaming and manage video controls, it is still work in progress and there remains a lot of work to be done for it to implement the entire UVC standard. At the moment, however, having a complete UVC driver cannot be more than a long-term goal. For one thing, the UVC standard describes many features and technologies for which there exist no devices today and for another, not even Windows ships with such a driver. What is important, and this is a short-term goal that will be achieved soon, is that the driver supports the features that todays devices use. Luckily, the list of tasks to get there is now down to a relative small number of items. A few of these are discussed below. Support for status interrupts The UVC standard denes a status interrupt endpoint that devices must implement if they want to take advantage of certain special features. These are: Hardware triggers (e.g. buttons on the camera device for functions such as still image capturing) Asynchronous controls (e.g. motor controls whose execution can take a considerable amount of time and after completion of which the driver should be notied) AutoUpdate controls (controls whose values can change without an external set request, e.g. sensor-based controls) When such an event occurs, the device sends a corresponding interrupt packet to the host and the UVC driver can take the necessary action, for example update the internal state or pass the notication on to user space applications. Currently, the Linux UVC driver has no support for status interrupts and consequently ignores the packets. While this has no inuence on the video stream itself, it prevents applications from receiving device button events or be notied when a motor control command has nished. The latter one can be quite useful for applications because they may want to prevent the user from sending further motion commands while the device is still moving. 75

In the context of mechanical pan/tilt there are two other issues that the lack of such a notication brings with it: 1. Motion tracking. When a motion tracking algorithm, like the one used for multiple face tracking in liblumvp, issues a pan or tilt command to the camera, it must temporarily stop processing the video frames for the duration of the movement. Otherwise, the entire scene would be interpreted as being in motion due to the viewport translation that happens. After the motion has completed, the algorithm must resynchronize. If the algorithm has no way of knowing the exact completion time it must resort to approximations and guess work, therefore decreasing its performance. This is what liblumvp does at the moment. 2. Keeping track of the current angle. If the hardware itself does not provide the driver with information as to the current pan and tilt angles, the driver or user space library can approximate this by keeping track of the relative motion commands it sends to the device. For this purpose, it needs to know whether a given command has succeeded and if so, at what point in time in order to avoid overlapping requests. One of the reasons why the UVC driver does not currently process status interrupts is that the V4L2 API does not itself have any event notication support. As we saw in section 4.5.1 such a scheme is not easy to implement due to the lack of callback techniques that kernel space components have at their disposition. The sysfs interface that is about to be included in the UVC driver is a rst step into the direction of adding a notication scheme. Since kernel 2.6.17 it is possible to make sysfs attributes pollable (see [2] for an overview of the interface). This polling process does not impose any CPU load on the system because it is implemented with the help of the poll system call. The polling process sleeps and wakes up as soon as one of the monitored attributes changes. For the application this incurs some extra complexity, notably the necessity of multi-threading. This is clearly a task for a library like libwebcam. The polling functionality only needs to be written once and at the same time the notications can be sent using a more application friendly mechanism like callback. libwebcam already has an interface designed for this exact purpose. As soon as the driver is up to the task, applications will be able to register callback functions for individual events, some of them coming from the hardware, others being synthesized by the library itself. Sysfs permissions Another problem that still awaits resolution is to nd a method to avoid giving all users arbitrary access to the controls exported to the sysfs virtual le system. Since sysfs attributes have xed root:root ownership when the UVC driver creates them, this does not leave it much choice when it comes to dening 76

permissions. Modes 0660 and 0664, on the one hand, would only give the superuser write access to the sysfs attributes, and therefore the UVC extension controls. Mode 0666, on the other hand, would permit every user to change the behavior of the attached video devices leading to a rather undesirable situation: a user guest that happens to be logged in via SSH on a machine on which a video conference is in progress could change settings such as brightness or even cause the camera to tilt despite not having access to the video stream or the V4L2 interface itself. For device nodes this problem is usually resolved by changing the group ownership to something like root:video and giving it 0660 permissions. This still does not give ne grained permissions to individual users but at least a user has to be a member of the video group to be able to access the camera. A good solution would be to duplicate the ownership and permission from the device node and apply them to the sysfs nodes. This would make sure that whoever has access to the V4L2 video device also has access to the devices UVC extensions and controls. Currently, however, such a solution does not seem feasible due to the hard-coded attribute ownership. Another approach to the problem would be to let user space handle the permissions. Even though sysfs attributes have their UID and GID set to 0 on creation, they do preserve new values when set from user space, e.g. using chmod. A user space application running with elevated privileges could therefore take care of this task. Ongoing development The ongoing development of the UVC driver is of course not a limitation in itself. The limitation merely stems from the fact that not all of the proposed changes have made their way into the main driver branch yet. As of the time of this writing, the author is rewriting parts of the driver to be more modular and to better adapt them to future needs. At the same time, he is integrating the extensions presented in 6.1 piece by piece. The latest SVN version of the UVC driver does not yet contain the sysfs interface, but it will be added as soon as the completely rewritten control management is nished. Therefore, for the time being, users who want to try out the webcam framework in its entirety, in particular functions that require raw access to the extension units need to use the version distributed as part of the framework. Another aspect of the current rewrite is the consolidation of some internal structures, notably the combination of the uvc_terminal and uvc_unit structs. This will simplify large parts of the control code because both entity types can contain controls. The version distributed with the framework does not properly support controls on the camera terminal. This only affects the controls related to exposure parameters and will automatically be xed during the merge back.


Still image support The UVC specication includes features to retrieve still images from the camera. Still images are treated differently from streaming video in that they do not have to be real-time, which gives the camera time to apply image quality enhancing algorithms and techniques. At the moment, the Linux UVC driver does not support this method at all. This is hardly a limitation at the moment because current applications are simply not prepared for such a special mode. All single frame capture applications that currently exist open a video stream and then process single frames only, something that obviously works perfectly ne with the UVC driver. In the future one could, however, think of some interesting features like the ability to read still images directly from /dev/videoX after setting a few parameters in sysfs. This would allow frame capturing with simple command line tools or amazingly simple scripts. Imagine the following, for example: dd if=/dev/video0 of=capture.jpg It would be fairly simple to extend the driver to support such a feature, but the priorities are clearly elsewhere at the moment.


Linux webcam framework

Missing event support The fact that libwebcam currently lacks support for events, despite the fact that the interface is there, was already mentioned above. To give the reader an idea of what the future holds, let us look at the list of events that libwebcam and libwebcampanel could support: Device discovered/unplugged Control value changed automatically (e.g. for UVC AutoUpdate controls) Control value changed by client (to synchronize multiple clients of libwebcam) Control value change completed (for asynchronous controls) Other, driver-specic events Feature control value changed (libwebcampanel only) Events specic to liblumvp feature plugins (libwebcampanel only) Again, the events supported by libwebcampanel will be a superset of those known to libwebcam in a manner analog to controls.


Single stream per device The entire framework is laid out to only work with a single video stream at a time. This means that it is impossible to multiplex the stream, for example with the help of the GStreamer tee element, and control the feature plugins separately for both substreams. This design decision was made for simple practicality; the additional work required would hardly justify the benets. For most conceivable applications this is not a limitation, though. There are no applications today that provide multiple video windows per camera at the same time and the possible use cases seem restricted to development and debug purposes. There is another reason why it is unlikely that such applications appear in the near future: the XVideo extension used on Linux to accelerate video rendering can only be used by one stream at a time, so that any additional streams would have to be rendered using unaccelerated methods. In GStreamer terms this means that the slower ximagesink would have to be used instead of xvimagesink, which is the default in LVGstCap.



Providing an outlook of the further development of the Linux webcam framework at this moment is not easy given that it has not been published yet and therefore received very little feedback. There are, however, a few signs that there is quite some demand out there for Linux webcam software as well as related information. For one thing, requests and responses that come up on the Linux UVC mailing list clearly show that the current software has decits. A classic example is the fact that there are still many programs out there that do not support V4L2 but are still based on the deprecated V4L1 interface. Even V4L2 applications still use API calls that are not suitable for digital devices, clearly showing their origins in the world of TV cards. For another thing, the demand for detailed and reliable information out there is quite large. Linux users who want to use webcams have a number of information related problems to overcome. Typical questions that arise are: What camera should I buy so that it works on Linux? I have camera X. Does it work on Linux? Which driver do I need? Where do I download it? How do I compile and install the driver? How can I verify its proper functioning? What applications are there? What can they do? What camera features are supported? What would it take to x this? All these questions are not easy to answer. Even though the information is present somewhere on the web, it is usually not easy to nd because there 79

is no single point to start from. Many sites are incomplete and/or feature outdated information, making the search even harder. Providing software is thus not the only task on the to do list of Linux webcam developers. More and better information is required, something that Logitech is taking initiative in. Together with the webcam framework, Logitech will publish a website that is designated to become such an information portal. At the end of this chapter we will give more details about that project. In terms of software, the Linux webcam framework certainly has the potential to spur the development of new and great webcam applications as well as giving new improved tools to preexisting ones. Our hope is that, on the one hand, the broader use of the framework will bring forth further needs that can be satised by future versions and, on the other hand, that the project will give impulses for improving the existing components. The Linux UVC driver is one such component that is rapidly improving. As we have seen during the discussion of limitations above, new versions will create the need for libwebcam extensions. But libwebcam is not the only component that will see further improvements. Logitech will add more feature plugins to liblumvp as the framework gains momentum with the most prominent one being an algorithm for face tracking. Compared to the current motion tracker algorithm it performs much better when there is only a single person visible in the picture.



The licensing of open source software is a complex topic, especially when combined with closed source components. There are literally hundreds of different open source licenses out there and many projects choose to use their own, adapted license, further complicating the situation.



One key point that poses constraints on the licensing of a project is the set of used licenses for the underlying components. In our case, this situation is quite easy. The only closed source component of our framework, liblumvp, uses GStreamer, which is in turn developed under the LGPL. The LGPL is considered one of the most appropriate licenses for libraries because it allows both open and closed source components to link against it. Such a licensing scheme considerably increases the number of potential users because developers of closed source applications do not need to reinvent the wheel, but can instead rely on libraries proven to be stable. For this reason libwebcam and libwebcampanel are also released under the LGPL enabling any application to link against them and use their features. The same reasoning applies to the lvlter GStreamer plugin. The only closed source component of the webcam framework is the liblumvp library. Some of the feature plugins contain code that Logitech has 80

licensed from third parties under conditions that disallow their distribution in source code form. While liblumvp is free or charge, it is covered by an end-user license agreement very similar to the one that is used for Logitechs Windows applications. There is one question that keeps coming up in Internet forums when closed source components are discussed: "Why doesnt the company want to publish the source code?" The answer is usually not that companies do not want to but that they cannot for legal reasons. Hardware manufacturers often buy software modules from specialized companies and these licenses do not allow the source to be made public.



All non-library code, in particular LVGstCap and lvcmdpanel, is licensed under version 2 of the GNU GPL. This allows anybody to make changes to the code and publish new versions as long as the modied source code is also made available. Table 7.1 gives an overview over the licenses used for the different components of this project. The complete text of the GPL and LGPL licenses can be found in [7] and [8]. Component libwebcam libwebcampanel lvlter liblumvp LVGstCap lvcmdpanel Samples License LGPL LGPL LGPL Closed source GPL GPL Public domain

Table 7.1: Overview of the licenses used for the Linux webcam framework components.



Making the webcam framework public and getting people to use it, test it, and receive feedback will be an important task of the upcoming months. Logitech is currently setting up a web server that is expected to go online in the last quarter of 2006 and will contain the following: List of drivers: Overview of the different webcam drivers available for Logitech cameras. 81

Compatibility information: Which devices work with which drivers? FAQ: Answers to questions that frequently come up in the context of webcams. Downloads: All components of the Linux webcam framework (incl. sources except for liblumvp). Forum: Possibility for users to discuss problems with each other and ask questions to Logitech developers. The address will be announced through the appropriate channels, for example on the mailing list of the Linux UVC driver.


Chapter 8

The new webcam infrastructure at work

After the technical details it is now time to see the webcam framework in actionor at least static snapshots of this action. The user only has direct contact with the video capture application LVGstCap and the panel application lvcmdpanel. The work of the remaining components is, however, still visible, especially in the case of lvcmdpanel, whose interface is very close to libwebcampanels.



Figure 8.1 shows a screenshot of LVGstCap with its separation into video and control area. The video window to the left displays the current picture streaming from the webcam while the right-hand side contains both camera and feature controls in separate tabs. The Camera tab allows the user to change settings directly related to the image and the camera itself. All control elements are dynamically generated from the information that libwebcampanel provides. The Features tab gives control over the plugins that liblumvp contains. Currently it allows ipping the image about the horizontal and vertical axes and enabling or disabling the face tracker.



The following console transcript shows an example of how lvcmdpanel can be used.
$ lvcmdpanel -l Listing available devices:


Figure 8.1: A screenshot of LVGstCap with the format choice menu open.

video0 video1

Unknown Logitech camera (0x08cc) Logitech QuickCam Fusion

There are two devices in the system; one was recognized, the other one was detected as an unknown Logitech device and its USB PID is displayed instead.
$ lvcmdpanel -d video1 -c Listing available controls for device video1: Power Line Frequency Backlight Compensation Gamma Contrast Brightness $ lvcmdpanel -d video1 -cv Listing available controls for device video1: Power Line Frequency ID : 13, Type : Choice, Flags : { CAN_READ, CAN_WRITE, IS_CUSTOM }, Values : { Disabled[0], 50 Hz[1], 60 Hz[2] }, Default : 2 Backlight Compensation ID : 12, Type : Dword, Flags : { CAN_READ, CAN_WRITE, IS_CUSTOM }, Values : [ 0 .. 2, step size: 1 ], Default : 1


Gamma ID : Type : Flags : Values : Default : Contrast ID : Type : Flags : Values : Default : Brightness ID : Type : Flags : Values : Default :

6, Dword, { CAN_READ, CAN_WRITE }, [ 100 .. 220, step size: 120 ], 220 2, Dword, { CAN_READ, CAN_WRITE }, [ 0 .. 255, step size: 1 ], 32 1, Dword, { CAN_READ, CAN_WRITE }, [ 0 .. 255, step size: 1 ], 127

The -c command line switch outputs a list of controls supported by the specied video device, in this case the second one. For the second list the verbose switch was enabled, which yields detailed information about the type of control, the accepted and default values, etc. (Note that the output was slightly shortened by leaving out a number of less interesting controls.) The nal part of the transcript can be followed easiest by rst starting an instance of luvcview in the background. The commands below change the brightness of the image while luvcviewor any other video applicationis running.
$ lvcmdpanel -d video1 -g brightness 127 $ lvcmdpanel -d video1 -s brightness 255 $ lvcmdpanel -d video1 -g brightness 255

The current brightness value is 127 as printed by the rst command. The second command changes the brightness value to the maximum of 255 and the third one shows that the value was in fact changed. The last example shows how simple it is to create scripts to automate tasks with the help of panel applications. Even writing an actual panel application is very straightforward; lvcmdpanel consists of less than 400 lines of code and already covers the basic functionality.


Chapter 9

Jumping into work in the open source community with the support of a company in the back is a truly gratifying job. The expression "its the little things that count" immediately comes to mind and the positive reactions one receives, even for small favors, is a great motivation along the way. Having been on the user side of hardware and software products for many years myself, I know how helpful the little insider tips can be. Until recently most companies were unaware of the fact that small pieces of information that seem obvious on the inside of a product team can have a much higher value when carried outside. The success of modern media like Internet forums with employee participation and corporate blogs is a clear sign for this. Open source is in some ways similar to these media. Simple information that is given out comes back in the form of improved product support, drivers written from scratch, and, last but not least, reputation. The Logitech video team has had such a relationship with the open source community for a while, although in a rather low-prole manner leading to little public perception. This is the rst time that we have actively participated and while it remains to be seen what the inuence of the project will be, the little feedback we have received makes us condent that the project is a success and will not end here. As far as the authors personal experience is concerned, the vast majority was of a positive nature. I was in contact with project mailing lists, developers, and ordinary users of open source software without a strong programming background. Out of these three, the last two are certainly the easiest to work with. Developers are grateful for feedback, test results, suggestions, and patches whereas users appreciate help with questions to which the answers are not necessarily obvious. Mailing lists are a category of their own. While many fruitful discussions are held, some of them reminded me of modern politics. What makes democracy a successful process is the fact that everybody has their say and everybody is encouraged to speak up, something that holds true for mailing lists. Unfortunately, the good and bad sides go hand in hand and so mailing lists inherit 86

the dangers of slow decision making and standstill. Many discussions fail to reach a conclusion and silently dissolve, much to the frustration of the person who brought up the topic. If open source developers need to learn one thing, it is seeing their users as customers and treating them as such. The pragmatic solution often beats the technically more elegant in terms of utility, a fact that each developer must learn to live with. The future will show whether we are able to reach our long-term goal, achieving a webcam experience among users that can catch up to what Windows offers nowadays. The Linux platform has undoubtedly become a competitive platform but in order not to lose its momentum, Linux must focus on its weaknesses and multimedia is clearly one of them. The components are there for the most part but they need to be consistently improved to make sure that they work together more closely. There are high hopes on KDE 4 with its multimedia architecture and camera support will denitely have its place in it. The moment when Linux users can plug in their webcam, start their favorite instant messenger and have a video conference taking advantage of all the cameras features is within a graspan opportunity not to be missed.


Appendix A

List of Logitech webcam USB PIDs

This appendix contains a list webcams manufactured by Logitech, their USB identiers and the name of the driver they are reported or tested work with. We use the following abbreviated driver names in the table: Key pwc qcexpress quickcam spca5xx uvcvideo Driver Philips USB Webcam driver (see 4.3.1) QuickCam Express driver (see 4.3.4) QuickCam Messenger & Communicate driver (see 4.3.3) Spca5xx Webcam driver (see 4.3.2) Linux USB Video Class driver (see 4.3.5)

The table below contains the following information: 1. The USB product ID as reported, for example, by lsusb. Note that the vendor ID is always 0x046D. 2. The ASIC that the camera is based on. 3. The name under which the product was released. 4. The driver by which the camera is supported. An asterisk means that the state of support for the given camera is untested but that the camera is likely to work the driver given the ASIC. Possibly the driver may need patching in order to recognize the given PID. A dash means that the camera is not currently supported.


PID 0840 0850 0870

ASIC ST600 ST610 ST602

Product name Logitech QuickCam Express Logitech QuickCam Web Logitech QuickCam Express Logitech QuickCam for Notebooks Labtec WebCam Acer OrbiCam Acer OrbiCam Logitech QuickCam IM Labtec Webcam Plus Logitech QuickCam IM Logitech QuickCam Image Logitech QuickCam for Notebooks Deluxe Labtec Notebook Pro Logitech QuickCam IM Logitech QuickCam Communicate STX Logitech QuickCam for Notebooks Logitech QuickCam Pro Logitech QuickCam Pro 3000 Logitech QuickCam Pro for Notebooks Logitech QuickCam Pro 4000 Logitech QuickCam Zoom Logitech QuickCam Zoom Logitech QuickCam Orbit Logitech QuickCam Sphere Cisco VT Camera Logitech ViewPort AV100 Logitech QuickCam Pro 4000 Logitech QuickCam Zoom Logitech QuickCam Fusion Logitech QuickCam Orbit Logitech QuickCam Sphere MP Logitech QuickCam for Notebooks Pro Logitech QuickCam Pro 5000 QuickCam for Dell Notebooks Cisco VT Camera II Logitech QuickCam IM Logitech QuickCam Connect Logitech QuickCam Messenger MP

Driver qcexpress qcexpress qcexpress

0892 0896 08A0 08A2 08A4 08A7 08A9 08AA 08AC 08AD 08AE 08B0 08B1 08B2 08B3 08B4 08B5 08B6 08B7 08BD 08BE 08C1 08C2 08C3 08C5 08C6 08C7 08D9 08DA

VC321 VC321 VC301 VC302 VC301 VC302 VC302 VC302 VC301 VC302 VC302 SAA8116 SAA8116 SAA8116 SAA8116 SAA8116 SAA8116 SAA8116 SAA8116 SAA8116 SAA8116 SPCA525 SPCA525 SPCA525 SPCA525 SPCA525 SPCA525 VC302 VC302

spca5xx spca5xx spca5xx (*) spca5xx (*) spca5xx spca5xx spca5xx (*) spca5xx spca5xx pwc pwc pwc pwc pwc pwc pwc pwc pwc pwc uvcvideo uvcvideo uvcvideo uvcvideo uvcvideo uvcvideo spca5xx spca5xx


08F0 08F1 08F4 08F5 08F6 0920 0921 0922 0928 0929 092A 092B 092C 092D 092E 092F 09C0

ST6422 ST6422 ST6422 ST6422 ST6422 ICM532 ICM532 ICM532 SPCA561B SPCA561B SPCA561B SPCA561B SPCA561B SPCA561B SPCA561B SPCA561B SPCA525

Logitech QuickCam Messenger Logitech QuickCam Express Labtec WebCam Logitech QuickCam Communicate Logitech QuickCam Communicate Logitech QuickCam Express Labtec WebCam Logitech QuickCam Live Logitech QuickCam Express Labtec WebCam Logitech QuickCam for Notebooks Labtec WebCam Plus Logitech QuickCam Chat Logitech QuickCam Express Logitech QuickCam Chat Logitech QuckCam Express QuickCam for Dell Notebooks

quickcam quickcam (*) quickcam (*) quickcam quickcam spca5xx spca5xx spca5xx (*) spca5xx spca5xx spca5xx spca5xx spca5xx spca5xx (*) spca5xx (*) spca5xx uvcvideo


[1] Jonathan Corbet. Linux loses the Philips webcam driver. LWN, 2004. URL [2] Jonathan Corbet. Some upcoming sysfs enhancements. LWN, 2006. URL [3] Creative. Creative Open Source: Webcam support. URL http://

[4] Bill Dirks. Video for Linux Two - Driver Writers Guide, 1999. URL http: // [5] Bill Dirks, Michael H. Schimek, and Hans Verkuil. Video for Linux Two API Specication, 1999-2006. URL video4linux/API/V4L2_API/. [6] USB Implementers Forum. Universal Serial Bus Device Class Denition for Video Devices. Revision 1.1 edition, 2005. URL developers/devclass_docs. [7] Free Software Foundation. GNU General Public License. 1991. URL [8] Free Software Foundation. GNU Lesser General Public License. 1999. URL [9] Philip Heron. fswebcam/. fswebcam, 2006. URL URL URL http://

[10] Mike Isely. pvrusb2 driver, 2006. pvrusb2/pvrusb2.html. [11] Greg Jones and Jens Knutson.

Camorama, 2005.

[12] Avery Lee. Capture timing and capture sync. 2005. URL http://www.


[13] Marco Lohse. Setting up a Video Wall with NMM. 2004. URL http:// [14] Christian Magnusson. QuickCam Messenger & Communicate driver for Linux, 2006. URL [15] Juan Antonio Martnez. VideoForLinux: El canal del Pingino (The Penguin Channel). TLDP-ES/LuCAS, 1998. URL Articulos-periodisticos/jantonio/video4linux/v4l_1.html. [16] Motama and Saarland University Computer Graphics Lab. Network-Integrated Multimedia Middleware. URL http://www. [17] Laurent Pinchart. Linux UVC driver, 2006. URL http://linux-uvc. [18] Damien Sandras. Ekiga, 2006. URL [19] Tuukka Toivonen and Kurt Wal. QuickCam Express Driver, 2006. URL [20] Linus Torvalds. Linux GPL and binary module exception clause?, December 2003. URL 0312.0/0670.html. [21] Dave Wilson., 2006. URL [22] Michel Xhaard. luvcview, 2006. spca50x/Investigation/uvc/. URL

[23] Michel Xhaard. SPCA5xx Webcam driver, 2006. URL http://mxhaard.