Computer Science Department

Technical Report NWU-CS-05-12 April 20, 2005 Histographs: Interactive Visualization of Complex Data with Graphs
Pin Ren, Benjamin Watson

Graphs are a widely used and understood visualization. However, they quickly break down when the visualized data is quite complex, requiring hundreds or thousands of graphs. Our Histographs system builds on the techniques introduced with Information Murals [7] to enable meaningful visualization of such data. Histographs map the frequency of data elements at each display location to luminance, revealing data density and trends. Our improvements include contrast-weighted histogram equalization to improve the frequency-luminance mapping, splatting to make outliers visible, a second derivative modulation to reveal changes in data trends, and the use of line integral convolution show local data flow. Different data-to-space mappings can be implemented interactively. A linked correlation matrix display and highlights inter-graph relationships. Users can zoom in on data, as well as select with shape- and correlation-based brushing. Histographs are a useful way of obtaining data overviews, and revealing hidden structure in complex data sets.

Keywords: information visualization, data streams, splatting, histogram equalization, image processing, market visualization, computer system visualization

Histographs: Interactive Visualization of Complex Data with Graphs
Pin Ren*
Northwestern University

Benjamin Watson*
Northwestern University

Figure 1: NYSE TAQ trading data for May, 2004. On the left, a log(price) histograph. Dark horizontal stripes indicate more trades in the middle price range, thinner vertical stripes show increased activity as trading days close and open. Right, trades in each stock plotted relative to its mean (SDMean(log(price))), and illustrates price trends with line integral convolution (LIC). Monthly and daily trends are much clearer.

ABSTRACT Graphs are a widely used and understood visualization. However, they quickly break down when the visualized data is quite complex, requiring hundreds or thousands of graphs. Our Histographs system builds on the techniques introduced with Information Murals [7] to enable meaningful visualization of such data. Histographs map the frequency of data elements at each display location to luminance, revealing data density and trends. Our improvements include contrast-weighted histogram equalization to improve the frequency-luminance mapping, splatting to make outliers visible, a second derivative modulation to reveal changes in data trends, and the use of line integral convolution show local data flow. Different data-to-space mappings can be implemented interactively. A linked correlation matrix display and highlights inter-graph relationships. Users can zoom in on data, as well as select with shape- and correlationbased brushing. Histographs are a useful way of obtaining data overviews, and revealing hidden structure in complex data sets. CR Categories: H.5.2 [Information Interfaces and Presentation]: User Interfaces – Screen Design; I.3.8 [Computer Graphics]: Applications Keywords: information visualization, data streams, splatting,

histogram equalization, image processing, market visualization, computer systems visualization 1 GRAPHS FOR COMPLEX DATA Visualizing large multivariate data sets is an ongoing research effort. The goal in these visualizations is finding the overall trends and hidden patterns, both by visual inspection and interactive navigation and manipulation. Graphs are one widely used and understood visualization technique. However if complex data is graphed (consider “stacking” a few thousand scatterplots), many data points are mapped to the same location, hiding most of the information and greatly reducing the effectiveness of the visualization. Our aim is to make graphs useful for visualizing large, complex data sets, while retaining as much of their intuitive accessibility as possible. Our Histographs solution builds upon Jerding and Stasko’s Information Murals (IM) [7], which mapped the number of data points at a pixel to luminance, enabling effective displayspace scaling of 2D plots. Histographs adapt this solution to allow effective projection (“stacking”) of a very large number of graphs into 2D (Figure 1). We describe our solution in more detail below. 2 PREVIOUS WORK Mapping high-dimensional data sets to limited 2D display space has always been a basic visualization challenge. The scatterplot matrix [4] visualizes N-dimensional data by arranging N2 graphs into an N×N matrix. The dimensions corresponding to each graph’s axes are fixed by its location in the matrix. Parallel coordinates [6] map N data dimensions to N parallel axes. Points


Figure 2: NYSE data for Dec 1-10, 2004. Left, contrast-weighted equalization (CWE) maps frequency to luminance; right, the mapping is formed with scaling as in [1]. Note the increased overall contrast and improved visibility of outliers with CWE.

in the data space appear as connected line segments joining points on these axes. Dimensional stacking [9] maps N-dimensional data into two dimensions by choosing two dimensions and creating graph axes for them, subtracting those dimensions from the data set, and then recursively embedding smaller graphs inside this graph using the same technique, until graphs with only one or two dimensions can be created. Histographs address the dimensionality problem by choosing one axis from the data set and mapping it to the horizontal axis in N-1 2D graphs, each with one of the remaining N-1 dimensions mapped to its vertical axis. These axes are then stacked. We illustrate this approach below with New York Stock Exchange (NYSE) trading data containing thousands of stocks, and a 262-dimension Windows system performance data set. High-dimensional or large data is not only difficult to map to the display, but also to visualize clearly. Often, many data points map to the same display location, obscuring data visibility. Zhai et al. [16] address this occlusion problem with translucence, which blends the colors of overlapping data objects. This is effective if the number of data objects at any pixel is limited. Jerding and Stasko's IM system [7] permitted large graphs to be scaled down to small display sizes by mapping the number of data points at destination pixels to luminance. Trutschl et al. [13] reposition occluded points using a smart jittering algorithm. Although none of these techniques were designed to handle a very large number of occlusions, the frequency-luminance (FL) mapping used by IM is easily adapted to this purpose. Wegman & Luo [14] and Artero et al. [1] use this approach to resolve the occlusions generated by large numbers of data records in parallel coordinates projections. Histographs use this approach to resolve the occlusions generated when high-dimensional data is projected into 2D. 3 BASIC HISTOGRAPHS Like a histogram summarizes the distribution of samples using a frequency-space mapping, a histograph summarizes a large number (a “stack”) of graphs using an FL mapping. To generate a histograph, the data elements in each 2D graph Gi is organized and counted into a 2D data frequency image Fi,x×y, with resolution

corresponding to display resolution. The images for all the graphs are then summed, forming a composite frequency array Fx×y = Σi Fi,x×y. This array then becomes input for the FL mapping (see Section 4). The data-space mapping in each graph Gi is formed by choosing one of the N dimensions in the data set and mapping it to the horizontal axis in all graphs. We call this the abscissal dimension. In all our work to date, we use time as our abscissal dimension, meaning that we treat our data as time series [2][15]. We currently use a simple linear mapping of time to the horizontal axis. We map each of the remaining N-1 ordinal dimensions to the vertical axis on one of the N-1 graphs in the stack. Users can select interactively from various vertical mappings, including linear mappings between the Global minimum and maximum data values over all the N-1 ordinal dimensions, and between the Local minimum and maximum data values in the single ordinal dimension corresponding to the current graph Gi. Users can also transform the data interactively before it is mapped. Three particularly useful transforms are log(), deriv() and gain(). The latter subtracts the first data value in abscissal order in the current graph Gi from all subsequent local data values. For time series, this means subtracting the oldest value from all subsequent values (Figures 3, 5 and 10). When each of the N-1 ordinal dimensions is measured using the same units (e.g. trades for thousands of stocks measured in US dollars), these data-space mappings are quite useful. They are less useful when units differ, as they do in our Tlab Windows performance data set. For such cases, our system includes an SDMean mapping. This operation finds a local mean µi and a local standard deviation σi for each graph Gi and the data in its matching ordinal dimension. Data is then linearly mapped between -4σi and 4σi, with µi located at the center of Gi’s vertical axis. Data values outside this range are clamped to the range. This mapping places main trends at the middle of each graph’s vertical dimension, and outliers above and below it (Figures 1, 2, 6-9 and 11-12). To allow flexible, interactive zooming, we construct a temporal hierarchy on the data. This hierarchy is built bottom-up during

Figure 3: gain (and loss) in stock price for Dec 7, 2004. On the left CWE defines the luminance mapping, on the right histogram equalization does. CWE has higher overall contrast while preserving visibility of outliers and high-frequency data features.

Figure 4: NYSE data for Dec 1, 2004. Left, a histograph with splatting. Right, the same histograph without splatting. Note the increased visibility of splatted outliers.

precomputation from leaves that sample time regularly. Data points need not be so regularly spaced, so when a leaf contains multiple points, the points are aggregated using a simple or weighted (e.g. trading volume) average. This process continues recursively, with children being aggregated into their parents until the entire hierarchy is filled. Two data sets drive our work and illustrate this paper. The NYSE TAQ data set records every trade made on that Wall Street market, to a temporal resolution of one second. Each data record includes the ticker symbol of the traded stock, the time and date of the trade, as well as trade price and volume. The Tlab Windows system monitoring data measures the performance of a small cluster of PC in a departmental student lab. The data includes 29 different performance measures for each of the nine machines in the cluster, recorded at a 1Hz rate for two months in 2001. The measures include processor user time, memory usage, and the number of sent TCP packages. 4 FREQUENCY-LUMINANCE MAPPINGS The FL mapping is central to the utility of histographs, and should effectively reveal variation in data density across the graph stack, even when data in certain regions of the stack is quite sparse. 4.1 Contrast-Weighted Histogram Equalization In the attempt to maximize contrast and minimize quantization, visualizations (e.g. [1]) typically map data to luminance by equating the lowest and highest data values to the lowest and highest luminance values, and using a linear mapping between them. Unfortunately, if the highest and lowest data values are outliers, this linear mapping causes uneven distribution of actual data values across luminance, resulting in poor contrast (Figure 2). In image processing, histogram equalization (HE) [11] addresses a similar problem by accepting an input luminance image and producing a mapping of each distinct luminance in that image to a new luminance. The differences between output luminances are proportional to the frequency with which each original luminance occurs in the input image. To apply HE to our FL mapping problem, we replace the input luminance image with an image-sized array of data frequencies. HE then eliminates sensitivity to data outliers, but can overemphasize small differences in data frequency, simply because they are common (Figure 3). Our solution is to adjust HE’s mapping to reflect the contrast between input values, rather than the frequency with which they occur. We call the resulting algorithm contrast-weighted equalization (CWE). When applied to FL mapping, CWE accepts an image-sized array of data frequencies Fw×h and outputs a mapping FL|F|, where |F| is the number of distinct data frequencies in Fw×h. At completion, FL|F| contains an increasing set of device-

independent luminances in the range [0,1]. As a first step, it produces an intermediate array of summed contrasts C|F|:
DistinctF|F| = a list of all distinct frequencies in Fw×h Set all elements in C|F| to 0 For each element F in Fw×h do For each element N of F’s 8 neighbors do C|F|[F] += |F - N| End for End for FL|F|[0] = 0 SumC = 0 TotalC = sum of elements in C|F| For i = 1 to |F|-1 do

SumC += C[DistinctF|F|[i]]
FL|F|[i] =SumC/TotalC End for

The results of CWE can be seen in Figures 2 and 3. Note that if all the inner loop in the first half of this pseudocode is replaced by the simple operation C|F|[F] += 1 then the above is equivalent to HE. To ensure that all data frequencies in our histographs are visible, we often find it useful to transform the FL mapping to a device-dependent luminance range with a non-zero minimum. When the number of data frequencies |F| in input is less than the device’s luminance resolution, we ensure that each input frequency maps to a distinct output luminance. 4.2 Splatting to Increase Visibility of Isolated Data Even with a good FL mapping based on CWE, isolated data elements can be hard to see (Figure 4). This occurs when data elements represent outliers, or after user zooming has reduced data density in the current histograph. Such isolated data elements can be important precisely because they are unusual or isolated, and should be visible even when the FL mapping quite appropriately makes them low-contrast features in the visualization. To address this problem, we add lower spatial frequencies to isolated data elements using splatting, increasing the visual salience of these elements without distorting the FL mapping. These goals are quite different from van Liere and de Leeuw [10], who use splatting uniformly throughout their graphs to blur and reveal global structure. To focus splatting on low-contrast, spatially isolated data elements, our splats are adaptive both to the number of data elements nearby, and to the luminance of the data elements themselves. We begin with a simple neighborhood search around each pixel Pxy containing data elements to find the radius rk that defines the circular neighborhood around the pixel that contains

Figure 5: gain in stock price in May 2004, colored to reveal local price trends. Red shows falling prices, while green shows rising prices. High saturation indicates a broader trend. This histograph reveals many events in the middle of the trading day.

Figure 6: trading for Dec 1, 2004. LIC reveals pricing trends in the lower price range between 10 and 11AM.

Figure 7: Tlab Windows monitoring data for May 1-4, 2001, visualized relative to each parameter’s mean. Left, data frequency determines nd luminance; right, frequency is modulated by 2 derivative, highlighting several system events.

exactly k other data elements (we currently use k = 4). The splat is then defined by the exponential function
0.75 ⋅ lum( P xy ) ⋅ exp⎛ ⎜ ⎝
Dist ( P xy,Q xy ) rk

⋅ log(1 lum( P xy )) ⎞ ⎟ ⎠

When two splats overlap, the maximum luminance is applied; splats are not additive. Splats never affect the luminance of pixels containing data points. 5 VISUALIZING HIGHER-ORDER TRENDS By interpolating between data points, line graphs visualize an approximation of slope, giving viewers a sense of trend and flow. As dense scatterplots, simple histographs do not visualize these higher-order data characteristics. We have implemented a number of improvements to address these shortcomings.

where lum() returns the luminance of a pixel, Dist() the distance between two pixels, and Qxy is the pixel being shaded. Note that the exponential falls off both as a function of the local sparseness of data elements rk, and the luminance of the splatted pixel Pxy.

Figure 9: Tlab Windows monitoring data for May 2001 with a linked correlation matrix view on the right. Green indicates positive correlation, red negative correlation, with correlation strength mapped to color luminance. Each matrix row or column shows correlations of one system parameter to all others. Brushing a correlated range in the matrix image selects similar data streams in parameter view.

Figure 10: correlation brushing on stock data.

Exploiting the notion of “flow” in graphs, we can treat mean histograph slope as a vector, and visualize the mean slope vectors at each pixel using line integral convolution (LIC) [3] (Figure 6). We use the grayscale image created by the simple, unenhanced histograph as the necessary texture input for LIC. We find that assigning positive [1 0] vectors (pointing forward in time) to histograph pixels without data enhances this visualization, and use a constant integral length of 25 in LIC. 5.2 Visualizing Second Derivative Sudden changes in data trends or graph “flow” indicate important events and singularities in the data. Even if slope is visualized, these events can be hard to identify in dense histographs. To highlight these events, we modulate the FL mapping by multiplying frequency with the mean second derivative at each pixel before the mapping is constructed by CWE. Figure 7 shows how this second derivative modulation highlights sudden changes in computer system usage in a piecewise linear visualization of the Tlab data. Figure 8 shows sudden changes in NYSE stock price trends that straddle important market announcements or events. Like first derivatives, second derivatives in our visualization are aggregated values summarizing the different second

Figure 8: 2nd derivative modulation on trading data, revealing strongly changing price trends before and after events such as market closings, openings and important announcements.

5.1 Visualizing Slope First, we map slope to chroma (Figure 5). Since each pixel will typically cover data points from many graphs Gi, each with its own local first derivative, we in fact visualize mean slope. We map a negative mean first derivative to red, a positive mean first derivative to green. Saturation increases with the absolute magnitude of these derivatives. We approximate derivative by differencing consecutive (temporally adjacent) data elements.

Figure 11: shape-based selection, with saturation mapped to the proportion of streams selected.

Figure 12: two selected baskets of stocks.

derivatives at corresponding pixels in each stacked graph Gi. However unlike the first derivative, second derivatives are produced with filtering across the temporal dimension to make data events more visible. To find second derivatives, we first filter each of the zero-order data values in every graph Gi by averaging the five temporally adjacent surrounding values. We next construct first derivatives at each pixel in each graph Gi by differencing consecutive filtered, zero-order values. We then filter these first derivatives as we did the zero-order data, generate second derivatives from filtered first derivatives, and finally filter second derivatives as we did zero-order data and first derivatives. 6 VISUALIZING GRAPH RELATIONSHIPS To help users form and test hypotheses about relationships between individual graphs or dimensions in a histograph, we show a linked correlation matrix view (Figures 9 and 10). In this (N-1)×(N-1) matrix, each row and column represents one graph or dimension, and displays its correlations to all other dimensions. In each matrix cell, red indicates negative correlations and green positive correlations, while luminance increases with the absolute magnitude of the correlation. We calculate correlations using standard statistical techniques, interpolating to fill gaps in the data when necessary. To effectively cluster correlated graphs and dimensions in the matrix we apply the correlation matrix ordering technique described by Friendly [5]. This ordering is computed by projecting each graph or dimension into the 2D space described by the first two eigenvectors of the correlation matrix, and visiting those projected dimensions in angular order. Users can interactively specify a time range over which to construct the matrix and examine correlations. Users can also select correlated streams in the linked histograph by brushing on the correlation matrix view. We discuss those features in more detail below. 7 SUPPORTED INTERACTIONS Interaction is a crucial part of any visualization system, especially those designed for data exploration. As we have described above, users can interactively change the data transformation and dataspace mappings. FL mappings such as CWE and splatting can be

turned on and off at will. Users can toggle visualization of higherorder trends, and a linked correlation view. We have also implemented several other interactions. With click and query, users can click on a histograph pixel to reveal a pop-up menu showing textual information about the data from different dimensions aggregated by this pixel. For example with the NYSE data, data revealed includes ticker symbol, price, and slope. Users can also use the popup menu to perform a dimensionby-dimension (graph-by-graph) selection operation. Our annotation interaction allows user to add textual annotation linked to any point in the histograph. Only a red dot marks the annotation left behind, clicking on it later reveals the text. Users often want to experiment with clustering several graphs/dimensions by selecting streams with similar behaviors. With shape based selection (Figures 11 and 12), users can select graphs that have similar shape. With a lasso interaction, users define a target shape. In the temporal range spanned by the shape, if “most” (currently 90%) of a graphed dimension’s data falls inside the lasso shape it will be selected. Target shapes can also be dragged around the histograph to select a different time and ordinal data ranges. 7.1 Linking and Brushing Linking and brushing interactions are an important element of our interface. In the linked correlation matrix view, users can brush using a horizontal stroke or instead use more precise sliders. The corresponding set of dimensions/graphs in the linked histographs view will be selected for further attention (Figures 9 and 10). Similar sliders may be used in the histographs view, causing a recalculation of the correlation matrix to reflect any change in the selected time range. Users can easily create a duplicate, linked histograph view to which alternate data transforms and mappings can be applied, allowing users to see different aspects of the same data. Any sort of selection in one view will also selects and highlights the same set of graphs/dimensions in all linked views. Interesting graph selections can be saved for further analysis and comparison by dubbing the selection a “cluster” and assigning it a color. Multiple clusters can be viewed at once for close comparison and testing (Figure 12). Selections can also be used to

Figure 13: a data-space zoom into a smaller price range.

particular, we need to add views of price volatility into our system. This will likely require the addition of price minimum and maximum fields to our data structures. We would also like to add much more support for hypothesis or model formation and testing. Users should be able to describe a model and see it tested visually and interactively against a real data set, zooming in on interesting successes and failures. Our computer systems collaborators are at the earliest stages of their research, trying simply to understand what information is available to them in dense system traces like those in our Tlab data. For now the bar is quite low: visualization of any kind will be important to them as they explore their data, particularly eventfocused views like Figure 7 that may help them identify system failures or intrusions, and correlation views like Figure 9 that identify relationships within the traces and reduce the data space they are mining. Many systems traces are piecewise linear; we are considering special views and projections designed for this sort of data. 9 CONCLUSION AND FUTURE WORK Histographs are a new graph-based technique for visualizing and interacting with large, high-dimensional data sets. N-dimensional data is mapped to a stack of 2D graphs, which are then projected and composited into a single histograph view using techniques introduced by the Information Murals system [7]. We build on these techniques by describing a new contrast-weighted equalization for mapping data density or frequency to luminance, an application of line-integral convolution to reveal data trends and flow, and linked correlation view to show dimensional relationships. We have demonstrated the use of histographs in two applied domains: NYSE stock trades, and system traces in a local PC cluster. While we believe histographs can be used with a wide range of high-dimensional data, our work to date has focused on data that contains a temporal dimension, leading us to focus on techniques for time series. We would like to expand our system to include interaction techniques specialized for this type of data [2][15]. We also would like to expand to non-temporal and non-numeric datasets. The histographs system was built with an index card metaphor in mind: one graph on each index card, with cards stacked or unstacked. We would like to further develop this metaphor. 10 ACKNOWLEDGEMENTS Our thanks to our collaborators Torben Anderson of Northwestern Finance, and to Peter Dinda of Northwestern Computer Science’s Systems Group. 11 REFERENCES
[1] [2] [3] [4] [5] [6] [7] A.O. Artero, M.C.F. de Oliveira & H. Levkowitz. 2004. Uncovering clusters in crowded parallel coordinates visualizations. Proc. IEEE Information Visualization, 81-88. R. Bade, S. Schlechtweg & S. Miksch. 2004. Connecting timeoriented data and information to a coherent interactive visualization. Proc. ACM CHI, 105-112. B. Cabral & L.C. Leedom. 1993. Imaging vector fields using line integral convolution. Proc. ACM SIGGRAPH, 263-270. W.S. Cleveland & R. McGill. 1984. The many faces of a scatterplot. J. American Statistical Association, 79, 807—822. M. Friendly. 2002. Corrgrams: exploratory display for correlation matrices. American Statistician, 56, 4,:316-324. A. Inselberg & B. Dimsdale. 1990. Parallel coordinates: A tool for visualizing multi-dimensional geometry. Proc. IEEE Visualization, 361—378. D.F. Jerding & J.T. Stasko. 1998. The Information Mural: a technique for displaying and navigating large information spaces. IEEE Trans. Visualization & Computer Graphics, 4, 3, 257-271.

Figure 14: A data-space zoom into a smaller time range.

split the histograph into two histographs, one containing the selected graphs and dimensions, and the other containing the unselected data. 7.2 Zooming Having seen an overview, users can zoom in on interesting data features, obtaining more detailed and finer scale information about the data set [12]. To zoom, users can draw a rectangular region of any size anywhere in the histograph, simultaneously selecting a time and ordinal data (e.g. price) range. For more precision, sliders may also be used (Figures 13 and 14). When the selection is complete, a new (unlinked) histograph appears, displaying the selected time/data range, and any graphs/dimensions that intersect this range. Zooms may also be performed on zoomed views. To maintain a visual correspondence between a zoomed view and its parent context, we do not create a new FL mapping for the zoomed view. Instead, we reuse the parent’s FL mapping. Because zooming will usually result in smaller range of data frequencies than the parent contains, we scale the values in the zoomed view’s frequency image Fx×y to match the parent’s frequency range. Users can discard this common FL mapping at any time and create a unique FL mapping for the zoomed view with CWE. 8 APPLICATION EXAMPLES Our development of this system is driven by two applications and their related data sets. Our finance collaborators are interested in two uses of the system: macroeconomic analysis of market segments or portfolios for use in education and training, and a close research analysis of intraday trading patterns. We have come much farther in serving the first use than the second. In particular, overviews of long time periods (e.g. one month, Figure 1) reveal the large-scale trends nicely. The modulated view in Figure 8 highlights interesting and significant market events and patterns well. Correlation views like that in Figure 10 provide a valuable clustering analysis. On the other hand, while intra-day views such as Figures 3, 4 and 6 are quite interesting, they do not yet provide all the features required to support active research. In

D.A. Keim. 2000. Designing pixel-oriented visualization techniques: theory and applications. IEEE Trans. Visualization & Computer Graphics, 6, 1, 59-78. [9] J. LeBlanc, M.O. Ward & N. Wittels. 1990. Exploring Ndimensional databases. Proc. IEEE Visualization, 230-237. [10] R. van Liere & W.C. de Leeuw. 2003. GraphSplatting: Visualizing graphs as continuous fields. IEEE Trans. Visualization & Computer Graphics, 9, 2, 206-212. [11] J.S. Lim. 1990. Two-Dimensional Signal and Image Processing. Prentice-Hall: Upper Saddle River, NJ. [12] B. Shneiderman. 1996. The eyes have it: a task by data type taxonomy for information visualizations. Proc. IEEE Visual Languanges, 336-343.


[13] M. Trutschl, G.G. Grinstein & U. Cvek. 2003. Intelligently resolving point occlusion. Proc. IEEE Information Visualization, 131-136. [14] E.J. Wegman & Q. Luo.1997.High dimensional clustering using parallel coordinates and the grand tour. Computing Science and Statistics, Vol.28, 361-368. [15] J.J. van Wijk and E.R. Van Selow. 1999. Cluster and calendar based visualization of time series data. Proc. IEEE Information Visualization, 4-9. [16] S. Zhai, W. Buxton & P. Milgram. 1996. The partial-occlusion effect: utilizing semitransparency in 3d human-computer interaction. ACM Trans. Computer-Human Interaction, 3, 3, 254-284.

Sign up to vote on this title
UsefulNot useful