You are on page 1of 20

Name: Sugeerth Murugesan Major: Computer Science Course: ECS 199 summer session-II 2013, 5 units Instructor: Bernd

Hamann VISUALIZING WORK ASSIGNMENT DATA IN AN ADAPTIVE MESH REFINEMENT LIBRARYFURTHER EXTENSIONS OF SYSTEM Objective: Enormous amounts of data are generated daily. The lack of effective tools to analyze the collected data results in a reduced ability of the data scientists to gain insights [7]. The only reasonable approach to analyze such huge amounts of data is through effective and efficient ways of visualization. The ability to support real-time visualization is an essential aspect of any visualization tool. The goal of this project is to support visualizations of high-resolution data sets by designing, implementing and testing methods that allow the data scientists using the tool to perceive visualizations in real-time. This project is an extension of my work done in UC Davis during the summer session- I. We add further extensions to the existing system that we developed earlier in spring as well as summer session-I. The module that we developed was the Patch Module in the performance visualization tool Boxfish. The desired operators that we implement in this project include context visualization [5], parallelizing and optimizing data processing parts of the prototype, projecting regions of interest through different levels of the AMR hierarchy. Basically, AMR (Adaptive Mesh Refinement) [4] is a process by which, the cells in a physical space are refined only in the areas where there is a complex activity. Over a particular instant the space gets subdivided into smaller grid cells. In a High Performance Computing (HPC) scenario, the simulated domain (application domain) is the physical space and the hardware domain is the physical hardware of the supercomputers. The parts of the application domain are mapped to hardware domain. We visualize and optimize the visualizations of physical domain. We basically optimize our prototype by 1) Clearly stating the parts of non-parallel implementation that consume the relatively largest amounts of resources (computing, storage and the number of functional calls) 2) Understanding the behavior of the application by applying non-parallel implementation with 3 different resolutions of the same type a. Initial resolution = 1024 core run dataset.(Figure 2(a)) b. Twice the initial resolution = 2 x 1024 core run dataset.(Figure 2(b)) c. Four times the initial resolution =4 x 1024 core run dataset.(Figure 2(c))

3) Applying our optimization as well as parallelization techniques to those parts that seem more expensive in terms of computational storage and number of function calls used. The methods used to realize the overall software re-design by optimization and parallelization are: 1) Code profiling: Profiling the current execution pattern in the prototype, finding the cost data intensive operations. 2) Code optimizing: Optimizing data intensive operations that we were found in code profiling. 3) Code parallelizing: Parallelizing expensive operations identified in the code optimizing phase.

1. Code Profiling:
The python libraries used for the analysis are: cProfiler: It states the amount of time in CPU seconds and the number of function call each part(function) of the program took. Pymbler: It is a library in python that informs the current size of the data structures in the code. For example, the size of the "Patch dict" in "4 x 1024 core run data set is 1.44 MB. Pymetrics: It determines the number of basic independent paths of the code region. This gives a basic idea to profile which parts of the code need the most amount of attention (in terms of paralyzing and exploiting more resources from the CPU). PyCuda: It lets the programmer access the Nvidias CUDA parallel computation API from Python.[10].It utilizes the power of CUDA programming through the driver API. 1.2 Finding the McCabes Cyclomatic complexity: The McCabes Cyclomatic complexity *7+ analysis lets us know the important parts of the code which needs to be given attention. They traverse the most number of independent execution paths. This forms a basis to find which parts of the code consume the largest amount of resources (CPU units taken to execute, number of function calls made to the CPU).The parts of the code (applicable to our prototype) which will be analyzed in the following sections are defined below:
SetPatchSize(): (Function responsible for implosion operators) It sets the size of the patches and announces the change to the patch module. As soon as the patch sizes are updated the display lists are updated to reflect the change in rendering of patches. ChangeValue_for_transparency(): (Function to change the degree of opacity of patches)The function sets a value chosen when the opacity slider is updated

ChangeValue_for_Level_slider(): (Function to change the level of interest) Visualizes the desired level of interest. The function is connected to a Level slider where the user can restrict the visualization to certain levels of refinement. UpdateHighlights(): (Function that propagates any change in the resolution) Given a list of the patch ids to be highlighted, it displays a neat dialogue box which very well quantifies the data-attributes of the patches of interest. Neighbour_Change(): (Function that finds the neighbors of the patch of interest) It returns the neighbors of the patch of interest in a list. Does not return the patches which are already sliced-off using the Slice and Dice operation. Process_ Planes (): (Function that initiates values for slice planes): Initiates the planes data structure (Hash map) to render during the slice and dice mode. Magnify (): (Function that returns patches in the vicinity of the patch of interest.): The function finds the distance between the patch of interest and its neighbors. The function plays a crucial role in creating context visualization. Range (): Calculates the distance between the patch of interest and the entire simulated domain. It also calculates where exactly the slice planes are to be placed in the simulated domain. DoPick():A function responsible for the selection of patches and slice planes. Plane_highlight():Function responsible for the selection of a particular plane in the simulated domain. Highlight_Drawing():The function that is responsible for visualization on multiple patches of interest, level of interest, magnification visualization and highlighting the patch neighbors. The graph in figure 1 is a representation of different parts of the prototype plotted against the number of independent paths (McCabes Complexity) in the prototype designed. The figure 2 is a graph that represents the relation between the numbers of function calls made as well the amount of CPU seconds taken to respond for each part of the prototype.

2. Inference Drawn and Optimization:

With the brief analysis of the code, we applied many techniques for optimizing and parallelizing our non-parallel implementation. The goal is to minimize the cost of expensive operations in the prototype to create a real-time Visualization. We implemented the following operations to improve the overall performance of the software: 1. Replacement of Euclidean distance formulae with the Manhattan Distance or city block distance. Average improvement percentage is a. Neighbour_Change (): 37.34%.

2. The operational numpy data structures are converted from float32 to int32.Average improvement in performance in

a. Neighbor Change (): b. DrawCubes ():

25.73%. 2.74%.

Please note that the improvement in performance is calculated by Percent improvement = 3. Reducing on the fly operations to improve the overall process of rendering. This is accomplished by storing the calculated color values in a numpy data structure. Average improvement in performance is: a. Draw_Cubes(): b. Highlight_Drawing(): c. Range(): ~25x Faster ~2x Faster ~5x Slower

2.1 Manhattan Distance: For the purpose of reducing the computational time, the Manhattan distance is introduced to replace the Euclidean distance formulae. Manhattan distance is the number of city blocks between the two points of interest. According to the graph, we see that at an average, there is a 37.34% decrease in the amount of time taken by the CPU to respond. The purpose of finding distance in the Neighbor Change () function is to highlight the patches which are in the region of interest. Restructuring the code to accommodate Manhattan distance also improved the overall computational time. 2.2 Integer Conversion: The conversion of floating point data structures to integer type data structures resulted in a: a. Neighbor Change (): b. DrawCubes (): 25.73%. 2.74%.

decrease in the computational time taken by the CPU. The conversion resulted in a downsampling process where the data was rounded off to the nearest integer. This limited the precision of the float value. The data structures which were rounded off to integers are: self. Distance=Distance between the regions of interest. self.maxval, self.minval=Maximum and minimum values of color map self.Slice=Slice plane values in x, y and z axes. The conversion of a float value to an integer value in the draw Cubes () function resulted in a 2.74% decrease in the computational time whereas in the Neighbor_change () resulted in a 25.73 % decrease in computational time.

2.3 Reducing on the Fly computations: The purpose of reducing the on the fly computations is to decrease the initial time taken to render the patches. Earlier, patches were retrieved from the python dict and using the

value of dropped data-attribute, the color corresponding to the patch was computed for every patch. But now, the optimization technique that we apply is to store the color attributes (R, G, B, A) into a list of tuples and index of the data structure is using a hashing mechanism to retrieve the color corresponding to a particular patch-id.The method greatly reduced the startup render time. The function range () calculates the interval of Slice planes in x, y and z axes, Total number of patches, and the function also retrieves the value from the dataset and computes the color. The equation that is devised to return a 4 element tuple containing the (R, G, B, A) values is

Where Max_value=Maximum value of the dropped data attribute. Min_value=Minimum value of the dropped data attribute. Current_value=Value of the individual patch. The equation returns a 4 tuple value in terms of RGBA. The computed value is stored in a list. The data-structure used is a list of tuples. As a result, the number of computations in DrawCubes () as well as highlight_Drawing () decrease significantly. 2.4 Testing datasets with OpenGL Renderer: GeForce GT 420M/PCIe/SSE2 With optimizations in the serial implementation of the code .We tested the behavior of module with the Graphics Card: GeForce GT 420M which has a memory speed of 800 MHz and 1GB memory. The computer used is an Intel i5 architecture which has 4 CPUS and 2 threads per core. The memory is 4 GB with a clock rate of 2.67 GHz. We see that there is a ~10x improvement in the render time. The comparison of Frames per Second with the three different datasets is computed. We see that the FPS is almost constant for level 1 and level 2 of the AMR mesh. But as the level of refinement increases, there is a decrease in the FPS.

3. Code Parallelization:
The functions that are parallelized using PyCuda are: 1) Computing and storing the individual RGBA color schemes for every patch. 2) Calculating the Manhattan Distance from a particular point of interest. These operations are an example of embarrassingly parallel problem as there exists no dependency between the compute points. For such a type of operation we chose PyCuda library to

exploit the benefits of GPU computing. The purpose of using such an operation is to reduce the overall computation time and utilize the GPU for the computation of expensive parts of the code. Computing and storing the individual RGBA color scheme for every patch. Implementation details: The data structure self.values is a n-dimensional numpy array in the Patch module.The data that resides in self.values .It contains the values of the data-attributes that are dropped onto the rendering scene eg, max-hops-extents, owner-extents, Mpi-ranks. The data structure is divided into a number of threads and blocks to support the computation by PyCuda. A number of simultaneous threads computing the color value of the patches is launched. The number of threads are analogous to the number of patches. Thus, each thread computes the value color of the patch and stores them in a memory location.To compute of number of blocks required to caluclate the color attributes of tha patch the following equation is used:

With a study of existing hardware and the configuration of threads and blocks in the GPU hardware, the configuration of 1024 threads per block and Number of blocks+1 as the number of blocks was the one of the optimal configuration of threads and blocks. The configuration of number of threads per block that were taken into consideration were: 1)1024x1x1 2)64x4x4 3)32x32x1 4)512x2x1 Inferring from the analyses, the configuration of 32x32x1 threads per block yielded an approximately 36.6% improvement in the computation time than the average computational time of other configurations. The time measured here is the computation time and does not include the communication time. The communication time in this context here refers to the amount of time involved in transferring the data to and from the GPU,the amount of time involved in converting lists to numpy arrays. The graph in figure 10 represents a comparison of average computational time in CPU (Python looping time) vs. GPU (Time required to compute in GPU).The scalar values are: 1) CPU=0.119815s 2) GPU=0.00078925

Figure 10. Amount of time taken by the GPU and CPU to compute color. If we take into account the amount of time required to transfer data to and from the GPU, the overhead involved in transferring data, computing values, storing and retrieving the computed values to and from a data structure,then it amounts to a surprising increase in the initial data processing time by 33.41%.

Calculating the Manhattan Distance from a particular point of interest:

The Manhattan Distance or city block distance is found out when the highlight or Magnify operations are initiated .It calculates the distance between the centers of interest and the centers of all the patches in the vicinity.The procedure followed in the case of Color value to compute the optimal configuration is to find the number of threads and blocks for maximum performance. As the Cuda Kernel in Python implementation only computes on numpy arrays all the centers of the patches are stored in a numpy array and sent to the kernel for computation. The result of the computed values are stored in another numpy array. The distance from the patch of interest to the other patches are stored in their respective indexes.The computation of such an operation is done in parallel. The operation performs the same funcition on multiple threads or kernels. Parallel implementation vs Serial implementation: The graph represents the computational time (Communication_ time+ Computational_ time) time plotted with the 3 datasets. It is interesting to see that as the datasets go more and more complicated the computational time increases in the case of serial implementation, but in the case of parallel implementation, the computational time is more or less a constant. The serial implementation leads to more number of Python loops and iterations for a given dataset, but on the other hand, the parallel implementation sends simultaneous threads to the GPU to compute.

Figure 11. Parallel vs Serial behaviour with initial render time. In the Figure 11. we find that there is kink in the measurement of the serial implemntation of the prototype .It is probably because the datasets that were taken into consideration were not linear i.e. 1024 core-run, 2 times 1024 core-run datasets, 4 times 1024 core -run dataset. Therefore, as a result of non-linear increase in resolution, there is a "kink" in the graph. Other Features implemented: Context-Visualization: We develop a context [5] in the existing visualization to emphasize more on the region of interest in the visualization. Certain data subsets are viewed in more detail than certain regions which are just shown for context. [5] We develop an intuitive feature by which we use different graphic resources such as opacity, color, space, etc. to succinctly create visualizations of a focused region of interest .The reason to introduce such visualization is because for very large datasets every detail cannot be explicitly shown to the user. The method by which we apply this into our software is by rendering patches which are of least interest into a wire framework. The Figure 8 represents visualization where there are multiple regions of interest. The patch data which is assigned blue in color is the focus patch. The visualization also highlights the neighbors of the patch of interest in red. The patch colored in yellow was once a region of interest chosen by the user. To indicate that it is a previous region of interest the

original color of the patch is preserved. In Figure 8, we also see that the simulated domain is characterized by horizontal slice planes. These slice planes are responsible for cutting, narrowing unwanted parts of the simulated domain. The figure 9 represents visualization, where the region of interest is exhibited across the refinement levels of 3-D AMR dataset. We define the context of the visualization using the wire framework to focus on the regions of interest across all levels. Ability to take screenshots: Working: With the press of the key "z" on the keyboard, the user could take the cropped screenshot of the current visualization viewed by the user. Using the QFileDialog option in Qt, the software could pop a dialogue box to enable the saving of the image currently seen by the user.The crop function enables the cropping of the desired pixels from the image and returns a result. It finds out minimum_x, minimum_y,maximum_x,maximum_y values to initiate the cropping function.

CONCLUSION AND FUTURE WORK In this project, we have applied various concepts of parallel computing and information visualization to create simulations that scale to real-time. The optimization and parallelization of the current prototype has resulted in an average of ~25x improvement in the initial rendering time. We have also developed intuitive context visualization techniques that have resulted in increased clarity of the visualization of the region of interest. This project has also provided me the opportunity to exploit the computational power of the GPU. The result of evaluating the performance is very interesting and promising. Many other aspects of the tool could be improved in the near future .It would be important if the user could view multiple localized visualizations of the region of interest by keeping in mind the global reference of the simulated domain. The current toolset although supports real-time rendering to patches upto 8 times the initial resolution of 1024 core run dataset, but the rendering is platform dependent and is relative to every machine. Implementing platform independent software would be a challenging direction.

References [1] Bhatele, A., Gamblin, G.T., Isaacs, K.E., Gunney, B.T.N., Schulz, M.W.J., Bremer, P.-T. and Hamann, B. (2012), Novel views of performance data to analyze large-scale adaptive applications. (pdf), in: Hollingsworth, J.K., ed., Proceedings of Supercomputing 2012.SC12), ACM/IEEE, ACM Press, New York, New York, 11 pages. [2] Isaacs, K.E., Landge, A.G., Gamblin, G.T., Bremer, P.-T., Pascucci, V. and Hamann, B. (2012), Exploring performance data with Boxfish (pptx), electronic poster presentation, in: Hollingsworth, J.K., ed., Proceedings of Supercomputing 2012 (SC12), ACM/IEEE, ACM Press, New York, New York, 13 pages. [3] Bhatele, A., Gamblin, G.T., Langer, S.H., Bremer, P.-T., Draeger, E.W., Hamann, B., Isaacs, K.E., Landge, A.G., Levine, J.A., Pascucci, V., Schulz, M.W.J. and Still, C.H. (2012), Mapping applications with collectives over sub-communicators on torus networks (pdf), in: Hollingsworth, J.K., ed., Proceedings of Supercomputing 2012 (SC12), ACM/IEEE, ACM Press, New York, New Yor, 11 pages. [4] Marsha Berger and Phillip Colella. Local adaptive mesh refinement for shock hydrodynamics. Journal of Computational Physics, 82:6484, May 1989. Lawrence Livermore National Laboratory, Technical Report No. UCRL-97196. *5+ Generalizing Focus + Context Visualization, Helwig Hauser, VRVis Research Center in Vienna, Austria. [6] E-mails and conversations involving Prof. Bernd Hamann, and Katherine. E. Isaacs, Department of Computer Science, University of California, Davis. [7] Thomas McCabe A complexity Measure, IEEE transactions on software engineering, Vol.SE-2 NO.4, December 1976. [8] Andreas Klckner, Computer Science,University of Illinois at Urbana-Champaign, [9] Chapter 3 Viewing, Chapter 4 Color, Chapter 5 Lighting [10] Boxfish Documentation, User Guide and Developer Guide, [11] [12] R. D. Hornung and S. R. Kohn, Managing application complexity in the samrai object-oriented framework, Concurrency and Computation: Practice and Experience, vol. 14, no. 5, pp. 347368, 2002 [13] B. T. Gunney, A. M. Wissink, and D. A. Hysom, Parallel clustering algorithms for structured amr, Journal of Parallel and Distributed Computing, vol. 66, no. 11, pp. 1419 1430, 2006 [14]PyCuda examples, [15] David Luebke, John Owens, Mike Roberts, Cheng-Han Lee,Introduction to Parallel Programming,

[16] Andreas Klockner,GPU Metaprogramming using PyCUDA: Methods & Applications, Division of Applied Mathematics Brown University. Nvidia GTC October 2, 2009. [17] Helmut Doleisch, Martin Gasser, Helwig Hauser, Interactive Feature Speci cation for Focus+Context Visualization of Complex Simulation Data, VRVis Research Center, Vienna, Austria.

Appendix A: Figures

Figure.1 The different parts of the prototype are plotted against the individual percent that each part contributes to the overall McCabes complexity.

Figure.2 (a) Dataset 1: 1024 core run dataset:

Figure.2 (b) Dataset 2: 2 x 1024 core run dataset:

Figure.2 (c) Dataset 3: 4 x 1024 core run dataset:

Figure.3 (a) Represents the total contribution of a particular part of the prototype to the overall percentage.The two parameters that are taken into consideration are the number of function calls and CPU seconds taken to respnd for dataset 1.

Figure.3 (b)

Percent contribution in dataset 2.

Figure.3 (c) Percent contribution in dataset 3.

Figure.4 The graph plots the improvement in rendering time when the optimization technique of converting euclidean distance to manhattan distance is applied.

Figure.5 (a) The graph plots the improvement in response time when the optimization technique of converting floating point values to integer values for the DrawCubes().

Figure.5 (b) The graph plots the improvement in response time when the optimization technique of converting floating point values to integer values for the Neighbor_change().

Figure.6 (a) The graph plots the improvement in response time when the optimization technique of reducing the on the fly operations is applied for the DrawCubes().

Figure.6 (b) The graph plots the delay in response time when the optimization technique of reducing the on the fly operations is applied for the Range().

Figure.7 The graph represents the response time (in milliseconds) taken to render the basic visualization by two contrasting graphic cards.(Nvidia Get 420 M and Intel ironlake mobile)

Figure 8.
A Visualization that demonstrates the concept of context visualization.

Figure 9.
Visualizations that demonstrate the patch linking between every refinement level of the AMR dataset. The wire framework represents context visualization.