research

Graphic Processing Units (GPU) Research:


Research in the area of the GPU is focused on harnessing the power of the GPU for visualization, simulation, image and information processing and analysis, and, of course, for graphics. As the speed and efficiency of graphical processing units (GPUs) grows at rates even faster than those of conventional central processing units (CPUs), there is a growing consensus that the streaming architecture embodied in most modern graphics processors has inherent advantages in scalability. Virtually all the contemporary developers of computer processing units are exploring opportunities for GPU improvement and there is an emerging set of standards and tools that can take advantage of these processors for more general purpose computing than graphics.

There has been considerable progress in implementing the hardware and the supporting infrastructure for GPU programming and streaming architectures. Extensions to the C language, such as CUDA, provide needed capability on standard C data structures, and there is an apparent convergence on open standards (e.g. OpenCL) for general purpose computing on the streaming architectures.

Such streaming architectures have significant implications for algorithm design, especially when applied to general purpose tasks beyond the graphics manipulations for which they were initially designed. They offer massively parallel single instruction multiple data (SIMD) processing of several dozen streams across multiple blocks, with fast access to shared memory and some restrictions of limited local memory size and relatively slow communication between blocks.

In the area of scientific computing the SCI Institute is actively pursuing streaming architectures that typically satisfy three criteria: i) not impose a particular update sequence on the solution, ii) rely primarily on synchronous updates, and iii) have a high computational density (amount of computation per unit of data) in the absence of global communication (e.g. to/from host or between blocks).

GPU


 
Parallel Breadth First Search on GPU Clusters
Z. Fu, H.K. Dasari, M. Berzins, B. Thompson. In Proceedings of the IEEE BigData 2014 Conference, Washington DC, October, 2014.

Fast, scalable, low-cost, and low-power execution of parallel graph algorithms is important for a wide variety of commercial and public sector applications. Breadth First Search (BFS) imposes an extreme burden on memory bandwidth and network communications and has been proposed as a benchmark that may be used to evaluate current and future parallel computers. Hardware trends and manufacturing limits strongly imply that many core devices, such as NVIDIA® GPUs and the Intel® Xeon Phi®, will become central components of such future systems. GPUs are well known to deliver the highest FLOPS/watt and enjoy a very significant memory bandwidth advantage over CPU architectures. Recent work has demonstrated that GPUs can deliver high performance for parallel graph algorithms and, further, that it is possible to encapsulate that capability in a manner that hides the low level details of the GPU architecture and the CUDA language but preserves the high throughput of the GPU. We extend previous research on GPUs and on scalable graph processing on super-computers and demonstrate that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware.



 
Parallel Breadth First Search on GPU Clusters
Z. Fu, H.K. Dasari, M. Berzins, B. Thompson. SCI Technical Report, No. UUSCI-2014-002, SCI Institute, University of Utah, 2014.

Fast, scalable, low-cost, and low-power execution of parallel graph algorithms is important for a wide variety of commercial and public sector applications. Breadth First Search (BFS) imposes an extreme burden on memory bandwidth and network communications and has been proposed as a benchmark that may be used to evaluate current and future parallel computers. Hardware trends and manufacturing limits strongly imply that many core devices, such as NVIDIA® GPUs and the Intel® Xeon Phi®, will become central components of such future systems. GPUs are well known to deliver the highest FLOPS/watt and enjoy a very significant memory bandwidth advantage over CPU architectures. Recent work has demonstrated that GPUs can deliver high performance for parallel graph algorithms and, further, that it is possible to encapsulate that capability in a manner that hides the low level details of the GPU architecture and the CUDA language but preserves the high throughput of the GPU. We extend previous research on GPUs and on scalable graph processing on super-computers and demonstrate that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware.




An Analysis of Scalable GPU-Based Ray-Guided Volume Rendering
T. Fogal, A. Schiewe, J. Krueger. In 2013 IEEE Symposium on Large Data Analysis and Visualization (LDAV), 2013.

Volume rendering continues to be a critical method for analyzing large-scale scalar fields, in disciplines as diverse as biomedical engineering and computational fluid dynamics. Commodity desktop hardware has struggled to keep pace with data size increases, challenging modern visualization software to deliver responsive interactions for O(N3) algorithms such as volume rendering. We target the data type common in these domains: regularly-structured data.

In this work, we demonstrate that the major limitation of most volume rendering approaches is their inability to switch the data sampling rate (and thus data size) quickly. Using a volume renderer inspired by recent work, we demonstrate that the actual amount of visualizable data for a scene is typically bound considerably lower than the memory available on a commodity GPU. Our instrumented renderer is used to investigate design decisions typically swept under the rug in volume rendering literature. The renderer is freely available, with binaries for all major platforms as well as full source code, to encourage reproduction and comparison with future research.




A Generalized Malfatti Problem
C.-S. Chiang, C. Hoffmann, P. Rosen. In Computational Geometry: Theory and Applications, Vol. 45, No. 8, pp. 425--435. 2012.

Malfatti's problem, first published in 1803, is commonly understood to ask fitting three circles into a given triangle such that they are tangent to each other, externally, and such that each circle is tangent to a pair of the triangle's sides. There are many solutions based on geometric constructions, as well as generalizations in which the triangle sides are assumed to be circle arcs. A generalization that asks to fit six circles into the triangle, tangent to each other and to the triangle sides, has been considered a good example of a problem that requires sophisticated numerical iteration to solve by computer. We analyze this problem and show how to solve it quickly.




A Note on Circle Packing
Y.-J. Ahn, C. Hoffmann, P. Rosen. In Journal of Zhejiang University SCIENCE C, Vol. 13, No. 8, pp. 559--564. 2012.

The problem of packing circles into a domain of prescribed topology is considered. The circles need not have equal radii. The Collins-Stephenson algorithm computes such a circle packing. This algorithm is parallelized in two different ways and its performance is reported for a triangular, planar domain test case. The implementation uses the highly parallel graphics processing unit (GPU) on commodity hardware. The speedups so achieved are discussed based on a number of experiments.



 
Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System
A. Humphrey, Q. Meng, M. Berzins, T. Harman. In Proceedings of the first conference of the Extreme Science and Engineering Discovery Environment (XSEDE'12), No. 4, pp. 4:1--4:8. 2012.
DOI: 10.1145/2335755.2335791

The Uintah Computational Framework was developed to provide an environment for solving fluid-structure interaction problems on structured adaptive grids on large-scale, long-running, data-intensive problems. Uintah uses a combination of fluid-flow solvers and particle-based methods for solids, together with a novel asynchronous task-based approach with fully automated load balancing. Uintah demonstrates excellent weak and strong scalability at full machine capacity on XSEDE resources such as Ranger and Kraken, and through the use of a hybrid memory approach based on a combination of MPI and Pthreads, Uintah now runs on up to 262k cores on the DOE Jaguar system. In order to extend Uintah to heterogeneous systems, with ever-increasing CPU core counts and additional onnode GPUs, a new dynamic CPU-GPU task scheduler is designed and evaluated in this study. This new scheduler enables Uintah to fully exploit these architectures with support for asynchronous, outof- order scheduling of both CPU and GPU computational tasks. A new runtime system has also been implemented with an added multi-stage queuing architecture for efficient scheduling of CPU and GPU tasks. This new runtime system automatically handles the details of asynchronous memory copies to and from the GPU and introduces a novel method of pre-fetching and preparing GPU memory prior to GPU task execution. In this study this new design is examined in the context of a developing, hierarchical GPUbased ray tracing radiation transport model that provides Uintah with additional capabilities for heat transfer and electromagnetic wave propagation. The capabilities of this new scheduler design are tested by running at large scale on the modern heterogeneous systems, Keeneland and TitanDev, with up to 360 and 960 GPUs respectively. On these systems, we demonstrate significant speedups per GPU against a standard CPU core for our radiation problem.



 
Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System
A. Humphrey, Q. Meng, M. Berzins, T. Harman. SCI Technical Report, No. UUSCI-2012-003, SCI Institute, University of Utah, 2012.

The Uintah Computational Framework was developed to provide an environment for solving fluid-structure interaction problems on structured adaptive grids on large-scale, long-running, data-intensive problems. Uintah uses a combination of fluid-flow solvers and particle-based methods for solids, together with a novel asynchronous task-based approach with fully automated load balancing. Uintah demonstrates excellent weak and strong scalability at full machine capacity on XSEDE resources such as Ranger and Kraken, and through the use of a hybrid memory approach based on a combination of MPI and Pthreads, Uintah now runs on up to 262k cores on the DOE Jaguar system. In order to extend Uintah to heterogeneous systems, with ever-increasing CPU core counts and additional onnode GPUs, a new dynamic CPU-GPU task scheduler is designed and evaluated in this study. This new scheduler enables Uintah to fully exploit these architectures with support for asynchronous, outof- order scheduling of both CPU and GPU computational tasks. A new runtime system has also been implemented with an added multi-stage queuing architecture for efficient scheduling of CPU and GPU tasks. This new runtime system automatically handles the details of asynchronous memory copies to and from the GPU and introduces a novel method of pre-fetching and preparing GPU memory prior to GPU task execution. In this study this new design is examined in the context of a developing, hierarchical GPUbased ray tracing radiation transport model that provides Uintah with additional capabilities for heat transfer and electromagnetic wave propagation. The capabilities of this new scheduler design are tested by running at large scale on the modern heterogeneous systems, Keeneland and TitanDev, with up to 360 and 960 GPUs respectively. On these systems, we demonstrate significant speedups per GPU against a standard CPU core for our radiation problem.




GLuRay: Ray Tracing in Scientific Visualization Applications using OpenGL Interception
C. Brownlee, T. Fogal, C.D. Hansen. In Proceedings of the Eurographics Symposium on Parallel Graphics and Visualization (2012), Edited by H. Childs and T. Kuhlen and F. Marton, pp. 41--50. 2012.
DOI: 10.2312/EGPGV/EGPGV12/041-050

Ray tracing in scientific visualization allows for substantial gains in performance and rendering quality with large scale polygonal datasets compared to brute-force rasterization, however implementing new rendering architectures into existing tools is often costly and time consuming. This paper presents a library, GLuRay, which intercepts OpenGL calls from many common visualization applications and renders them with the CPU ray tracer Manta without modification to the underlying visualization tool. Rendering polygonal models such as isosurfaces can be done identically to an OpenGL implementation using provided material and camera properties or superior rendering can be achieved using enhanced settings such as dielectric materials or pinhole cameras with depth of field effects. Comparative benchmarks were conducted on the Texas Advanced Computing Center’s Longhorn cluster using the popular visualization packages ParaView, VisIt, Ensight, and VAPOR. Through the parallel ren- dering package ParaView, scaling up to 64 nodes is demonstrated. With our tests we show that using OpenGL interception to accelerate and enhance visualization programs provides a viable enhancement to existing tools with little overhead and no code modification while allowing for the creation of publication quality renderings using advanced effects and greatly improved large-scale software rendering performance within tools that scientists are currently using.




A Study of Ray Tracing Large-scale Scientific Data in Parallel Visualization Applications
C. Brownlee, J. Patchett, L.-T. Lo, D. DeMarle, C. Mitchell, J. Ahrens, C.D. Hansen. In Proceedings of the Eurographics Symposium on Parallel Graphics and Visualization (2012), Edited by H. Childs and T. Kuhlen and F. Marton, pp. 51--60. 2012.

Large-scale analysis and visualization is becoming increasingly important as supercomputers and their simulations produce larger and larger data. These large data sizes are pushing the limits of traditional rendering algorithms and tools thus motivating a study exploring these limits and their possible resolutions through alternative rendering algorithms . In order to better understand real-world performance with large data, this paper presents a detailed timing study on a large cluster with the widely used visualization tools ParaView and VisIt. The software ray tracer Manta was integrated into these programs in order to show that improved performance could be attained with software ray tracing on a distributed memory, GPU enabled, parallel visualization resource. Using the Texas Advanced Computing Center’s Longhorn cluster which has multi-core CPUs and GPUs with large-scale polygonal data, we find multi-core CPU ray tracing to be significantly faster than both software rasterization and hardware-accelerated rasterization in existing scientific visualization tools with large data.




Dynamic particle system for mesh extraction on the GPU
M. Kim, G. Chen, C.D. Hansen. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, London, England, GPGPU-5, ACM, New York, NY, USA pp. 38--46. 2012.
ISBN: 978-1-4503-1233-2
DOI: 10.1145/2159430.215943

Extracting isosurfaces represented as high quality meshes from three-dimensional scalar fields is needed for many important applications, particularly visualization and numerical simulations. One recent advance for extracting high quality meshes for isosurface computation is based on a dynamic particle system. Unfortunately, this state-of-the-art particle placement technique requires a significant amount of time to produce a satisfactory mesh. To address this issue, we study the parallelism property of the particle placement and make use of CUDA, a parallel programming technique on the GPU, to significantly improve the performance of particle placement. This paper describes the curvature dependent sampling method used to extract high quality meshes and describes its implementation using CUDA on the GPU.




Topological Analysis and Visualization of Cyclical Behavior in Memory Reference Traces
A.N.M. Imroz Choudhury, Bei Wang, P. Rosen, V. Pascucci. In Proceedings of the IEEE Pacific Visualization Symposium (PacificVis 2012), pp. 9--16. 2012.
DOI: 10.1109/PacificVis.2012.6183557

We demonstrate the application of topological analysis techniques to the rather unexpected domain of software visualization. We collect a memory reference trace from a running program, recasting the linear flow of trace records as a high-dimensional point cloud in a metric space. We use topological persistence to automatically detect significant circular structures in the point cloud, which represent recurrent or cyclical runtime program behaviors. We visualize such recurrences using radial plots to display their time evolution, offering multi-scale visual insights, and detecting potential candidates for memory performance optimization. We then present several case studies to demonstrate some key insights obtained using our techniques.



 
The Challenges of Writing Portable, Correct and High Performance Libraries for GPUs
M. Leeser, D. Yablonski, D.H. Brooks, L.S. King. In Computer Architecture News, Vol. 39, No. 4, pp. 2--7. 2011.
DOI: 10.1145/2082156.2082158

Graphics Processing Units (GPUs) are widely used to accelerate scientific applications. Many successes have been reported with speedups of two or three orders of magnitude over serial implementations of the same algorithms. These speedups typically pertain to a specific implementation with fixed parameters mapped to a specific hardware implementation. The implementations are not designed to be easily ported to other GPUs, even from the same manufacturer. When target hardware changes, the application must be re-optimized. In this paper we address a different problem. We aim to deliver working, efficient GPU code in a library that is downloaded and run by many different users. The issue is to deliver efficiency independent of the individual user parameters and without a priori knowledge of the hardware the user will employ. This problem requires a different set of tradeoffs than finding the best runtime for a single solution. Solutions must be adaptable to a range of different parameters both to solve users' problems and to make the best use of the target hardware. Another issue is the integration of GPUs into a Problem Solving Environment (PSE) where the use of a GPU is almost invisible from the perspective of the user. Ease of use and smooth interactions with the existing user interface are important to our approach. We illustrate our solution with the incorporation of GPU processing into the Scientific Computing Institute (SCI)Run Biomedical PSE developed at the University of Utah. SCIRun allows scientists to interactively construct many different types of biomedical simulations. We use this environment to demonstrate the effectiveness of the GPU by accelerating time consuming algorithms in the scientist's simulations. Specifically we target the linear solver module, including Conjugate Gradient, Jacobi and MinRes solvers for sparse matrices.




Multi-scale Unbiased Diffeomorphic Atlas Construction on Multi-GPUs
L.K. Ha, J. Krueger, S. Joshi, C.T. Silva. Vol. 1, Ch. 10, Morgan Kaufmann, pp. 42. 2011.

In this chapter, we present a high performance multi-scale 3D image processing framework to exploit the parallel processing power of multiple graphic processing units (Multi-GPUs) for medical image analysis. We developed GPU algorithms and data structures that can be applied to a wide range of 3D image processing applications and efficiently exploit the computational power and massive bandwidth offered by modern GPUs. Our framework helps scientists solve computationally intensive problems which previously required super computing power. To demonstrate the effectiveness of our framework and to compare to existing techniques, we focus our discussions on atlas construction - the application of understanding the development of the brain and the progression of brain diseases.




Parallel Visualization on Large Clusters using MapReduce
H.T. Vo, J. Bronson, B. Summa, J.L.D. Comba, J. Freire, B. Howe, V. Pascucci, C.T. Silva. In Proceedings of the 2011 IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV), pp. 81--88. 2011.

Large-scale visualization systems are typically designed to efficiently "push" datasets through the graphics hardware. However, exploratory visualization systems are increasingly expected to support scalable data manipulation, restructuring, and querying capabilities in addition to core visualization algorithms. We posit that new emerging abstractions for parallel data processing, in particular computing clouds, can be leveraged to support large-scale data exploration through visualization. In this paper, we take a first step in evaluating the suitability of the MapReduce framework to implement large-scale visualization techniques. MapReduce is a lightweight, scalable, general-purpose parallel data processing framework increasingly popular in the context of cloud computing. Specifically, we implement and evaluate a representative suite of visualization tasks (mesh rendering, isosurface extraction, and mesh simplification) as MapReduce programs, and report quantitative performance results applying these algorithms to realistic datasets. For example, we perform isosurface extraction of up to l6 isovalues for volumes composed of 27 billion voxels, simplification of meshes with 30GBs of data and subsequent rendering with image resolutions up to 800002 pixels. Our results indicate that the parallel scalability, ease of use, ease of access to computing resources, and fault-tolerance of MapReduce offer a promising foundation for a combined data manipulation and data visualization system deployed in a public cloud or a local commodity cluster.




Hybrid CPU-GPU Solver for Gradient Domain Processing of Massive Images
S. Philip, B. Summa, P-T Bremer, V. Pascucci. In Proceedings of 2011 International Conference on Parallel and Distributed Systems (ICPADS), pp. 244--251. 2011.

Gradient domain processing is a computationally expensive image processing technique. Its use for processing massive images, giga or terapixels in size, can take several hours with serial techniques. To address this challenge, parallel algorithms are being developed to make this class of techniques applicable to the largest images available with running times that are more acceptable to the users. To this end we target the most ubiquitous form of computing power available today, which is small or medium scale clusters of commodity hardware. Such clusters are continuously increasing in scale, not only in the number of nodes, but also in the amount of parallelism available within each node in the form of multicore CPUs and GPUs. In this paper we present a hybrid parallel implementation of gradient domain processing for seamless stitching of gigapixel panoramas that utilizes MPI, threading and a CUDA based GPU component. We demonstrate the performance and scalability of our implementation by presenting results from two GPU clusters processing two large data sets.




Abstract Visualization of Runtime Memory Behavior
A.N.M. Imroz Choudhury, P. Rosen. In 6th IEEE International Workshop on Visualizing Software for Understanding and Analysis (VISSOFT 2011), pp. 22--29. 2011.

We present a system for visualizing memory reference traces, the records of the memory transactions performed by a program at runtime. The visualization consists of a structured layout representing the levels of a cache and a set of data glyphs representing the pieces of data in memory being operated on during application runtime. The data glyphs move in response to events generated by a cache simulator, indicating their changing residency in the various levels of the memory hierarchy. Within the levels, the glyphs arrange themselves into higher-order shapes representing the structure of the cache levels, including the composition of their associative cache sets and eviction ordering. We make careful use of different visual channels, including structure, motion, color, and size, to convey salient events as they occur. Our abstract visualization provides a high-level, global view of memory behavior, while giving insight about important events that may help students or software engineers to better understand their software’s performance and behavior.



 
Optimal Multi-Image Processing Streaming Framework on Parallel Heterogeneous Systems
L.K. Ha, J. Krueger, J. Comba, S. Joshi, C.T. Silva. In Proceedings of Eurographics Symposium on Parallel Graphics and Visualization 2011, Note: Awarded Best Paper!, pp. 1--10. 2011.
DOI: 10.2312/EGPGV/EGPGV11/001-010

Atlas construction is an important technique in medical image analysis that plays a central role in understanding the variability of brain anatomy. The construction often requires applying image processing operations to multiple images (often hundreds of volumetric datasets), which is challenging in computational power as well as memory requirements. In this paper we introduce MIP, a Multi-Image Processing streaming framework to harness the processing power of heterogeneous CPU/GPU systems. In MIP we introduce specially designed streaming algorithms and data structures that provides an optimal solution for out-of-core multi-image processing problems both in terms of memory usage and computational efficiency. MIP makes use of the asynchronous execution mechanism supported by parallel heterogeneous systems to efficiently hide the inherent latency of the processing pipeline of out-of-core approaches. Consequently, with computationally intensive problems, the MIP out-of-core solution could achieve the same performance as the in-core solution. We demonstrate the efficiency of the MIP framework on synthetic and real datasets.



 
Parallel Visualization on Large Clusters using MapReduce
H.T. Vo, J. Bronson, B. Summa, J.L.D. Comba, J. Freire, B. Howe, V. Pascucci, C.T. Silva. SCI Technical Report, No. UUSCI-2011-002, SCI Institute, University of Utah, 2011.