banner pubs

SCI Publications

2016


A. V. P. Grosset, M. Prasad, C. Christensen, A. Knoll, C. Hansen. “TOD-Tree: Task-Overlapped Direct send Tree Image Compositing for Hybrid MPI Parallelism and GPUs,” In IEEE Transactions on Visualization and Computer Graphics, IEEE, pp. 1--1. 2016.
DOI: 10.1109/tvcg.2016.2542069

ABSTRACT

Modern supercomputers have thousands of nodes, each with CPUs and/or GPUs capable of several teraflops. However, the network connecting these nodes is relatively slow, on the order of gigabits per second. For time-critical workloads such as interactive visualization, the bottleneck is no longer computation but communication. In this paper, we present an image compositing algorithm that works on both CPU-only and GPU-accelerated supercomputers and focuses on communication avoidance and overlapping communication with computation at the expense of evenly balancing the workload. The algorithm has three stages: a parallel direct send stage, followed by a tree compositing stage and a gather stage. We compare our algorithm with radix-k and binary-swap from the IceT library in a hybrid OpenMP/MPI setting on the Stampede and Edison supercomputers, show strong scaling results and explain how we generally achieve better performance than these two algorithms. We developed a GPU-based image compositing algorithm where we use CUDA kernels for computation and GPU Direct RDMA for inter-node GPU communication. We tested the algorithm on the Piz Daint GPU-accelerated supercomputer and show that we achieve performance on par with CPUs. Lastly, we introduce a workflow in which both rendering and compositing are done on the GPU.


2015


M. Kim, C.D. Hansen. “GPU Surface Extraction with the Closest Point Embedding,” In Proceedings of IS&T/SPIE Visualization and Data Analysis, 2015, February, 2015.

ABSTRACT

Isosurface extraction is a fundamental technique used for both surface reconstruction and mesh generation. One method to extract well-formed isosurfaces is a particle system; unfortunately, particle systems can be slow. In this paper, we introduce an enhanced parallel particle system that uses the closest point embedding as the surface representation to speedup the particle system for isosurface extraction. The closest point embedding is used in the Closest Point Method (CPM), a technique that uses a standard three dimensional numerical PDE solver on two dimensional embedded surfaces. To fully take advantage of the closest point embedding, it is coupled with a Barnes-Hut tree code on the GPU. This new technique produces well-formed, conformal unstructured triangular and tetrahedral meshes from labeled multi-material volume datasets. Further, this new parallel implementation of the particle system is faster than any known methods for conformal multi-material mesh extraction. The resulting speed-ups gained in this implementation can reduce the time from labeled data to mesh from hours to minutes and benefits users, such as bioengineers, who employ triangular and tetrahedral meshes.

Keywords: scalar field methods, GPGPU, curvature based, scientific visualization



B. Peterson, H. K. Dasari, A. Humphrey, J.C. Sutherland, T. Saad, M. Berzins. “Reducing overhead in the Uintah framework to support short-lived tasks on GPU-heterogeneous architectures,” In Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC'15), ACM, pp. 4:1-4:8. 2015.
DOI: 10.1145/2830018.2830023


2014


Z. Fu, H.K. Dasari, M. Berzins, B. Thompson. “Parallel Breadth First Search on GPU Clusters,” SCI Technical Report, No. UUSCI-2014-002, SCI Institute, University of Utah, 2014.

ABSTRACT

Fast, scalable, low-cost, and low-power execution of parallel graph algorithms is important for a wide variety of commercial and public sector applications. Breadth First Search (BFS) imposes an extreme burden on memory bandwidth and network communications and has been proposed as a benchmark that may be used to evaluate current and future parallel computers. Hardware trends and manufacturing limits strongly imply that many core devices, such as NVIDIA® GPUs and the Intel® Xeon Phi®, will become central components of such future systems. GPUs are well known to deliver the highest FLOPS/watt and enjoy a very significant memory bandwidth advantage over CPU architectures. Recent work has demonstrated that GPUs can deliver high performance for parallel graph algorithms and, further, that it is possible to encapsulate that capability in a manner that hides the low level details of the GPU architecture and the CUDA language but preserves the high throughput of the GPU. We extend previous research on GPUs and on scalable graph processing on super-computers and demonstrate that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware.

Keywords: GPU cluster, MPI, BFS, graph, parallel graph algorithm



Z. Fu, H.K. Dasari, M. Berzins, B. Thompson. “Parallel Breadth First Search on GPU Clusters,” In Proceedings of the IEEE BigData 2014 Conference, Washington DC, October, 2014.

ABSTRACT

Fast, scalable, low-cost, and low-power execution of parallel graph algorithms is important for a wide variety of commercial and public sector applications. Breadth First Search (BFS) imposes an extreme burden on memory bandwidth and network communications and has been proposed as a benchmark that may be used to evaluate current and future parallel computers. Hardware trends and manufacturing limits strongly imply that many core devices, such as NVIDIA® GPUs and the Intel® Xeon Phi®, will become central components of such future systems. GPUs are well known to deliver the highest FLOPS/watt and enjoy a very significant memory bandwidth advantage over CPU architectures. Recent work has demonstrated that GPUs can deliver high performance for parallel graph algorithms and, further, that it is possible to encapsulate that capability in a manner that hides the low level details of the GPU architecture and the CUDA language but preserves the high throughput of the GPU. We extend previous research on GPUs and on scalable graph processing on super-computers and demonstrate that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware.

Keywords: GPU cluster, MPI, BFS, graph, parallel graph algorithm



Qingyu Meng. “Large-Scale Distributed Runtime System for DAG-Based Computational Framework,” Note: Ph.D. in Computer Science, advisor Martin Berzins, School of Computing, University of Utah, August, 2014.

ABSTRACT

Recent trends in high performance computing present larger and more diverse computers using multicore nodes possibly with accelerators and/or coprocessors and reduced memory. These changes pose formidable challenges for applications code to attain scalability. Software frameworks that execute machine-independent applications code using a runtime system that shields users from architectural complexities offer a portable solution for easy programming. The Uintah framework, for example, solves a broad class of large-scale problems on structured adaptive grids using fluid-flow solvers coupled with particle-based solids methods. However, the original Uintah code had limited scalability as tasks were run in a predefined order based solely on static analysis of the task graph and used only message passing interface (MPI) for parallelism. By using a new hybrid multithread and MPI runtime system, this research has made it possible for Uintah to scale to 700K central processing unit (CPU) cores when solving challenging fluid-structure interaction problems. Those problems often involve moving objects with adaptive mesh refinement and thus with highly variable and unpredictable work patterns. This research has also demonstrated an ability to run capability jobs on the heterogeneous systems with Nvidia graphics processing unit (GPU) accelerators or Intel Xeon Phi coprocessors. The new runtime system for Uintah executes directed acyclic graphs of computational tasks with a scalable asynchronous and dynamic runtime system for multicore CPUs and/or accelerators/coprocessors on a node. Uintah's clear separation between application and runtime code has led to scalability increases without significant changes to application code. This research concludes that the adaptive directed acyclic graph (DAG)-based approach provides a very powerful abstraction for solving challenging multiscale multiphysics engineering problems. Excellent scalability with regard to the different processors and communications performance are achieved on some of the largest and most powerful computers available today.



N.P. Singh, J. Hinkle, S. Joshi, P.T. Fletcher. “An Efficient Parallel Algorithm for Hierarchical Geodesic Models in Diffeomorphisms,” In Proceedings of the 2014 IEEE International Symposium on Biomedical Imaging (ISBI), pp. (accepted). 2014.

ABSTRACT

We present a novel algorithm for computing hierarchical geodesic models (HGMs) for diffeomorphic longitudinal shape analysis. The proposed algorithm exploits the inherent parallelism arising out of the independence in the contributions of individual geodesics to the group geodesic. The previous serial implementation severely limits the use of HGMs to very small population sizes due to computation time and massive memory requirements. The conventional method makes it impossible to estimate the parameters of HGMs on large datasets due to limited memory available onboard current GPU computing devices. The proposed parallel algorithm easily scales to solve HGMs on a large collection of 3D images of several individuals. We demonstrate its effectiveness on longitudinal datasets of synthetically generated shapes and 3D magnetic resonance brain images (MRI).

Keywords: LDDMM, HGM, Vector Momentum, Diffeomorphisms, Longitudinal Analysis


2013


T. Fogal, A. Schiewe, J. Krüger. “An Analysis of Scalable GPU-Based Ray-Guided Volume Rendering,” In 2013 IEEE Symposium on Large Data Analysis and Visualization (LDAV), 2013.

ABSTRACT

Volume rendering continues to be a critical method for analyzing large-scale scalar fields, in disciplines as diverse as biomedical engineering and computational fluid dynamics. Commodity desktop hardware has struggled to keep pace with data size increases, challenging modern visualization software to deliver responsive interactions for O(N3) algorithms such as volume rendering. We target the data type common in these domains: regularly-structured data.

In this work, we demonstrate that the major limitation of most volume rendering approaches is their inability to switch the data sampling rate (and thus data size) quickly. Using a volume renderer inspired by recent work, we demonstrate that the actual amount of visualizable data for a scene is typically bound considerably lower than the memory available on a commodity GPU. Our instrumented renderer is used to investigate design decisions typically swept under the rug in volume rendering literature. The renderer is freely available, with binaries for all major platforms as well as full source code, to encourage reproduction and comparison with future research.



A. Knoll, I. Wald, P. Navratil, M. E Papka,, K. P Gaither. “Ray Tracing and Volume Rendering Large Molecular Data on Multi-core and Many-core Architectures.,” In Proc. 8th International Workshop on Ultrascale Visualization at SC13 (Ultravis), 2013, 2013.

ABSTRACT

Visualizing large molecular data requires efficient means of rendering millions of data elements that combine glyphs, geometry and volumetric techniques. The geometric and volumetric loads challenge traditional rasterization-based vis methods. Ray casting presents a scalable and memory- efficient alternative, but modern techniques typically rely on GPU-based acceleration to achieve interactive rendering rates. In this paper, we present bnsView, a molecular visualization ray tracing framework that delivers fast volume rendering and ball-and-stick ray casting on both multi-core CPUs andmany-core Intel ® Xeon PhiTM co-processors, implemented in a SPMD language that generates efficient SIMD vector code for multiple platforms without source modification. We show that our approach running on co- processors is competitive with similar techniques running on GPU accelerators, and we demonstrate large-scale parallel remote visualization from TACC's Stampede supercomputer to large-format display walls using this system.



Q. Meng, A. Humphrey, J. Schmidt, M. Berzins. “Preliminary Experiences with the Uintah Framework on Intel Xeon Phi and Stampede,” SCI Technical Report, No. UUSCI-2013-002, SCI Institute, University of Utah, 2013.

ABSTRACT

In this work, we describe our preliminary experiences on the Stampede system in the context of the Uintah Computational Framework. Uintah was developed to provide an environment for solving a broad class of fluid-structure interaction problems on structured adaptive grids. Uintah uses a combination of fluid-flow solvers and particle-based methods, together with a novel asynchronous taskbased approach and fully automated load balancing. While we have designed scalable Uintah runtime systems for large CPU core counts, the emergence of heterogeneous systems presents considerable challenges in terms of effectively utilizing additional on-node accelerators and co-processors, deep memory hierarchies, as well as managing multiple levels of parallelism. Our recent work has addressed the emergence of heterogeneous CPU/GPU systems with the design of a Unified heterogeneous runtime system, enabling Uintah to fully exploit these architectures with support for asynchronous, out-of-order scheduling of both CPU and GPU computational tasks. Using this design, Uintah has run at full scale on the Keeneland System and TitanDev. With the release of the Intel Xeon Phi co-processor and the recent availability of the Stampede system, we show that Uintah may be modified to utilize such a coprocessor based system. We also explore the different usage models provided by the Xeon Phi with the aim of understanding portability of a general purpose framework like Uintah to this architecture. These usage models range from the pragma based offload model to the more complex symmetric model, utilizing all co-processor and host CPU cores simultaneously. We provide preliminary results of the various usage models for a challenging adaptive mesh refinement problem, as well as a detailed account of our experience adapting Uintah to run on the Stampede system. Our conclusion is that while the Stampede system is easy to use, obtaining high performance from the Xeon Phi co-processors requires a substantial but different investment to that needed for GPU-based systems.

Keywords: Uintah, hybrid parallelism, scalability, parallel, adaptive, MIC, Xeon Phi, heterogeneous systems, Stampede, co-processor



Q. Meng, A. Humphrey, J. Schmidt, M. Berzins. “Investigating Applications Portability with the Uintah DAG-based Runtime System on PetaScale Supercomputers,” In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 96:1--96:12. 2013.
ISBN: 978-1-4503-2378-9
DOI: 10.1145/2503210.2503250

ABSTRACT

Present trends in high performance computing present formidable challenges for applications code using multicore nodes possibly with accelerators and/or co-processors and reduced memory while still attaining scalability. Software frameworks that execute machine-independent applications code using a runtime system that shields users from architectural complexities offer a possible solution. The Uintah framework for example, solves a broad class of large-scale problems on structured adaptive grids using fluid-flow solvers coupled with particle-based solids methods. Uintah executes directed acyclic graphs of computational tasks with a scalable asynchronous and dynamic runtime system for CPU cores and/or accelerators/co-processors on a node. Uintah's clear separation between application and runtime code has led to scalability increases of 1000x without significant changes to application code. This methodology is tested on three leading Top500 machines; OLCF Titan, TACC Stampede and ALCF Mira using three diverse and challenging applications problems. This investigation of scalability with regard to the different processors and communications performance leads to the overall conclusion that the adaptive DAG-based approach provides a very powerful abstraction for solving challenging multi-scale multi-physics engineering problems on some of the largest and most powerful computers available today.

Keywords: Blue Gene/Q, GPU, Xeon Phi, adaptive, application, co-processor, heterogeneous systems, hybrid parallelism, parallel, scalability, software, uintah, NETL



Q. Meng, A. Humphrey, J. Schmidt, M. Berzins. “Preliminary Experiences with the Uintah Framework on Intel Xeon Phi and Stampede,” In Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery (XSEDE 2013), San Diego, California, pp. 48:1--48:8. 2013.
DOI: 10.1145/2484762.2484779

ABSTRACT

In this work, we describe our preliminary experiences on the Stampede system in the context of the Uintah Computational Framework. Uintah was developed to provide an environment for solving a broad class of fluid-structure interaction problems on structured adaptive grids. Uintah uses a combination of fluid-flow solvers and particle-based methods, together with a novel asynchronous task-based approach and fully automated load balancing. While we have designed scalable Uintah runtime systems for large CPU core counts, the emergence of heterogeneous systems presents considerable challenges in terms of effectively utilizing additional on-node accelerators and co-processors, deep memory hierarchies, as well as managing multiple levels of parallelism. Our recent work has addressed the emergence of heterogeneous CPU/GPU systems with the design of a Unified heterogeneous runtime system, enabling Uintah to fully exploit these architectures with support for asynchronous, out-of-order scheduling of both CPU and GPU computational tasks. Using this design, Uintah has run at full scale on the Keeneland System and TitanDev. With the release of the Intel Xeon Phi co-processor and the recent availability of the Stampede system, we show that Uintah may be modified to utilize such a co-processor based system. We also explore the different usage models provided by the Xeon Phi with the aim of understanding portability of a general purpose framework like Uintah to this architecture. These usage models range from the pragma based offload model to the more complex symmetric model, utilizing all co-processor and host CPU cores simultaneously. We provide preliminary results of the various usage models for a challenging adaptive mesh refinement problem, as well as a detailed account of our experience adapting Uintah to run on the Stampede system. Our conclusion is that while the Stampede system is easy to use, obtaining high performance from the Xeon Phi co-processors requires a substantial but different investment to that needed for GPU-based systems.

Keywords: MIC, Xeon Phi, adaptive, co-processor, heterogeneous systems, hybrid parallelism, parallel, scalability, stampede, uintah, c-safe



Y. Wan, H. Otsuna, C.D. Hansen. “Synthetic Brainbows,” In Computer Graphics Forum, Vol. 32, No. 3pt4, Wiley-Blackwell, pp. 471--480. jun, 2013.
DOI: 10.1111/cgf.12134

ABSTRACT

Brainbow is a genetic engineering technique that randomly colorizes cells. Biological samples processed with this technique and imaged with confocal microscopy have distinctive colors for individual cells. Complex cellular structures can then be easily visualized. However, the complexity of the Brainbow technique limits its applications. In practice, most confocal microscopy scans use different florescence staining with typically at most three distinct cellular structures. These structures are often packed and obscure each other in rendered images making analysis difficult. In this paper, we leverage a process known as GPU framebuffer feedback loops to synthesize Brainbow-like images. In addition, we incorporate ID shuffling and Monte-Carlo sampling into our technique, so that it can be applied to single-channel confocal microscopy data. The synthesized Brainbow images are presented to domain experts with positive feedback. A user survey demonstrates that our synthetic Brainbow technique improves visualizations of volume data with complex structures for biologists.



Y. Wan. “Fluorender, An Interactive Tool for Confocal Microscopy Data Visualization and Analysis,” Note: Ph.D. Thesis, School of Computing, University of Utah, June, 2013.

ABSTRACT

Confocal microscopy has become a popular imaging technique in biology research in recent years. It is often used to study three-dimensional (3D) structures of biological samples. Confocal data are commonly multi-channel, with each channel resulting from a different fluorescent staining. This technique also results finely detailed structures in 3D, such as neuron fibers. Despite the plethora of volume rendering techniques that have been available for many years, there is a demand from biologists for a flexible tool that allows interactive visualization and analysis of multi-channel confocal data. Together with biologists, we have designed and developed FluoRender. It incorporates volume rendering techniques such as a two-dimensional (2D) transfer function and multi-channel intermixing. Rendering results can be enhanced through tone-mappings and overlays. To facilitate analyses of confocal data, FluoRender provides interactive operations for extracting complex structures. Furthermore, we developed the Synthetic Brainbow technique, which takes advantage of the asynchronous behavior in Graphics Processing Unit (GPU) framebuffer loops and generates random colorizations for different structures in single-channel confocal data. The results from our Synthetic Brainbows, when applied to a sequence of developing cells, can then be used for tracking the movements of these cells. Finally, we present an application of FluoRender in the workflow of constructing anatomical atlases.

Keywords: confocal microscopy, visualization, software



L. Zhou, C.D. Hansen. “Interactive rendering and efficient querying for large multivariate seismic volumes on consumer level PCs,” In Proceedings of the 2013 IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV), pp. 117--118. 2013.
DOI: 10.1109/LDAV.2013.6675167

ABSTRACT

We present a volume visualization method that allows interactive rendering and efficient querying of large multivariate seismic volume data on consumer level PCs. The volume rendering pipeline utilizes a virtual memory structure that supports out-of-core multivariate multi-resolution data and a GPU-based ray caster that allows interactive multivariate transfer function design. A Gaussian mixture model representation is precomputed and nearly interactive querying is achieved by testing the Gaussian functions against user defined transfer functions on the GPU in the runtime. Finally, the method has been tested on a multivariate 3D seismic dataset which is larger than the size of the main memory of the testing machine.


2012


Y.-J. Ahn, C. Hoffmann, P. Rosen. “A Note on Circle Packing,” In Journal of Zhejiang University SCIENCE C, Vol. 13, No. 8, pp. 559--564. 2012.

ABSTRACT

The problem of packing circles into a domain of prescribed topology is considered. The circles need not have equal radii. The Collins-Stephenson algorithm computes such a circle packing. This algorithm is parallelized in two different ways and its performance is reported for a triangular, planar domain test case. The implementation uses the highly parallel graphics processing unit (GPU) on commodity hardware. The speedups so achieved are discussed based on a number of experiments.

Keywords: Circle packing, Algorithm performance, Parallel computation, Graphics processing unit (GPU)



C. Brownlee, J. Patchett, L.-T. Lo, D. DeMarle, C. Mitchell, J. Ahrens, C.D. Hansen. “A Study of Ray Tracing Large-scale Scientific Data in Parallel Visualization Applications,” In Proceedings of the Eurographics Symposium on Parallel Graphics and Visualization (2012), Edited by H. Childs and T. Kuhlen and F. Marton, pp. 51--60. 2012.

ABSTRACT

Large-scale analysis and visualization is becoming increasingly important as supercomputers and their simulations produce larger and larger data. These large data sizes are pushing the limits of traditional rendering algorithms and tools thus motivating a study exploring these limits and their possible resolutions through alternative rendering algorithms . In order to better understand real-world performance with large data, this paper presents a detailed timing study on a large cluster with the widely used visualization tools ParaView and VisIt. The software ray tracer Manta was integrated into these programs in order to show that improved performance could be attained with software ray tracing on a distributed memory, GPU enabled, parallel visualization resource. Using the Texas Advanced Computing Center’s Longhorn cluster which has multi-core CPUs and GPUs with large-scale polygonal data, we find multi-core CPU ray tracing to be significantly faster than both software rasterization and hardware-accelerated rasterization in existing scientific visualization tools with large data.

Keywords: kaust, scidac



C.-S. Chiang, C. Hoffmann, P. Rosen. “A Generalized Malfatti Problem,” In Computational Geometry: Theory and Applications, Vol. 45, No. 8, pp. 425--435. 2012.

ABSTRACT

Malfatti's problem, first published in 1803, is commonly understood to ask fitting three circles into a given triangle such that they are tangent to each other, externally, and such that each circle is tangent to a pair of the triangle's sides. There are many solutions based on geometric constructions, as well as generalizations in which the triangle sides are assumed to be circle arcs. A generalization that asks to fit six circles into the triangle, tangent to each other and to the triangle sides, has been considered a good example of a problem that requires sophisticated numerical iteration to solve by computer. We analyze this problem and show how to solve it quickly.

Keywords: Malfatti's problem, circle packing, geometric constraint solving, GPU programming



A. Duchowski, M. Price, M.D. Meyer, P. Orero. “Aggregate Gaze Visualization with Real-Time Heatmaps,” In Proceedings of the ACM Symposium on Eye Tracking Research and Applications (ETRA), pp. 13--20. 2012.
DOI: 10.1145/2168556.2168558

ABSTRACT

A GPU implementation is given for real-time visualization of aggregate eye movements (gaze) via heatmaps. Parallelization of the algorithm leads to substantial speedup over its CPU-based implementation and, for the first time, allows real-time rendering of heatmaps atop video. GLSL shader colorization allows the choice of color ramps. Several luminance-based color maps are advocated as alternatives to the popular rainbow color map, considered inappropriate (harmful) for depiction of (relative) gaze distributions.



L.K. Ha, J. Krüger, J.L.D. Comba, C.T. Silva, S. Joshi. “ISP: An Optimal Out-of-Core Image-Set Processing Streaming Architecture for Parallel Heterogeneous Systems,” In IEEE Transactions on Visualization and Computer Graphics (TVCG), Vol. 18, No. 6, pp. 838--851. 2012.
DOI: 10.1109/TVCG.2012.32

ABSTRACT

Image population analysis is the class of statistical methods that plays a central role in understanding the development, evolution and disease of a population. However, these techniques often require excessive computational power and memory that are compounded with a large number of volumetric inputs. Restricted access to supercomputing power limits its influence in general research and practical applications. In this paper we introduce ISP, an Image-Set Processing streaming framework that harnesses the processing power of commodity heterogeneous CPU/GPU systems and attempts to solve this computational problem. In ISP we introduce specially-designed streaming algorithms and data structures that provide an optimal solution for out-of-core multi-image processing problems both in terms of memory usage and computational efficiency. ISP makes use of the asynchronous execution mechanism supported by parallel heterogeneous systems to efficiently hide the inherent latency of the processing pipeline of out-of-core approaches. Consequently, with computationally intensive problems, the ISP out-of-core solution can achieve the same performance as the in-core solution. We demonstrate the efficiency of the ISP framework on synthetic and real datasets.