Designed especially for neurobiologists, FluoRender is an interactive tool for multi-channel fluorescence microscopy data visualization and analysis.
Large scale visualization on the Powerwall.
BrainStimulator is a set of networks that are used in SCIRun to perform simulations of brain stimulation such as transcranial direct current stimulation (tDCS) and magnetic transcranial stimulation (TMS).
Developing software tools for science has always been a central vision of the SCI Institute.

SCI Publications

2019


T. Athawale, C. R. Johnson. “Probabilistic Asymptotic Decider for Topological Ambiguity Resolution in Level-Set Extraction for Uncertain 2D Data,” In IEEE Transactions on Visualization and Computer Graphics, Vol. 25, No. 1, IEEE, pp. 1163-1172. Jan, 2019.
DOI: 10.1109/TVCG.2018.2864505

ABSTRACT

We present a framework for the analysis of uncertainty in isocontour extraction. The marching squares (MS) algorithm for isocontour reconstruction generates a linear topology that is consistent with hyperbolic curves of a piecewise bilinear interpolation. The saddle points of the bilinear interpolant cause topological ambiguity in isocontour extraction. The midpoint decider and the asymptotic decider are well-known mathematical techniques for resolving topological ambiguities. The latter technique investigates the data values at the cell saddle points for ambiguity resolution. The uncertainty in data, however, leads to uncertainty in underlying bilinear interpolation functions for the MS algorithm, and hence, their saddle points. In our work, we study the behavior of the asymptotic decider when data at grid vertices is uncertain. First, we derive closed-form distributions characterizing variations in the saddle point values for uncertain bilinear interpolants. The derivation assumes uniform and nonparametric noise models, and it exploits the concept of ratio distribution for analytic formulations. Next, the probabilistic asymptotic decider is devised for ambiguity resolution in uncertain data using distributions of the saddle point values derived in the first step. Finally, the confidence in probabilistic topological decisions is visualized using a colormapping technique. We demonstrate the higher accuracy and stability of the probabilistic asymptotic decider in uncertain data with regard to existing decision frameworks, such as deciders in the mean field and the probabilistic midpoint decider, through the isocontour visualization of synthetic and real datasets.



A. Gyulassy, P.-T. Bremer, V. Pascucci. “Shared-Memory Parallel Computation of Morse-Smale Complexes with Improved Accuracy,” In IEEE Transactions on Visualization and Computer Graphics, Vol. 25, No. 1, IEEE, pp. 1183--1192. Jan, 2019.
DOI: 10.1109/tvcg.2018.2864848

ABSTRACT

Topological techniques have proven to be a powerful tool in the analysis and visualization of large-scale scientific data. In particular, the Morse-Smale complex and its various components provide a rich framework for robust feature definition and computation. Consequently, there now exist a number of approaches to compute Morse-Smale complexes for large-scale data in parallel. However, existing techniques are based on discrete concepts which produce the correct topological structure but are known to introduce grid artifacts in the resulting geometry. Here, we present a new approach that combines parallel streamline computation with combinatorial methods to construct a high-quality discrete Morse-Smale complex. In addition to being invariant to the orientation of the underlying grid, this algorithm allows users to selectively build a subset of features using high-quality geometry. In particular, a user may specifically select which ascending/descending manifolds are reconstructed with improved accuracy, focusing computational effort where it matters for subsequent analysis. This approach computes Morse-Smale complexes for larger data than previously feasible with significant speedups. We demonstrate and validate our approach using several examples from a variety of different scientific domains, and evaluate the performance of our method.



M. Han, I. Wald, W. Usher, Q. Wu, F. Wang, V. Pascicci, C. D. Hansen, C. R. Johnson. “Ray Tracing Generalized Tube Primitives: Method and Applications,” In Computer Graphics Forum, Vol. 38, No. 3, John Wiley & Sons Ltd., 2019.

ABSTRACT

We present a general high-performance technique for ray tracing generalized tube primitives. Our technique efficiently supports tube primitives with fixed and varying radii, general acyclic graph structures with bifurcations, and correct transparency with interior surface removal. Such tube primitives are widely used in scientific visualization to represent diffusion tensor imaging tractographies, neuron morphologies, and scalar or vector fields of 3D flow. We implement our approach within the OSPRay ray tracing framework, and evaluate it on a range of interactive visualization use cases of fixed- and varying-radius streamlines, pathlines, complex neuron morphologies, and brain tractographies. Our proposed approach provides interactive, high-quality rendering, with low memory overhead.



D. Hoang, P. Klacansky, H. Bhatia, P.-T. Bremer, P. Lindstrom, V. Pascucci. “A Study of the Trade-off Between Reducing Precision and Reducing Resolution for Data Analysis and Visualization,” In IEEE Transactions on Visualization and Computer Graphics, Vol. 25, No. 1, IEEE, pp. 1193--1203. Jan, 2019.
DOI: 10.1109/tvcg.2018.2864853

ABSTRACT

There currently exist two dominant strategies to reduce data sizes in analysis and visualization: reducing the precision of the data, e.g., through quantization, or reducing its resolution, e.g., by subsampling. Both have advantages and disadvantages and both face fundamental limits at which the reduced information ceases to be useful. The paper explores the additional gains that could be achieved by combining both strategies. In particular, we present a common framework that allows us to study the trade-off in reducing precision and/or resolution in a principled manner. We represent data reduction schemes as progressive streams of bits and study how various bit orderings such as by resolution, by precision, etc., impact the resulting approximation error across a variety of data sets as well as analysis tasks. Furthermore, we compute streams that are optimized for different tasks to serve as lower bounds on the achievable error. Scientific data management systems can use the results presented in this paper as guidance on how to store and stream data to make efficient use of the limited storage and bandwidth in practice.



J. K. Holmen, B. Peterson, A. Humphrey, D. Sunderland, O. H. Diaz-Ibarra, J. N. Thornock, M. Berzins. “Portably Improving Uintah's Readiness for Exascale Systems Through the Use of Kokkos,” SCI Institute, 2019.

ABSTRACT

Uncertainty and diversity in future HPC systems, including those for exascale, makes portable codebases desirable. To ease future ports, the Uintah Computational Framework has adopted the Kokkos C++ Performance Portability Library. This paper describes infrastructure advancements and performance improvements using partitioning functionality recently added to Kokkos within Uintah's MPI+Kokkos hybrid parallelism approach. Results are presented for two challenging calculations that have been refactored to support Kokkos::OpenMP and Kokkos::Cuda back-ends. These results demonstrate performance improvements up to (i) 2.66x when refactoring for portability, (ii) 81.59x when adding loop-level parallelism via Kokkos back-ends, and (iii) 2.63x when more eciently using a node. Good strong-scaling characteristics to 442,368 threads across 1728 Knights Landing processors are also shown. These improvements have been achieved with little added overhead (sub-millisecond, consuming up to 0.18% of per-timestep time). Kokkos adoption and refactoring lessons are also discussed.



W. Usher, I. Wald, J. Amstutz, J. Gunther, C. Brownlee, V. Pascucci. “Scalable Ray Tracing Using the Distributed FrameBuffer,” In Eurographics Conference on Visualization (EuroVis) 2019, Vol. 38, No. 3, 2019.

ABSTRACT

Image- and data-parallel rendering across multiple nodes on high-performance computing systems is widely used in visualization to provide higher frame rates, support large data sets, and render data in situ. Specifically for in situ visualization, reducing bottlenecks incurred by the visualization and compositing is of key concern to reduce the overall simulation runtime. Moreover, prior algorithms have been designed to support either image- or data-parallel rendering and impose restrictions on the data distribution, requiring different implementations for each configuration. In this paper, we introduce the Distributed FrameBuffer, an asynchronous image-processing framework for multi-node rendering. We demonstrate that our approach achieves performance superior to the state of the art for common use cases, while providing the flexibility to support a wide range of parallel rendering algorithms and data distributions. By building on this framework, we extend the open-source ray tracing library OSPRay with a data-distributed API, enabling its use in data-distributed and in situ visualization applications.



F. Wang, I. Wald, Q. Wu, W. Usher, C. R. Johnson. “CPU Isosurface Ray Tracing of Adaptive Mesh Refinement Data,” In IEEE Transactions on Visualization and Computer Graphics, Vol. 25, No. 1, IEEE, pp. 1142-1151. Jan, 2019.
DOI: 10.1109/TVCG.2018.2864850

ABSTRACT

Adaptive mesh refinement (AMR) is a key technology for large-scale simulations that allows for adaptively changing the simulation mesh resolution, resulting in significant computational and storage savings. However, visualizing such AMR data poses a significant challenge due to the difficulties introduced by the hierarchical representation when reconstructing continuous field values. In this paper, we detail a comprehensive solution for interactive isosurface rendering of block-structured AMR data. We contribute a novel reconstruction strategy—the octant method—which is continuous, adaptive and simple to implement. Furthermore, we present a generally applicable hybrid implicit isosurface ray-tracing method, which provides better rendering quality and performance than the built-in sampling-based approach in OSPRay. Finally, we integrate our octant method and hybrid isosurface geometry into OSPRay as a module, providing the ability to create high-quality interactive visualizations combining volume and isosurface representations of BS-AMR data. We evaluate the rendering performance, memory consumption and quality of our method on two gigascale block-structured AMR datasets.


2018


T.A.J, Ouermi, R. M. Kirby,, M. Berzins. “Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling,” In International Journal of Networking and Computing, Vol. 8, No. 2, IJNC , pp. 301--327. 2018.
DOI: 10.15803/ijnc.8.2_301

ABSTRACT

Performance optimization in the petascale era and beyond in the exascale era has and will require modifications of legacy codes to take advantage of new architectures with large core counts and SIMD units. The Numerical Weather Prediction (NWP) physics codes considered here are optimized using thread-local structures of arrays (SOA). High-level and low-level optimization strategies are applied to the WRF Single-Moment 6-Class Microphysics Scheme (WSM6) and Global Forecast System (GFS) physics codes used in the NEPTUNE forecast code. By building on previous work optimizing WSM6 on the Intel Knights Landing (KNL), it is shown how to further optimize WMS6 and GFS physics, and GFS radiation on Intel KNL, Haswell, and potentially on future micro-architectures with many cores and SIMD vector units. The optimization techniques used herein employ thread-local structures of arrays (SOA), an OpenMP directive, OMP SIMD, and minor code transformations to enable better utilization of SIMD units, increase parallelism, improve locality, and reduce memory traffic. The optimized versions of WSM6, GFS physics, GFS radiation run 70, 27, and 23 faster (respectively) on KNL and 26, 18 and 30 faster (respectively) on Haswell than their respective original serial versions. Although this work targets WRF physics schemes, the findings are transferable to other performance optimization contexts and provide insight into the optimization of codes with complex physical models for present and near-future architectures with many core and vector units.



S. Petruzza, A. Gyulassy, V. Pascucci,, P. T. Bremer. “A Task-Based Abstraction Layer for User Productivity and Performance Portability in Post-Moore’s Era Supercomputing,” In 3RD INTERNATIONAL WORKSHOP ON POST-MOORE’S ERA SUPERCOMPUTING (PMES), 2018.

ABSTRACT

The proliferation of heterogeneous computing architectures in current and future supercomputing systems dramatically increases the complexity of software development and exacerbates the divergence of software stacks. Currently, task-based runtimes attempt to alleviate these impediments, however their effective use requires expertise and deep integration that does not facilitate reuse and portability. We propose to introduce a task-based abstraction layer that separates the definition of the algorithm from the runtime-specific implementation, while maintaining performance portability.



S. Petruzza, A. Gyulassy, V. Pascucci,, P. T. Bremer. “A Task-Based Abstraction Layer for User Productivity and Performance Portability in Post-Moore’s Era Supercomputing,” In 3RD INTERNATIONAL WORKSHOP ON POST-MOORE’S ERA SUPERCOMPUTING (PMES), 2018.

ABSTRACT

The proliferation of heterogeneous computing architectures in current and future supercomputing systems dramatically increases the complexity of software development and exacerbates the divergence of software stacks. Currently, task-based runtimes attempt to alleviate these impediments, however their effective use requires expertise and deep integration that does not facilitate reuse and portability. We propose to introduce a task-based abstraction layer that separates the definition of the algorithm from the runtime-specific implementation, while maintaining performance portability.



W Usher, P Klacansky, F Federer, PT Bremer, A Knoll, J. Yarch, A. Angelucci, V. Pascucci . “A virtual reality visualization tool for neuron tracing,” In IEEE Transactions on Visualization and Computer Graphics, Vol. 24, No. 1, IEEE, pp. 994--1003. Jan, 2018.
DOI: 10.1109/tvcg.2017.2744079

ABSTRACT

racing neurons in large-scale microscopy data is crucial to establishing a wiring diagram of the brain, which is needed to understand how neural circuits in the brain process information and generate behavior. Automatic techniques often fail for large and complex datasets, and connectomics researchers may spend weeks or months manually tracing neurons using 2D image stacks. We present a design study of a new virtual reality (VR) system, developed in collaboration with trained neuroanatomists, to trace neurons in microscope scans of the visual cortex of primates. We hypothesize that using consumer-grade VR technology to interact with neurons directly in 3D will help neuroscientists better resolve complex cases and enable them to trace neurons faster and with less physical and mental strain. We discuss both the design process and technical challenges in developing an interactive system to navigate and manipulate terabyte-sized image volumes in VR. Using a number of different datasets, we demonstrate that, compared to widely used commercial software, consumer-grade VR presents a promising alternative for scientists.



W. Usher, S. Rizzi, I. Wald, J. Amstutz, J. Insley, V. Vishwanath, N. Ferrier, M. E. Papka,, V. Pascucci. “libIS: A Lightweight Library for Flexible In Transit Visualization,” In Proceedings of the Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, ACM Press, 2018.
DOI: 10.1145/3281464.3281466

ABSTRACT

As simulations grow in scale, the need for in situ analysis methods to handle the large data produced grows correspondingly. One desirable approach to in situ visualization is in transit visualization. By decoupling the simulation and visualization code, in transit approaches alleviate common difficulties with regard to the scalability of the analysis, ease of integration, usability, and impact on the simulation. We present libIS, a lightweight, flexible library which lowers the bar for using in transit visualization. Our library works on the concept of abstract regions of space containing data, which are transferred from the simulation to the visualization clients upon request, using a client-server model. We also provide a SENSEI analysis adaptor, which allows for transparent deployment of in transit visualization. We demonstrate the flexibility of our approach on batch analysis and interactive visualization use cases on different HPC resources.


2017


J. K. Holmen, A. Humphrey, D. Sutherland, M. Berzins. “Improving Uintah's Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks,” In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact, PEARC17, No. 27, pp. 27:1--27:8. 2017.
ISBN: 978-1-4503-5272-7
DOI: 10.1145/3093338.3093388

ABSTRACT

The University of Utah's Carbon Capture Multidisciplinary Simulation Center (CCMSC) is using the Uintah Computational Framework to predict performance of a 1000 MWe ultra-supercritical clean coal boiler. The center aims to utilize the Intel Xeon Phi-based DOE systems, Theta and Aurora, through the Aurora Early Science Program by using the Kokkos C++ library to enable node-level performance portability. This paper describes infrastructure advancements and portability improvements made possible by our integration of Kokkos within Uintah. Scalability results are presented that compare serial and data parallel task execution models for a challenging radiative heat transfer calculation, central to the center's predictive boiler simulations. These results demonstrate both good strong-scaling characteristics to 256 Knights Landing (KNL) processors on the NSF Stampede system, and show the KNL-based calculation to compete with prior GPU-based results for the same calculation.



S. Kumar, D. Hoang, S. Petruzza, J. Edwards, V. Pascucci. “Reducing network congestion and synchronization overhead during aggregation of hierarchical data,” In 2017 IEEE 24th International Conference on High Performance Computing (HiPC), IEEE, Dec, 2017.
DOI: 10.1109/hipc.2017.00034

ABSTRACT

Hierarchical data representations have been shown to be effective tools for coping with large-scale scientific data. Writing hierarchical data on supercomputers, however, is challenging as it often involves all-to-one communication during aggregation of low-resolution data which tends to span the entire network domain, resulting in several bottlenecks. We introduce the concept of indexing templates, which succinctly describe data organization and can be used to alter movement of data in beneficial ways. We present two techniques, domain partitioning and localized aggregation, that leverage indexing templates to alleviate congestion and synchronization overheads during data aggregation. We report experimental results that show significant I/O speedup using our proposed schemes on two of today's fastest supercomputers, Mira and Shaheen II, using the Uintah and S3D simulation frameworks.



T.A.J. Ouermi, A. Knoll, R.M. Kirby, M. Berzins. “OpenMP 4 Fortran Modernization of WSM6 for KNL,” In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact, PEARC17, No. 12, ACM, pp. 12:1--12:8. 2017.
ISBN: 978-1-4503-5272-7
DOI: 10.1145/3093338.3093387

ABSTRACT

Parallel code portability in the petascale era requires modifying existing codes to support new architectures with large core counts and SIMD vector units. OpenMP is a well established and increasingly supported vehicle for portable parallelization. As architectures mature and compiler OpenMP implementations evolve, best practices for code modernization change as well. In this paper, we examine the impact of newer OpenMP features (in particular OMP SIMD) on the Intel Xeon Phi Knights Landing (KNL) architecture, applied in optimizing loops in the single moment 6-class microphysics module (WSM6) in the US Navy's NEPTUNE code. We find that with functioning OMP SIMD constructs, low thread invocation overhead on KNL and reduced penalty for unaligned access compared to previous architectures, one can leverage OpenMP 4 to achieve reasonable scalability with relatively minor reorganization of a production physics code.



T.A.J. Ouermi, A. Knoll, R.M. Kirby, M. Berzins. “Optimization Strategies for WRF Single-Moment 6-Class Microphysics Scheme (WSM6) on Intel Microarchitectures,” In Proceedings of the fifth international symposium on computing and networking (CANDAR 17). Awarded Best Paper , IEEE, 2017.

ABSTRACT

Optimizations in the petascale era require modifications of existing codes to take advantage of new architectures with large core counts and SIMD vector units. This paper examines high-level and low-level optimization strategies for numerical weather prediction (NWP) codes. These strategies employ thread-local structures of arrays (SOA) and an OpenMP directive such as OMP SIMD. These optimization approaches are applied to the Weather Research Forecasting single-moment 6-class microphysics schemes (WSM6) in the US Navy NEPTUNE system. The results of this study indicate that the high-level approach with SOA and low-level OMP SIMD improves thread and vector parallelism by increasing data and temporal locality. The modified version of WSM6 runs 70x faster than the original serial code. This improvement is about 23.3x faster than the performance achieved by Ouermi et al., and 14.9x faster than the performance achieved by Michalakes et al.



S. Petruzza, A. Venkat, A. Gyulassy, G. Scorzelli, F. Federer, A. Angelucci, V. Pascucci, P. T. Bremer. “ISAVS: Interactive Scalable Analysis and Visualization System,” In ACM SIGGRAPH Asia 2017 Symposium on Visualization, ACM Press, 2017.
DOI: 10.1145/3139295.3139299

ABSTRACT

Modern science is inundated with ever increasing data sizes as computational capabilities and image acquisition techniques continue to improve. For example, simulations are tackling ever larger domains with higher fidelity, and high-throughput microscopy techniques generate larger data that are fundamental to gather biologically and medically relevant insights. As the image sizes exceed memory, and even sometimes local disk space, each step in a scientific workflow is impacted. Current software solutions enable data exploration with limited interactivity for visualization and analytic tasks. Furthermore analysis on HPC systems often require complex hand-written parallel implementations of algorithms that suffer from poor portability and maintainability. We present a software infrastructure that simplifies end-to-end visualization and analysis of massive data. First, a hierarchical streaming data access layer enables interactive exploration of remote data, with fast data fetching to test analytics on subsets of the data. Second, a library simplifies the process of developing new analytics algorithms, allowing users to rapidly prototype new approaches and deploy them in an HPC setting. Third, a scalable runtime system automates mapping analysis algorithms to whatever computational hardware is available, reducing the complexity of developing scaling algorithms. We demonstrate the usability and performance of our system using a use case from neuroscience: filtering, registration, and visualization of tera-scale microscopy data. We evaluate the performance of our system using a leadership-class supercomputer, Shaheen II.



W. Usher, J. Amstutz, C. Brownlee, A. Knoll, I. Wald . “Progressive CPU Volume Rendering with Sample Accumulation,” In Eurographics Symposium on Parallel Graphics and Visualization, Edited by Alexandru Telea and Janine Bennett, The Eurographics Association, 2017.
ISBN: 978-3-03868-034-5
ISSN: 1727-348X
DOI: 10.2312/pgv.20171090

ABSTRACT

We present a new method for progressive volume rendering by accumulating object-space samples over successively rendered frames. Existing methods for progressive refinement either use image space methods or average pixels over frames, which can blur features or integrate incorrectly with respect to depth. Our approach stores samples along each ray, accumulates new samples each frame into a buffer, and progressively interleaves and integrates these samples. Though this process requires additional memory, it ensures interactivity and is well suited for CPU architectures with large memory and cache. This approach also extends well to distributed rendering in cluster environments. We implement this technique in Intel's open source OSPRay CPU ray tracing framework and demonstrate that it is particularly useful for rendering volumetric data with costly sampling functions.



I. Wald, C. Brownlee, W. Usher, A. Knoll. “CPU Volume Rendering of Adaptive Mesh Refinement Data,” In ACM SIGGRAPH Asia 2017 Symposium on Visualization, ACM Press, 2017.
DOI: 10.1145/3139295.3139305

ABSTRACT

Adaptive Mesh Refinement (AMR) methods are widespread in scientific computing, and visualizing the resulting data with efficient and accurate rendering methods can be vital for enabling interactive data exploration. In this work, we detail a comprehensive solution for directly volume rendering block-structured (Berger-Colella) AMR data in the OSPRay interactive CPU ray tracing framework. In particular, we contribute a general method for representing and traversing AMR data using a kd-tree structure, and four different reconstruction options, one of which in particular (the basis function approach) is novel compared to existing methods. We demonstrate our system on two types of block-structured AMR data and compressed scalar field data, and show how it can be easily used in existing production-ready applications through a prototypical integration in the widely used visualization program ParaView.