Designed especially for neurobiologists, FluoRender is an interactive tool for multi-channel fluorescence microscopy data visualization and analysis.
Large scale visualization on the Powerwall.
BrainStimulator is a set of networks that are used in SCIRun to perform simulations of brain stimulation such as transcranial direct current stimulation (tDCS) and magnetic transcranial stimulation (TMS).
Developing software tools for science has always been a central vision of the SCI Institute.

Scientific Computing

Numerical simulation of real-world phenomena provides fertile ground for building interdisciplinary relationships. The SCI Institute has a long tradition of building these relationships in a win-win fashion – a win for the theoretical and algorithmic development of numerical modeling and simulation techniques and a win for the discipline-specific science of interest. High-order and adaptive methods, uncertainty quantification, complexity analysis, and parallelization are just some of the topics being investigated by SCI faculty. These areas of computing are being applied to a wide variety of engineering applications ranging from fluid mechanics and solid mechanics to bioelectricity.


martin

Martin Berzins

Parallel Computing
GPUs
mike

Mike Kirby

Finite Element Methods
Uncertainty Quantification
GPUs
pascucci

Valerio Pascucci

Scientific Data Management
chris

Chris Johnson

Problem Solving Environments
ross

Ross Whitaker

GPUs
chuck

Chuck Hansen

GPUs
       

Scientific Computing Project Sites:


Publications in Scientific Computing:


Scalable Data Management of the Uintah Simulation Framework for Next-Generation Engineering Problems with Radiation
S. Kumar, A. Humphrey, W. Usher, S. Petruzza, B. Peterson, J. A. Schmidt, D. Harris, B. Isaac, J. Thornock, T. Harman, V. Pascucci,, M. Berzins. In Supercomputing Frontiers, Springer International Publishing, pp. 219--240. 2018.
ISBN: 978-3-319-69953-0
DOI: 10.1007/978-3-319-69953-0_13

The need to scale next-generation industrial engineering problems to the largest computational platforms presents unique challenges. This paper focuses on data management related problems faced by the Uintah simulation framework at a production scale of 260K processes. Uintah provides a highly scalable asynchronous many-task runtime system, which in this work is used for the modeling of a 1000 megawatt electric (MWe) ultra-supercritical (USC) coal boiler. At 260K processes, we faced both parallel I/O and visualization related challenges, e.g., the default file-per-process I/O approach of Uintah did not scale on Mira. In this paper we present a simple to implement, restructuring based parallel I/O technique. We impose a restructuring step that alters the distribution of data among processes. The goal is to distribute the dataset such that each process holds a larger chunk of data, which is then written to a file independently. This approach finds a middle ground between two of the most common parallel I/O schemes--file per process I/O and shared file I/O--in terms of both the total number of generated files, and the extent of communication involved during the data aggregation phase. To address scalability issues when visualizing the simulation data, we developed a lightweight renderer using OSPRay, which allows scientists to visualize the data interactively at high quality and make production movies. Finally, this work presents a highly efficient and scalable radiation model based on the sweeping method, which significantly outperforms previous approaches in Uintah, like discrete ordinates. The integrated approach allowed the USC boiler problem to run on 260K CPU cores on Mira.



Scalable Data Management of the Uintah Simulation Framework for Next-Generation Engineering Problems with Radiation
S. Kumar, A. Humphrey, W. Usher, S. Petruzza, B. Peterson, J. A. Schmidt, D. Harris, B. Isaac, J. Thornock, T. Harman, V. Pascucci,, M. Berzins. In Supercomputing Frontiers, Springer International Publishing, pp. 219--240. 2018.
ISBN: 978-3-319-69953-0
DOI: 10.1007/978-3-319-69953-0_13

The need to scale next-generation industrial engineering problems to the largest computational platforms presents unique challenges. This paper focuses on data management related problems faced by the Uintah simulation framework at a production scale of 260K processes. Uintah provides a highly scalable asynchronous many-task runtime system, which in this work is used for the modeling of a 1000 megawatt electric (MWe) ultra-supercritical (USC) coal boiler. At 260K processes, we faced both parallel I/O and visualization related challenges, e.g., the default file-per-process I/O approach of Uintah did not scale on Mira. In this paper we present a simple to implement, restructuring based parallel I/O technique. We impose a restructuring step that alters the distribution of data among processes. The goal is to distribute the dataset such that each process holds a larger chunk of data, which is then written to a file independently. This approach finds a middle ground between two of the most common parallel I/O schemes--file per process I/O and shared file I/O--in terms of both the total number of generated files, and the extent of communication involved during the data aggregation phase. To address scalability issues when visualizing the simulation data, we developed a lightweight renderer using OSPRay, which allows scientists to visualize the data interactively at high quality and make production movies. Finally, this work presents a highly efficient and scalable radiation model based on the sweeping method, which significantly outperforms previous approaches in Uintah, like discrete ordinates. The integrated approach allowed the USC boiler problem to run on 260K CPU cores on Mira.



Nonlinear stability and time step selection for the MPM method
M. Berzins. In Computational Particle Mechanics, Jan, 2018.
ISSN: 2196-4386
DOI: 10.1007/s40571-018-0182-y

The Material Point Method (MPM) has been developed from the Particle in Cell (PIC) method over the last 25 years and has proved its worth in solving many challenging problems involving large deformations. Nevertheless there are many open questions regarding the theoretical properties of MPM. For example in while Fourier methods, as applied to PIC may provide useful insight, the non-linear nature of MPM makes it necessary to use a full non-linear stability analysis to determine a stable time step for MPM. In order to begin to address this the stability analysis of Spigler and Vianello is adapted to MPM and used to derive a stable time step bound for a model problem. This bound is contrasted against traditional Speed of sound and CFL bounds and shown to be a realistic stability bound for a model problem.



Research and Education in Computational Science and Engineering
Subtitled “Report from a workshop sponsored by the Society for Industrial and Applied Mathematics (SIAM) and the European Exascale Software Initiative (EESI-2), August 4-6, 2014, Breckenridge, Colorado,” U. Ruede, K. Willcox, L. C. McInnes, H. De Sterck, G. Biros, H. Bungartz, J. Corones, E. Cramer, J. Crowley, O. Ghattas, M. Gunzburger, M. Hanke, R. Harrison, M. Heroux, J. Hesthaven, P. Jimack, C. Johnson, K. E. Jordan, D. E. Keyes, R. Krause, V. Kumar, S. Mayer, J. Meza, K. M. Mrken, J. T. Oden, L. Petzold, P. Raghavan, S. M. Shontz, A. Trefethen, P. Turner, V. Voevodin, B. Wohlmuth,, C. S. Woodward. Vol. abs/1610.02608, 2018.

This report presents challenges, opportunities and directions for computational science and engineering (CSE) research and education for the next decade. Over the past two decades the field of CSE has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers with algorithmic inventions and software systems that transcend disciplines and scales. CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society; and the CSE community is at the core of this transformation. However, a combination of disruptive developments—including the architectural complexity of extreme-scale computing, the data revolution and increased attention to data-driven discovery, and the specialization required to follow the applications to new frontiers—is redefining the scope and reach of the CSE endeavor. With these many current and expanding opportunities for the CSE field, there is a growing demand for CSE graduates and a need to expand CSE educational offerings. This need includes CSE programs at both the undergraduate and graduate levels, as well as continuing education and professional development programs, exploiting the synergy between computational science and data science. Yet, as institutions consider new and evolving educational programs, it is essential to consider the broader research challenges and opportunities that provide the context for CSE education and workforce development.



Toward parallel CFA with datalog, MPI, and CUDA
T. Gilray, S. Kumar. In Scheme and Functional Programming Workshop, 2017.

We present our recent experience working to design parallel functional control-flow analysis (CFA) using an encoding in Datalog and underlying relational algebra implemented for SIMD coprocessors and supercomputers. Control-flow analysis statically models the possible propagations of data and control through a target program, finitely obtaining a bound on reachable expressions and environments and on possible return and argument values. We used Souffl´e, a parallel CPU-based Datalog implementation from Oracle labs, and worked toward a new MPI-based distributed hash join implementation and an extension of the GPU-based relational algebra library RedFox.

In this paper, we provide introductions to functional flow analysis, Datalog, MPI, and CUDA, explaining the total process we are working on to bring these components together in an analysis pipeline toward the end of scaling functional program analyses by extracting their intrinsic parallelism in a principled manner.



Reducing Network Congestion and Synchronization Overhead During Aggregation of Hierarchical Data
S. Kumar, D. Hoang, S. Petruzza, J. Edwards, V. Pascucci. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC), pp. 223-232. Dec, 2017.
DOI: 10.1109/HiPC.2017.00034

Hierarchical data representations have been shown to be effective tools for coping with large-scale scientific data. Writing hierarchical data on supercomputers, however, is challenging as it often involves all-to-one communication during aggregation of low-resolution data which tends to span the entire network domain, resulting in several bottlenecks. We introduce the concept of indexing templates, which succinctly describe data organization and can be used to alter movement of data in beneficial ways. We present two techniques, domain partitioning and localized aggregation, that leverage indexing templates to alleviate congestion and synchronization overheads during data aggregation. We report experimental results that show significant I/O speedup using our proposed schemes on two of today's fastest supercomputers, Mira and Shaheen II, using the Uintah and S3D simulation frameworks.



Effectively Subsampled Quadratures for Least Squares Polynomial Approximations
P. Seshadri, A. Narayan, S. Mahadevan. In SIAM/ASA Journal on Uncertainty Quantification, pp. 1003--1023. Jan, 2017.

This paper proposes a new deterministic sampling strategy for constructing polynomial chaos approximations for expensive physics simulation models. The proposed approach, effectively subsampled quadratures involves sparsely subsampling an existing tensor grid using QR column pivoting. For polynomial interpolation using hyperbolic or total order sets, we then solve the following square least squares problem. For polynomial approximation, we use a column pruning heuristic that removes columns based on the highest total orders and then solves the tall least squares problem. While we provide bounds on the condition number of such tall submatrices, it is difficult to ascertain how column pruning effects solution accuracy as this is problem specific. We conclude with numerical experiments on an analytical function and a model piston problem that show the efficacy of our approach compared with randomized subsampling. We also show an example where this method fails.



Sequential data assimilation with multiple nonlinear models and applications to subsurface flow
L. Yang, A. Narayan, P. Wang. In Journal of Computational Physics, Vol. 346, pp. 356--368. Oct, 2017.
ISSN: 0021-9991
DOI: 10.1016/j.jcp.2017.06.026

Complex systems are often described with competing models. Such divergence of interpretation on the system may stem from model fidelity, mathematical simplicity, and more generally, our limited knowledge of the underlying processes. Meanwhile, available but limited observations of system state could further complicates one's prediction choices. Over the years, data assimilation techniques, such as the Kalman filter, have become essential tools for improved system estimation by incorporating both models forecast and measurement; but its potential to mitigate the impacts of aforementioned model-form uncertainty has yet to be developed. Based on an earlier study of Multi-model Kalman filter, we propose a novel framework to assimilate multiple models with observation data for nonlinear systems, using extended Kalman filter, ensemble Kalman filter and particle filter, respectively. Through numerical examples of subsurface flow, we demonstrate that the new assimilation framework provides an effective and improved forecast of system behaviour.



A Christoffel function weighted least squares algorithm for collocation approximations
A. Narayan, J. Jakeman, T. Zhou. In Mathematics of Computation, Vol. 86, No. 306, pp. 1913--1947. 2017.
ISSN: 0025-5718, 1088-6842
DOI: 10.1090/mcom/3192

We propose, theoretically investigate, and numerically validate an algorithm for the Monte Carlo solution of least-squares polynomial approximation problems in a collocation frame- work. Our method is motivated by generalized Polynomial Chaos approximation in uncertainty quantification where a polynomial approximation is formed from a combination of orthogonal polynomials. A standard Monte Carlo approach would draw samples according to the density of orthogonality. Our proposed algorithm samples with respect to the equilibrium measure of the parametric domain, and subsequently solves a weighted least-squares problem, with weights given by evaluations of the Christoffel function. We present theoretical analysis to motivate the algorithm, and numerical results that show our method is superior to standard Monte Carlo methods in many situations of interest.



Offline-Enhanced Reduced Basis Method through adaptive construction of the Surrogate Parameter Domain
J. Jiang, Y. Chen, A. Narayan. In Journal of Scientific Computing, Vol. 73, No. 2-3, pp. 853--875. 2017.
ISSN: 0885-7474, 1573-7691
DOI: 10.1007/s10915-017-0551-3

The Reduced Basis Method (RBM) is a popular certified model reduction approach for solving parametrized partial differential equations. One critical stage of the \textitoffline portion of the algorithm is a greedy algorithm, requiring maximization of an error estimate over parameter space. In practice this maximization is usually performed by replacing the parameter domain continuum with a discrete "training" set. When the dimension of parameter space is large, it is necessary to significantly increase the size of this training set in order to effectively search parameter space. Large training sets diminish the attractiveness of RBM algorithms since this proportionally increases the cost of the offline phase.

In this work we propose novel strategies for offline RBM algorithms that mitigate the computational difficulty of maximizing error estimates over a training set. The main idea is to identify a subset of the training set, a "surrogate parameter domain" (SPD), on which to perform greedy algorithms. The SPD's we construct are much smaller in size than the full training set, yet our examples suggest that they are accurate enough to represent the solution manifold of interest at the current offline RBM iteration. We propose two algorithms to construct the SPD: Our first algorithm, the Successive Maximization Method (SMM) method, is inspired by inverse transform sampling for non-standard univariate probability distributions. The second constructs an SPD by identifying pivots in the Cholesky Decomposition of an approximate error correlation matrix. We demonstrate the algorithm through numerical experiments, showing that the algorithm is capable of accelerating offline RBM procedures without degrading accuracy, assuming that the solution manifold has low Kolmogorov width.



A Generalized Sampling and Preconditioning Scheme for Sparse Approximation of Polynomial Chaos Expansions
J. Jakeman, A. Narayan, T. Zhou. In SIAM Journal on Scientific Computing, Vol. 39, No. 3, SIAM, pp. A1114--A1144. Jan, 2017.
ISSN: 1064-8275
DOI: 10.1137/16M1063885

In this paper we propose an algorithm for recovering sparse orthogonal polynomials using stochastic collocation. Our approach is motivated by the desire to use generalized polynomial chaos expansions (PCE) to quantify uncertainty in models subject to uncertain input parameters. The standard sampling approach for recovering sparse polynomials is to use Monte Carlo (MC) sampling of the density of orthogonality. However MC methods result in poor function recovery when the polynomial degree is high. Here we propose a general algorithm that can be applied to any admissible weight function on a bounded domain and a wide class of exponential weight functions defined on unbounded domains. Our proposed algorithm samples with respect to the weighted equilibrium measure of the parametric domain, and subsequently solves a preconditioned ℓ1-minimization problem, where the weights of the diagonal preconditioning matrix are given by evaluations of the Christoffel function. We present theoretical analysis to motivate the algorithm, and numerical results that show our method is superior to standard Monte Carlo methods in many situations of interest. Numerical examples are also provided that demonstrate that our proposed Christoffel Sparse Approximation algorithm leads to comparable or improved accuracy even when compared with Legendre and Hermite specific algorithms.



Stochastic Collocation Methods via L1 Minimization Using Randomized Quadratures
L. Guo, A. Narayan, T. Zhou, Y. Chen. In SIAM Journal on Scientific Computing, Vol. 39, No. 1, pp. A333--A359. Jan, 2017.
ISSN: 1064-8275
DOI: 10.1137/16M1059680

In this work, we discuss the problem of approximating a multivariate function via ℓ1 minimization method, using a random chosen sub-grid of the corresponding tensor grid of Gaussian points. The independent variables of the function are assumed to be random variables, and thus, the framework provides a non-intrusive way to construct the generalized polynomial chaos expansions, stemming from the motivating application of Uncertainty Quantification (UQ). We provide theoretical analysis on the validity of the approach. The framework includes both the bounded measures such as the uniform and the Chebyshev measure, and the unbounded measures which include the Gaussian measure. Several numerical examples are given to confirm the theoretical results.



Numerical Computation of Weil-Peterson Geodesics in the Universal Teichmueller Space
M. Feiszli, A. Narayan. In SIAM Journal on Imaging Sciences, Vol. 10, No. 3, SIAM, pp. 1322--1345. Jan, 2017.
DOI: 10.1137/15M1043947

We propose an optimization algorithm for computing geodesics on the universal Teichm\"uller space T(1) in the Weil-Petersson (WP) metric. Another realization for T(1) is the space of planar shapes, modulo translation and scale, and thus our algorithm addresses a fundamental problem in computer vision: compute the distance between two given shapes. The identification of smooth shapes with elements on T(1) allows us to represent a shape as a diffeomorphism on S1. Then given two diffeomorphisms on S1 (i.e., two shapes we want connect with a flow), we formulate a discretized WP energy and the resulting problem is a boundary-value minimization problem. We numerically solve this problem, providing several examples of geodesic flow on the space of shapes, and verifying mathematical properties of T(1). Our algorithm is more general than the application here in the sense that it can be used to compute geodesics on any other Riemannian manifold.



An Orthogonality Property of the Legendre Polynomials
L. Bos, A. Narayan, N. Levenberg, F. Piazzon. In Constructive Approximation, Vol. 45, No. 1, pp. 65--81. Feb, 2017.
ISSN: 0176-4276, 1432-0940
DOI: 10.1007/s00365-015-9321-3

We give a remarkable additional othogonality property of the classical Legendre polynomials on the real interval [−1,1]: polynomials up to degree n from this family are mutually orthogonal under the arcsine measure weighted by the degree-n normalized Christoffel function



Nonlinear Stability of the MPM Method
M. Berzins. In V International Conference on Particle-based Methods – Fundamentals and Applications. PARTICLES 2017, Edited by P. Wriggers, M. Bischoff, E. O˜nate, D.R.J. Owen, & T. Zohdi, pp. 671--682. 2017.

The Material Point Method (MPM) has been very successful in providing solutions to many challenging problems involving large deformations. The nonlinear nature of MPM makes it necessary to use a full nonlinear stability analysis to determine a stable timestep. The stability analysis of Spigler and Vianello is adapted to MPM and used to derive a stable timestep bound for a model problem. This bound is contrasted against a traditional CFL bound.



Optimization Strategies for WRF Single-Moment 6-Class Microphysics Scheme (WSM6) on Intel Microarchitectures
T.A.J. Ouermi, A. Knoll, R.M. Kirby, M. Berzins. In Proceedings of the fifth international symposium on computing and networking (CANDAR 17). Awarded Best Paper , IEEE, 2017.

Optimizations in the petascale era require modifications of existing codes to take advantage of new architectures with large core counts and SIMD vector units. This paper examines high-level and low-level optimization strategies for numerical weather prediction (NWP) codes. These strategies employ thread-local structures of arrays (SOA) and an OpenMP directive such as OMP SIMD. These optimization approaches are applied to the Weather Research Forecasting single-moment 6-class microphysics schemes (WSM6) in the US Navy NEPTUNE system. The results of this study indicate that the high-level approach with SOA and low-level OMP SIMD improves thread and vector parallelism by increasing data and temporal locality. The modified version of WSM6 runs 70x faster than the original serial code. This improvement is about 23.3x faster than the performance achieved by Ouermi et al., and 14.9x faster than the performance achieved by Michalakes et al.



Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs. Awarded Best Paper
B. Peterson, A. Humphrey, J. Schmidt, M. Berzins. In Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware - ESPM2'17, ACM, 2017.
DOI: 10.1145/3152041.3152082

Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying the all-to-all communication of data dependencies among the nodes. Intranodal challenges include gathering together these data dependencies into usable data objects while avoiding data duplication. This paper addresses these challenges within the context of a large-scale, industrial coal boiler simulation using the Uintah asynchronous many-task runtime system on GPU architectures. We show significant reduction in time spent analyzing data dependencies through refinements in our dependency search algorithm. Multiple task graphs are used to eliminate subsequent analysis when task graphs change in predictable and repeatable ways. Using a combined data store and task scheduler redesign reduces data dependency duplication ensuring that problems fit within host and GPU memory. These modifications did not require any changes to application code or sweeping changes to the Uintah runtime system. We report results running on the DOE Titan system on 119K CPU cores and 7.5K GPUs simultaneously. Our solutions can be generalized to other task dependency problems with global dependencies among thousands of nodes which must be processed efficiently at large scale.



Massively Parallel Simulations of Spread of Infectious Diseases over Realistic Social Networks
A. Bhatele, J. Yeom, N. Jain, C. J. Kuhlman, Y. Livnat, K. R. Bisset, L. V. Kale, M. V. Marathe. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), May, 2017.
DOI: 10.1109/ccgrid.2017.141

Controlling the spread of infectious diseases in large populations is an important societal challenge. Mathematically, the problem is best captured as a certain class of reaction-diffusion processes (referred to as contagion processes) over appropriate synthesized interaction networks. Agent-based models have been successfully used in the recent past to study such contagion processes. We describe EpiSimdemics, a highly scalable, parallel code written in Charm++ that uses agent-based modeling to simulate disease spreads over large, realistic, co-evolving interaction networks. We present a new parallel implementation of EpiSimdemics that achieves unprecedented strong and weak scaling on different architectures — Blue Waters, Cori and Mira. EpiSimdemics achieves five times greater speedup than the second fastest parallel code in this field. This unprecedented scaling is an important step to support the long term vision of real-time epidemic science. Finally, we demonstrate the capabilities of EpiSimdemics by simulating the spread of influenza over a realistic synthetic social contact network spanning the continental United States (∼280 million nodes and 5.8 billion social contacts).



Progressive CPU Volume Rendering with Sample Accumulation
W. Usher, J. Amstutz, C. Brownlee, A. Knoll, I. Wald . In Eurographics Symposium on Parallel Graphics and Visualization, Edited by Alexandru Telea and Janine Bennett, The Eurographics Association, 2017.
ISBN: 978-3-03868-034-5
ISSN: 1727-348X
DOI: 10.2312/pgv.20171090

We present a new method for progressive volume rendering by accumulating object-space samples over successively rendered frames. Existing methods for progressive refinement either use image space methods or average pixels over frames, which can blur features or integrate incorrectly with respect to depth. Our approach stores samples along each ray, accumulates new samples each frame into a buffer, and progressively interleaves and integrates these samples. Though this process requires additional memory, it ensures interactivity and is well suited for CPU architectures with large memory and cache. This approach also extends well to distributed rendering in cluster environments. We implement this technique in Intel's open source OSPRay CPU ray tracing framework and demonstrate that it is particularly useful for rendering volumetric data with costly sampling functions.



Pathways for Theoretical Advances in Visualization
M. Chen, G. Grinstein, C. R. Johnson, J. Kennedy, M. Tory. In IEEE Computer Graphics and Applications, IEEE, pp. 103--112. July, 2017.

More than a decade ago, Chris Johnson proposed the "Theory of Visualization" as one of the top research problems in visualization. Since then, there have been several theory-focused events, including three workshops and three panels at IEEE Visualization (VIS) Conferences. Together, these events have produced a set of convincing arguments.