## Martin BerzinsParallel ComputingGPUs |
## Mike KirbyFinite Element MethodsUncertainty Quantification GPUs |
## Valerio PascucciScientific Data Management |
## Chris JohnsonProblem Solving Environments |
## Ross WhitakerGPUs |

## Chuck HansenGPUs |

Scalable Data Management of the Uintah Simulation Framework for Next-Generation Engineering Problems with RadiationS. Kumar, A. Humphrey, W. Usher, S. Petruzza, B. Peterson, J. A. Schmidt, D. Harris, B. Isaac, J. Thornock, T. Harman, V. Pascucci,, M. Berzins. In Supercomputing Frontiers, Springer International Publishing, pp. 219--240. 2018. ISBN: 978-3-319-69953-0 DOI: 10.1007/978-3-319-69953-0_13 The need to scale next-generation industrial engineering problems to the largest computational platforms presents unique challenges. This paper focuses on data management related problems faced by the Uintah simulation framework at a production scale of 260K processes. Uintah provides a highly scalable asynchronous many-task runtime system, which in this work is used for the modeling of a 1000 megawatt electric (MWe) ultra-supercritical (USC) coal boiler. At 260K processes, we faced both parallel I/O and visualization related challenges, e.g., the default file-per-process I/O approach of Uintah did not scale on Mira. In this paper we present a simple to implement, restructuring based parallel I/O technique. We impose a restructuring step that alters the distribution of data among processes. The goal is to distribute the dataset such that each process holds a larger chunk of data, which is then written to a file independently. This approach finds a middle ground between two of the most common parallel I/O schemes--file per process I/O and shared file I/O--in terms of both the total number of generated files, and the extent of communication involved during the data aggregation phase. To address scalability issues when visualizing the simulation data, we developed a lightweight renderer using OSPRay, which allows scientists to visualize the data interactively at high quality and make production movies. Finally, this work presents a highly efficient and scalable radiation model based on the sweeping method, which significantly outperforms previous approaches in Uintah, like discrete ordinates. The integrated approach allowed the USC boiler problem to run on 260K CPU cores on Mira. |

Scalable Data Management of the Uintah Simulation Framework for Next-Generation Engineering Problems with RadiationS. Kumar, A. Humphrey, W. Usher, S. Petruzza, B. Peterson, J. A. Schmidt, D. Harris, B. Isaac, J. Thornock, T. Harman, V. Pascucci,, M. Berzins. In Supercomputing Frontiers, Springer International Publishing, pp. 219--240. 2018. ISBN: 978-3-319-69953-0 DOI: 10.1007/978-3-319-69953-0_13 The need to scale next-generation industrial engineering problems to the largest computational platforms presents unique challenges. This paper focuses on data management related problems faced by the Uintah simulation framework at a production scale of 260K processes. Uintah provides a highly scalable asynchronous many-task runtime system, which in this work is used for the modeling of a 1000 megawatt electric (MWe) ultra-supercritical (USC) coal boiler. At 260K processes, we faced both parallel I/O and visualization related challenges, e.g., the default file-per-process I/O approach of Uintah did not scale on Mira. In this paper we present a simple to implement, restructuring based parallel I/O technique. We impose a restructuring step that alters the distribution of data among processes. The goal is to distribute the dataset such that each process holds a larger chunk of data, which is then written to a file independently. This approach finds a middle ground between two of the most common parallel I/O schemes--file per process I/O and shared file I/O--in terms of both the total number of generated files, and the extent of communication involved during the data aggregation phase. To address scalability issues when visualizing the simulation data, we developed a lightweight renderer using OSPRay, which allows scientists to visualize the data interactively at high quality and make production movies. Finally, this work presents a highly efficient and scalable radiation model based on the sweeping method, which significantly outperforms previous approaches in Uintah, like discrete ordinates. The integrated approach allowed the USC boiler problem to run on 260K CPU cores on Mira. |

Nonlinear stability and time step selection for the MPM methodM. Berzins. In Computational Particle Mechanics, Jan, 2018. ISSN: 2196-4386 DOI: 10.1007/s40571-018-0182-y The Material Point Method (MPM) has been developed from the Particle in Cell (PIC) method over the last 25 years and has proved its worth in solving many challenging problems involving large deformations. Nevertheless there are many open questions regarding the theoretical properties of MPM. For example in while Fourier methods, as applied to PIC may provide useful insight, the non-linear nature of MPM makes it necessary to use a full non-linear stability analysis to determine a stable time step for MPM. In order to begin to address this the stability analysis of Spigler and Vianello is adapted to MPM and used to derive a stable time step bound for a model problem. This bound is contrasted against traditional Speed of sound and CFL bounds and shown to be a realistic stability bound for a model problem. |

Reducing Network Congestion and Synchronization Overhead During Aggregation of Hierarchical DataS. Kumar, D. Hoang, S. Petruzza, J. Edwards, V. Pascucci. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC), pp. 223-232. Dec, 2017. DOI: 10.1109/HiPC.2017.00034 Hierarchical data representations have been shown to be effective tools for coping with large-scale scientific data. Writing hierarchical data on supercomputers, however, is challenging as it often involves all-to-one communication during aggregation of low-resolution data which tends to span the entire network domain, resulting in several bottlenecks. We introduce the concept of indexing templates, which succinctly describe data organization and can be used to alter movement of data in beneficial ways. We present two techniques, domain partitioning and localized aggregation, that leverage indexing templates to alleviate congestion and synchronization overheads during data aggregation. We report experimental results that show significant I/O speedup using our proposed schemes on two of today's fastest supercomputers, Mira and Shaheen II, using the Uintah and S3D simulation frameworks. |

Sequential data assimilation with multiple nonlinear models and applications to subsurface flowL. Yang, A. Narayan, P. Wang. In Journal of Computational Physics, Vol. 346, pp. 356--368. Oct, 2017. ISSN: 0021-9991 DOI: 10.1016/j.jcp.2017.06.026 Complex systems are often described with competing models. Such divergence of interpretation on the system may stem from model fidelity, mathematical simplicity, and more generally, our limited knowledge of the underlying processes. Meanwhile, available but limited observations of system state could further complicates one's prediction choices. Over the years, data assimilation techniques, such as the Kalman filter, have become essential tools for improved system estimation by incorporating both models forecast and measurement; but its potential to mitigate the impacts of aforementioned model-form uncertainty has yet to be developed. Based on an earlier study of Multi-model Kalman filter, we propose a novel framework to assimilate multiple models with observation data for nonlinear systems, using extended Kalman filter, ensemble Kalman filter and particle filter, respectively. Through numerical examples of subsurface flow, we demonstrate that the new assimilation framework provides an effective and improved forecast of system behaviour. |

A Christoffel function weighted least squares algorithm for collocation approximationsA. Narayan, J. Jakeman, T. Zhou. In Mathematics of Computation, Vol. 86, No. 306, pp. 1913--1947. 2017. ISSN: 0025-5718, 1088-6842 DOI: 10.1090/mcom/3192 We propose, theoretically investigate, and numerically validate an algorithm for the Monte Carlo solution of least-squares polynomial approximation problems in a collocation frame- work. Our method is motivated by generalized Polynomial Chaos approximation in uncertainty quantification where a polynomial approximation is formed from a combination of orthogonal polynomials. A standard Monte Carlo approach would draw samples according to the density of orthogonality. Our proposed algorithm samples with respect to the equilibrium measure of the parametric domain, and subsequently solves a weighted least-squares problem, with weights given by evaluations of the Christoffel function. We present theoretical analysis to motivate the algorithm, and numerical results that show our method is superior to standard Monte Carlo methods in many situations of interest. |

Offline-Enhanced Reduced Basis Method through adaptive construction of the Surrogate Parameter DomainJ. Jiang, Y. Chen, A. Narayan. In Journal of Scientific Computing, Vol. 73, No. 2-3, pp. 853--875. 2017. ISSN: 0885-7474, 1573-7691 DOI: 10.1007/s10915-017-0551-3 The Reduced Basis Method (RBM) is a popular certified model reduction approach for solving parametrized partial differential equations. One critical stage of the \textitoffline portion of the algorithm is a greedy algorithm, requiring maximization of an error estimate over parameter space. In practice this maximization is usually performed by replacing the parameter domain continuum with a discrete "training" set. When the dimension of parameter space is large, it is necessary to significantly increase the size of this training set in order to effectively search parameter space. Large training sets diminish the attractiveness of RBM algorithms since this proportionally increases the cost of the offline phase. |

A Generalized Sampling and Preconditioning Scheme for Sparse Approximation of Polynomial Chaos ExpansionsJ. Jakeman, A. Narayan, T. Zhou. In SIAM Journal on Scientific Computing, Vol. 39, No. 3, SIAM, pp. A1114--A1144. Jan, 2017. ISSN: 1064-8275 DOI: 10.1137/16M1063885 In this paper we propose an algorithm for recovering sparse orthogonal polynomials using stochastic collocation. Our approach is motivated by the desire to use generalized polynomial chaos expansions (PCE) to quantify uncertainty in models subject to uncertain input parameters. The standard sampling approach for recovering sparse polynomials is to use Monte Carlo (MC) sampling of the density of orthogonality. However MC methods result in poor function recovery when the polynomial degree is high. Here we propose a general algorithm that can be applied to any admissible weight function on a bounded domain and a wide class of exponential weight functions defined on unbounded domains. Our proposed algorithm samples with respect to the weighted equilibrium measure of the parametric domain, and subsequently solves a preconditioned ℓ1-minimization problem, where the weights of the diagonal preconditioning matrix are given by evaluations of the Christoffel function. We present theoretical analysis to motivate the algorithm, and numerical results that show our method is superior to standard Monte Carlo methods in many situations of interest. Numerical examples are also provided that demonstrate that our proposed Christoffel Sparse Approximation algorithm leads to comparable or improved accuracy even when compared with Legendre and Hermite specific algorithms. |

Stochastic Collocation Methods via L1 Minimization Using Randomized QuadraturesL. Guo, A. Narayan, T. Zhou, Y. Chen. In SIAM Journal on Scientific Computing, Vol. 39, No. 1, pp. A333--A359. Jan, 2017. ISSN: 1064-8275 DOI: 10.1137/16M1059680 In this work, we discuss the problem of approximating a multivariate function via ℓ1 minimization method, using a random chosen sub-grid of the corresponding tensor grid of Gaussian points. The independent variables of the function are assumed to be random variables, and thus, the framework provides a non-intrusive way to construct the generalized polynomial chaos expansions, stemming from the motivating application of Uncertainty Quantification (UQ). We provide theoretical analysis on the validity of the approach. The framework includes both the bounded measures such as the uniform and the Chebyshev measure, and the unbounded measures which include the Gaussian measure. Several numerical examples are given to confirm the theoretical results. |

Numerical Computation of Weil-Peterson Geodesics in the Universal Teichmueller SpaceM. Feiszli, A. Narayan. In SIAM Journal on Imaging Sciences, Vol. 10, No. 3, SIAM, pp. 1322--1345. Jan, 2017. DOI: 10.1137/15M1043947 We propose an optimization algorithm for computing geodesics on the universal Teichm\"uller space T(1) in the Weil-Petersson (WP) metric. Another realization for T(1) is the space of planar shapes, modulo translation and scale, and thus our algorithm addresses a fundamental problem in computer vision: compute the distance between two given shapes. The identification of smooth shapes with elements on T(1) allows us to represent a shape as a diffeomorphism on S1. Then given two diffeomorphisms on S1 (i.e., two shapes we want connect with a flow), we formulate a discretized WP energy and the resulting problem is a boundary-value minimization problem. We numerically solve this problem, providing several examples of geodesic flow on the space of shapes, and verifying mathematical properties of T(1). Our algorithm is more general than the application here in the sense that it can be used to compute geodesics on any other Riemannian manifold. |

An Orthogonality Property of the Legendre PolynomialsL. Bos, A. Narayan, N. Levenberg, F. Piazzon. In Constructive Approximation, Vol. 45, No. 1, pp. 65--81. Feb, 2017. ISSN: 0176-4276, 1432-0940 DOI: 10.1007/s00365-015-9321-3 We give a remarkable additional othogonality property of the classical Legendre polynomials on the real interval [−1,1]: polynomials up to degree n from this family are mutually orthogonal under the arcsine measure weighted by the degree-n normalized Christoffel function |

Nonlinear Stability of the MPM MethodM. Berzins. In V International Conference on Particle-based Methods – Fundamentals and Applications. PARTICLES 2017, Edited by P. Wriggers, M. Bischoff, E. O˜nate, D.R.J. Owen, & T. Zohdi, pp. 671--682. 2017. The Material Point Method (MPM) has been very successful in providing solutions to many challenging problems involving large deformations. The nonlinear nature of MPM makes it necessary to use a full nonlinear stability analysis to determine a stable timestep. The stability analysis of Spigler and Vianello is adapted to MPM and used to derive a stable timestep bound for a model problem. This bound is contrasted against a traditional CFL bound. |

Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs. Awarded Best PaperB. Peterson, A. Humphrey, J. Schmidt, M. Berzins. In Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware - ESPM2'17, ACM, 2017. DOI: 10.1145/3152041.3152082 Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying the all-to-all communication of data dependencies among the nodes. Intranodal challenges include gathering together these data dependencies into usable data objects while avoiding data duplication. This paper addresses these challenges within the context of a large-scale, industrial coal boiler simulation using the Uintah asynchronous many-task runtime system on GPU architectures. We show significant reduction in time spent analyzing data dependencies through refinements in our dependency search algorithm. Multiple task graphs are used to eliminate subsequent analysis when task graphs change in predictable and repeatable ways. Using a combined data store and task scheduler redesign reduces data dependency duplication ensuring that problems fit within host and GPU memory. These modifications did not require any changes to application code or sweeping changes to the Uintah runtime system. We report results running on the DOE Titan system on 119K CPU cores and 7.5K GPUs simultaneously. Our solutions can be generalized to other task dependency problems with global dependencies among thousands of nodes which must be processed efficiently at large scale. |

Massively Parallel Simulations of Spread of Infectious Diseases over Realistic Social NetworksA. Bhatele, J. Yeom, N. Jain, C. J. Kuhlman, Y. Livnat, K. R. Bisset, L. V. Kale, M. V. Marathe. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), May, 2017. DOI: 10.1109/ccgrid.2017.141 Controlling the spread of infectious diseases in large populations is an important societal challenge. Mathematically, the problem is best captured as a certain class of reaction-diffusion processes (referred to as contagion processes) over appropriate synthesized interaction networks. Agent-based models have been successfully used in the recent past to study such contagion processes. We describe EpiSimdemics, a highly scalable, parallel code written in Charm++ that uses agent-based modeling to simulate disease spreads over large, realistic, co-evolving interaction networks. We present a new parallel implementation of EpiSimdemics that achieves unprecedented strong and weak scaling on different architectures — Blue Waters, Cori and Mira. EpiSimdemics achieves five times greater speedup than the second fastest parallel code in this field. This unprecedented scaling is an important step to support the long term vision of real-time epidemic science. Finally, we demonstrate the capabilities of EpiSimdemics by simulating the spread of influenza over a realistic synthetic social contact network spanning the continental United States (∼280 million nodes and 5.8 billion social contacts). |

Progressive CPU Volume Rendering with Sample AccumulationW. Usher, J. Amstutz, C. Brownlee, A. Knoll, I. Wald . In Eurographics Symposium on Parallel Graphics and Visualization, Edited by Alexandru Telea and Janine Bennett, The Eurographics Association, 2017. ISBN: 978-3-03868-034-5 ISSN: 1727-348X DOI: 10.2312/pgv.20171090 We present a new method for progressive volume rendering by accumulating object-space samples over successively rendered frames. Existing methods for progressive refinement either use image space methods or average pixels over frames, which can blur features or integrate incorrectly with respect to depth. Our approach stores samples along each ray, accumulates new samples each frame into a buffer, and progressively interleaves and integrates these samples. Though this process requires additional memory, it ensures interactivity and is well suited for CPU architectures with large memory and cache. This approach also extends well to distributed rendering in cluster environments. We implement this technique in Intel's open source OSPRay CPU ray tracing framework and demonstrate that it is particularly useful for rendering volumetric data with costly sampling functions. |