Leveraging 31 Million Google Street View Images to Characterize Built Environments and Examine County Health Outcomes |
Q. C Nguyen, J. M. Keralis, P. Dwivedi, A. E. Ng, M. Javanmardi, S. Khanna, Y. Huang, K. D. Brunisholz, A. Kumar, T. Tasdizen. In Public Health Reports, Vol. 136, No. 2, SAGE Publications, pp. 201-211. 2021.
MethodsWe leveraged computer vision and Google Street View images accessed from December 15, 2017, through July 17, 2018, to detect features of the built environment (presence of a crosswalk, non–single-family home, single-lane roads, and visible utility wires) for 2916 US counties. We used multivariate linear regression models to determine associations between features of the built environment and county-level health outcomes (prevalence of adult obesity, prevalence of diabetes, physical inactivity, frequent physical and mental distress, poor or fair self-rated health, and premature death [in years of potential life lost]).
ResultsCompared with counties with the least number of crosswalks, counties with the most crosswalks were associated with decreases of 1.3%, 2.7%, and 1.3% of adult obesity, physical inactivity, and fair or poor self-rated health, respectively, and 477 fewer years of potential life lost before age 75 (per 100 000 population). The presence of non–single-family homes was associated with lower levels of all health outcomes except for premature death. The presence of single-lane roads was associated with an increase in physical inactivity, frequent physical distress, and fair or poor self-rated health. Visible utility wires were associated with increases in adult obesity, diabetes, physical and mental distress, and fair or poor self-rated health.
ConclusionsThe use of computer vision and big data image sources makes possible national studies of the built environm
Understanding a program's resiliency through error propagation|
Z. Li, H. Menon, K. Mohror, P. T. Bremer, Y. Livant, V. Pascucci. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, pp. 362-373. 2021.
Aggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault introducing an error that is not readily detected nto an HPC simulation. Due to the insidious nature of SDCs, researchers have worked to understand their impact on applications. Previous studies have relied on expensive fault injection campaigns with uniform sampling to provide overall SDC rates, but this solution does not provide any feedback on the code regions without samples.
Blueprint: Cyberinfrastructure Center of Excellence|
Subtitled arXiv, E. Deelman, A. Mandal, A. P. Murillo, J. Nabrzyski, V. Pascucci, R. Ricci, I. Baldin, S. Sons, L. Christopherson, C. Vardeman, R. F. da Silva, J. Wyngaard, S. Petruzza, M. Rynge, K. Vahi, W. R. Whitcup, J. Drake, E. Scott. 2021.
In 2018, NSF funded an effort to pilot a Cyberinfrastructure Center of Excellence (CI CoE or Center) that would serve the cyberinfrastructure (CI) needs of the NSF Major Facilities (MFs) and large projects with advanced CI architectures. The goal of the CI CoE Pilot project (Pilot) effort was to develop a model and a blueprint for such a CoE by engaging with the MFs, understanding their CI needs, understanding the contributions the MFs are making to the CI community, and exploring opportunities for building a broader CI community. This document summarizes the results of community engagements conducted during the first two years of the project and describes the identified CI needs of the MFs. To better understand MFs' CI, the Pilot has developed and validated a model of the MF data lifecycle that follows the data generation and management within a facility and gained an understanding of how this model captures the fundamental stages that the facilities' data passes through from the scientific instruments to the principal investigators and their teams, to the broader collaborations and the public. The Pilot also aimed to understand what CI workforce development challenges the MFs face while designing, constructing, and operating their CI and what solutions they are exploring and adopting within their projects. Based on the needs of the MFs in the data lifecycle and workforce development areas, this document outlines a blueprint for a CI CoE that will learn about and share the CI solutions designed, developed, and/or adopted by the MFs, provide expertise to the largest NSF projects with advanced and complex CI architectures, and foster a …
Symplectic Time Integration Methods for the Material Point Method, Experiments, Analysis and Order Reduction|
M. Berzins. In WCCM-ECCOMAS2020 virtual Conference, January, 2021.
The provision of appropriate time integration methods for the Material Point Method (MPM) involves considering stability, accuracy and energy conservation. A class of methods that addresses many of these issues are the widely-used symplectic time integration methods. Such methods have good conservation properties and have the potential to achieve high accuracy. In this work we build on the work in  and consider high order methods for the time integration of the Material Point Method. The results of practical experiments show that while high order methods in both space and time have good accuracy initially, unless the problem has relatively little particle movement then the accuracy of the methods for later time is closer to that of low order methods. A theoretical analysis explains these results as being similar to the stage error found in Runge Kutta methods, though in this case the stage error arises from the MPM differentiations and interpolations from particles to grid and back again, particularly in cases in which there are many grid crossings.
A Heterogeneous MPI+PPL Task Scheduling Approach for Asynchronous Many-Task Runtime Systems|
J. K. Holmen, D. Sahasrabudhe, M. Berzins. In Proceedings of the Practice and Experience in Advanced Research Computing 2021 on Sustainability, Success and Impact (PEARC21), ACM, 2021.
Asynchronous many-task runtime systems and MPI+X hybrid parallelism approaches have shown promise for helping manage theincreasing complexity of nodes in current and emerging high performance computing (HPC) systems, including those for exascale. Theincreasing architectural diversity, however, poses challenges for large legacy runtime systems emphasizing broad support for majorHPC systems. Performance portability layers (PPL) have shown promise for helping manage this diversity. This paper describes aheterogeneous MPI+PPL task scheduling approach for combining these promising solutions with additional consideration for parallelthird party libraries facing similar challenges to help prepare such a runtime for the diverse heterogeneous systems accompanyingexascale computing. This approach is demonstrated using a heterogeneous MPI+Kokkos task scheduler and the accompanyingportable abstractions  implemented in the Uintah Computational Framework, an asynchronous many-task runtime system, withadditional consideration for hypre, a parallel third party library. Results are shown for two challenging problems executing workloadsrepresentative of typical Uintah applications. These results show performance improvements up to 4.4x when using this schedulerand the accompanying portable abstractions  to port a previously MPI-Only problem to Kokkos::OpenMP and Kokkos::CUDA toimprove multi-socket, multi-device node use. Good strong-scaling to 1,024 NVIDIA V100 GPUs and 512 IBM POWER9 processor arealso shown using MPI+Kokkos::OpenMP+Kokkos::CUDA at scale
Logically Parallel Communication for Fast MPI+Threads Communication|
R. Zambre, D. Sahasrabudhe, H. Zhou, M. Berzins, A. Chandramowlishwaran, P. Balaji. In Proceedings of the Transactions on Parallel and Distributed Computing, IEEE, April, 2021.
Supercomputing applications are increasingly adopting the MPI+threads programming model over the traditional “MPI everywhere” approach to better handle the disproportionate increase in the number of cores compared with other on-node resources. In practice, however, most applications observe a slower performance with MPI+threads primarily because of poor communication performance. Recent research efforts on MPI libraries address this bottleneck by mapping logically parallel communication, that is, operations that are not subject to MPI’s ordering constraints to the underlying network parallelism. Domain scientists, however, typically do not expose such communication independence information because the existing MPI-3.1 standard’s semantics can be limiting. Researchers had initially proposed user-visible endpoints to combat this issue, but such a solution requires intrusive changes to the standard (new APIs). The upcoming MPI-4.0 standard, on the other hand, allows applications to relax unneeded semantics and provides them with many opportunities to express logical communication parallelism. In this paper, we show how MPI+threads applications can achieve high performance with logically parallel communication. Through application case studies, we compare the capabilities of the new MPI-4.0 standard with those of the existing one and user-visible endpoints (upper bound). Logical communication parallelism can boost the overall performance of an application by over 2x.
Optimizing the Hypre solver for manycore and GPU architectures|
D. Sahasrabudhe, R. Zambre, A. Chandramowlishwaran, M. Berzins. In Journal of Computational Science, Springer International Publishing, pp. 101279. 2020.
The solution of large-scale combustion problems with codes such as Uintah on modern computer architectures requires the use of multithreading and GPUs to achieve performance. Uintah uses a low-Mach number approximation that requires iteratively solving a large system of linear equations. The Hypre iterative solver has solved such systems in a scalable way for Uintah, but the use of OpenMP with Hypre leads to at least 2x slowdown due to OpenMP overheads. The proposed solution uses the MPI Endpoints within Hypre, where each team of threads acts as a different MPI rank. This approach minimizes OpenMP synchronization overhead and performs as fast or (up to 1.44x) faster than Hypre’s MPI-only version, and allows the rest of Uintah to be optimized using OpenMP. The profiling of the GPU version of Hypre shows the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro-kernels and was further optimized by using Cuda-aware MPI, resulting in an overall speedup of 1.16–1.44x compared to the baseline GPU implementation.
Report from the NSF Workshop on Smart Cyberinfrastructure 2020|
V. Pascucci, I. Altintas, J. Fortes, I. Foster, H. Gu, S. Hariri, D. Stanzione, M. Taufer, X. Zhao. NSF, 2020.
Machine learning and other Artifical Intelligenece technologies (all indicated in the following as AI) used within a modern, smart cyberinfrastructure have become critical new avenues for discovery and validation in data-driven science and engineering disciplines of all kinds. We can expect many landmark discoveries and new lines of productive research to be enabled through AI analysis of the rapidly growing treasure trove of scientific data. AI-based techniques have been applied in many fields of science and engineering, including remote sensing, cosmology, energy, cancer research, IT systems management, and machine design and control, but the lack of proper integration with the current NSF-supported cyberinfrastructure is limiting their potential. Recent events due to the COVID-19 pandemic have highlighted how cyberinfrastructure is a crucial enabler of modern research, with massive simulations and data management capabilities [8-10], but these events have also emphasized how the lack of proper integration with AI technology remains a major limiting factor for the advancement of science and engineering, especially when any kind of rapid response is needed.
A Terminology for In Situ Visualization and Analysis Systems|
H. Childs, S. D. Ahern, J. Ahrens, A. C. Bauer, J. Bennett, E. W. Bethel, P. Bremer, E. Brugger, J. Cottam, M. Dorier, S. Dutta, J. M. Favre, T. Fogal, S. Frey, C. Garth, B. Geveci, W. F. Godoy, C. D. Hansen, C. Harrison, B. Hentschel, J. Insley, C. R. Johnson, S. Klasky, A. Knoll, J. Kress, M. Larsen, J. Lofstead, K. Ma, P. Malakar, J. Meredith, K. Moreland, P. Navratil, P. O’Leary, M. Parashar, V. Pascucci, J. Patchett, T. Peterka, S. Petruzza, N. Podhorszki, D. Pugmire, M. Rasquin, S. Rizzi, D. H. Rogers, S. Sane, F. Sauer, R. Sisneros, H. Shen, W. Usher, R. Vickery, V. Vishwanath, I. Wald, R. Wang, G. H. Weber, B. Whitlock, M. Wolf, H. Yu, S. B. Ziegeler. In International Journal of High Performance Computing Applications, Vol. 34, No. 6, pp. 676–691. 2020.
The term “in situ processing” has evolved over the last decade to mean both a specific strategy for visualizing and analyzing data and an umbrella term for a processing paradigm. The resulting confusion makes it difficult for visualization and analysis scientists to communicate with each other and with their stakeholders. To address this problem, a group of over fifty experts convened with the goal of standardizing terminology. This paper summarizes their findings and proposes a new terminology for describing in situ systems. An important finding from this group was that in situ systems are best described via multiple, distinct axes: integration type, proximity, access, division of execution, operation controls, and output type. This paper discusses these axes, evaluates existing systems within the axes, and explores how currently used terms relate to the axes.
Numerical Testing of a New Positivity-Preserving Interpolation Algorithm|
Subtitled arXiv, T. A. J. Ouermi, R. M. Kirby, M. Berzins. 2020.
An important component of a number of computational modeling algorithms is an interpolation method that preserves the positivity of the function being interpolated. This report describes the numerical testing of a new positivity-preserving algorithm that is designed to be used when interpolating from a solution defined on one grid to different spatial grid. The motivating application is a numerical weather prediction (NWP) code that uses spectral elements as the discretization choice for its dynamics core and Cartesian product meshes for the evaluation of its physics routines. This combination of spectral elements, which use nonuniformly spaced quadrature/collocation points, and uniformly-spaced Cartesian meshes combined with the desire to maintain positivity when moving between these necessitates our work. This new approach is evaluated against several typical algorithms in use on a range of test problems in one or more space dimensions. The results obtained show that the new method is competitive in terms of observed accuracy while at the same time preserving the underlying positivity of the functions being interpolated.
Improving Performance of the Hypre Iterative Solver for Uintah Combustion Codes on Manycore Architectures Using MPI Endpoints and Kernel Consolidation|
D. Sahasrabudhe, M. Berzins. In Computational Science -- ICCS 2020, 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part I, Springer International Publishing, pp. 175--190. 2020.
The solution of large-scale combustion problems with codes such as the Arches component of Uintah on next generation computer architectures requires the use of a many and multi-core threaded approach and/or GPUs to achieve performance. Such codes often use a low-Mach number approximation, that require the iterative solution of a large system of linear equations at every time step. While the discretization routines in such a code can be improved by the use of, say, OpenMP or Cuda Approaches, it is important that the linear solver be able to perform well too. For Uintah the Hypre iterative solver has proved to solve such systems in a scalable way. The use of Hypre with OpenMP leads to at least 2x slowdowns due to OpenMP overheads, however. This behavior is analyzed and a solution proposed by using the MPI Endpoints approach is implemented within Hypre, where each team of threads acts as a different MPI rank. This approach minimized OpenMP synchronization overhead, avoided slowdowns, performed as fast or (up to 1.5x) faster than Hypre’s MPI only version, and allowed the rest of Uintah to be optimized using OpenMP. Profiling of the GPU version of Hypre showed the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro kernels and was further optimized by using Cuda-aware MPI. The overall speedup of 1.26x to 1.44x was observed compared to the baseline GPU implementation.
Distributed Resources for the Earth System Grid Advanced Management (DREAM), Final Report|
L. Cinquini, S. Petruzza, Jason J. Boutte, S. Ames, G. Abdulla, V. Balaji, R. Ferraro, A. Radhakrishnan, L. Carriere, T. Maxwell, G. Scorzelli, V. Pascucci. 2020.
The DREAM project was funded more than 3 years ago to design and implement a next-generation ESGF (Earth System Grid Federation ) architecture which would be suitable for managing and accessing data and services resources on a distributed and scalable environment. In particular, the project intended to focus on the computing and visualization capabilities of the stack, which at the time were rather primitive. At the beginning, the team had the general notion that a better ESGF architecture could be built by modularizing each component, and redefining its interaction with other components by defining and exposing a well defined API. Although this was still the high level principle that guided the work, the DREAM project was able to accomplish its goals by leveraging new practices in IT that started just about 3 or 4 years ago: the advent of containerization technologies (specifically, Docker), the development of frameworks to manage containers at scale (Docker Swarm and Kubernetes), and their application to the commercial Cloud. Thanks to these new technologies, DREAM was able to improve the ESGF architecture (including its computing and visualization services) to a level of deployability and scalability beyond the original expectations.
CPU Ray Tracing of Tree-Based Adaptive Mesh Refinement Data|
F. Wang, N. Marshak, W. Usher, C. Burstedde, A. Knoll, T. Heister, C. R. Johnson. In Eurographics Conference on Visualization (EuroVis) 2020, Vol. 39, No. 3, 2020.
Adaptive mesh refinement (AMR) techniques allow for representing a simulation’s computation domain in an adaptive fashion. Although these techniques have found widespread adoption in high-performance computing simulations, visualizing their data output interactively and without cracks or artifacts remains challenging. In this paper, we present an efficient solution for direct volume rendering and hybrid implicit isosurface ray tracing of tree-based AMR (TB-AMR) data. We propose a novel reconstruction strategy, Generalized Trilinear Interpolation (GTI), to interpolate across AMR level boundaries without cracks or discontinuities in the surface normal. We employ a general sparse octree structure supporting a wide range of AMR data, and use it to accelerate volume rendering, hybrid implicit isosurface rendering and value queries. We demonstrate that our approach achieves artifact-free isosurface and volume rendering and provides higher quality output images compared to existing methods at interactive rendering rates.
A convected particle least square interpolation material point method|
Q. A. Tran, W. Sołowski, M. Berzins, J. Guilkey. In International Journal for Numerical Methods in Engineering, Wiley, October, 2019.
Applying the convected particle domain interpolation (CPDI) to the material point method has many advantages over the original material point method, including significantly improved accuracy. However, in the large deformation regime, the CPDI still may not retain the expected convergence rate. The paper proposes an enhanced CPDI formulation based on least square reconstruction technique. The convected particle least square interpolation (CPLS) material point method assumes the velocity field inside the material point domain as nonconstant. This velocity field in the material point domain is mapped to the background grid nodes with a moving least squares reconstruction. In this paper, we apply the improved moving least squares method to avoid the instability of the conventional moving least squares method due to a singular matrix. The proposed algorithm can improve convergence rate, as illustrated by numerical examples using the method of manufactured solutions.
In situ visualization of performance metrics in multiple domains|
A. Sanderson, A. Humphrey, J. Schmidt, R. Sisneros,, M. Papka. In 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools), IEEE, Nov, 2019.
As application scientists develop and deploy simulation codes on to leadership-class computing resources, there is a need to instrument these codes to better understand performance to efficiently utilize these resources. This instrumentation may come from independent third-party tools that generate and store performance metrics or from custom instrumentation tools built directly into the application. The metrics collected are then available for visual analysis, typically in the domain in which there were collected. In this paper, we introduce an approach to visualize and analyze the performance metrics in situ in the context of the machine, application, and communication domains (MAC model) using a single visualization tool. This visualization model provides a holistic view of the application performance in the context of the resources where it is executing.
A Portable SIMD Primitive using Kokkos for Heterogeneous Architectures|
D. Sahasrabudhe, E. T. Phipps, S. Rajamanickam, M. Berzins. In Sixth Workshop on Accelerator Programming Using Directives (WACCPD), 2019.
As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the \logical vector length" (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.
An Approach for Indirectly Adopting a Performance Portability Layer in Large Legacy Codes|
J. K. Holmen, B. Peterson, M. Berzins. In 2nd International Workshop on Performance, Portability, and Productivity in HPC (P3HPC), In conjunction with SC19, 2019.
Diversity among supported architectures in current and emerging high performance computing systems, including those for exascale, makes portable codebases desirable. Portability of a codebase can be improved using a performance portability layer to provide access to multiple underlying programming models through a single interface. Direct adoption of a performance portability layer, however, poses challenges for large pre-existing software frameworks that may need to preserve legacy code and/or adopt other programming models in the future. This paper describes an approach for indirect adoption that introduces a framework-specific portability layer between the application developer and the adopted performance portability layer to help improve legacy code support and long-term portability for future architectures and programming models. This intermediate layer uses loop-level, application-level, and build-level components to ease adoption of a performance portability layer in large legacy codebases. Results are shown for two challenging case studies using this approach to make portable use of OpenMP and CUDA via Kokkos in an asynchronous many-task runtime system, Uintah. These results show performance improvements up to 2.7x when refactoring for portability and 2.6x when more efficiently using a node. Good strong-scaling to 442,368 threads across 1,728 Knights Landing processors are also shown using MPI+Kokkos at scale.
Time Integration Errors and Energy Conservation Properties of the Stormer Verlet Method Applied to MPM|
M. Berzins. In Proceedings of VI International Conference on Particle-based Methods – Fundamentals and Applications, Barcelona, Edited by E. O ̃ nate, M. Bischoff, D.R.J. Owen, P. Wriggers & T. Zohdi, PARTICLES 2019, pp. 555-566. October, 2019.
The success of the Material Point Method (MPM) in solving many challenging problems nevertheless raises some open questions regarding the fundamental properties of the method such as the energy conservation since being addressed by Bardenhagen and by Love and Sulsky. Similarly while low order symplectic time integration techniques are used with MPM, higher order methods have not been used. For this reason the Stormer Verlet method, a popular and widely-used symplectic method is applied to MPM. Both the time integration error and the energy conservation properties of this method applied to MPM are considered. The method is shown to have locally third order accuracy of energy conservation in time. This is in contrast to the locally second order accuracy in energy conservation of the methods that are used in many MPM calculations. This third accuracy accuracy is demonstrated both locally and globally on a standard MPM test example.
An improved moving least squares method for the Material Point Method|
Q. Tran, M. Berzins, W. Solowski. In Proceedings of the 2nd International Conference on the Material Point Method for Modelling Soil-Water-Structure Interaction (MPM 2019), 2019.
The paper presents an improved moving least squares reconstruction technique for the Material Point Method. The moving least squares reconstruction(MLS)can improve spatial accuracy in simulations involving large deformations. However, the MLS algorithm relies on computing the inverse of the moment matrix.This is both expensive and potentially unstable when there are not enough material points to reconstruct the high-order least squares function, which leads to a singular or an ill-conditioned matrix. The shown formulation can overcome this limitation while retain the same order of accuracy compared with the conventional moving least squares reconstruction.Numerical experiments demonstrate the improvements in the accuracy and comparison with the original Material Point Method and the Convected Particles Domain Interpolation method.
An Evaluation of An Asynchronous Task Based Dataflow Approach For Uintah|
A. Humphrey, M. Berzins. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Vol. 2, pp. 652-657. July, 2019.
The challenge of running complex physics code on the largest computers available has led to dataflow paradigms being explored. While such approaches are often applied at smaller scales, the challenge of extreme-scale data flow computing remains. The Uintah dataflow framework has consistently used dataflow computing at the largest scales on complex physics applications. At present Uintah contains two main dataflow models. Both are based upon asynchronous communication. One uses a static graph-based approach with asynchronous communication and the other uses a more dynamic approach that was introduced almost a decade ago. Subsequent changes within the Uintah runtime system combined with many more large scale experiments, has necessitated a reevaluation of these two approaches, comparing them in the context of large scale problems. While the static approach has worked well for some large-scale simulations, the dynamic approach is seen to offer performance improvements over the static case for a challenging fluid-structure interaction problem at large scale that involves fluid flow and a moving solid represented using particle method on an adaptive mesh.