Scientific Data Management

The Information Management group has been working on building new cyberinfrastructure that streamlines the creation, execution and sharing of complex visualizations, data mining and other large-scale data analysis applications. We developed VisTrails (, a new open source, scientific workflow and provenance management system that was designed to manage rapidly evolving workflows common in exploratory applications. VisTrails provides novel mechanisms for capturing and interacting with provenance that greatly simplify the data exploration process. The system has been downloaded over 8,000 times since its beta release in January, 2007. VisTrails has been adopted as part of the cyberinfrastructure in large scientific projects, as well as a teaching and learning tool in graduate and undergraduate courses, both in the U.S. and abroad.

There has been an explosive growth in the volume of structured information on the Web. This information often resides in the hidden (or deep) Web, stored in databases and exposed only through queries over Web forms. A recent study by Google estimates that there are several millions of such form interfaces. However, the high quality information in online databases can be hard to find: it is out of reach for traditional search engines, whose index include only content in the surface Web.

Our group is combining techniques from machine learning, information retrieval and databases to build infrastructure that automates, to a large extent, the process of discovering and organizing hidden-Web data sources, a necessary step to large-scale retrieval and integration of Web information. This infrastructure will enable people and applications to more easily find the right databases and consequently, the hidden information they are seeking on the Web. We have used our hidden-Web infrastructure to build DeepPeep (, a new search engine for Web forms.

Scientific Data Management

Rethinking Abstractions for Big Data: Why, Where, How, and What
M. Hall, R.M. Kirby, F. Li, M.D. Meyer, V. Pascucci, J.M. Phillips, R. Ricci, J. Van der Merwe, S. Venkatasubramanian. In Cornell University Library, No. arXiv:1306.3295, 2013.

Big data refers to large and complex data sets that, under existing approaches, exceed the capacity and capability of current compute platforms, systems software, analytical tools and human understanding [7]. Numerous lessons on the scalability of big data can already be found in asymptotic analysis of algorithms and from the high-performance computing (HPC) and applications communities. However, scale is only one aspect of current big data trends; fundamentally, current and emerging problems in big data are a result of unprecedented complexity |in the structure of the data and how to analyze it, in dealing with unreliability and redundancy, in addressing the human factors of comprehending complex data sets, in formulating meaningful analyses, and in managing the dense, power-hungry data centers that house big data.

The computer science solution to complexity is finding the right abstractions, those that hide as much triviality as possible while revealing the essence of the problem that is being addressed. The "big data challenge" has disrupted computer science by stressing to the very limits the familiar abstractions which define the relevant subfields in data analysis, data management and the underlying parallel systems. Efficient processing of big data has shifted systems towards increasingly heterogeneous and specialized units, with resilience and energy becoming important considerations. The design and analysis of algorithms must now incorporate emerging costs in communicating data driven by IO costs, distributed data, and the growing energy cost of these operations. Data analysis representations as structural patterns and visualizations surpass human visual bandwidth, structures studied at small scale are rare at large scale, and large-scale high-dimensional phenomena cannot be reproduced at small scale.

As a result, not enough of these challenges are revealed by isolating abstractions in a traditional soft-ware stack or standard algorithmic and analytical techniques, and attempts to address complexity either oversimplify or require low-level management of details. The authors believe that the abstractions for big data need to be rethought, and this reorganization needs to evolve and be sustained through continued cross-disciplinary collaboration.

In what follows, we first consider the question of why big data and why now. We then describe the where (big data systems), the how (big data algorithms), and the what (big data analytics) challenges that we believe are central and must be addressed as the research community develops these new abstractions. We equate the biggest challenges that span these areas of big data with big mythological creatures, namely cyclops, that should be conquered.

Synergistic Challenges in Data-Intensive Science and Exascale Computing
J. Chen, A. Choudhary, S. Feldman, B. Hendrickson, C.R. Johnson, R. Mount, V. Sarkar, V. White, D. Williams. DOE ASCAC Data Subcommittee Report, Department of Energy Office of Science, March, 2013.

The ASCAC Subcommittee on Synergistic Challenges in Data-Intensive Science and Exascale Computing has reviewed current practice and future plans in multiple science domains in the context of the challenges facing both Big Data and the Exascale Computing. challenges. The review drew from public presentations, workshop reports and expert testimony. Data-intensive research activities are increasing in all domains of science, and exascale computing is a key enabler of these activities. We briefly summarize below the key findings and recommendations from this report from the perspective of identifying investments that are most likely to positively impact both data-intensive science goals and exascale computing goals.

Scientific Discovery at the Exascale: Report from the DOE ASCR 2011 Workshop on Exascale Data Management, Analysis, and Visualization
S. Ahern, A. Shoshani, K.L. Ma, A. Choudhary, T. Critchlow, S. Klasky, V. Pascucci. Department of Energy, February, 2011.

DEFOG: A System for Data-Backed Visual Composition
L. Lins, D. Koop, J. Freire, C.T. Silva. SCI Technical Report, No. UUSCI-2011-003, SCI Institute, University of Utah, 2011.

Collaborative Monitoring and Analysis for Simulation Scientists
R. Tchoua, S. Klasky, N. Podhorszki, B. Grimm, A. Khan, E. Santos, C.T. Silva, P. Mouallem, M. Vouk. In Proceedings of The 2010 International Symposium on Collaborative Technologies and Systems (CTS 2010), pp. 235--244. 2010.
DOI: 10.1109/CTS.2010.5478506

Collaboratively monitoring and analyzing large scale simulations from petascale computers is an important area of research and development within the scientific community. This paper addresses these issues when teams of colleagues from different research areas work together to help understand the complex data generated from these simulations. In particular, we address the issues when geographically diverse teams of disparate researchers work together to understand the complex science being simulated on high performance computers. Most application scientists want to focus on the sciences and spend a minimum amount of time learning new tools or adopting new techniques to monitor and analyze their simulation data. The challenge of eSimMon, our web-based system is to decrease or eliminate some of the hurdles on the scientists' path to scientific discovery, and allow these collaborations to flourish.

VisMashup: Streamlining the Creation of Custom Visualization Applications
E. Santos, L. Lins, J.P. Ahrens, J. Freire, C.T. Silva. In IEEE Transactions on Visualization and Computer Graphics, Proceedings of the 2009 IEEE Visualization Conference, Vol. 15, No. 6, pp. 1539--1546. Sept/Oct, 2009.
DOI: 10.1109/TVCG.2009.195

Provenance Management: Challenges and Opportunities
J. Freire. In Datenbanksysteme in Business, Technologie und Web (BTW), pp. 4. 2009.

Using Workflow Medleys to Streamline Exploratory Tasks
E. Santos, D. Koop, H.T. Vo, E. Anderson, J. Freire, C.T. Silva. In 21st International Conference on Scientific and Statistical Database Management (SSDBM), pp. 292--301. 2009.

A First Study on Strategies for Generating Workflow Snippets
T. Ellkvist, L. Stromback, L. Lins, J. Freire. In Proceedings of the ACM SIGMOD Intenational Workshop on Keyword Search on Structured Data (KEYS), pp. 15--20. 2009.
ISBN: 978-1-60558-570-3

Using Mediation to Achieve Provenance Interoperability
T. Ellkvist, D. Koop, J. Freire, C.T. Silva, L. Stromback. In Proceedings of the IEEE International Workshop on Scientific Workflows, 2009, pp. 291--298. 2009.
ISBN: 978-0-7695-3708-5