sfh-1-cibc-2.jpg

Scientific Data Management

The Information Management group has been working on building new cyberinfrastructure that streamlines the creation, execution and sharing of complex visualizations, data mining and other large-scale data analysis applications. We developed VisTrails (www.vistrails.org), a new open source, scientific workflow and provenance management system that was designed to manage rapidly evolving workflows common in exploratory applications. VisTrails provides novel mechanisms for capturing and interacting with provenance that greatly simplify the data exploration process. The system has been downloaded over 8,000 times since its beta release in January, 2007. VisTrails has been adopted as part of the cyberinfrastructure in large scientific projects, as well as a teaching and learning tool in graduate and undergraduate courses, both in the U.S. and abroad.

There has been an explosive growth in the volume of structured information on the Web. This information often resides in the hidden (or deep) Web, stored in databases and exposed only through queries over Web forms. A recent study by Google estimates that there are several millions of such form interfaces. However, the high quality information in online databases can be hard to find: it is out of reach for traditional search engines, whose index include only content in the surface Web.

Our group is combining techniques from machine learning, information retrieval and databases to build infrastructure that automates, to a large extent, the process of discovering and organizing hidden-Web data sources, a necessary step to large-scale retrieval and integration of Web information. This infrastructure will enable people and applications to more easily find the right databases and consequently, the hidden information they are seeking on the Web. We have used our hidden-Web infrastructure to build DeepPeep (www.deeppeep.org), a new search engine for Web forms.

esim-mon Collaborative Monitoring and Analysis for Simulation Scientists. R. Tchoua, S. Klasky, N. Podhorszki, B. Grimm, A. Khan, E. Santos, C.T. Silva, P. Mouallem, M. Vouk. In Proceedings of The 2010 International Symposium on Collaborative Technologies and Systems (CTS 2010), pp. (accepted). 2010.

Collaboratively monitoring and analyzing large scale simulations from petascale computers is an important area of research and development within the scientific community. This paper addresses these issues when teams of colleagues from different research areas work together to help understand the complex data generated from these simulations. In particular, we address the issues when geographically diverse teams of disparate researchers work together to understand the complex science being simulated on high performance computers. Most application scientists want to focus on the sciences and spend a minimum amount of time learning new tools or adopting new techniques to monitor and analyze their simulation data. The challenge of eSimMon, our web-based system is to decrease or eliminate some of the hurdles on the scientists’ path to scientific discovery, and allow these collaborations to flourish.

Full Publication
 

snippets A First Study on Strategies for Generating Workflow Snippets. T. Ellkvist, L. Stromback, L. Lins, J. Freire.  In Proceedings of the ACM SIGMOD Intenational Workshop on Keyword Search on Structured Data (KEYS), pp. 15--20. 2009. ISBN: 978-1-60558-570-3

Workflows are increasingly being used to specify computational tasks, from simulations and data analysis to the creation of Web mashups. Recently, a number of public workflow repositories have become available, for example, myExperiment for scientific workflows, and Yahoo! Pipes. Workflow collections are also commonplace in many scientific projects. Having such collections opens up new opportunities for knowledge sharing and re-use. But for this to become a reality, mechanisms are needed that help users explore these collections and locate useful workflows. Although there has been work on querying workflows, not much attention has been given to presenting query results. In this paper, we take a first look at the requirements for workflow snippets and study alternative techniques for deriving concise, yet informative snippets.

Full Publication
 

mediation Using Mediation to Achieve Provenance Interoperability. T. Ellkvist, D. Koop, J. Freire, C.T. Silva, L. Stromback. In Proceedings of the IEEE International Workshop on Scientific Workflows, 2009, pp. 291--298. 2009. ISBN: 978-0-7695-3708-5

Provenance is essential in scientific experiments. It contains information that is key to preserving data, to determining its quality and authorship, and to reproducing as well as validating the results. In complex experiments and analyses, where multiple tools are used to derive data products, provenance captured by these tools must be combined in order to determine the complete lineage of the derived products. In this paper, we describe a mediator-based architecture for integrating provenance information from multiple sources. This architecture contains two key components: a global mediated schema that is general and capable of representing provenance information represented in different model; and a new system-independent query API that is general and able to express complex queries over provenance information from different sources. We also present a case study where we show how this model was applied to integrate provenance from three provenance-enabled systems and discuss the issues involved in this integration process.

Full Publication
 

Provenance Management: Challenges and Opportunities. J. Freire. In Datenbanksysteme in Business, Technologie und Web (BTW), pp. 4. 2009.

Computing has been an enormous accelerator to science and industry alike and it has led to an information explosion in many different fields. The unprecedented volume of data acquired from sensors, derived by simulations and data analysis processes, accumulated in warehouses, and often shared on the Web, has given rise to a new field of research: provenance management. Provenance (also referred to as audit trail, lineage, and pedigree) captures information about the steps used to generate a given data product. Such information provides important documentation that is key to preserve data, to determine the data’s quality and authorship, to understand, reproduce, as well as validate results. Provenance solutions are needed in many different domains and applications, from environmental science and physics simulations, to business processes and data integration in warehouses. In this talk, we survey recent research results and outline challenges involved in building provenance management systems. We also discuss emerging applications that are enabled by provenance and outline open problems and new directions for database-related research.
 

medlys Using Workflow Medleys to Streamline Exploratory Tasks. E. Santos, D. Koop, H.T. Vo, E. Anderson, J. Freire, C.T. Silva. In 21st International Conference on Scientific and Statistical Database Management (SSDBM), pp. 292--301. 2009.

To analyze and understand the growing wealth of scientific data, complex workflows need to be assembled, often requiring the combination of loosely coupled resources, specialized libraries, distributed computing infrastructure, and Web services. However, constructing these workflows is a non-trivial task, especially for users who do not have programming expertise. This problem is compounded for exploratory tasks, where the workflows need to be iteratively refined. In this paper, we introduce workflow medleys, a new approach for manipulating collections of workflows. We propose a workflow manipulation language that includes operations that are common in exploratory tasks and present a visual interface designed for this language. We briefly discuss how medleys have been applied in two (real) applications.

Full Publication
 

User-Driven Application Development. E. Santos, L. Lins, J. Ahrens, J. Freire, C.T. Silva. In IEEE Transactions on Visualization and Computer Graphics, Proceedings of the 2009 IEEE Visualization Conference. Sept/Oct, 2009.

Visualization is essential for understanding the increasing volumes of digital data. However, the process required to create insightful visualizations is involved and time consuming. Although several visualization tools are available, including tools with sophisticated visual interfaces, they are out of reach for users who have little or no knowledge of visualization techniques and/or who do not have programming expertise. In this paper, we propose VISMASHUP, a new framework for streamlining the creation of customized visualization applications. Because these applications can be customized for very specific tasks, they can hide much of the complexity in a visualization specification and make it easier for users to explore visualizations by manipulating a small set of parameters. We describe the framework and how it supports the various tasks a designer needs to carry out to develop an application, from mining and exploring a set of visualization specifications (pipelines), to the creation of simplified views of the pipelines, and the automatic generation of the application and its interface. We also describe the implementation of the system and demonstrate its use in two real application scenarios.