Scientific discovery is undergoing a profound transformation from traditional hypothesis-driven methods to data-intensive methods–driven by massive data acquisition enabled by high-throughput instruments and sensors, increased reliance on analysis to explore extant data, and the growing recognition that sharing data is a vital form of scientific communication complementing traditional publications. Consider a researcher studying the mechanisms by which stem cells differentiate to form the structure of a human kidney. The researcher wonders: “What happens if I try this?” “What approaches have others tried?” “What data is available?” “What if things worked like this?” Such questions may be easily posed, but frequently require much laborious effort to answer, due to the large “semantic gap” between how humans think and the technical realities of data and software.
In his seminal 1960 paper “Man-Computer Symbiosis,” J. C. R. Licklider observed that… “my choices of what to attempt and what not to attempt [are] determined to an embarrassingly great extent by considerations of clerical feasibility, not intellectual capability.” Unfortunately, this remains the case today, in 2015, if not more so, as the increase of data-intensive methods is revealing limitations of current systems to support discovery, and innovation at times grinds to a halt. As data size and complexity grows, the challenges are magnified. These complexities can have a profound impact on the ability to achieve scientific transformation, in the best case, significantly slowing progress, and in the worst case, causing potential advances to go unrecognized.
Surprisingly, there are examples in the commercial sector where the complexity of managing and manipulating data has been radically reduced. Consider popular services for organizing pictures which streamline data ingest from cameras, automatically extract relevant metadata, perform quality control to determine if a picture is in focus, and apply deep learning to extract information about content and context of a picture so as to enable sophisticated organization and navigation of the data. Of course working with scientific data is significantly more complicated due to the size, heterogeneity and complexity of the data in domains such as biosciences. However, we argue that that if we had the ability to organize and manage scientific data as easily as we can organize our vacation pictures it could have a transformational impact on our ability to rapidly discover new knowledge from our exploding scientific data.
In this talk, I will explore these ideas within the context of discovery over diverse and complex biomedical data. I will describe our work in a new class of systems that we call Biomedical Digital Asset Management systems that are inspired by the common approaches we currently use to organize the data in the rest of our lives, like music and photos. I will describe our architectural approach to building these types of systems and illustrate their application in a number of challenging domains such as neuroscience, craniofacial development, stem cell research and molecular biology.
Posted by: Nathan Galli