$1.6 million grant supports creation of large dataset repository
By STEPHANIE DUTCHEN
Developing a database to host large biomedical datasets from around the world, such as the lattice light-sheet microscopy information shown above, would allow scientists to more effectively assess study results and accelerate progress. Video: Betzig Lab, HHMI/Janelia Research Campus, and Mimori-Kiyosue Lab, RIKEN Center for Developmental Biology
Doing science these days involves generating data—often a lot of data.
Doing excellent science requires making the data behind a published study available so others can validate or challenge the results.
However, several obstacles stand in the way of this ideal. One is that scientists in many fields, including structural biology, lack a central repository into which they can upload their data.
“If someone wanted a dataset I collected when I was in graduate school, I know it’s on a shelf in a box in the lab I was in. At least, it was there when I left,” said Pete Meyer, a research computing specialist and X-ray crystallographer at Harvard Medical School. “Replicate that among all structural biologists or all X-ray crystallographers and you get a sense of the scale of the problem.”
Because there’s nowhere to store them, researchers estimate that the datasets from hundreds of thousands of structural biology experiments have essentially disappeared.
On Oct. 1, researchers at HMS and Harvard University received a three-year, $1.6 million grant from the Leona M. and Harry B. Helmsley Charitable Trust to help solve the problem by developing a global open-source system that can manage large biomedical datasets.
“It’s like a community Dropbox,” said co-principal investigator Piotr Sliz, associate professor of biological chemistry and molecular pharmacology at HMS. “By collecting data in one place where people can find it, access it and analyze it, we will be better able to reproduce the entire workflow described in a paper, stimulate the development of new methods, teach and train new scientists and accelerate the growth of the field.”
“There has been a push from funding agencies and journals to make primary data public when possible,” said co-principal investigator Mercè Crosas, director of data science at the Institute for Quantitative Social Science at Harvard. “When you have a mandate but no solution, then people are lost. We have the infrastructure to provide a user-friendly solution.”
The endeavor expands on the Dataverse, an open-source, web-based research storage and sharing application led by Crosas. The Dataverse was originally designed for the social sciences and will now be augmented to better accommodate bigger datasets from structural biology, cell biology and other fields.
An X-ray diffraction dataset (https://data.sbgrid.org/dataset/14) was used to determine the three-dimensional model of a crystallized protein. Video: Pete Meyer
“It’s a nice merge of what the community needs—access to data—with software development, standards and best practices to provide a framework,” said Crosas.
The project also harnesses the power of hundreds of structural biology laboratories around the world that belong to the SBGrid Consortium, convened by Sliz.
Just as a dataset loses its usefulness if it’s not accessible, even the most beautifully designed database won’t do much good if nobody uses it. Sliz hopes that introducing the expanded Dataverse to the SBGrid community will ensure that it is quickly populated with datasets and adopted by others.