A few weeks ago I attended the Festival of Genomics in London and amongst the various interesting topics that were discussed the age old question of biomedical data integration featured heavily and still remains a primary (or rather a priority) focus. All this talk of the richness and diversity of the biomedical datasets reminded me of the 2010 paper, “The $1,000 genome, the $100,000 analysis?” and how our collective thinking has moved on from data generation to challenges associated with the analysis part.
No one disputes the potential of how this vast amount of data puts us in a highly unique position to find diagnoses and treatments that will make precision medicine a reality. Additionally, no one is under the illusion that getting there can be solely addressed with the $1000 genome catch-phrase. The key to harnessing the value in the avalanche of data generated by these assays is to be strategic about implementing an analysis infrastructure that will enable us to account for the complexity and volume of data in a manner that can be easily understood be decision makers.
Of course this problem is not exclusive to the life sciences and just goes along with the territory of producing large and complex datasets regardless of the industry. As early as 1990s, the Research Institute for Advanced Computer Science at NASA’s Ames Research Centre were collecting so much (complex) data leading their Director, Peter J Denning to comment, “The imperative to save all the bits forces us into an impossible situation: the rate and volume of information flow overwhelm our networks, storage devices and retrieval systems, as well as the human capacity for comprehension.” This doesn’t sound too dissimilar to the problem we are facing in life sciences, and regardless of how cheap human genome sequencing gets the ‘human capacity for comprehension’ needs to be aided so that we can develop a coherent clinical sense out of the data.
Thanks to the pioneering work that has been done in other industries, we are equipped with some great technological advances to deal with this problem. A clear indication was the diversity of the vendors exhibiting at the Festival of Genomics. Companies, which would look more at home perhaps at an IT convention, most definitely see a growing market. Some of them are uniquely placed to help us whether or not their initial offering was in the Life sciences. Utilising these advancements will no doubt improve the clinical utility of biomedical data and help us march towards our goal of precision medicine. However, before we can leverage these advancements to help us effectively mine biomedical data (and invest heavily in an infrastructure) we need to be aware of some critical points in the data life cycle.
Collaborative Data Storage & Security: How do we store the data in response to the availability of computational resources becoming more challenging- On-premise versus Cloud? How can we control infrastructure costs whilst improving the scalability in the process?
Facilitating rapid transfer and data processing: How can we integrate tools that allow for processing and storage of extremely large data sets to support distributed research?
Access to public or legacy databases: How can we create an infrastructure that is able to access the vast amount of data in the public domain and is able to integrate flexible data types?
Searching and analysing data in real time: How can we improve search and querying data to ensure a seamless flow of information from the data lake to the end-user?
Enriching or curating your data: How can we integrate data (structured and unstructured) from different sources to add context to our data for a deeper and more meaningful analysis?
User friendly applications to analyse and explore data: How can we choose a data exploration platform that not only has the ability to scale up to perform complex analytics (for the use of data scientists/bioinformaticians) but is also able to disseminate the analysis in a manner that is user-friendly for the end-users (such as Clinicians or Biologists) who will make a clinical (or a physiological) inference from the actual data?
I believe it is critical to address these pain points in the data life cycle. Moreover, as biologists, clinicians, data scientists, bioinformaticians and IT all work together to address the different aspects of the biomedical data cycle, a better understanding of each other’s requirements (and challenges) is also needed. In subsequent blogs, I intend to explore how current technologies and their applications can be used to solve these key pains to enhance the clinical utility of Translational research data.
Stay Tuned! In the meantime, I leave you with this white paper.
Published on Linkedin on March 1, 2018