Drug discovery is hard (nine out of ten drug candidates fail), time-consuming (typically 10 - 15 years), and expensive (Tufts’ 2016 estimate $2.87Bn). But things are getting better, right? In 2017, although the EMA only approved 35 new active substances, FDA drug approvals hit a 21-year high, with 46 new molecular entities approved, the highest number since 1996. This was mix of 29 small molecules and, demonstrating their increasing therapeutic importance, 17 biologics (nine antibodies, five peptides, two enzymes, and an antibody-drug conjugate). But of the 46 approvals, the FDA only counted 33% as new classes of compound, so the others would have to be from older classes of compound, which probably entered the R&D pipeline 15 – 20 years ago.
Is this bumper crop of approvals some reflection of major advances in drug discovery techniques and technology that primed the R&D pipeline at the turn of the century? Or is it just an artifact of the FDA approval process and timeline? Hard to say either way, but in the long game of drug development, scientists and researchers will be keen to jump on any improvements that can be made now.
What contributes to the tri-fold challenges that make drug discovery and development hard, time-consuming and expensive? Surely the plethora of “latest things” – personalized and translational medicine, biomarkers, the cloud, AI, NLP, CRISPR, data lakes, etc. – will lead to better drugs sooner and more cheaply? At the highest level, probably; but down in the trenches researchers and their IT and data scientist colleagues are engaged in an ever-increasing daily struggle to develop and run more complex assays, to capture and manage larger volumes of variable and disparate data, and to handle a mix of small molecule and biologic entities; then to make sense of this data deluge and draw conclusions and insights: and often to do this with inflexible and hard-to-maintain home-grown or legacy systems that can no longer keep pace.
Let’s look at some of these challenges in more detail.
Informatics systems built on traditional RDBMS require expensive DB operators just to keep them functioning, and much time and budget has to be devoted to fixing issues and keeping up with software and system upgrades: this leaves little or no time to make enhancements or to adjust the system to incorporate a new assay or manage and index a novel and different data type. This delays IT staff making even the simplest requested change and may spur researchers to go rogue and revert to using spreadsheets and sneakernet to capture and share data.
The Data Scientist’s inbox
Organizing and indexing the variety and volume of data and datatypes generated in modern drug discovery research is an ongoing challenge. Scientists want timely and complete access to the data, with reasonable response times to searches, and easy-to-use display forms and tables.
Older legacy informatics systems did a reasonable job of capturing, indexing, linking and presenting basic chemistry, physical properties and bioassay structured data, but at the cost of devising, setting up, and maintaining an unwieldy array of underlying files and forms. Extending a bioassay that captures additional data, reading in a completely new instrument data file, or linking two previously disconnected data elements all require modifications to the underlying data schema and forms, and add to the growing backlog of unaddressed enhancement tasks in the data scientist’s inbox.
In addition to managing well-structured data, scientists increasingly want combined access to non-structured data such as text contained in comments or written reports, and legacy systems have very limited capabilities to incorporate and index such material in a usable way, so that potentially valuable information is ignored when making decisions or drawing insights.
Lack of tools for meaningful exploration
Faced with the research data deluge, scientists want to get to just the right data in the right format, and with the right tools on hand for visualization and analysis. But the challenge is to know what data exists, where, and in what format. Legacy systems often provide data catalogs to help find what is available, and offer simple, brute-force search tools, but often response times are not adequate, and hit lists contain far too few or too many results to be useful. Iterative searches may help to focus a hit set on a lead series or assay type of interest, but often the searcher is left trying to make sense of a series of slightly different hits lists by using cumbersome list logic operations to arrive at the correct intersection list that has all the specified substructure/dose response/physical property range parameters.
Once a tractable hit set is available, the researcher is then challenged to locate and use the appropriate tools to explore structure activity relationships (SARs), develop and test hypotheses, and identify promising candidates for more detailed evaluation. Such tools are often hard to find, and each may come with its own idiosyncratic user interface, with a steep and challenging learning curve. Time is also spent designing and tweaking display forms to present the data in the best way, and every change slows down decision making. Knowing which tools and forms to use, in what order, and on which sets of data can be frustrating, and lead to incomplete or misleading analyses or conclusions.
In the area SAR and bioSAR, underlying chemical structural and biosequence intelligence are key requirements for meaningful exploration and analysis, and these are often only available in separate and distinct applications with different user interfaces, when ideally they should be accessible through a unified chemistry/biosequence search and display application, supported by a full range of substructure and sequence analysis and display tools.
Lab, section, and therapeutic area managers are all challenged to help discover, develop, and deliver better drugs faster and more cheaply. They want their R&D teams to be working at peak efficiency, with the best tools available to meet current and future demands. This first requires the foundation of a future-proof, flexible, and extensible platform. Next, any system built on the platform must be able to intelligently and flexibly handle all types of R&D data, now and in the future, structured or unstructured. Research scientists can then exploit this well-managed data with tools that guide them through effective and timely search and retrieval; analysis workflows; and advanced SAR visual analytics. This will lead to better science and faster insights to action.