How Poor Data is Holding Back AI in R&D

 How Poor Data is Holding Back AI in R&D

by Kevin Cramer, Founder, CEO, CTO, Sapio Sciences

There is no doubt that the speed at which artificial intelligence (AI) is impacting biopharma R&D is accelerating rapidly. In just the last 18 months, we have seen major announcements from companies of all sizes whose mission is to accelerate scientific discovery. For example, Insilico Medicine announced the successful AI-driven design of a novel drug candidate1; Microsoft announced BioGPT, a specialized large language model (LLM) trained on biomedical texts2; and Google DeepMind researchers earned a Nobel Prize for their groundbreaking work in protein prediction and development.3

AI is often hailed as a groundbreaking technology that will transform the landscape of drug discovery. From speeding up the identification of drug targets to predicting protein structures, the potential applications seem limitless. Despite the promised transformation and the promising advancements, AI’s impact on the drug discovery process has not yet been the revolutionary or democratizing force anticipated, leaving scientists and researchers questioning when and how AI will truly impact their research.

While AI is helping scientists to accelerate research to some extent, a number of barriers exist that currently limit AI’s potential in drug discovery and its widespread integration across labs.

AI in drug discovery

There have unquestionably been a number of bright spots in recent years. In 2020, DeepMind's AlphaFold2 demonstrated the ability to predict protein structures with atomic accuracy comparable to experimental structures in a majority of cases4, and Benevolent AI’s knowledge graph was used to search for approved drugs that could help in the fight against COVID-19, identifying baricitinib as a means of preventing the virus from infecting lung cells.5 By 2024, AlphaFold3 had moved beyond proteins to a broad spectrum of biomolecules, including DNA, RNA, and even small molecules.6

Although breakthroughs like DeepMind's AlphaFold and BenevolentAI’s work on COVID-19 have demonstrated AI’s potential, these represent isolated cases rather than widespread success stories. For most researchers, AI still feels more like an abstract tool than something they can practically apply in everyday lab work. In fact, a recent survey by Elsevier found that while more than three out of four researchers expect to use AI within the next two to five years, over 80% raised concerns about the potential for AI to cause critical errors.7

One reason for the lack of trust in generalized AI tools is AI’s tendency to hallucinate. Put simply, AI model hallucinations occur when the technology generates information that is false, nonsensical, or unverifiable while presenting it as a fact, obviously a huge issue when AI is being used at any stage of the drug discovery process. This lack of trust that scientists currently have in AI is an issue in the uptake of this technology in the lab, and a big reason for this is that the majority of AI models are trained on generalized, incomplete, or non-representative datasets. Many AI tools depend on public data, which is often fragmented, outdated, or insufficiently detailed for real laboratory settings. In fact, Elsevier’s survey found that 91% of researchers want AI’s LLMs to be based solely on high-quality, trusted sources, removing the bias or erroneous conclusions that current AI tools can suffer from.

AI’s Achilles’ heel

The LLMs that underpin AI tools are trained on massive amounts of data. They are very good at generating results based on what they saw during training, but public datasets lack the depth required for meaningful scientific insights when used for biopharma research. They exclude proprietary data, unpublished or confidential research, and information of a commercially sensitive nature. This lack of depth to the data available can severely impact the intelligence of the models and the scientific validity of the responses.

Crucially, many AI models lack access to real-time data, such as the findings from experiments that are still in progress or readings from the wide range of devices and instruments used in today’s labs, compromising its ability to adapt to new findings and limiting its effectiveness.

While new lab technologies and devices have allowed researchers to capture data at an ever-increasing pace, data unification often remains a critical missing piece. This means that while key data and information exists, it is often siloed across disparate systems and left without context.

Breaking down data barriers

The amount of data generated during the drug discovery process is enormous. From large screens to identify novel targets to clinical trials, researchers accumulate vast amounts of information from various sources. In genomics research, the volumes of data are staggering, with estimates of between 2–40 exabytes of data being created within the next decade.8 To put that into perspective, 40 exabytes is approximately 40 billion hours of HD video, or 20 trillion books.

For AI to reach its full potential in drug discovery, it must be able to access and integrate these huge datasets from across the entire research process, including structured data such as experimental measurements and results and unstructured data like research notes, lab notebooks, experimental comments, and published research. By providing the LLMs that power AI systems with both types of data and ensuring the information encompasses core scientific principles, we can improve the accuracy and relevance of the output, reducing the likelihood of errors and hallucinations.

Unfortunately, integrating data stored in various formats and across different platforms is not a straightforward task. Traditional Laboratory Information Management Systems (LIMS) and Electronic Laboratory Notebooks (ELNs) solutions struggle to capture the nuanced insights or volumes of information being generated both in the lab and “in silico.” However, today’s modern lab informatics platforms offer LIMS and ELNs that have been developed with AI as a core capability, maximizing the benefits of AI for scientists and overcoming issues relating to poor data collection and integration. Specifically, modern LIMS are data-agnostic, collecting, categorizing, and managing all the data being generated, not just select subsets.

Similarly, modern ELNs allow scientists to deploy an extensive array of experiments, assays, and templates easily and precisely, facilitating real-time data sharing and collaboration among researchers and fostering improved partnerships across the scientific community.

For AI to become an effective and trusted tool for scientists at all stages of the discovery journey, it needs seamless access to all these sources of information, securely, in real-time, and in a way that is easy to process.

Recent advances in lab informatics platforms, such as integrating with advanced technologies like AWS Bedrock and NVIDIA BioNeMo, will allow researchers to unlock new ways to interact with scientific data and accelerate the pace of discovery by providing a foundation for developing specific drug development applications and giving access to proprietary information, as well as open-source data, in a secure, scalable, and science-aware way.

Overcoming usability barriers in AI adoption

Science and research are hard, but the technology that scientists use to conduct research does not need to be. For AI to start making real impacts in drug discovery, it also needs to be accessible to scientists across all disciplines and levels, not just for expert data scientists.

The usability of any technology plays a crucial role in its adoption. Traditional user interfaces in lab informatics tools have been clunky, menu- or form-driven, and differ between LIMS and ELNs. However, we are beginning to see AI – specifically generative AI (GenAI) and natural language processing (NLP) – make lab informatics more user-friendly, and by enabling a voice-first user experience, we are seeing AI technology help scientists to build workflows and experiments simply by telling the technology what they want.

This is an early step toward adopting virtual lab assistants, AI-powered agents9  capable of managing significant parts of the research workflow, making decisions, planning actions, and learning from experience. From a research perspective, this means a virtual assistant that can find and use various applications, techniques, or workflows to move the process of drug discovery along without direct human interaction.

New era of drug discovery

While AI has made notable strides in drug discovery, its full potential has yet to be realized. Challenges like limited access to real-time, high-quality data and difficulties with integrating AI into existing workflows have slowed progress.

Despite the current challenges, the future of AI in drug discovery is promising as the technology continues to evolve and becomes better equipped to handle the huge volumes and complexities of lab data. By breaking down the barriers that currently hold AI back, we can usher in a new era of drug discovery that is faster, more efficient, and more effective.

Over the next 12-18 months, we will see AI start to revolutionize the early stages of drug discovery, with a surge in “in silico” experimentation and exploratory work being done by autonomous AI to identify viable targets and candidates for further research. By automating tasks such as target identification, candidate exploration, and data analysis, AI can enable researchers to concentrate on the discovery process's more complex challenges.

As we move forward, it is clear that the future of drug discovery lies in a hybrid approach that combines the best of human expertise and AI capabilities. While AI might not yet have lived up to its initial hype in drug discovery, its potential remains enormous.

About the author

At the helm of Sapio Sciences, Kevin is driven by his dedication to accelerating the drug R&D process. His vision encompasses a unified platform designed to streamline the creation of large and small molecules, assess their efficacy, manage data capture at scale, and harness the power of data visualization. A well-established authority in the lab informatics field, Kevin’s professional footprint extends to his contributions to various papers in Nature Genetics, where he applied statistical and machine learning methodologies to the interpretation of genetic data. His educational background includes a Bachelor’s Degree in Information Systems from York College, PA.

References 

  1. New milestone in AI drug discovery: First generative AI drug begins phase II trials with patients. Insilico Medicine. 2024. insilico.com/blog/first_phase2
  2. Luo, R. et al. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics. 2022; 23(6), bbac409. doi.org/10.1093/bib/bbac409
  3. The Novel Prize in Chemistry 2024. Nobel Prize. 2024. www.nobelprize.org/prizes/chemistry/2024/press-release/
  4. Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 595, 583-589. doi.org/10.1038/s41586-021-03819-2
  5. Richardson, P et al. Baricitinib as potential treatment for 2019-nCoV acute respiratory disease. The Lancet. 2020; 395(10223), e30-e31. doi.org/10.1016/S0140-6736(20)30304-4
  6. Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold3. Nature. 2024; 630, 493-500. doi.org/10.1038/s41586-024-07487-w
  7. Are corporate researchers ready to embrace GenAI? Elsevier. 2024. www.elsevier.com/industry/corporate-researchers-attitudes-toward-ai
  8. Artificial intelligence, machine learning and genomics. National Human Genomic Research Institute. 2022. www.genome.gov/about-genomics/educational-resources/fact-sheets/artificial-intelligence-machine-learning-and-genomics
  9. Agentic AI: The next big breakthrough that’s transforming business and technology. Forbes. 2024. www.forbes.com/sites/bernardmarr/2024/09/06/agentic-ai-the-next-big-breakthrough-thats-transforming-business-and-technology/

 

Subscribe to our e-Newsletters!
Stay up to date with the latest news, articles, and events. Plus, get special offers from Labcompare – all delivered right to your inbox! Sign up now!
  • <<
  • >>