
by Richard Lee, Director, Core Technology and Capabilities, ACD/Labs
Data is the driver for scientific discovery, and regardless of industry sector, R&D organizations generate increasingly complex and voluminous datasets. As traditional methods of analysis are being pushed to their limits, organizations are looking to artificial intelligence (AI) and machine learning (ML) to uncover insights, streamline processes, and accelerate innovation.
Finance, retail, and manufacturers in various industries are successfully leveraging AI/ML to detect fraud, predict market trends, personalize customer interactions, and optimize operations and supply chain. The biggest difference is in the data. The numerical data they rely on is easier to aggregate and normalize than the complex data generated in chemical R&D. While scientists are eager to leverage AI/ML for efficiency gains and deeper insights, scattered, heterogeneous, and often inaccessible data continue to hinder progress. Without clean, standardized, and well-integrated datasets, even the most sophisticated AI algorithms will struggle to deliver meaningful results.
The Analytical Data Heterogeneity Problem
AI/ML models require high-quality, well-structured data—yet much of the analytical data used to make risk-mitigating and go/no-go decisions in R&D remain fragmented, siloed, and locked in proprietary vendor formats. Speaking to individual scientists, this may seem to be an aggregation problem for data science purposes. Scientists in the lab have built workaround processes for accessing their own data, and that of close colleagues. Issues become apparent when trying to access data from past projects, from a team they don’t often work with, or looking across projects.
An ACD/Labs survey found that analytical data are typically managed and stored in a variety of systems including Microsoft Office applications (80%), instrument software (70%), in-house software (34%), LIMS (24%), and ELNs (13%). These systems are often used in combination throughout the lifecycle of analytical data, where data can be abstracted, which can lead to transcription errors and frustration. Layer upon this the need for standardized, normalized data for AI/ML—and you are left with a daunting challenge.
Analytical chemistry techniques such as nuclear magnetic resonance (NMR), liquid chromatography (LC), and mass spectrometry (MS) generate diverse datasets that are often stored in incompatible formats. This fragmentation hinders the seamless integration of data across systems, limiting AI/ML’s potential. Furthermore, the fact that multiple data types and chemistry information are used in combination to reach decisions, means contextual data assembly is critical in extracting maximum value from analytical data.
Standardization (harmonizing data into consistent formats) and digitalized data assembly provides a solution by enabling interoperability, improving accessibility, and ensuring data can be effectively used in AI/ML workflows.
The Role of Standardization in AI-Driven Innovation
Data standardization is a strategic necessity for organizations aiming to leverage AI effectively and it proffers added benefits for scientists. Standardized data facilitates:
- Data Accessibility: Standardized data can be collected into a single repository— for example, a database or data lake—for both human and machine access; which, even without centralized access, allows for easier data exchange and collaboration.
- Enhanced Data Integration: Datasets from multiple sources can be easily combined and compared for a more comprehensive understanding of chemical processes and experimental outcomes.
- Consistent, Quality Data: Comparison of data from different instruments and consistent data analysis become possible; and while standardization itself does not improve data quality, it permits the identification of inconsistencies that can be addressed.
- Improved AI/ML Performance: Machine-readable, structured data ensure AI models are trained on high-quality inputs, leading to more reliable predictions and insights.
- Regulatory Compliance: Standardized data support traceability and auditability, making it easier to meet industry regulations and quality control requirements.
However, achieving standardization is not straightforward. R&D organizations must navigate the complexities of balancing open and proprietary data formats. Open formats, often developed by standards bodies, promote long-term usability and accessibility, while proprietary formats can offer specialized functionalities tailored to advanced instrumentation. The best approach often involves adopting multi-format compatibility, preserving metadata integrity, and ensuring flexibility to accommodate both legacy and emerging data formats.
AI is the Answer... What’s the Question?
Enthusiasm for AI/ML is at an all-time high, particularly at the executive level. There is an urgent mandate to “do something with AI,” driven by competitive pressure and a desire to modernize R&D and manufacturing pipelines. However, this enthusiasm often outpaces the development of concrete, actionable strategies. While executives seek transformative outcomes, such as better asset utilization, reduced instrument downtime, and accelerated innovation, scientists at the bench are often left without clear direction or tools for practical implementation.
Organizations typically envision using AI/ML to optimize capital expenditure (e.g., determining which instruments to replace or redeploy), anticipate maintenance cycles to reduce downtime, and gain operational insights from aggregated laboratory data. These use cases rely heavily on harmonized and abstracted datasets that can feed enterprise-level analytics and forecasting models.
Meanwhile, scientists express a different set of needs. From polls conducted during ACD/Labs’ recent webinars, the most popular use case among scientists is leveraging AI/ML to explore experimental design space more efficiently, build their own predictive models, create interpretive models, and reduce the time burden associated with preparing reports. Achieving these goals requires access to highly contextualized data complete with metadata, sample provenance, method parameters, and interpretation-ready results, not just raw numbers.
Organizations must take a more deliberate and use-case-driven approach to deciding what AI/ML should be applied to, and how it will be adopted. Especially when these technologies are new to an organization, success depends on identifying clear, narrowly defined problems where AI/ML can add value, and establishing realistic goals for implementation. Without such focus, efforts risk becoming an exploratory initiative with no practical outcome. Effective AI/ML integration requires not just data readiness, but also organizational readiness—a shared understanding of what problems are being solved, who benefits, and how success will be measured.
Overcoming Barriers to AI/ML Adoption
Once data is standardized and there is a clear use-case for AI, another hurdle remains—the expertise required to develop and implement AI/ML models. Many R&D organizations lack the specialized data science skills needed to build AI-driven solutions. Bridging this skills gap requires a multi-faceted approach:
- Training and Upskilling: Scientists and researchers must build foundational data literacy to critically assess AI-generated results and ensure scientific rigor.
- Interdisciplinary Collaboration: AI/ML initiatives benefit from close collaboration between domain experts, data scientists, and IT professionals.
- Strategic Implementation: Not every scientific problem requires AI, and traditional statistical methods may sometimes be just as effective. Organizations should adopt a balanced approach, using AI where it provides clear benefits while maintaining human expertise in decision-making.
AI as an Augmentative Tool
While AI/ML offers exciting possibilities, these technologies do not replace scientific expertise. AI models rely on historical data and pattern recognition, but they lack the creativity, intuition, and contextual understanding that human scientists bring to discovery and innovation.
For example, AI applications have demonstrated the ability to explore new molecular spaces within hours. However, in real-world applications, the most promising results often emerge from brainstorming sessions among experienced scientists, who leverage AI as a tool rather than a decision-maker. The future of R&D will be shaped by those who understand how to harness these technologies effectively while maintaining the fundamental principles of research.
A New Data-Driven Future
The transformative potential of AI/ML in R&D lies not just in the models themselves but in the quality and accessibility of the data they rely on. Standardization plays a crucial role in unlocking the full value of analytical data, enabling scientists to accelerate discovery, enhance collaboration, and drive innovation.
As AI and ML continue to evolve, organizations that prioritize data standardization, interdisciplinary collaboration, and scientific rigor will be best positioned to realize the benefits of these technologies. The key to success is not just adopting AI, but integrating it into a robust data ecosystem that supports discovery, experimentation, and innovation.
By embracing a balanced approach—leveraging AI where beneficial while preserving human expertise—scientists can navigate the evolving R&D landscape with confidence. AI/ML is not just a technological trend; it is a transformative force that, when harnessed correctly, has the potential to redefine the future of scientific discovery.
About the author
Richard obtained his Ph.D. from McMaster University, Canada, where he focused on strategies for metabolite identification and metabolomics studies. He has been with ACD/Labs since 2012 and during this time has been responsible for inception and development of MetaSense®—software to support metabolite identification; and more recently has been ushering new technology development, laying the foundations for the next generation of ACD/Labs software.