Breaking Down Data Silos with Centralized Data Management

Breaking Down Data Silos with Centralized Data Management

Most scientists follow a five-step data-management process during the life cycle of an experiment: plan, execute, analyze, process and report. Through this process, scientists interact with a variety of tools including Word, Excel, LABORATORY INFORMATION MANAGEMENT SYSTEM (LIMS), document management, inventory, instrument software, data reduction and report generation systems. Some of these solutions are highly valuable, highly automated solutions, while many are not. Generally, for every automated process, there is at least one manual process to copy, paste, transform and load data. Each time this is done there is a risk that a mistake is made, which can lead to inaccurate data reporting. To mitigate this risk, laboratories conduct a manual review of this process to ensure data quality.

A Data Labyrinth

Use of personal or shared Excel spreadsheets is a classic example of poor data management and is a common use-case across the research spectrum from small biotech to large pharma. Similarly, using email as a prioritization or per-experiment communication medium, detaches critical data from recordkeeping systems, leaving important contextual information uncaptured.

Biotech companies typically operate with small teams using file-based mechanisms for managing their critical reagents and entities. They tend to use spreadsheets for freezer management, reagent management and as a “database” of results (IC50s, Hematology, etc..) for their data. This practice can cause significant data management issues over time. Auditing of changes to data is a particular challenge in that context, as is record retention. For example, when an older data point is changed it can be difficult to ascertain whether the change was intentional or not, and who made the change. Without such auditing, it can be impossible to identify the source and/or reason for a change.

Another typical scenario is companies working with external partners to synthesize or produce compounds, guide sequences, proteins or cell lines. Outbound requests are typically shared via files attached to emails sent to the CROs. Once the development work is completed, a data package and physical materials are sent back to the sponsor. The data then need to be quality checked and imported into internal systems. This process from first data received to clean data can take more than two weeks. This impedes the discovery process in many ways: quality checks are done after the completion of the project, data is received in bulk when all of the work is completed rather than on an item-by-item basis, and collaboration is limited to troubleshooting over email and video/conference calls with collaborators on a 12 hours difference.

These are not the only significant data issues faced by lab scientists. At the end of an experiment or project, all data is typically printed and pasted into a paper notebook. Secondary data (instrument data, excel spreadsheets, etc.) are also stored in physical folders and referenced in the paper notebook. These paper records are managed by a long-term data retention policy. The electronic files are often moved to a file share where they are stored by date and name to allow for easy retrieval. These files are backed up routinely by IT teams.

Experimental results are shared with colleagues via reports and project meetings. Scientists validate and reference these original experiments with their own work repeating the processes defined above.  This cycle creates a network of experiments that reference other experiments, each of which has its own data footprint.

In the end, data ends up in three places: final “positive” results move forward into project reports, paper records are placed in long-term physical storage, and electronic data is warehoused on shared drives with no direct connection to the original experimental record. These challenges make project management a difficult and time-consuming task.

Asserting Method over Madness

Most enterprise data solutions are based on either instrument file type (SCIENTIFIC DATA MANAGEMENT SYSTEM (SDMS)) or are sample centric (LABORATORY INFORMATION MANAGEMENT SYSTEM (LIMS)). While both can help to provide a complete laboratory solution, they suffer significant drawbacks. SCIENTIFIC DATA MANAGEMENT SYSTEM (SDMS) solutions provide an automated means for intelligent data backups. Instrument data are stored as files in a structure that makes it possible to analyze the output at any time, and most systems have a means for capturing metadata to allow association with the appropriate instrument, scientist, project, etc. While these systems automate the capture of the instrument files, they are limited in the amount of context and insight they provide.

LABORATORY INFORMATION MANAGEMENT SYSTEM (LIMS) and assay management tools generally organize data either by molecule or by sample. While this connects the data with the sample, the details on how the samples were prepared, and exactly what was used to prepare the samples are lost.

The missing element in both instances is context sensitivity. Bringing the entire experimental process into a single, integrated scientific workflow provides several significant advantages:

●        Experimental work can be prioritized by project and team leaders. This reduces the need for meetings, emails and other ad hoc ways of communicating status. Tracking who prioritized which items and when they did it allows for accurate reporting of key project metrics.

●        Integrated workflows enable real-time feedback to the user. On-screen notifications provide the scientist with information about the availability of equipment, sample preparation and instrument data. This streamlining greatly increases individual productivity. When these workflows are fully integrated, stepwise execution of a process can also be enforced without separate documentation or outside monitoring. For example, results cannot be entered for a sample until the sample preparation is properly documented and completed.

●        Search results can also be informed by the context of experiments. The review and approval of experimental results being a built-in part of the experimental process makes it trivial to screen out results that have not yet been finalized, for example. But the user can also optionally include these data if they are of interest. In loosely integrated systems these simple kinds of queries can involve complex reports that are difficult to toggle between different data sets.

There are many other advantages such as assignment of tasks to specific individuals or to an entire team to draw from. Re-assignment of tasks due to a particular analyst being unavailable also becomes a very simple bulk update that is audited for reporting purposes. Updates to any process, data or priority is tracked and audited for reporting.

Bringing All the Data Together

Enterprise data management should be more than file management or results management. The goal is to relate all research information together in a single view to improve collaboration, productivity and decision support. By relating all data, from sample identification, location and container management to results and experiment management, a genealogy is generated from conception to filing. This genealogy allows researchers to “follow the data” from the first experiment and understand how and why decisions were made. No data is lost and more importantly the context of the data is maintained.

With the changing paradigm of drug development, it is critical to ensure that collaborators’ data are captured, quality checked and made available as part of the complete data set. This allows for better scientific decision making and increased research productivity.

At the same time, the permissions model should allow collaborator data to be added to the system without compromising the security of internally-generated data or results submitted by another collaborator. This allows for high-quality data at the time of creation regardless of how or where the data are generated.

A well-designed process guides the scientist through a prioritized, structured data collection workflow that increases their productivity and gathers results into a structured repository in an intuitive manner without specialized training. The result of this is the collection and curation of a fully audited data set that can be easily queried by any user to generate reports against an ad hoc set of criteria. Streamlined data collection leads to streamlined and accurate report generation which increases team productivity as well as reproducibility of results.

Unification of these processes into an auditable, permissions-based electronic informatics system significantly reduces human error, increases visibility and data retention and provides a robust auditing function.

  • <<
  • >>