top of page
Taking Samples

Identifying Patients for Clinical Trial

There is major incentive for drug companies to reduce R&D spending, both to free up funds for additional ventures as well as to be able to offer lower prices for their products.

Machine learning, and AI can help introduce efficiencies to the clinical trial process in two ways. First, by more quickly and precisely identifying patients who would be a good fit for a particular trial via advanced analysis of medical records through natural language processing (NLP) or by exploring geographically- and symptom-distinct patients at scale. Secondly, these techniques can examine the interactions of potential trial members’ specific biomarkers and current medication to predict the drug’s interactions and side effects, avoiding potential complications.  

Identifying Compounds faster

The estimated cost for drug development by U.S. biopharmaceutical companies is nearly $ 1 billion per drug.  Pharmaceutical companies can leverage machine learning techniques to not only cull through literature and journal publications using NLP but also to pre-screen for the most effective potential compounds to prioritize their time.

Science Lab
GxP Compliance for Pharmaceuticals: A Merck and Dataiku collaborative effort 

Dataiku has been qualified by Merck's internal audit and review process as supporting GxP compliance, enabling Merck’s use of Dataiku’s collaborative data science platform in production with regulated data.

The pharmaceutical industry’s highly regulatory environment can be tremendously demanding for organizations navigating the space. One of the biggest concerns that the pharmaceutical industry constantly attempts to address is the safety of its products and integrity of data used to make product-related safety decisions, a concern aimed to be addressed by GxP, a collection of regulations for good practices used in pharmaceuticals.


In October 2020, Dataiku was qualified by Merck’s internal audit and review process as supporting GxP compliance, which has enabled Merck’s use of Dataiku’s collaborative data science platform in production with regulated data. 

AT the LAb
Pfizer: Leveraging analytics & AI to scale initiatives and achieve results
The regulatory environment has become more challenging, demanding and far more extensive, testing, before drugs can go to market.

Dataiku supports agility in organizations’ data efforts via collaborative, elastic, and responsible AI, all at enterprise scale. It helps streamline the pharmaceutical R&D process and enables robust NLP for clinical trial patient selection and identifying compounds. The platform offers a central, collaborative environment for the major steps including:


Pre-processing data: Cleaning usually involves deconstructing the data into words or chunks of words (tokenization), removing parts of speech without any inherent meaning (like stop words such as a, the, an), making the data more uniform (like changing all words to lowercase), and grouping words into pre-defined categories such as the names of persons (entity extraction). Manually, this process can take an inordinate amount of time, but Dataiku makes data cleaning and prep easy.


Vectorization (or “embedding”): After pre-processing, the non-numerical data is transformed into numerical data, since it’s much faster for a computer to operate on vectors. Dataiku leverages popular deep learning methods such as word2vec to make this a breeze.


Testing: Once a baseline project has been created, Dataiku enables users to test its prediction accuracy using cross-validation, a model validation technique that divides data into training and testing subsets. The model is built using the training subset and then tested on the testing subset to see if the model is generalizable.

In the read more link below, Mr. Chris Kakkanatt, Data Science Senior Director at Pfizer, speaks about the techniques and elements employed to create a culture of collaboration and co-creation around analytics and the steps the company has taken to achieve a human-centric, AI-driven transformation.
bottom of page