Adoption of AI/ML into a Small Molecule Discovery

Last week in the Journal of Chemical Information and Modeling (JCIM) a reflection was published on the adoption of AI by medicinal chemists. The viewpoint, written by Alex Zhavoronkov of Insilico Medicine, asked the question “[how can we] advance artificial intelligence for drug discovery?” The article considers the cost and challenges of AI adoption and ultimately proposes a competition between human medicinal chemistry experts and AI generated approaches that will assess each stage of the drug discovery pipeline. While I believe in the spirit of competition and Alex’s desire to further AI/ML in small molecule discovery, I suspect such a contest may be difficult to organize (especially with respect to intellectual property) and fund. However, I am aligned with Alex in the importance of finding a way to rapidly integrate the leading edge AI/ML approaches into drug discovery. In this post outline some of my thoughts on what it will take for AI/ML to be adopted more broadly by medicinal chemists and throughout the discovery process.


In the past couple of years the promise of AI assisted drug discovery has been championed by many as a way to find better lead molecules. I have seen many chemists transform from skeptical to curious, as well as a few luckier individuals with large data sets to downright excited. Certainly the increase in tangible published results have helped fuel this change of heart; however, in my experience education can be greater instrument for change. Medicinal chemists want to understand how the algorithms work, why you use one algorithm over another, what data requirements there are, and most importantly how confident they should be about predictions. While I believe in the power of machine learning to augment the way we do discovery, it is important to be realistic about what we can achieve right now. Having discussions with chemists and biologists about what is feasible can go a long way into widening the adoption of AI into the discovery pipeline. It is best to set expectations based on the data available on a project and understand what additional data can be added during the life of the project.

Part of the argument for competition made in the JCIM correspondence highlights the cost concerns in building technology platforms and the challenge of publication vs intellectual property loss. It is certainly easy to sympathize with Alex on this viewpoint! The construction of a platform for AI/ML is a complex engineering feat, it requires computational resources and ultimately is partly a research project unto its own. At Cognitive Dataworks we attempt to be very strategic in what we choose to build and how we use computational resources. Our goal is to build tools which chemists can interact with (in as close to real time as possible) which are backed by automated machine learning. All of our methods are containerized and ready to be deployed so that parallel workflows can be launched in the cloud and return results quickly. It is my belief that speed and practicality will make AI/ML ubiquitous in discovery. Remembering that AI/ML methods are another tool we can use to make decisions is important here. We are not trying to replace chemists, we are trying to augment the way they accomplish their goals.

As is true most technology AI/ML can have the greatest impact when it is part of the decision making process from the beginning. Launching a project which will utilize AI/ML requires a slightly different mindset to the typical lead discovery or lead optimization cycle, and it becomes a bit more challenging when AI/ML is asked to rescue a program. I am particularly interested what will become possible once we begin to use AI/ML to choose chemical matter for the preliminary screen. Consider the common question of “how diverse is my library?” and the task of comparing one screening library to another. Which library is more diverse? What chemical diversity metric should I use? While it is true we have made great strides in producing algorithms for prediction as well as the generation of compounds I believe we need to focus on leveraging AI/ML to design novel chemical libraries for initial target screening. These new libraries should provide maximum information gain when used in downstream AI/ML methods. This will require us to rethink how we approach the discovery process a bit. Can we use related targets to fine tune a custom screening deck? Can we accept screening compounds which would make awful leads in order to leverage that information for use in a predictive model? We certainly take similar gambles when we choose to screen libraries focused on CNS targets or Kinases. Of course this will require more acceptance from medicinal chemists than before, but I believe it will be a stepping stone which reinforces the use of AI/ML.

Finally how can we talk about modern AI/ML methods such as deep learning without discussing the data requirement? It has been shown that these methods are data hungry. In fact as I write this post, work done by Cognitive Dataworks, Google Accelerated Science and X-Chem Pharmaceuticals using DNA Encoded Libraries (DEL) has just been published in J. Med Chem showing the power of using machine learning and DEL together to find lead molecules. However, DEL chemistry and the types of targets which can be screened using DEL methods have limitations, making it important to find alternative methods which require less data and computational resources. Recently a group of researchers at MIT used a method called “chemprop” with a primary screening set of ~2500 compounds to discover a novel broad spectrum antibiotic. Coupling a method such as “chemprop” with AI/ML designed focused libraries could result in a rapid and more inexpensive way to find lead molecules.


While it is clear that adoption of AI/ML into small molecule discovery programs is not widespread at this time, I believe the industry is starting to embrace this new technology where it can. Certainly AI/ML is being used in repurposing drugs (look at the response for covid-19!) or as a way to decode biology using phenotypic screens or to personalize cancer treatment based on a patient’s genome. While I believe that the concepts I outlined in this post will be important in the spread of AI/ML through discovery, ultimately success with difficult targets will translate into adoption.