In Search Of

Precise Drug Candidates

"But biology and computer science – life and computation – are related.
I am confident that at their interface great discoveries await those who seek them."
- Leonard Adelman, computer scientist
# backimg - is black, must change to something new (see folding@home for inspo)

The Problem with Genomics and Drug Discovery Today

Nearly 20 years ago, we sequenced the first human genome. After $3 billion in investment and 13 years of research, we had finally cracked the code to life. All 3 billion A's, C's, T's and G's which code for human life. The future for creating personalized medicine.


Unfortunately, we're still incredibly short of achieving personalized medicine and curing genetic disease. And while the cost to sequence a human genome today has gone down from 8-figures to only $47, we're still at a similar standpoint in genomics research. Why? Because biology is fundamentally hard to read. Humans cannot fundamentally conceptualize the hundreds or possibly thousands of mutations which result in complex diseases such as cancer.


Inspired by this and the research being done by companies like Deep Genomics, I decided to try to tackle this problem exactly one year ago. With the rise of more powerful computing and a rapid increase in genomic data, machine learning can be used to help increase our understanding of the human genome.


So much of the drug discovery process and genomics research involves brute force, repetitive experiments + recognizing patterns in data. A skill models are fundamentally designed to understand. With better models for understanding our genome, we can limit the time it takes to discover targets for rare disease by at least 50-70%.


Out of this came Project De Novo.

The Status Quo

To find the motif sequence (binding region) for a transcription factor-protein, computational biologists typically employ the following:


Position Weight Matrices

PWMs are commonly used to represent patterns in DNA sequences. They show the most common conserved bases across a genome. A PWM has one row for each symbol of the alphabet: 4 rows for nucleotides in DNA sequences or 20 rows for amino acids in protein sequences. It also has one column for each position in the pattern.


Another important task in drug discovery is creating generative models for designing new molecules. Recently, natural language processing has emerged as a popular approach to this problem, using SMILES strings.


Recurrent Neural Networks and LSTM cells

Using RNNs alongisde LSTMs to deal with the vanishing gradient problem, a model can learn to generate molecular drug candidates using SMILES (Simplified Molecular Input Line Entry Specification) strings, similarly to how language models learn to generate text.

Learn More About Project De Novo

The Problem

There exists two main problems within the current approaches outlined that still remains unaddressed.


1. PWMs have many false-positives

Specific types of PWMs are known to overstimate the importance of the different bases in a motif, likely due to an unequal number of sub-sequeneces (known as k-mers). This leads to a much higher number of false positives.

Additionally, PWMs are an archaic method of finding motifs, having been orignally discovered over 30 years ago. New methods which utilize the increasing compute power, as well as abundance of genomic data are invaluable.


2. Language-modelling of molecules using RNNs miss key chemical information

SMILES strings, the type of chemical representation used to train most language models, miss key information such as the prescence of entiomers. They also do not natively represent molecules and their bonds, using characters such as brackets to signify them instead. In the past year, transformer architectures have emerged as incredibly strong at learning from text as well, which shows the potiential for using models such as BERT or GPT-2 over the RNN-LSTM models of the past.

The Solution

Using convolutional neural networks to understand transcription-factor binding patterns in A549 lung epithelial cells. Using one-hot encoded ChIP-seq data which gives us signals for binding strength across an entire genome, the model is able to learn motifs, or potentially disease-causing regulatory variants that can negatively impact gene-expression.

The model is able to achieve an accuracy of 90.5%, surpassing traditional approaches which use Position Weight Matrices (PWMs) models by nearly 20%. Last October, I gave a talk on using CNNs for learning TF-binding patterns at the Re-Work Deep Learning Summit in Montreal. Check it out here.


Using graph convolutional networks, reinforcement learning, and transformers to learn + generate molecules. Project De Novo is working on creating implementations of two different approaches. The first one uses a convolutional network based model for goal-directed graph generation through reinforcement learning, based on research from the Pande Lab. The second approach uses transformers, an emerging and widely successful language model, to generate SMILES strings.


Scientific Open-Source Library

In order to help researchers acclerate the process of discoverying motifs + generating molecules for a specific dataset, our deep learning library is a straight out of the box approach that can help researchers write, execute, train, and deploy valuble learning algorithms for communication analysis and intelligence detection. Just fork the repo, load your dataset, and you're good to go.


This library will include, but is not limited to:

  • Protein Sequence Motif Discovery using CNN
  • Goal-Directed Molecular Graph Generation using a Policy Network
  • Generating SMILES strings using NLP

Currently the project is still under construction. Contact us for more info and updates on the project!

Github Library