There exists two main problems within the current approaches outlined that still remains unaddressed.
1. PWMs have many false-positives
Specific types of PWMs are known to overstimate the importance of the different bases in a motif, likely due to an unequal number of sub-sequeneces (known as k-mers). This leads to a much higher number of false positives.
Additionally, PWMs are an archaic method of finding motifs, having been orignally discovered over 30 years ago. New methods which utilize the increasing compute power, as well as abundance of genomic data are invaluable.
2. Language-modelling of molecules using RNNs miss key chemical information
SMILES strings, the type of chemical representation used to train most language models, miss key information such as the prescence of entiomers. They also do not natively represent molecules and their bonds, using characters such as brackets to signify them instead. In the past year, transformer architectures have emerged as incredibly strong at learning from text as well, which shows the potiential for using models such as BERT or GPT-2 over the RNN-LSTM models of the past.