Can Machines Learn Chemistry?
By Bryan Kim
Cover Image: Stocker, S., Csányi, G., Reuter, K., & Margraf, J. T. (2020). Machine learning in Chemical Reaction Space. Nature Communications, 11(1). https://doi.org/10.1038/s41467-020-19267-x.
Although frequently employed as a tech buzzword, machine learning (ML) simply refers to using data to predict statistical trends that can extrapolate to new data. A basic example of ML is linear regression: taking various data points and drawing a line that best fits the presumably linear trend. Congratulations, you’ve just done machine learning!
However, ML can be tricky and even un-ideal to implement in the chemical/physical sciences. This incongruence with the popular perception of ML as an esoteric, all-powerful technology highlights how improper usage of ML can hinder quality science.
One of the principle tenets of ML is that a good predictive model needs a lot of data. “A lot” is subjective, but Anthony Yu-Tang Wang, a chemistry researcher at the Technical University of Berlin, recommends a dataset “on the order of thousands or more.” That’s why ML thrives in data-rich domains, such as internet traffic data generated by billions of website users. Meanwhile, in the chemistry world, consider an ML model aimed to predict microscopy images just by knowing a molecule’s chemical composition (without needing a microscope at all). To train this model, one would need to go into the lab and manually synthesize microscopy images—and making each sample requires substantial time and resources. Without unprecedentedly large collaboration, it is impossible to create the amount of microscopy images to meet the demands of training even basic ML models. This is referred to as the small dataset problem— similar to linear regression but with only two data points.
For smaller datasets, Wang recommends more simple ML models, such as regression (albeit with more complexity than a simple linear regression). For a chemist who has a project with a feasibly large dataset, instead of simple regression models, they would presumably like to advance into more complex, state-of-the-art ML models like neural networks. Despite the abundance of data, another problem arises from an additional tenet of ML: the “model interpretability vs predictive power trade-off.” The more advanced the model, the harder it becomes to understand why it works. Besides the inputs and outputs, the inner workings of the model becomes an unintelligible “black box” due to the complex, patternless mathematical operations it performs for predictions. This poses a problem for the chemist by severely limiting a scientific understanding of the meaningful relationships between the input and output.
So, is ML even viable for chemistry? Sometimes. ML is most useful if the dataset is too large and complex for a human mind to find representations. For example, a “good” ML chemistry project could be screening/down-selecting candidate materials from a large pool of known compounds. Researchers at Carnegie Mellon University and MIT have screened over 12,000 inorganic solid species to find candidate materials for suppressing dendrite formation in Li-batteries. But such a project merely suggests possible materials of interest, rather than replacing entire scientific procedures that ML is often portrayed to be capable of. ML is a powerful tool, but not every chemistry problem requires an ML solution. Just like with any new technology, ML must be viewed with some amount of skepticism.
Although the realistic scope of ML in chemistry is smaller than one might hope, it doesn’t trivialize the potential of this subfield. For instance, the Open Catalyst Project, an open competition organized by Facebook AI Research (FAIR) and Carnegie Mellon University Department of Chemical Engineering, attempts to build a ML model that best approximates electrocatalysis results from a more traditional computational chemistry method called density-functional theory (which itself is computationally exhaustive and time-consuming). The competition’s introductory paper suggests that current models “are not yet learning fundamental physical representations,” or the actual chemical principles that dictate the density-functional theory simulations. Since ML models are black-boxes with no rhyme or reason as to how they predict their results (beyond mathematical optimization), it is questionable if machines can ever truly “learn” chemistry.
Ahmad, Z., Xie, T., Maheshwari, C., Grossman, J. C., & Viswanathan, V. (2018). Machine learning enabled computational screening of inorganic solid electrolytes for suppression of dendrite formation in lithium metal anodes. ACS Central Science, 4(8), 996–1006. https://doi.org/10.1021/acscentsci.8b00229.
Chanussot, L., Das, A., Goyal, S. Lavril, T., Shuaibi, M., Riviere, M., Tran, K., Heras-Domingo, J., Ho, C., Hu, W., Palizhati, A., Sriram, A., Wood, B., Yoon, J., Parikh, D., Zitnick, L., & Ulissi, Z. The Open Catalyst 2020 (OC20) Dataset and Community Challenges, ACS Catalysis, 11(10), 6059-6072. https://doi.org/10.1021/acscatal.0c04525.
Stocker, S., Csányi, G., Reuter, K., & Margraf, J. T. (2020). Machine learning in Chemical Reaction Space. Nature Communications, 11(1). https://doi.org/10.1038/s41467-020-19267-x.
Wang, A.Y., Murdock, R.J., Kauwe S.K., Oliynuk, A.O., Gurlo, A., Brgoch, J., Persson, K.A., & Sparks, T.D. Machine Learning for Materials Scientists: An Introductory Guide toward Best Practices. Chemistry of Materials, 32(12), 4954-4965. https://doi.org/10.1021/acs.chemmater.0c01907.