UBC Undergraduate Research

Comparing SMILES and SELFIES Embeddings and ECFP4 Fingerprints for Molecular Similarity Search Karmarkar, Amey

Abstract

We compare SMILES and SELFIES embeddings generated by two transformer models (ChemBERTa, SELFormer) against 1024-bit ECFP4 fingerprints for molecular similarity search. Using thirty chemically diverse query molecules, we retrieve the top-k nearest neighbours k ∈ {5, 8, 10, 15, 20, 25, 35} via cosine similarity for all embedding representations and Tanimoto’s coefficient for fingerprints. Hits are evaluated on chemical similarity (Hamming distance) and structural similarity (Graph Edit Distance, GED). Statistical significance is assessed with one-way ANOVA and Tukey’s HSD. Our results confirm that ECFP4/Tanimoto yields the lowest mean Hamming distances (mean=12.5 at k=35) as well as the lowest mean GEDs (mean=18.9). SELFIES embeddings consistently outperform SMILES embeddings yet remain inferior to fingerprints.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International