- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Undergraduate Research /
- Comparing SMILES and SELFIES Embeddings and ECFP4 Fingerprints...
Open Collections
UBC Undergraduate Research
Comparing SMILES and SELFIES Embeddings and ECFP4 Fingerprints for Molecular Similarity Search Karmarkar, Amey
Abstract
We compare SMILES and SELFIES embeddings generated by two transformer models (ChemBERTa, SELFormer) against 1024-bit ECFP4 fingerprints for molecular similarity search. Using thirty chemically diverse query
molecules, we retrieve the top-k nearest neighbours k ∈ {5, 8, 10, 15, 20, 25, 35}
via cosine similarity for all embedding representations and Tanimoto’s coefficient for fingerprints. Hits are evaluated on chemical similarity (Hamming
distance) and structural similarity (Graph Edit Distance, GED). Statistical
significance is assessed with one-way ANOVA and Tukey’s HSD. Our results
confirm that ECFP4/Tanimoto yields the lowest mean Hamming distances
(mean=12.5 at k=35) as well as the lowest mean GEDs (mean=18.9). SELFIES embeddings consistently outperform SMILES embeddings yet remain
inferior to fingerprints.
Item Metadata
| Title |
Comparing SMILES and SELFIES Embeddings and ECFP4 Fingerprints for Molecular Similarity Search
|
| Creator | |
| Date Issued |
2025-08
|
| Description |
We compare SMILES and SELFIES embeddings generated by two transformer models (ChemBERTa, SELFormer) against 1024-bit ECFP4 fingerprints for molecular similarity search. Using thirty chemically diverse query
molecules, we retrieve the top-k nearest neighbours k ∈ {5, 8, 10, 15, 20, 25, 35}
via cosine similarity for all embedding representations and Tanimoto’s coefficient for fingerprints. Hits are evaluated on chemical similarity (Hamming
distance) and structural similarity (Graph Edit Distance, GED). Statistical
significance is assessed with one-way ANOVA and Tukey’s HSD. Our results
confirm that ECFP4/Tanimoto yields the lowest mean Hamming distances
(mean=12.5 at k=35) as well as the lowest mean GEDs (mean=18.9). SELFIES embeddings consistently outperform SMILES embeddings yet remain
inferior to fingerprints.
|
| Genre | |
| Type | |
| Language |
eng
|
| Series | |
| Date Available |
2025-08-12
|
| Provider |
Vancouver : University of British Columbia Library
|
| Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
| DOI |
10.14288/1.0449638
|
| URI | |
| Affiliation | |
| Peer Review Status |
Unreviewed
|
| Scholarly Level |
Undergraduate
|
| Rights URI | |
| Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International