Systematic Evaluation of Feature Representations Improves Cancer-Associated sORF Prediction in Non-coding RNA

Poster Abstract: ALEXANDRE ROSSI PASCHOAL, Group Leader, Rosalind Franklin Institute

Abstract

Background: Short open reading frames (sORFs) within non-coding RNAs (ncRNAs) have arisen as a hidden layer of gene regulation, encoding small peptides that represent a new class of cancer reg- ulators with diagnostic and therapeutic potential. However, linking sORFs to specific cancer types remains challenging and requires computational approaches for accurate prediction. Recently, the CoraL framework introduced the first computational approach for predicting cancer-associated pep- tides, focusing primarily on model architecture while overlooking how feature extraction strategies influence predictive accuracy. We present a systematic evaluation of machine learning models and feature extraction approaches to predict cancer-associated sORFs across 15 cancer types. We bench- marked three classical algorithms (Random Forest, SVM, and MLP) combined with three feature extraction methods: k-mer frequency, Word2Vec embeddings, and genomic language model (gLM)- based embeddings. To our knowledge, this is the first study applying gLM-derived embeddings to the prediction of cancer-associated sORFs in ncRNA. Our results show that classical machine learning models with appropriate feature extraction outperform the CoraL baseline across all cancer types, achieving up to 10% higher accuracy in some datasets. Interestingly, k-mer features consis- tently outperformed gLM embeddings without fine-tuning, highlighting that the effectiveness of the representation method depends on the characteristics of the data. Additionally, we observed that the way sequences are tokenized, such as the k-mer length, can affect performance: longer fragments (e.g., k=7) sometimes reduced accuracy for Random Forest but had a smaller effect on MLP. 

Conclusions: Our findings suggest appropriately that feature engineering can provide greater improvements than in- creasing model complexity.