# Reconstructing a latent representation of gene expression from genomic alterations to improve clinical utility of real-world clinicogenomics data ### Authors Sunil Kumar, Felicia Kuperwaser, Dillon Tracy, Jeff Sherman, Emily Vucic and Maayan Baron ### Abstract # 3519 ## BACKGROUND - Patient datasets with clinical and molecular information are ideal for studying tumor biology and developing robust machine learning (ML) models for predicting outcome and treatment response. These data however rarely exist in real-world settings or in sufficient quantities within research contexts. - Large publicly available datasets like The Cancer Genome Atlas (TCGA), which provide multi-omic profiles for diverse cancer types, have greatly facilitated development of novel therapies and personalized medicines. However, the absence of patient outcome data tied to treatment limits the applicability of these data for understanding and modeling treatment response. - Real-world clinicogenomics cohorts, such as the AACR Project GENIE, on the other hand are typically very rich in clinical annotations, including treatment regimens and outcomes measures. These data, however, are sparsely annotated for patient tumor molecular profiles, rarely exceeding ~100’s of genes profiled. ## METHODS We developed an ML model (Mut2Ex) to reconstruct tumor gene expression profiles using genetic information available on commercial next generation sequencing panels using a regression-adapted Principle Label Space Transformation (PLST), along with embeddings from minimal clinical information (OncoTree code, sex and stage) generated by a language model (Fig. 1). Mut2Ex was trained on ~1200 DepMap cell lines across 26 cancer types to reconstruct whole transcriptome mRNA expression profiles. These profiles were generated for ~10,000 tumors from TCGA and ~180,000 tumors from AACR Project GENIE and applied to a variety of clinical tasks. ## RESULTS **Input** - Reconstructed mRNA expression by Mut2Ex was highly correlated with true expression in cell lines (r = 0.9342, \[0.9328-0.9357, 95% CI, N=164\]). Compared to true expression, reconstructed profiles recapitulate sub-clusters within cancer types, PAM50 subtyping in breast tumors, survival signatures in colorectal tumors and multiple oncogenic signatures in a pan-cancer manner. - Analysis of reconstructed expression for AACR Project GENIE tumors revealed expected enrichment of known driver genes within expression subtypes and enrichment of oncogenic signatures associated with distinct clinical outcomes in a cancer type specific manner. - Boxplots of number of mutations per sample for each cancer type. - Same as above but limited to the 220 genes and hotspot mutations that are the input for our model designed for biomedical text mining tasks, called BioBERT. Corresponding one-hot encoded patient tumor hotspot mutations and high level copy number alterations (amplifications or homozygous deletions) for a set of n≈220 genes commonly profiled on multi-gene commercial next generation sequencing (NGS) panels were input into an adapted Principle Label Space Transformation model, to reconstruct an mRNA transcriptome (n=18,969 genes) for a tumor sample. Reconstructed expression profiles can be applied to downstream analyses or mRNA-based clinical tasks, augmenting the utility of RWD cohorts. ### Open Source & Proprietary **Output** Zephyr AI Machine Learning (ML) method reconstructs transcriptomes with high accuracy across multiple tumor types. Our expression reconstruction model, trained on the DepMap dataset, used 720 cell lines for training and 164 for testing. We compared the model's reconstructed expression profiles (18,969 genes) to actual expression in 26 cancer subtypes (Fig. 2A). Including clinical features significantly improved accuracy at sample and gene levels (Fig. 2B, P<10,
effect sizes of 1.18 and 1.02, respectively). There was a strong positive correlation between reconstructed and true expression, especially for highly variable genes (Fig. 2C, left panel, r=0.66, P<10), suggesting variability enhances model learning and prediction. ### Zephyr AI’s Reconstruction Model is Robust Across Diverse Commercial NGS Panels Neighbor Embedding (t-SNE) plot of reconstructed expression profiles shows the model output is robust across genomic inputs from various commercial NGS providers and assays, with no distinct clustering by assay type (Fig. 3D), while capturing salient clinical and biological features including cancer type (Fig. 3E) and expression patterns of key cancer genes (Fig. 3F). To assess the clinical utility of reconstructed expression, we applied our method to 564 breast cancer samples (Fig. 4A), using only those features specified in Fig 1. We compared the predictive efficacy of reconstructed expression to true expression and mutations for four clinical classifications: stage, HER2 status, ER status, and PAM50 status. Reconstructed expression performed comparably to true expression and outperformed DNA alterations alone in all tasks (Fig. 4B). ### Evaluating Clinical Utility of Zephyr AI's Reconstructed Expression Model in Breast Cancer Reconstructed expression profiles were generated for 272 colon adenocarcinoma tumors. OncotypeDx signatures were derived using genes from either real RNA sequencing or reconstructed profiles. Overall survival (OS) was compared between patients with high and low risk scores (RS) from real expression (Fig. 5A) and reconstructed expression (Fig. 5B). Real expression-based signatures showed a 12-month survival increase for low RS patients, while reconstructed expression-based signatures showed a 27-month increase. While a high correlation (r = 0.65, p-value < 10) was observed between risk scores from real and reconstructed expression (Fig. 5C), discrepancies may contribute to improved survival outcomes in some patients. Indeed, gene set enrichment analysis revealed that high RS from reconstructed expression is associated with STK33 and BMI pathways, whereas high RS from real expression is linked to fatty acid metabolism and AKT/MTOR signaling (Fig. 5D). ## ACKNOWLEDGEMENTS The authors express their gratitude to the Zephyr AI science, engineering, data and business development teams for invaluable technical support and discussion. We also acknowledge the contributions of the authors and organizations cited, with special thanks to AACR Project GENIE, TCGA and the Cancer Dependency Map for essential data resources. We extend our appreciation to Candy Zhu and Jasmine Chu for their valuable assistance in designing this poster. ## CONCLUSION