# Generative Bayesian Networks for Augmentation of Molecular Data from Commercial Genetic Panels

**Dillon Tracy, Jeff Sherman, Maayan Baron**

## SUMMARY

- We introduce a generative Bayesian network method for synthesizing annotated patient feature profiles using a constrained set of genes from limited real-world molecular data, looking specifically at somatic mutations and lung and breast cancer.

## Abstract #7373

- This approach addresses challenges posed by widely clinically available, yet molecularly sparse tumor data, enhancing the value of established real-world clinicogenomic datasets and potentially advancing precision oncology through personalized treatment guidance, enriched data analysis and novel biomarker identification.

## BACKGROUND

- This issue of molecular sparsity is exacerbated by earlier assays, resulting in real-world clinicogenomic databases that are very rich in longitudinal clinical follow-up, but restricted in their applicability to research pursuits such as biomarker discovery.

- The number of genes on commercial NGS panels continues to increase over time, reflecting the discovery of more biomarkers in cancer research and the translation of these discoveries into clinical practice.

\downarrow\mathrm{r e},

- We hypothesized that by modeling the joint distribution of both observed and unobserved molecular features in a large tumor cohort using a Bayesian network and Gibbs sampling, we could effectively infer and synthesize comprehensive mutational profiles for tumors with otherwise limited data from commercial NGS panels (Fig. 1).

## Characterizing Drug Response Using Generated Patient Mutational Data

One application of this type of generative model is in downstream modeling and biomarker discovery. Using an internal drug response prediction model we found variations in augmented profiles typically induced small perturbations to modeled drug response (Fig. 4). Interestingly, when outputs were discordant between limited and expanded actual gene panel inputs, synthetic data were more concordant with results from expanded panels (Fig. 5). Moreover, synthetic data enables

| parent | child | coeff |
| --- | --- | --- |
| FLI1 | SOX10 | 0.645 |
| IRF2 | SOX10 | 0.779 |
| NFE2 | SOX10 | 0.473 |
| PTPRO | SOX10 | 0.779 |

A Marginal mutation probabilities by gene, TCGA LUSC cohort

## GENIE-DFCI-004310-577

Figure 4. Consistency in drug sensitivity predictions using real and synthesized mutational profiles for fulvestrant in a BRCA patient. Fulvestrant drug response prediction scores were generated for a single BRCA patient using 5000 profiles (metapanel, 757g) generated from an actual 190-gene set. Predictive performance for this drug response model was assessed using actual data as input from a 190-gene (panel, 190g) and 757-gene panel (envelope, 757g). The outcome demonstrates high consistency between real and synthesized profiles (higher AUC = increased resistance).

Figure 5. Tamoxifen response predictions between synthetic and expanded gene panels are concordant. Tamoxifen response scores were obtained for an additional BRCA patient using a synthetic 757-gene (metapanel, 757g) generated from an actual 190 gene mutation panel result for this patient. Predictions were discordant between the panel (panel, 190g; predicted sensitive) and the generated profile (envelope, 757g; predicted resistant), highlighting the impact in predictive power of larger gene panels. Remarkably, the insensitive peak in the predicted response distribution (metapanel, 757g) closely matched the actual 757g panel, demonstrating the robustness of our method. Solid line demarcates the sensitive/insensitive boundary for the binary classifier whose feature importance appears in the SHAP analysis (right). Positive and red SHAP values indicate REL mutations are linked to tamoxifen sensitivity.

- Additionally, these enhanced datasets hold the potential to facilitate development of machine learning models, utilizing vast amounts of real-world data to address diverse questions that support the advancement of precision medicine.

1. Wang L, Audenaert P, Michoel T. High-Dimensional Bayesian Network Inference From Systems Genetics Data Using Genetic Node Ordering. Front Genet. 2019 Dec 20;10:1196. doi: 10.3389/fgene.2019.01196. PMID: 31921278; PMCID: PMC6933017.
2. Lee J, Choi MK, Song IS. Recent Advances in Doxorubicin Formulation to Enhance Pharmacokinetics and Tumor Targeteing. Pharmaceuticals (Basel). 2023 May 29;16(6):802. doi: 10.3390/ph16060802. PMID: 37375753; PMCID: PMC10301446.

- The Bayesian network method for synthesizing patient genetic profiles tackles the issue of limited molecular data in real-world clinical settings, significantly enhancing real-world clinicogenomic datasets that typically lack molecular detail but have extensive clinical follow-up.

4. Malash, I., Mansour, O., Gaafar, R. et al. Her2/EGFR-PDGFR pathway aberrations associated with tamoxifen response in metastatic breast cancer patients. J Egypt Natl Canc Inst 34, 31 (2022). [https://doi.org/10.1186/s43046-022-00132-5](https://doi.org/10.1186/s43046-022-00132-5)
5. Chouhan S, Singh S, Athavale D, Ramteke P, Vanuopadath M, Nair BG, Nair SS, Bhat MK. Sensitization of hepatocellular carcinoma cells towards doxorubicin and sorafenib is facilitated by glucose dependent alterations in reactive oxygen species, P-glycoprotein and DKK4. J Biosci. 2020;45:97. PMID: 32713860.
6. Williams MM, Cook RS. Bcl-2 family proteins in breast development and cancer: could Mcl-1 targeting overcome therapeutic resistance? Oncotarget. 2015 Feb 28;6(6):3519-30. doi: 10.18632/oncotarget.2792. PMID: 25784482; PMCID: PMC4414133.
