1. Network construction by integrating heterogeneous genomic data

AF-specific functional gene network (FGN) is built by integrating AF RNA-seq gene expression datasets along with several other genomic interaction datasets using the state-of-the-art XGBoost algorithm. The method is briefly described below.

1.1 Functional genomic data and gold standard of functionally related gene pairs.

The integrated heterogeneous functional genomic data are dominantly array gene expression, which are obtained from GEO expression datasets, and general omics data. We downloaded the AF expression data from the database. In each dataset, lowly expressed genes (FPKM < 0.1 in more than 90% samples, which is the same criteria used in [1]) were removed. This procedure was applied to each dataset. Finally, we obtained qualified gene expression data. Because the co-function probability (CFP) between two genes in the network is gene-pair-wise, the features to characterize gene pairs need to be pairwise. For each dataset, the Pearson correlation coefficient (PCC) between every gene pair was calculated and used as the feature.

In addition, we also considered six pairwise genomic features, which are obtained from the GIANT website[3], including protein-protein interaction from MINT, IntAct, and BioGRID, the chemical and genetic perturbations, shared 3’ UTR microRNA binding motif from MSigDB, and co-occurrence of transcription factor binding sites originally from Jaspar. Finally, a number of features are obtained for each tissue.

We then generated functionally related (positives) and unrelated (negatives) gene pairs using the same method established in our previous work[1, 4, 5]. Briefly, gene pairs that are co-annotated to the same Gene Ontology (GO) biological process or Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway are considered as positives. For GO terms, only the annotation with experimental evidence codes (EXP, IDA, IPI, IMP, IGI and IEP) were used for quality control. The gene pairs generated from GO and KEGG were combined, and redundant ones were removed. Negatives were generated using the same approach as established in the work[1, 4].

1.2 Genomic data integration with XGBoost

Considering the limitation of the naïve Bayesian classifier as used in previous work, the state-of-the-art XGBoost algorithm was adopted to build the network prediction model because it does not assume feature independence, is suitable for nonlinear modelling, scalable to large datasets, and has been proved powerful in many studies[6-9]. An XGBoost model was learned using the features and the labelled positive or negative gene pair as input. The learned model was then used to predict CFP for all possible gene pairs, resulting in the FGN for each tissue or cell line (called TissueNexus for simplicity).

Reference

[1] Cui-Xiang Lin, Hong-Dong Li, Chao Deng, Weisheng Liu, Shannon Erhardt, Fang-Xiang Wu, Xing-Ming Zhao, Yuanfang Guan, Jun Wang, Daifeng Wang, Bin Hu, Jianxin Wang, An integrated brain-specific network identifies genes associated with neuropathologic and clinical traits of Alzheimer’s disease, Briefings in Bioinformatics, 2021, bbab522, https://doi.org/10.1093/bib/bbab522.
[2] Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A. 2003;100:8348-53.
[3] Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, et al. Understanding multicellular function and disease with human tissue-specific networks. Nat Genet. 2015;47:569-576.
[4] Li H-D, Bai T, Sandford E, Burmeister M, Guan Y. BaiHui: cross-species brain-specific network built with hundreds of hand-curated datasets. Bioinformatics. 2018;35:2486-2488.
[5] Guan Y, Myers CL, Lu R, Lemischka IR, Bult CJ, Troyanskaya OG. A genomewide functional network for the laboratory mouse. PLOS Comput Biol. 2008;4:e1000165.
[6] Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: Association for Computing Machinery, New York, NY, United States; 2016. p. 785–794.
[7] Kim N, Kim HK, Lee S, Seo JH, Choi JW, Park J, et al. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nature Biotechnology. 2020;38:1328-1336.
[8] Lewis JE, Kemp ML. Integration of machine learning and genome-scale metabolic modeling identifies multi-omics biomarkers for radiation resistance. Nature Communications. 2021;12:2700.
[9] Cichońska A, Ravikumar B, Allaway RJ, Wan F, Park S, Isayev O, et al. Crowdsourced mapping of unexplored target space of kinase inhibitors. Nature Communications. 2021;12:3307.