Each tissue-wise functional gene network (FGN) is built by integrating tissue RNA-seq gene expression datasets along with several other genomic interaction datasets using the state-of-the-art XGBoost algorithm. The method is briefly described below.
The integrated heterogeneous functional genomic data are dominantly RNA-seq gene expression, which are obtained from the the Digital Expression Explorer 2 (DEE2) database. In DEE2, the RNA-seq data are organized into datasets, with each dataset contains multiple samples. We downloaded the human RNA-seq expression data measured with Fragments per Kilobase of exon model per million mapped reads (FPKM) along with the description about tissue origins from this database. We manually read the description of every individual sample and extracted the tissue origin or cell line. In this way, we assigned a tissue type or cell line to each sample. Because for each dataset the Pearson correlation coefficient (PCC) between every pair of genes needs to be calculated as features and PCC could be spurious if the number of samples is small, we retained only the datasets containing at least 10 samples following the practice in our previous work[1]. In each dataset, lowly expressed genes (FPKM < 0.1 in more than 90% samples, which is the same criteria used in [1]) were removed. This procedure was applied to the data of each tissue. Finally, we obtained RNA-seq gene expression data for 50 tissues or cell lines. Because the co-function probability (CFP) between two genes in the network is gene-pair-wise, the features to characterize gene pairs need to be pairwise. For each RNA-seq dataset, the PCC between every gene pair was calculated and used as the feature.
In addition, we also considered six pairwise genomic features, which are obtained from the GIANT website[3], including protein-protein interaction from MINT, IntAct, and BioGRID, the chemical and genetic perturbations, shared 3’ UTR microRNA binding motif from MSigDB, and co-occurrence of transcription factor binding sites originally from Jaspar. Finally, a number of features are obtained for each tissue.
We then generated functionally related (positives) and unrelated (negatives) gene pairs using the same method established in our previous work[1, 4, 5]. Briefly, gene pairs that are co-annotated to the same Gene Ontology (GO) biological process or Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway are considered as positives. For GO terms, only the annotation with experimental evidence codes (EXP, IDA, IPI, IMP, IGI and IEP) were used for quality control. The gene pairs generated from GO and KEGG were combined, and redundant ones were removed. Negatives were generated using the same approach as established in the work[1, 4].
Considering the limitation of the naïve Bayesian classifier as used in previous work, the state-of-the-art XGBoost algorithm was adopted to build the network prediction model because it does not assume feature independence, is suitable for nonlinear modelling, scalable to large datasets, and has been proved powerful in many studies[6-9]. An XGBoost model was learned using the features and the labelled positive or negative gene pair as input. The learned model was then used to predict CFP for all possible gene pairs, resulting in the FGN for each tissue or cell line (called TissueNexus for simplicity).
[1] Li HD, Menon R, Govindarajoo B, Panwar B, Zhang Y, Omenn GS, et al. Functional Networks of Highest-Connected Splice Isoforms: From The Chromosome 17 Human Proteome Project. J Proteome Res. 2015;14:3484-91.
[2] Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A. 2003;100:8348-53.
[3] Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, et al. Understanding multicellular function and disease with human tissue-specific networks. Nat Genet. 2015;47:569-576.
[4] Li H-D, Bai T, Sandford E, Burmeister M, Guan Y. BaiHui: cross-species brain-specific network built with hundreds of hand-curated datasets. Bioinformatics. 2018;35:2486-2488.
[5] Guan Y, Myers CL, Lu R, Lemischka IR, Bult CJ, Troyanskaya OG. A genomewide functional network for the laboratory mouse. PLOS Comput Biol. 2008;4:e1000165.
[6] Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: Association for Computing Machinery, New York, NY, United States; 2016. p. 785–794.
[7] Kim N, Kim HK, Lee S, Seo JH, Choi JW, Park J, et al. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nature Biotechnology. 2020;38:1328-1336.
[8] Lewis JE, Kemp ML. Integration of machine learning and genome-scale metabolic modeling identifies multi-omics biomarkers for radiation resistance. Nature Communications. 2021;12:2700.
[9] Cichońska A, Ravikumar B, Allaway RJ, Wan F, Park S, Isayev O, et al. Crowdsourced mapping of unexplored target space of kinase inhibitors. Nature Communications. 2021;12:3307.
We quantified the contribution of each individual dataset to the functional relationship between two genes using the established method (described on https://humanbase.readthedocs.io/en/latest/functional-networks.html#evidence). Briefly, it measures the increment of the co-functional probability between a pair of genes (denoted by Q) conditioned on feature data D.
Q = p(y=1|D) - p(y=1)                  (1)
, where p(y=1) and p(y=1|D) represent the prior and posterior probability that a given gene pair is functionally related (i.e., being positive) conditioned on given data D. The percentage of positive gene pairs is used as prior probability and the posterior probability can be calcualted using Bayesian formula.
In this way, we have calculated the contribution of each dataset to each edge, thus providing a means to evaluate whether a given edge is supported by RNA-seq data or by tissue agnostic databases or by both.
The GOTermFinder software is used to perform enrichment analysis. The input data is the Gene Ontology database (2021-09-01 version) and the human gene annotation file (2021-09-01 version), which are available from the GO website.
We used the "Curated gene-disease associations" downloaded from the DisGeNET database (version 5.0) for disease enrichment analysis. DisGeNET (v5.0) contains 561,119 gene-disease associations (GDAs), between 17,074 genes and 20,370 diseases, disorders, traits, and clinical or abnormal human phenotypes", as stated by DisGeNET.
Using the DisGeNET data, for each disease/traits, we performed enrichment analysis and evaluated significance of the enrichment using hypergeometric distribution which is commonly used in enrichment anlaysis such as in GO enrichment. The resulting p-value is corrected for multiple tests using the Bonferroni method.
Copyright @ Hunan Provincial Key Lab on Bioinformatics
School of Computer Science and Engineering, Central South University, Changsha, Hunan Province, P.R. China