TissueNexus

1. Network construction by integrating heterogeneous genomic data

Each tissue-specific functional gene network (FGN) is built by integrating tissue-specific RNA-seq gene expression datasets along with several other genomic interaction datasets using the state-of-the-art XGBoost algorithm. The method is briefly described below.

1.1 Functional genomic data and gold standard of functionally related gene pairs.

The integrated heterogeneous functional genomic data are dominantly RNA-seq gene expression, which are obtained from the the Digital Expression Explorer 2 (DEE2) database. In DEE2, the RNA-seq data are organized into datasets, with each dataset contains multiple samples. We downloaded the human RNA-seq expression data measured with Fragments per Kilobase of exon model per million mapped reads (FPKM) along with the description about tissue origins from this database. We manually read the description of every individual sample and extracted the tissue origin or cell line. In this way, we assigned a tissue type or cell line to each sample. Because for each dataset the Pearson correlation coefficient (PCC) between every pair of genes needs to be calculated as features and PCC could be spurious if the number of samples is small, we retained only the datasets containing at least 10 samples following the practice in our previous work[1]. In each dataset, lowly expressed genes (FPKM < 0.1 in more than 90% samples, which is the same criteria used in [1]) were removed. This procedure was applied to the data of each tissue. Finally, we obtained RNA-seq gene expression data for 50 tissues or cell lines. Because the co-function probability (CFP) between two genes in the network is gene-pair-wise, the features to characterize gene pairs need to be pairwise. For each RNA-seq dataset, the PCC between every gene pair was calculated and used as a feature. Because the PCC in different datasets may have different distributions and therefore not comparable, we followed the established method in previous studies[1, 2] and normalized the PCC of each dataset using Fisher’s z-transformation as follows:

z = 0.5ln[(1+r)/(1-r)] (1)

, where r and z represent the original and transformed PCC, respectively. The transformed PCC is approximately normally distributed and are comparable across datasets.

In addition, we also considered seven pairwise genomic features. Six features are obtained from the GIANT website[3], including protein-protein interaction from MINT, IntAct, and BioGRID, the chemical and genetic perturbations, shared 3’ UTR microRNA binding motif from MSigDB, and co-occurrence of transcription factor binding sites originally from Jaspar. Another one is the number of shared phenotypes between genes, which is calculated based on the gene-phenotype annotation data from the Mouse Genome Informatics database (http://www.informatics.jax.org/) where each gene is annotated to one or more phenotypes. Finally, a number of features are obtained for each tissue.

We then generated functionally related (positives) and unrelated (negatives) gene pairs using the same method established in our previous work[1, 4, 5]. Briefly, gene pairs that are co-annotated to the same Gene Ontology (GO) biological process or Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway are considered as positives. For GO terms, only the annotation with experimental evidence codes (EXP, IDA, IPI, IMP, IGI and IEP) were used for quality control. The gene pairs generated from GO and KEGG were combined, and redundant ones were removed. Negatives were generated using the same approach as established in the work[1, 4].

1.2 Genomic data integration with XGBoost

Considering the limitation of the naïve Bayesian classifier as used in previous work, the state-of-the-art XGBoost algorithm was adopted to build the network prediction model because it does not assume feature independence, is suitable for nonlinear modelling, scalable to large datasets, and has been proved powerful in many studies[6-9]. An XGBoost model was learned using the features and the labelled positive or negative gene pair as input. The learned model was then used to predict CFP for all possible gene pairs, resulting in the FGN for each tissue or cell line (called TissueNexus for simplicity).

Reference

[1] Li HD, Menon R, Govindarajoo B, Panwar B, Zhang Y, Omenn GS, et al. Functional Networks of Highest-Connected Splice Isoforms: From The Chromosome 17 Human Proteome Project. J Proteome Res. 2015;14:3484-91.
[2] Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A. 2003;100:8348-53.
[3] Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, et al. Understanding multicellular function and disease with human tissue-specific networks. Nat Genet. 2015;47:569-576.
[4] Li H-D, Bai T, Sandford E, Burmeister M, Guan Y. BaiHui: cross-species brain-specific network built with hundreds of hand-curated datasets. Bioinformatics. 2018;35:2486-2488.
[5] Guan Y, Myers CL, Lu R, Lemischka IR, Bult CJ, Troyanskaya OG. A genomewide functional network for the laboratory mouse. PLOS Comput Biol. 2008;4:e1000165.
[6] Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: Association for Computing Machinery, New York, NY, United States; 2016. p. 785–794.
[7] Kim N, Kim HK, Lee S, Seo JH, Choi JW, Park J, et al. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nature Biotechnology. 2020;38:1328-1336.
[8] Lewis JE, Kemp ML. Integration of machine learning and genome-scale metabolic modeling identifies multi-omics biomarkers for radiation resistance. Nature Communications. 2021;12:2700.
[9] Cichońska A, Ravikumar B, Allaway RJ, Wan F, Park S, Isayev O, et al. Crowdsourced mapping of unexplored target space of kinase inhibitors. Nature Communications. 2021;12:3307.

2. Disease/trait enrichment

We used the "Curated gene-disease associations" downloaded from the DisGeNET database (version 5.0) for disease enrichment analysis. DisGeNET (v5.0) contains 561,119 gene-disease associations (GDAs), between 17,074 genes and 20,370 diseases, disorders, traits, and clinical or abnormal human phenotypes", as stated by DisGeNET.

Using the DisGeNET data, for each disease/traits, we performed enrichment analysis and evaluated significance of the enrichment using hypergeometric distribution which is commonly used in enrichment anlaysis such as in GO enrichment. The resulting p-value is corrected for multiple tests using the Bonferroni method.

School of Computer Science and Engineering, Central South University, Changsha, Hunan Province, P.R. China