Is the Average Shortest Path Length of Gene Set a Reflection of their Biological Relatedness?

When a set of genes are identified to be related to a disease, say through gene expression analysis, it is common to examine the average distance among their protein products in the human interactome as a measure of biological relatedness of these genes. The reasoning for this is that, genes associated with a disease would tend to be functionally related, and that functionally related genes would be closely connected to each other in the interactome. Typically, average shortest path length (ASPL) of disease genes * is compared to ASPL of randomly selected genes or to ASPL in a randomly permuted network. We examined whether the ASPL of a set of genes is indeed a good measure of biological relatedness or whether it is simply a characteristic of the degree distribution of those genes. We examined the ASPL of genes sets of some disease and pathway associations and compared them to ASPL of three types of randomly selected control sets: uniform selection, from entire proteome, degree-matched selection and random permutation of the network. We found that disease associated genes and their degree-matched random genes have comparable ASPL. In other words, ASPL is a characteristic of the degree of the genes and the network topology, and not that of functional coherence.


Background
Protein-protein interaction (PPI) networks are undirected graphs that denote physical interaction between proteins. They are scale free networks i.e., the number of interactions of a protein follows the power law distribution which has been a noticeable phenomenon over the years as shown in Figure 1 [1]. Graph theory has been applied to PPI networks to analyze and predict biological phenomena using topological features such as clustering coefficients, presence of highly connected hubs, modular organization, etc [2,3].
Average shortest path length (ASPL), a topological feature of scale-free networks, is a commonly used metric for the analysis of disease-associated genes and is used to demonstrate whether the genes are functionally cohesive. The intuition behind this is that the * Although referred to as 'genes' in the context of disease associations, it is the protein-products that are represented in the interactome. Thus, throughout the paper, 'ASPL among genes' refers to 'ASPL among their protein products'. genes that are functionally related are going to be closer to each other in the interactome. Typically, a low ASPL is reported between candidate genes compared to randomly selected genes along with an empirical p-value of statistical significance [4][5][6][7][8][9]. ASPL is commonly used in gene-prioritization [4,10]. It has been used to study pathways in complex diseases to connect putative causal genes to target genes and to extract disease associated network modules [11].
However, it has been shown mathematically that in scale-free networks, the ASPL of the network converges to a value that is solely dependent on number of nodes and the average degree of nodes [12]. This property has been empirically shown to hold true for PPI network of yeast. Comparing the ASPL of yeast PPI network to ASPL of randomly generated degreematched networks showed that they converged to the same value [13].
This leads us to this question: is the ASPL of a subset of genes a reflection of their biological relatedness or that of the degree of those genes? In this work, we examined the ASPL of typical candidate genes sets, namely those that are associated with diseases, with randomly selected genes and in randomly permuted network. We also perform experiments to study the effect of hub proteins in characterizing the ASPL of the candidate gene sets.

Methods
We collected human protein-protein interactions from two databases: BioGRID and HPRD [14,15]. We filtered the data set to only include direct interactions (MI:0407 in the PSI MI Ontology). This yielded 77,534 non-redundant interactions among 13,217 genes.
Pathways were acquired from REACTOME and diseases from KEGG [16,17]. We filtered the annotated gene sets to include only those with a minimum of 10 genes. This yielded 1,040 pathways and 283 diseases among 7,305 and 5,396 genes respectively. We also analyzed six traits studied in Genome Wide Association Studies (GWAS): Type 2 diabetes, celiac disease, bipolar disorder, obesity, Alzheimer's disease and Crohn's disease. We refer to the genes associated with a given annotation as 'candidate genes'.
Average length of shortest path between every pair of candidate genes is computed. In the event that there is no path between two genes, we defined that shortest path length to be 5 more than the largest distance observed between any pair of genes in the network, thus setting it 17. We used NetworkX, a Python library, for computing ASPL [18].
We compared the ASPL of candidate genes in the human interactome with that in a randomly permuted network. We generated the permuted network by swapping the edges in the graph while still keeping the degree of each node fixed.
We also compared the ASPL of candidate genes with that of randomly selected genes in the human interactome. We employed two methods to sample genes randomly:

•
Uniform sampling: Uniform random sampling from the human genes from a total 19,676 genes in HPRD database. • Degree-matched random sampling: Random genes were sampled with the constraint that they have (nearly) the same degree distribution as the candidate genes. We binned all the genes according to their degree in the interactome using varied bin sizes (Table 1), and then sampled the same number of genes from each bin as the number of candidate genes in that bin, while ensuring that candidate genes are not included in the sample. Table 2 provides the general architecture of the Human PPI network with overall statistics about its size, density and connectivity. The interactome has 30 connected components. There are only 13,217 human genes with known interactions, leaving thousands of genes without any known interactions. The ASPL values for each of the sampling methods is reported in Table 3.

Results
When compared with random genes selected with uniform sampling, the disease or pathway associated genes always had a significantly lower ASPL, suggesting that they were much more closely connected in the network. However, when compared with random genes that were constrained to have same degree as candidate genes, there was no discernible difference in their ASPL. This was also true when the comparison was made to candidate genes in a randomly permuted network. Thus, it is seen consistently that, any gene sets that had same degree distribution in a given network or another network with same global degree distribution, had similar ASPLs.
For each of the three methods, we computed the ASPL for 1000 gene sets for each annotation. We performed a two sided t-test which showed that all of the traits had significant p-values when compared against ASPL of random gene lists and found that the results had a high statistical significance.
Next, we considered the possibility that the presence of hub proteins may influence the ASPL values of the selected gene sets. In order to account for it, we carried out similar analysis as before, but by removing hub genes from the gene sets. First we considered removing hub genes with degree >50, and then hub genes with degree >25. Despite removing these hub genes from the candidate and random gene sets, it remains that the degree-matched random genes have comparable ASPL as candidate genes ( Figure 2).

Conclusions
ASPL is a commonly used metric to show that a candidate set of genes is functionally related in many system biology works. In this paper, we discuss the bias that the method may hold due to the incomplete nature of PPI networks and also because some genes may be better studied than others due to their association with diseases. We performed an experiment by using three different sampling methods and compared the ASPL of the sample genes with a candidate set of genes known to be associated with diseases and pathways. We found that the ASPL of the degree matched random genes and the ASPL of genes in the permuted network converge to the same value as that of the candidate genes. We examined whether the presence of hub genes in a candidate gene set, and therefore a corresponding gene in the degree-matched random set, was the reason for comparable ASPL values, but concluded the ASPL value of degree-matched random genes remained to be similar to that of candidate genes even after removing hub genes.
We conclude that the AS is more a characteristic of the network topology rather than that of biological cohesiveness. That is, even a randomly selected set of genes can be as closely connected as biologically related genes (ex. genes involved in a disease), as long as the random genes have same degree distribution as the candidate genes.  ASPL after removing hub genes (a) without genes with degree greater than 50 (b) without genes with degree greater than 25 The range of degrees a node may have: There are only 13,217 human genes with known interactions from HPRD and BioGRID, leaving most of the genome without any known interactions.   Table 3 Comparison of shortest path distances: Average shortest path lengths are shown for different genes sets in the computed in the human interactomes network (columns C-E) and in randomly permuted network (column F). Distances have been rounded to closest integer value so that it is easy to see the similarities.

KEGG Diseases
Type I diabetes mellitus