Identifying cancer type speci ̄ c oncogenes and tumor suppressors using limited size data

Cancer is a complex and heterogeneous genetic disease. Di®erent mutations and dysregulated molecular mechanisms alter the pathways that lead to cell proliferation. In this paper, we explore a method which classi ̄es genes into oncogenes (ONGs) and tumor suppressors. We optimize this method to identify speci ̄c (ONGs) and tumor suppressors for breast cancer, lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC) and colon adenocarcinoma (COAD), using data from the cancer genome atlas (TCGA).A set of geneswere previously classi ̄ed asONGs and tumor suppressors across multiple cancer types (Science 2013). Each gene was assigned an ONG score and a tumor suppressor score based on the frequency of its driver mutations across all variants from the catalogue of somaticmutations in cancer (COSMIC).We evaluate and optimize this approach within di®erent cancer types from TCGA.We are able to determine known driver genes for each of the four cancer types. After establishing the baseline parameters for each cancer type, we identify new driver genes for each cancer type, and the molecular pathways that are highly a®ected by them. Our methodology is general and can be applied to di®erent cancer subtypes to identify speci ̄c driver genes and improve personalized therapy.


Introduction
Cancer is a complex disease driven by di®erent genetic, genomic or epigenetic mechanisms.A cancer driver gene is activated by driver mutations, but may also contain This is an Open Access article published by World Scienti¯c Publishing Company.It is distributed under the terms of the Creative Commons Attribution 4.0 (CC-BY) License.Further distribution of this work is permitted, provided the original work is properly cited.

Classifying genes into oncogenes or tumor suppressors
The 20/20 rule proposed in Ref. 2 is an heuristic algorithm to classify genes into oncogenes or tumor suppressors.The rule takes into account both the mutation categories and their frequencies.First, for a given gene, the total number of variants is computed.Then, each gene is assigned an ONG score and a tumor suppressor gene (TSG) score which are computed based on the frequency of gain-of-function or lossof-function mutations, respectively.Gain-of-function mutations include missense or in-frame indels which are recurrently mutated at the same aminoacid position, while loss-of-function mutations include nonstop, nonsense and frame-shift indels. 2 For each gene, the ONG score is the frequency of gain-of-function mutations out of the total number of variants, while the TSG score is the frequency of all loss-of-function mutations out of the total number of variants.To validate the method, the authors in Ref. 2 used data from the COSMIC 13 which is a large collection of somatic mutations from multiple studies of di®erent cancers.They computed the ONG and TSG rule for 125 known cancer drivers and concluded that if ONG is greater than 20%, then the gene is an oncogene.Similarly, if TSG score is higher than 20%, then the gene is a tumor suppressor.The 20/20 rule is illustrated in Fig. 1 for an ONG and a tumor suppressor.Some genes may be assigned both ONG and TSG scores greater than 20%.For example, TP53 tumor suppressor presents similarly elevated values for both ONG and TSG scores.For TP53 case, missense mutations generally drive the loss-offunction of the gene, 2,21-24 not its oncogenic activity as considered by the 20/20 rule.However, the 20/20 rule is an heuristics that was successfully validated for most of the well known ONGs and tumor suppressors. 2In this paper, we will use the scores for the 125 known ONGs and tumor suppressors 2 as a baseline to further explore and optimize the rule in reduced size data sets of di®erent cancer types.
The results of this paper are based upon data generated by the TCGA Research Network.Publicly available somatic mutation data (level 2) was downloaded from TCGA 14 for four cancer types: breast cancer (BRCA), 15 LUAD, 16 LUSC 17 and COAD. 18

Testing the 20/20 rule
First, we compared our implementation of the 20/20 rule using COSMIC mutation data v72 against the published values from Ref. 2. The authors in Ref. 2 used v61, however COSMIC data has been updated and it is currently available for download as v72.
As expected we obtained high correlation values (above 0.9) for both TSG and ONG scores across the 125 driver genes in Ref. 2 (Fig. 2).We explored di®erent thresholds for the recurrence level of gain-of-function mutations: more than 2, 5, 10, 100 and 200 variants at the same aminoacid position (Figs.2(b)-2(f)).The highest correlation for the ONG scores is obtained for a recurrence threshold of 2. We will further use this threshold in our analysis.
We did not obtain a perfect correlation of 1 between our computed scores and the previously published values due to the di®erences in the COSMIC database versions.Moreover, the authors in Ref. 2 may have ¯ltered out some of the cancer studies from COSMIC, while we included all the currently available data.However, the correlation values are signi¯cantly high (0.92 for TSG scores and 0.95 for ONG scores) to validate our implementation of the 20/20 rule.Next, we will evaluate the rule in four cancer speci¯c data sets.

Optimizing the 20/20 rule
We hypothesized that the 20% frequency threshold of the 20/20 rule, which indicates the active state of a gene, 2 could be optimized for each data set.Active cancer genes are speci¯c to each type of cancer, therefore this threshold may be di®erent from one type of cancer to another.This percentage may also depend on the dimensions of the data set and may change when a reduced sample size is used.Moreover, the gene activity may be de¯ned by di®erent thresholds for ONGs than for tumor suppressors.
Therefore, we considered the ONG and TSG scores obtained by the 20/20 rule as baseline, since they were computed and validated on a large scale data such as  COSMIC.Then, we optimized the rule for each cancer type independently.We tested all possibly frequency thresholds between 1% to 100% and ¯ltered out the inactive cancer genes from the 125 genes set.We computed the correlation of the remaining genes with the published scores and compared the results obtained using the default threshold of 20%.We then identi¯ed the ONG and TSG frequency thresholds which produced the highest correlation with the selected baseline genes.
To make sure we kept enough genes to estimate the correlation, we restricted our ¯ltering to a minimum of 10% of the baseline genes.Correlation values on less than 12 genes were not considered.Moreover, we only included in the analysis those genes with a total number of variants higher than the background (the mean of variants per gene across all genes in the data set).This minimum value is around ¯ve variants in all the four data sets.

Results of the optimized 20/20 rule in each data set
We optimized the 20/20 rule in four di®erent cancer types: TCGA breast cancer, LUAD, LUSC and COAD.For each data set, we selected the baseline genes which validate best.We improved the correlation with the published scores by selecting the most representative baseline genes.Therefore, we obtained correlation coe±cients of around 0.9 for both ONG and TSG scores which are greater than the ones obtained by applying the default 20/20 rule in each data set.For each of the four cancer types, the selected baseline genes are well known cancer genes that have been previously associated with: . breast invasive carcinoma (BRCA), Fig. 3; the genes selected as the baseline (PIK3CA, KRAS, SF3B1, RET, ERBB2, TP53, MAP3K1, GATA3, CDH1, RUNX1, RB1, ARID1A) have been previously found as mutated in breast cancer 15,[26][27][28] ; .LUAD, Fig. 4; the genes selected as the baseline (EGFR, BRAF, KRAS, PIK3CA, U2AF1, CTNNB1, ARID1A, RB1, NF1, SETD2, STK11, SMARCA4, ATM) have been previously found as mutated in LUAD 16 ; .LUSC, Fig. 5; the genes selected as the baseline (EGFR, PIK3CA, HRAS, NFE2L2, TP53, PTEN, RB1, TSC1, MLL2, SMAD4, CDKN2A, FUBP1) have been previously found as mutated in LUSC (except for FUBP1) 17 ; .COAD, Fig. 6; the genes selected as the baseline (PIK3CA, KRAS, BRAF, NRAS, AR, TP53, APC, SOX9, FAM123B, ARID1A, CASP8, RNF43) have been previously found as mutated in COAD. 18,29,30ing the optimized frequency thresholds we identi¯ed novel cancer type speci¯c ONGs and tumor suppressors which are available in Supplementary ¯le 1.The correlation coe±cients between the computed and the published scores, along with the optimized thresholds are shown for each TCGA data set in Table 1.
Table 1 also provides information about the four TCGA data sets.Note that the TCGA data sets are much reduced in size compared to the entire COSMIC data that was used to test the default 20/20 rule.COSMIC v72 provides a total number of 292 571 samples and over four million variants, while each TCGA data set provides hundreds of patients and thousands of variants.
Moreover, we included in Table 1 a comparison between the driver genes obtained by the optimized 20/20 rule and MuSiC 6 or MutSig. 10These two methods combined with manual curation were used by the TCGA papers to identify signi¯cantly mutated genes (MuSiC for BRCA 15 and MutSig for LUAD, 16 LUSC 17 and COAD 18 ).Both MuSiC and MutSig analyze the mutations of each gene to identify genes that were mutated more often than expected by chance, compared to background.They do not take into account the semantics of the mutation variants, such as gain-offunction or loss-of-function activity.Therefore, we do not expect to obtain identical results by the 20/20 rule compared to MuSiC or MutSig.In addition, the rule provides a metric to classify the driver genes into ONGs or tumor suppressors.

Gene drivers are cancer type speci¯c
Figure 7 shows the number of active driver genes identi¯ed in each cancer type.Interestingly, most of the genes we found to be important ONGs or tumor suppressors are speci¯c to each cancer type.However, we found PIK3CA gene, which is known to be an active oncogene in many cancers, to present a high activity in all four cancer types.Moreover, EGFR is highly active in both lung cancer types (LUSC and LUAD), as expected.TP53 is a gene known to present both oncogenic and tumor suppressor activities, most of the times contributing to cancer as a tumor suppressor with loss-of-function.TP53 is an exception to the 20/20 rule because missense mutations in TP53 often cause its loss-of-function instead of gain-of-function as considered by the rule.This is also shown in Ref. 2, where the ONG score is higher than the TSG score for TP53.As expected, we found TP53 to have elevated scores in BRCA, LUSC and COAD.However, based on its tumor biology and the fact that it is a known exception to the 20/20 rule, we classi¯ed TP53 as a gene with loss-of-function activity.Based on the ONG/TSG scores classi¯cation, we found seven oncogenenes (PIK3CA, EGFR, KRAS, BRAF, KRTAP5-5, KRTAP9-1, KCNN3) and nine tumor suppressors with loss-of-function (RB1, ARID1A, SOX9, RNF43, JPH4, HLA-A, RASA1, ATAD5, TP53) to be active in more than one cancer types.All of the other identi¯ed driver genes may be cancer speci¯c as shown in Fig. 7.
Therefore, we identi¯ed driver genes by assessing the oncogenic and tumor suppressor activity within each data set separately, even under the limitation of the sample size.Optimizing the 20/20 rule within each data set plays an important role in identifying the most relevant cancer type speci¯c oncogenes and tumor suppressors.Moreover, we found the mostly mutated cancer pathways in each cancer type.

Limitations of the Optimized 20/20 rule
The 20/20 rule proposed in Ref. 2 provides a way to classify genes into ONGs or tumor suppressors based on the frequency of their gain-of-function or loss-of-function variants.This method is an heuristics and it is based on the idea that the gain-offunction mutations generally include missense/in-frame indels, while the loss-offunction mutations generally include nonsense/nonstop/frame-shift indels.The rule has been shown to correctly classify most of the known ONGs and tumor suppressors (120 of 125 genes, 96%), with few exceptions such as TP53, NOTCH1, FBXW7, PAX5 and TRAF7. 2 To explain these exceptions, the authors in Ref. 2 point out that it is less likely for ONGs to harbor stop codons.Therefore, if a gene has a TSGscore > 5% and the ONG is also elevated (ONGscore > 20%), it is more likely to be a tumor suppressor than an ONG.This assumption is based on the fact that ONGs rarely harbor stop codons and therefore some of the missense mutations may actually be loss-of-function mutations.This idea could be used for the optimized rule as well.In the case of elevated values for both ONG and TSG score, the gene is more likely to be a tumor suppressor.In this paper, we provided both ONG and TSG scores for each gene.The ¯nal status of a gene could be determined based on the ONG/TSG values combined with further assessment of certain types of mutations or other knowledge from literature.However, besides the TP53 known tumor suppressor, we have not found in the TCGA other exceptions to the optimized rule.Most genes had either a high ONG and a low TSG, or a high TSG and a low ONG (Supplementary ¯le 1).NOTCH1, PAX5 and TRAF7 have not been selected as driver genes in either of the four TCGA data sets; FBXW7 was detected in LUAD and it was correctly classi¯ed as a tumor suppressor (TSG ¼ 0:6 and ONG ¼ 0).By improving the thresholds of the 20/20 rule in each cancer type, we selected those ONGs and tumor suppressors based on previously validated data.It is possible that some true ONGs or tumor suppressors may be missed by our approach due to prior knowledge.However, the optimized rule increases con¯dence in the newly identi¯ed oncogenes and tumor suppressors, by tuning the selection on previously validated genes.

Conclusions
In this paper, we evaluate the 20/20 rule for classifying ONGs and tumor suppressors (TSGs) proposed in Ref. 2, and optimize it for limited size data sets of speci¯c cancer types.To the authors knowledge, there are no other general methods in literature which are able to classify genes into ONGs and tumor suppressors.Most of the existing methods are focused on identifying driver mutations but they do not provide an overall ONG/TSG status for each gene.Although the 20/20 rule is a general approximation of the absolute gene status, the authors in Ref. 2 prove its value and validate it for most of known driver genes.The impact of this approach is to identify novel ONGs and tumor suppressors that can be used to further understand cancer mechanisms and improve targeted therapies.
Therefore, in this paper we propose a new way of optimizing the 20/20 rule by comparing the results with the baseline gene scores.The scores of well known cancer drivers were validated using a large sample size from COSMIC database, therefore these genes were used as baseline.We show that the best ONG and TSG frequency thresholds are not always equal to 20%, as proposed by Ref. 2, and may depend on the cancer type and data size.We conclude that the 20/20 rule validates in reduced size data sets if it is properly tuned.
We validated our approach by comparing the active baseline driver genes in each data set with known cancer genes from literature.By using the optimized rule, new ONGs and tumor suppressors were identi¯ed in each of the four cancer types (breast cancer, LUAD, LUSC and COAD).Moreover, we found that these driver genes are implicated in cancer type speci¯c pathways.This explains part of the heterogeneity between each cancer type, showing that di®erent mutations may lead to cancer development through di®erent molecular mechanisms.
We plan to further investigate the activity of these genes and the pathways-level mechanisms by which they contribute to the initiation and proliferation of cancer.Designing in-vitro experiments would be further necessary to con¯rm the activity of these genes in each cancer type.

Fig. 1 .
Fig. 1.Computation of ONG/TSG scores based on the method proposed in Ref. 2. Missense or in-frame indels are represented in green, while nonstop, nonsense and frame-shift indels are represented in blue.(a) Example of an oncogene: the frequency of gain-of-function mutations (recurrent missense/in-frame indels) is greater than 20% (ONG > 20%), while the frequency of loss-of-function mutations (nonstop/nonsense/ frame-shift indels) is lower than 20% (TSG < 20%).(b) Example of a tumor suppressor: the frequency of loss-of-function mutations (nonstop/nonsense/frame-shift indels) is greater than 20% (TSG > 20%), while the frequency of gain-of-function mutations (recurrent missense/in-frame indels) is lower than 20% (ONG < 20%).

Fig. 3 .
Fig.3.Comparison between the optimized rule and the default 20/20 rule in TCGA BRCA data set for the cancer genes in Ref. 2 (green illustrates a higher ONG score, while blue, a higher TSG score).Note that we exclude from the analysis those genes with a lower number of variants than the background (minimum six variants in this case).The correlation between the computed scores and the published scores is higher for the optimized rule.(a) Loss-of-function (TSG) gene scores computed in TCGA BRCA by the optimized rule.(b) Loss-of-function (TSG) gene scores computed in TCGA BRCA by the default 20/20 rule.(c) Gain-of-function (ONG) gene scores computed in TCGA BRCA by the optimized rule.(d) Gain-offunction (ONG) gene scores computed in TCGA BRCA by the default 20/20 rule.

Fig. 4 .
Fig. 4. Comparison between the optimized rule and the default 20/20 rule in TCGA LUAD data set for the cancer genes in Ref. 2 (green illustrates a higher ONG score, while blue, a higher TSG score).Note that we exclude from the analysis those genes with a lower number of variants than the background (minimum ¯ve variants in this case).The correlation between the computed scores and the published scores is higher for the optimized rule.(a) Loss-of-function (TSG) gene scores computed in TCGA LUAD by the optimized rule.(b) Loss-of-function (TSG) gene scores computed in TCGA LUAD by the default 20/20 rule.(c) Gain-of-function (ONG) gene scores computed in TCGA LUAD by the optimized rule.(d) Gain-offunction (ONG) gene scores computed in TCGA LUAD by the default 20/20 rule.

Fig. 5 .
Fig.5.Comparison between the optimized rule and the default 20/20 rule in TCGA LUSC data set for the cancer genes in Ref. 2 (green illustrates a higher ONG score, while blue, a higher TSG score).Note that we exclude from the analysis those genes with a lower number of variants than the background (minimum ¯ve variants in this case).The correlation between the computed scores and the published scores is higher for the optimized rule.(a) Loss-of-function (TSG) gene scores computed in TCGA LUSC by the optimized rule.(b) Loss-of-function (TSG) gene scores computed in TCGA LUSC by the default 20/20 rule.(c) Gain-of-function (ONG) gene scores computed in TCGA LUSC by the optimized rule.(d) Gain-of-function (ONG) gene scores computed in TCGA LUSC by the default 20/20 rule.

Fig. 6 .
Fig.6.Comparison between the optimized rule and the default 20/20 rule in TCGA COAD data set for the cancer genes in Ref. 2 (green illustrates a higher ONG score, while blue, a higher TSG score).Note that we exclude from the analysis those genes with a lower number of variants than the background (minimum seven variants in this case).The correlation between the computed scores and the published scores is higher for the optimized rule.(a) Loss-of-function (TSG) gene scores computed in TCGA COAD by the optimized rule.(b) Loss-of-function (TSG) gene scores computed in TCGA COAD by the default 20/20 rule.(c) Gain-of-function (ONG) gene scores computed in TCGA COAD by the optimized rule.(d) Gainof-function (ONG) gene scores computed in TCGA COAD by the default 20/20 rule.

Fig. 7 .
Fig. 7.The overlap of the cancer genes between the four cancer types.(a) Very few active ONGs are present in multiple cancer types.Most active ONGs are found to be cancer type speci¯c.PIK3CA is the only active oncogene in all four cancer types.(b) Very few tumor suppressors present loss-of-function in multiple cancer types.Most tumor suppressors with loss-of-function are found to be cancer type speci¯c.Note that TP53 is an exception to 20/20 rule because most of missense mutations in TP53 produce the gene's loss-of-function, not gain-of-function as considered by the rule.Therefore, TP53 is classi¯ed as a gene with loss-of-function activity.