IntOGen plus - Release notes
In this release we have increased the number of cohorts (and samples), we have made small updates to the IntOGen pipeline and we have updated key third-party dependencies. As a result, the list of driver genes has increased modestly (N=618). The vast majority of previous driver genes are still in this list, which also include previously unidentified drivers, while few others –mostly identified in a single cohort in the previous release– now fail to pass the thresholds set in the pipeline.
Here we describe in detail all changes carried out in this release: the new cohorts and samples incorporated, the updates on the pipeline, and how these changes affect the discovery of driver genes.
More cohorts, more samples
In this new release we have incorporated 48 new cohorts to the analysis (Figure 1) representing 4943 new samples (Table 1 and Table 2).
Some previously included cohorts have increased the number of samples they contribute to the analysis. This is the case of cohorts sequenced by the Hartwig Medical Foundation (release downloaded on 21/10/2021) and ICGC (latest release from 2019 downloaded). After careful examination, we have excluded an ICGC cohort (
LINC_JP_WGS_ICGC) which was completely included within the HC_PCAWG cohort (now re-named
PCAWG_WGS_LIVER_HCC) (Figure 1).
New cohorts have been contributed by some already included data sources, such as TARGET, from which we obtained 2 new cohorts with Neuroblastoma (NB) and Osteosarcoma (OS) samples.
Due to changes in the sample annotation of the TCGA_WXS_AML cohort, 95 more samples of this cohort were processed by the pipeline.
In order to standardize the cohort nomenclatures, cohort names have been updated, applying -in most cases- the following pattern:
|Cancer type||Samples release 2023||Samples release 2020||Number of new samples||Increase (%)|
|Lung Neuroendocrine Tumor||30||10||20||200.00%|
|Cutaneous Squamous Cell Carcinoma||128||49||79||161.22%|
|Renal Clear Cell Carcinoma||948||620||328||52.90%|
|Non-Small Cell Lung Cancer||564||384||180||46.88%|
|Cervical Squamous Cell Carcinoma||467||341||126||36.95%|
|Well-Differentiated Thyroid Cancer||852||684||168||24.56%|
|Lung Squamous Cell Carcinoma||703||584||119||20.38%|
|Data source||Samples release 2023||Samples release 2020||Number of new samples|
Several updates have been introduced in the IntOGen pipeline:
- bgparsers replaced with OpenVariant:
- Seed to run OncodriveFML, OncodriveCLUSTL, smRegions and dNdScv added:
- Ensembl VEP updated from v92 to v101:
- MSKCC oncotree updated to version 2021:
We have developed OpenVariant, a new python package to parse the input files. This new method annotates all SNVs and indels from the mutation data files and transforms the data in a standardized format, by reading a yaml file prepared by the user.
Importantly, the previous bgparsers tool had a bug in the annotation of indels, leading to some indels not being processed. This bug has been corrected in OpenVariant. As a consequence, there are, overall, more mutations annotated per sample.
We have added the use of a seed in these methods to reduce the variability in the calculation of the p-value of genes across IntOGen runs. Due to difficulties in the implementation, we could not add it in the remaining three methods run by the pipeline.
We updated the Ensembl VEP version to 101. This version may contain important gene annotation differences, and therefore the number of mutations annotated may change substantially from the previous release.
We have updated the oncotree from the latest version released by MSKCC (November 2021). As a result, the number of cancer type nodes has increased from 82 to 889.
We have applied some modifications to the MSKCC Oncotree:
NON_SOLIDnode added after
SOLIDnode added after
TISSUEand before the rest of tissues
ALLnode added after
We have updated the Cancer Gene Census gene list with the v95. This version distinguishes between Tier1 and Tier2 CGC genes. In the IntOGen pipeline, both tiers are considered as a unique list of CGC genes.
We have unified all gene SYMBOLs according to the latest version of the HUGO Gene Nomenclature Committee.
The CancerMine database, which gathers information on literature evidence for cancer genes, was updated.
CADD, a tool for scoring the deleteriousness of single nucleotide variants - as well as insertion/deletions variants in the human genome, was updated from version 1.4 to version 1.6.
In the previous release the pipeline was using a list of ‘Known Artifacts’ and a separated blacklist of genes, to rule out possible false positives in the post-processing. In this release, the blacklist genes are considered ‘Known Artifacts’, and, together with the ‘Warning artifacts’ (list of genes suspected to be artifacts) are all excluded in the post-processing step. Moreover, the olfactory receptors genes list from HORDE database used as a filter was updated to to Build #44c (30/July 2019).
In this new release we include a list of unfiltered drivers as one of the outputs of the IntOGen pipeline. This file annotates all the filters applied to the post-processing step: from the output of the combination to the final set of driver genes. [
We have included along with the IntOGen output, all the files needed to run BoostDM. Specifically, two steps in the pipeline were added:
- DriversSaturation: all possible mutations for a given gene are generated and returned as independent annotated files per gene. [
- filterMNVs: one file containing all positions detected as MNVs. [
In the new release, several aspects have been improved: there are more cohorts and tumor type in which a gene is identified as driver; and more drivers and cohorts per tumor type (Figure 2).
The new release contains a total of 618 driver genes, 50 more than the previous release. There are 7 new tumor types: two of them are from two cohorts with re-assigned tumor type
HARTWIG_PANCREAS and from
and five of them are new;
PSCC (Penil Squamous Cell Carcinoma,
MCC (Merkel Cell Carcinoma,
COAD (Colon Adenocarcinoma,
MLYM (Malignant Lymphoma,
|Intogen run||Total number of mutations||Total number of samples||Number of cancer types||Number of cohorts||Number of drivers|
Among these 618 driver genes, 133 that are new, while 82 included in the previous release are no longer identified as drivers (Figure 3). This is due to several reasons:
- Different mutation annotations:
- Increase in the number of processed indels (see above)
- New Ensembl gene annotation, where annotation of mutations per gene may have changed with respect to the previous release
- Different bona fide cancer gene / artifacts lists:
- CGC genes: The step of combination in the pipeline depends on the list of bona fide cancer genes obtained from CGC.
- CancerMine database: This database, that annotates literature-gathered information, is used to provide evidence of the involvement of genes in tumorigenesis.
- Lists of artifacts: in this release we filter out in the post-processing step all known artifacts and suspect/warning artifacts. The previous release was just filtering the known artifacts. These are accordingly annotated in the unfiltered drivers file.
- The final ranking q-value list:
- Implementation of a seed: Everytime we run Intogen, the final ranking of q-values may change, due to heuristic calculation of the background model in different methods. This means that genes that are usually found close to the driver threshold may have driver or passenger status across runs. In this release, we included a seed option for 4 methods (those that allowed it), to reduce the variability across runs. This problem is still not solved, as there are still 3 methods where the background calculation cannot be fixed with a seed.
We provide a table summarizing the reason why genes previously identified as drivers do not appear in the current release (See here). For further inquiries on the reason why a driver is not appearing in this release, please send a request to (firstname.lastname@example.org) and we will try to provide a more detailed explanation.