Top

IntOGen plus - Release notes

What's new:
Release v2023.05.31

In this release we have increased the number of cohorts (and samples), we have made small updates to the IntOGen pipeline and we have updated key third-party dependencies. As a result, the list of driver genes has increased modestly (N=618). The vast majority of previous driver genes are still in this list, which also include previously unidentified drivers, while few others –mostly identified in a single cohort in the previous release– now fail to pass the thresholds set in the pipeline.

Know more..

Here we describe in detail all changes carried out in this release: the new cohorts and samples incorporated, the updates on the pipeline, and how these changes affect the discovery of driver genes.

More cohorts, more samples

In this new release we have incorporated 48 new cohorts to the analysis (Figure 1) representing 4943 new samples (Table 1 and Table 2).

Some previously included cohorts have increased the number of samples they contribute to the analysis. This is the case of cohorts sequenced by the Hartwig Medical Foundation (release downloaded on 21/10/2021) and ICGC (latest release from 2019 downloaded). After careful examination, we have excluded an ICGC cohort (LINC_JP_WGS_ICGC) which was completely included within the HC_PCAWG cohort (now re-named PCAWG_WGS_LIVER_HCC) (Figure 1).

New cohorts have been contributed by some already included data sources, such as TARGET, from which we obtained 2 new cohorts with Neuroblastoma (NB) and Osteosarcoma (OS) samples.

We also included datasets from new genomic data sources, such as CPTAC and CGCI.

Due to changes in the sample annotation of the TCGA_WXS_AML cohort, 95 more samples of this cohort were processed by the pipeline.

In order to standardize the cohort nomenclatures, cohort names have been updated, applying -in most cases- the following pattern:

{PROJECT}_{PLATFORM}_{ONCOTREE_CODE}_{OTHER_INFO}_{YEAR}

Examples: CPTAC_WXS_BRCA_2020, ICGC_WXS_AML_LAML_CN_2019, HARTWIG_WGS_NSCLC_2020

Table 1 - Increase in the number of samples in many tumor types. This table shows 16 tumor types that have increased more than 20% in the number of samples.
Cancer typeSamples release 2023Samples release 2020Number of new samplesIncrease (%)
Burkitt Lymphoma871572480.00%
Lung Neuroendocrine Tumor301020200.00%
Pleural Mesothelioma337117220188.03%
Cutaneous Squamous Cell Carcinoma1284979161.22%
Nasopharyngeal Carcinoma221103118114.56%
Angiosarcoma713437108.82%
Lymphoid Neoplasm48252392.00%
Lung Adenocarcinoma1,21575645960.71%
Renal Clear Cell Carcinoma94862032852.90%
Osteosarcoma55336319052.34%
Non-Small Cell Lung Cancer56438418046.88%
Cervical Squamous Cell Carcinoma46734112636.95%
Well-Differentiated Thyroid Cancer85268416824.56%
Stomach Adenocarcinoma85870715121.36%
Esophageal Adenocarcinoma1,13994519420.53%
Lung Squamous Cell Carcinoma70358411920.38%
Others24,79822,3392,45911.01%
Total33,01928,0764,94317.61%
Table 2 - Number of new samples per data source. STJUDE, PCAWG and PEDCBIOP have the same number of samples as in the previous release. This new release has more samples on TCGA, TARGET, HARTWIG, CBIOP, ICGC and other projects. Two new sequencing projects are added: CPTAC and CGCI.
Data source Samples release 2023 Samples release 2020 Number of new samples
STJUDE 622 622 0
PCAWG 2,554 2,554 0
PEDCBIOP 1,087 1,087 0
TCGA 10,105 10,010 95
TARGET 365 246 119
CGCI 192 0 192
HARTWIG 4,386 3,742 644
OTHER 2,915 2,257 658
CBIOP 4,566 3,570 996
CPTAC 1,076 0 1,076
ICGC 5,151 3,988 1,163
Total 33,019 28,076 4,943

Figure 1 - New cohorts in the Intogen release 2023. There are 48 new cohorts and 1 excluded cohort (LINC_JP_WGS_ICGC).

Updated pipeline

Several updates have been introduced in the IntOGen pipeline:

  1. bgparsers replaced with OpenVariant:
  2. We have developed OpenVariant, a new python package to parse the input files. This new method annotates all SNVs and indels from the mutation data files and transforms the data in a standardized format, by reading a yaml file prepared by the user.

    Importantly, the previous bgparsers tool had a bug in the annotation of indels, leading to some indels not being processed. This bug has been corrected in OpenVariant. As a consequence, there are, overall, more mutations annotated per sample.

  3. Seed to run OncodriveFML, OncodriveCLUSTL, smRegions and dNdScv added:
  4. We have added the use of a seed in these methods to reduce the variability in the calculation of the p-value of genes across IntOGen runs. Due to difficulties in the implementation, we could not add it in the remaining three methods run by the pipeline.

  5. Ensembl VEP updated from v92 to v101:
  6. We updated the Ensembl VEP version to 101. This version may contain important gene annotation differences, and therefore the number of mutations annotated may change substantially from the previous release.

  7. MSKCC oncotree updated to version 2021:
  8. We have updated the oncotree from the latest version released by MSKCC (November 2021). As a result, the number of cancer type nodes has increased from 82 to 889.

    We have applied some modifications to the MSKCC Oncotree:

    • NON_SOLID node added after TISSUE and before MYELOID and LYMPHOID
    • SOLID node added after TISSUE and before the rest of tissues
    • ALL node added after LNM and before TLL and BLL

  9. CGC genes updated to v95:
  10. We have updated the Cancer Gene Census gene list with the v95. This version distinguishes between Tier1 and Tier2 CGC genes. In the IntOGen pipeline, both tiers are considered as a unique list of CGC genes.

  11. Hugo Symbols updated to the 20/01/2022 release:
  12. We have unified all gene SYMBOLs according to the latest version of the HUGO Gene Nomenclature Committee.

  13. CancerMine database updated to 07/12/2021 release:
  14. The CancerMine database, which gathers information on literature evidence for cancer genes, was updated.

  15. CADD update:
  16. CADD, a tool for scoring the deleteriousness of single nucleotide variants - as well as insertion/deletions variants in the human genome, was updated from version 1.4 to version 1.6.

  17. List of frequent artifacts updated:
  18. In the previous release the pipeline was using a list of ‘Known Artifacts’ and a separated blacklist of genes, to rule out possible false positives in the post-processing. In this release, the blacklist genes are considered ‘Known Artifacts’, and, together with the ‘Warning artifacts’ (list of genes suspected to be artifacts) are all excluded in the post-processing step. Moreover, the olfactory receptors genes list from HORDE database used as a filter was updated to to Build #44c (30/July 2019).

  19. File with unfiltered drivers included:
  20. In this new release we include a list of unfiltered drivers as one of the outputs of the IntOGen pipeline. This file annotates all the filters applied to the post-processing step: from the output of the combination to the final set of driver genes. [unfiltered_drivers.tsv]

  21. BoostDM connection prepared:
  22. We have included along with the IntOGen output, all the files needed to run BoostDM. Specifically, two steps in the pipeline were added:

    • DriversSaturation: all possible mutations for a given gene are generated and returned as independent annotated files per gene. [{gene}.vep.gz]
    • filterMNVs: one file containing all positions detected as MNVs. [mnvs.tsv.gz]

New results

In the new release, several aspects have been improved: there are more cohorts and tumor type in which a gene is identified as driver; and more drivers and cohorts per tumor type (Figure 2).

Figure 2 - Release 2020 vs Release 2023 - A. Number of cohorts in which each gene is called driver (first 80 driver genes). Overall, the number of cohorts in which a gene is a driver has increase, being TP53 the gene with the highest increase.
Figure 2 - Release 2020 vs Release 2023 - B. Number of cohorts per cancer type (first 80 cancer types).
Figure 2 - Release 2020 vs Release 2023 - C. Number of drivers per cancer type (first 80 cancer types).
Figure 2 - Release 2020 vs Release 2023 - D. Number of cancer types in which each gene is called driver (first 80 driver genes).

The new release contains a total of 618 driver genes, 50 more than the previous release. There are 7 new tumor types: two of them are from two cohorts with re-assigned tumor type (from PAAD to PANCREAS on HARTWIG_PANCREAS and from RCCC to RCC on HARTWIG_KIDNEY_RENAL_CELL) and five of them are new; PSCC (Penil Squamous Cell Carcinoma, HARTWIG_WGS_PSCC_2020), MCC (Merkel Cell Carcinoma, HARTWIG_WGS_MCC_2020), COAD (Colon Adenocarcinoma, CPTAC_WXS_COAD_2020) and MLYM (Malignant Lymphoma, ICGC_WGS_MLYM_MALY_DE_PED_2019)

Table 3 - Comparison of the counts between releases. This new release contains ~50 million more mutations; ~5,000 more samples and 7 more tumor types, and 50 more drivers.
Intogen run Total number of mutations Total number of samples Number of cancer types Number of cohorts Number of drivers
Release 2023 252,486,809 33,019 73 266 618
Release 2020 203,003,747 28,076 66 221 568

Figure 3 - New drivers appear in this new release, but many other driver genes are not called in this new release

Among these 618 driver genes, 133 that are new, while 82 included in the previous release are no longer identified as drivers (Figure 3). This is due to several reasons:

  • Different mutation annotations:
    • Increase in the number of processed indels (see above)
    • New Ensembl gene annotation, where annotation of mutations per gene may have changed with respect to the previous release
  • Different bona fide cancer gene / artifacts lists:
    • CGC genes: The step of combination in the pipeline depends on the list of bona fide cancer genes obtained from CGC.
    • CancerMine database: This database, that annotates literature-gathered information, is used to provide evidence of the involvement of genes in tumorigenesis.
    • Lists of artifacts: in this release we filter out in the post-processing step all known artifacts and suspect/warning artifacts. The previous release was just filtering the known artifacts. These are accordingly annotated in the unfiltered drivers file.
  • The final ranking q-value list:
    • Implementation of a seed: Everytime we run Intogen, the final ranking of q-values may change, due to heuristic calculation of the background model in different methods. This means that genes that are usually found close to the driver threshold may have driver or passenger status across runs. In this release, we included a seed option for 4 methods (those that allowed it), to reduce the variability across runs. This problem is still not solved, as there are still 3 methods where the background calculation cannot be fixed with a seed.

We provide a table summarizing the reason why genes previously identified as drivers do not appear in the current release (See here). For further inquiries on the reason why a driver is not appearing in this release, please send a request to (bbglab@irbbarcelona.org) and we will try to provide a more detailed explanation.