IntOGen - Cancer Mutations Browser

IntOGen plus - Release notes

What's new:

Release v2024.09.20

In this release we have re-organised the distribution of the Hartwig samples into cohorts using the DOID annotation. We have also adopted MANE (Matched Annotation from NCBI and EMBL-EBI) transcripts as the base to map mutations to protein coordinates. In addition to that, we updated the CGC annotation to the latest CGC v99, and carried out other small fixes. The current results show a modest increase in the number of driver genes (N=633 vs 619). The vast majority (N=583) of driver genes in the previous release are still in the Compendium, which also includes previously unidentified drivers. Conversely, N=36 genes in the previous release –mostly identified in a single cohort– are now absent from the Compendium. Most of these differences are directly caused by the change of reference transcript to the MANE, which resulted in the inclusion or exclusion of mutations overlapping the protein coding sequence. We show some examples of both cases.

Know more..

Here we describe in detail all changes carried out in this release: the new Hartwig cohorts organization, the updates on the pipeline, and how these changes affect the discovery of driver genes.

Update Hartwig dataset to v2023

We have updated the Hartwig dataset with the data release of September 2023. We have changed the tumor type annotation for each sample in Hartwig: instead of a manual annotation using the primary tumor type descriptions, we are now using the DOID identification to annotate the corresponding Oncotree code. In the case of tumor types with less than 20 samples, we re-assigned the tumor type to a parent node of the Oncotree, up until we can group at least 20 samples. This tumor type re-annotation of the entire Hartwig dataset modifies the current cohorts and new cohorts appear.

Figure 1 - Tumor type re-annotation on Hartwig Medical Foundation data. In the ribbon plot, the left column corresponds to the classification of Hartwig samples according to the previous IntOGen release, while the right column illustrates their classification in the current release. The width of the ribbons connecting nodes of the two columns is proportional to the number of samples shared between the two.

**Figure 2 - Number of cohorts comparison between IntOGen releases 2023 and 2024.**

**Table 1 - New cohorts and lost cohorts due to the new Hartwig Medical Foundation data organization**
New Cohorts		Lost Cohorts
Cohort name	Tumor type	Cohort name	Tumor type
HARTWIG_WGS_BLADDER_2023	Bladder/Urinary Tract	HARTWIG_WGS_BCC_2023	Basal Cell Carcinoma
HARTWIG_WGS_CEAD_2023	Cervical Adenocarcinoma	HARTWIG_WGS_CESC_2023	Cervical Squamous Cell Carcinoma
HARTWIG_WGS_COAD_2023	Colon Adenocarcinoma	HARTWIG_WGS_CSCC_2023	Cutaneous Squamous Cell Carcinoma
HARTWIG_WGS_EGC_2023	Esophagogastric Adenocarcinoma	HARTWIG_WGS_GBM_2023	Glioblastoma Multiforme
HARTWIG_WGS_GBC_2023	Gallbladder Cancer	HARTWIG_WGS_GINET_2023	Gastrointestinal Neuroendocrine Tumors
HARTWIG_WGS_GB_2023	Glioblastoma	HARTWIG_WGS_LNET_2023	Lung Neuroendocrine Tumor
HARTWIG_WGS_GIST_2023	Gastrointestinal Stromal Tumor	HARTWIG_WGS_LNM_2023	Lymphoid Neoplasm
HARTWIG_WGS_LIPO_2023	Liposarcoma	HARTWIG_WGS_MCC_2023	Merkel Cell Carcinoma
HARTWIG_WGS_LMS_2023	Leiomyosarcoma	HARTWIG_WGS_MEL_2023	Melanoma
HARTWIG_WGS_LUNG_2023	Lung	HARTWIG_WGS_OS_2023	Osteosarcoma
HARTWIG_WGS_MT_2023	Malignant Tumor	HARTWIG_WGS_PANET_2023	Pancreatic Neuroendocrine Tumor
HARTWIG_WGS_NETNOS_2023	Neuroendocrine Tumor, NOS	HARTWIG_WGS_PRAD_2023	Prostate Adenocarcinoma
HARTWIG_WGS_PROSTATE_2023	Prostate	HARTWIG_WGS_PSCC_2023	Penile Squamous Cell Carcinoma
HARTWIG_WGS_READ_2023	Rectal Adenocarcinoma	HARTWIG_WGS_SBC_2023	Small Bowel Cancer
HARTWIG_WGS_SACA_2023	Salivary Carcinoma	HARTWIG_WGS_STAD_2023	Stomach Adenocarcinoma
HARTWIG_WGS_SARCNOS_2023	Sarcoma, NOS	HARTWIG_WGS_THYM_2023	Thymoma
HARTWIG_WGS_SIC_2023	Small Intestinal Carcinoma	HARTWIG_WGS_VULVA_2023	Vulva/Vagina
HARTWIG_WGS_SKCM_2023	Cutaneous Melanoma
HARTWIG_WGS_SKIN_2023	Skin
HARTWIG_WGS_SOFT_TISSUE_2023	Soft Tissue
HARTWIG_WGS_STOMACH_2023	Esophagus/Stomach
HARTWIG_WGS_UTUC_2023	Upper Tract Urothelial Carcinoma
Show more

**Figure 3 - Number of cohorts per tumor type.** Shown are only tumor types with change in the number of samples between releases 2023 and 2024.

Updated pipeline

Several updates and important bugfixes have been introduced in the IntOGen pipeline:

Ensembl VEP v111

We have updated the Ensembl VEP from v101 to v111. This version differs substantially from the previous one used in IntOGen. Key differences are the redefinition of canonical transcripts and different annotations for some genes and mutations. Therefore, the number of mutations annotated to a gene may change from that in the previous release.

Use of MANE Transcripts

We have switched to using MANE transcript in our pipeline, to align with common standards used in the clinics.

Impact on datasets

Redefinition of the transcript impacts the regions used in methods like OncodriveFML, OncodriveCLUSTLand and smRegions, when computing the profile and when building coding sequence regions for dNdS.

Impact on the pipeline

VepStep. The process now runs with –mane flag, that includes the column MANE_SELECT.
ProcessVEPoutput step. The filtering out of the transcript is now based on the newly added MANE_SELECT column. The step only retains transcripts with a Refseq ID equivalent in the MANE_SELECT column. Moreover we updated the list of accepted consequence type to match with the new VEP v111.

CGC v99

We have updated the Cancer Gene Census gene list to v99, since COSMIC dropped support for releases prior v96.

cBaSE v1.1 update

Former version of cBaSE had a major bugfix when analyzing cohorts with low TMB that would call every gene a driver. This has been solved by updating from v1.0 to v1.1.

BoostDM connection (for further details in BoostDM updates please visit BoostDM release notes)

DriversSaturation step

The regions are now defined as the CDS according to the MANE SELECT transcripts, including stop codons, plus 5bp segments flanking each CDS segment.

ParseProfile step

Profile output by ComputeProfile step is parsed in a 96-channel for every trinucleotide context, to ensure consistency and compatibility with BoostDM run.

Renaming of Steps (Signature → ComputeProfile)

We renamed the Signature step into a more self-explainable name: ComputeProfile, in order to avoid confusion and misunderstanding.

New results

We have obtained a new set of 633 driver genes. The list of driver genes may change with respect to the previous release of the Compendium due to several reasons:

The change of canonical transcript for the MANE select transcript.
The update of COSMIC CGC to v99
The final ranking q-value list. As explained in the previous release, the final ranking of q-values may change due to heuristic calculation of the background model in different methods.

**Table 2 - Comparison of the counts between releases.** This new release contains ~5 million more mutations; ~200 more samples and 14 more tumor types, and 14 more drivers.
Intogen run	Total number of mutations	Total number of samples	Number of cancer types	Number of cohorts	Number of drivers
Release 2024	257,898,749	33,218	87	271	633
Release 2023	252,486,809	33,019	73	266	618

**Figure 4 - Drivers comparison between Release 2023 and Release 2024.** There are 50 drivers newly annotated in this release. Other 36 genes are no longer annotated as drivers. Most drivers (583) in the previous release remain.

Cohorts per driver Drivers per cancer type Cancer types per driver

**Figure 5 - Release 2023 vs Release 2024** - A. Number of cohorts in which each gene is called driver (first 80 driver genes). Overall, the number of cohorts in which a gene is a driver has increase, being TP53 the gene with the highest increase.

**Figure 5 - Release 2023 vs Release 2024** - C. Number of drivers per cancer type (first 80 cancer types).

**Figure 5 - Release 2023 vs Release 2024** - D. Number of cancer types in which each gene is called driver (first 80 driver genes).

After further inspection on the new driver genes and the genes no longer called drivers, we observe that an important number (12 new drivers) change the status in the CGC database and also present a different canonical transcript (16 new drivers and 13 lost drivers). Below are some examples (Table 4 and 5, Figure 5 to 7).

**Table 3 - Number of new and lost drivers that change the CGC status and/or the canonical transcript.** No lost genes change the status on the CGC database.
Different Transcript	Different ranking in CGC	Number of New Drivers	Number of Lost Drivers
True	True	4	/
True	False	12	13
False	True	8	/
False	False	39	30

**Table 4 - List of the 15 new drivers that either change the canonical transcript or the CGC status.**
Gene	Transcript (v111)	Transcript (v101)	Different Transcript	Cohort name	Gene in CGC v95	Gene in CGC v99	Different CGC ranking
NACA	ENST00000454682	ENST00000550952	True	PCAWG_WGS_SKIN_MELANOMA, TCGA_WXS_MEL, HARTWIG_WGS_BLCA_2023, CBIOP_WXS_BCC_UNIGE_2016_UNTREAT	True	True	False
PRKD1	ENST00000331968	ENST00000415220	True	CBIOP_WXS_PRAD_MSKCC_DCFI_2018_PRY, HARTWIG_WGS_COADREAD_2023, PCAWG_WGS_BILIARY_ADENOCA	True	False	True
NTRK2	ENST00000277120	ENST00000277120	False	PEDCBIOP_WXS_MBL_SHH_PRY, ICGC_WXS_CLLSLL_CLLE_ES_2019, CBIOP_WXS_ANGS_TREATED_2020	True	False	True
AFF3	ENST00000672756	ENST00000409579	True	CBIOP_WXS_LUAD_VALLEN_2018, ICGC_WXS_WDTC_THCA_SA_2019	True	True	False
CTNNA1	ENST00000302763	ENST00000302763	False	ICGC_WXS_COADREAD_COCA_CN_CASPOINT_2019, HARTWIG_WGS_CEAD_2023	True	False	True
RAD50	ENST00000378823	ENST00000378823	False	CPTAC_WXS_GBM_2020	True	False	True
NCOA1	ENST00000348332	ENST00000406961	True	CPTAC_WXS_PAAD_2021	True	True	False
RAP1B	ENST00000250559	ENST00000250559	False	HARTWIG_WGS_BLCA_2023	True	False	True
TMPRSS2	ENST00000332149	ENST00000398585	True	OTHER_WGS_PRAD_CPGEA_2021	True	True	False
CRLF2	ENST00000400841	ENST00000381566	True	TCGA_WXS_LUAD	True	True	False
ERG	ENST00000288319	ENST00000417133	True	CBIOP_WXS_ANGS_TREATED_2020	True	True	False
GSK3B	ENST00000264235	ENST00000316626	True	CBIOP_WGS_STAD_ONCOSG_2018	True	False	True
HGF	ENST00000222390	ENST00000222390	False	CBIOP_WXS_SCLC_CLCGP	True	False	True
HIF1A	ENST00000337138	ENST00000539097	True	ICGC_WXS_WDTC_THCA_SA_2019	True	True	False
ZCCHC8	ENST00000633063	ENST00000543897	True	CBIOP_WXS_EGC_TMUCIH_2015	True	True	False

**Table 5 - List of the 10 lost drivers that change the canonical transcript.**
Gene	Transcript (v111)	Transcript (v101)	Different Transcript	Cohort name	Gene in CGC v95	Gene in CGC v99	Different CGC ranking
ETNK1	ENST00000266517	ENST00000671733	True	ICGC_WXS_NPC_ORCA_IN_2019, OTHER_WGS_MM_NATLEUK2018	True	True	False
BCL2L12	ENST00000246784	ENST00000616144	True	CBIOP_WXS_CSCC_UCSF_2018, TCGA_WXS_MEL	True	True	False
CRNKL1	ENST00000536226	ENST00000377340	True	CBIOP_WXS_CSCC_UCSF_2018, TCGA_WXS_MEL	True	True	False
TFEB	ENST00000373033	ENST00000358871	True	OTHER_WGS_PLMESO_MESOMICS_2023	True	True	False
PDE4DIP	ENST00000695795	ENST00000369356	True	TCGA_WXS_OVT	True	True	False
SLC35E2A	NaN	ENST00000643905	True	ICGC_WGS_LUSC_LUSC_KR_2019	False	False	False
S100A7	ENST00000368723	ENST00000368722	True	HARTWIG_WGS_OVT_2023	True	True	False
PTPRU	ENST00000373779	ENST00000345512	True	TCGA_WXS_LUAD	False	False	False
FANCD2	ENST00000675286	ENST00000287647	True	ICGC_WXS_COADREAD_COCA_CN_VARSCAN_2019	True	True	False
CREM	ENST00000685392	ENST00000345491	True	CPTAC_WXS_GBM_2020	False	False	False

Here we show some examples of genes that change the driver status likely because they change the canonical transcript: a new driver, NACA, presented with a longer MANE transcript, containing a region with a cluster of mutations (Figure 5). On the other hand, BCL2L12 and CRNKL1 are two genes with a shorter transcript, losing a region that contains a cluster of mutations (Figure 6 and 7).

New Driver: NACA Lost Driver: BCL2L12 Lost Driver: CRNKL1

**Figure 5 - NACA gene mutations distribution.** *Top*. Along the updated MANE transcript (Ensembl v111). *Bottom*. Along the previous canonical transcript (Ensembl v101).

**Figure 6 - BCL2L12 gene mutations distribution.** *Top*. Along the updated MANE transcript (Ensembl v111). *Bottom*. Along the previous canonical transcript (Ensembl v101).

**Figure 7 - CRNKL1 gene mutations distribution.** *Top*. Along the updated MANE transcript (ensembl v111).Bottom.Along the previous canonical transcript (ensembl v101).

v2023.05.31

In this release we have increased the number of cohorts (and samples), we have made small updates to the IntOGen pipeline and we have updated key third-party dependencies. As a result, the list of driver genes has increased modestly (N=618). The vast majority of previous driver genes are still in this list, which also include previously unidentified drivers, while few others –mostly identified in a single cohort in the previous release– now fail to pass the thresholds set in the pipeline.

Know more..

Here we describe in detail all changes carried out in this release: the new cohorts and samples incorporated, the updates on the pipeline, and how these changes affect the discovery of driver genes.

More cohorts, more samples

In this new release we have incorporated 48 new cohorts to the analysis (Figure 1) representing 4943 new samples (Table 1 and Table 2).

Some previously included cohorts have increased the number of samples they contribute to the analysis. This is the case of cohorts sequenced by the Hartwig Medical Foundation (release downloaded on 21/10/2021) and ICGC (latest release from 2019 downloaded). After careful examination, we have excluded an ICGC cohort (LINC_JP_WGS_ICGC) which was completely included within the HC_PCAWG cohort (now re-named PCAWG_WGS_LIVER_HCC) (Figure 1).

New cohorts have been contributed by some already included data sources, such as TARGET, from which we obtained 2 new cohorts with Neuroblastoma (NB) and Osteosarcoma (OS) samples.

We also included datasets from new genomic data sources, such as CPTAC and CGCI.

Due to changes in the sample annotation of the TCGA_WXS_AML cohort, 95 more samples of this cohort were processed by the pipeline.

In order to standardize the cohort nomenclatures, cohort names have been updated, applying -in most cases- the following pattern:

{PROJECT}_{PLATFORM}_{ONCOTREE_CODE}_{OTHER_INFO}_{YEAR}

Examples: CPTAC_WXS_BRCA_2020, ICGC_WXS_AML_LAML_CN_2019, HARTWIG_WGS_NSCLC_2020

**Table 1 - Increase in the number of samples in many tumor types.** This table shows 16 tumor types that have increased more than 20% in the number of samples.
Cancer type	Samples release 2023	Samples release 2020	Number of new samples	Increase (%)
Burkitt Lymphoma	87	15	72	480.00%
Lung Neuroendocrine Tumor	30	10	20	200.00%
Pleural Mesothelioma	337	117	220	188.03%
Cutaneous Squamous Cell Carcinoma	128	49	79	161.22%
Nasopharyngeal Carcinoma	221	103	118	114.56%
Angiosarcoma	71	34	37	108.82%
Lymphoid Neoplasm	48	25	23	92.00%
Lung Adenocarcinoma	1,215	756	459	60.71%
Renal Clear Cell Carcinoma	948	620	328	52.90%
Osteosarcoma	553	363	190	52.34%
Non-Small Cell Lung Cancer	564	384	180	46.88%
Cervical Squamous Cell Carcinoma	467	341	126	36.95%
Well-Differentiated Thyroid Cancer	852	684	168	24.56%
Stomach Adenocarcinoma	858	707	151	21.36%
Esophageal Adenocarcinoma	1,139	945	194	20.53%
Lung Squamous Cell Carcinoma	703	584	119	20.38%
Others	24,798	22,339	2,459	11.01%
Total	33,019	28,076	4,943	17.61%

**Table 2 - Number of new samples per data source.** STJUDE, PCAWG and PEDCBIOP have the same number of samples as in the previous release. This new release has more samples on TCGA, TARGET, HARTWIG, CBIOP, ICGC and other projects. Two new sequencing projects are added: CPTAC and CGCI.
Data source	Samples release 2023	Samples release 2020	Number of new samples
STJUDE	622	622	0
PCAWG	2,554	2,554	0
PEDCBIOP	1,087	1,087	0
TCGA	10,105	10,010	95
TARGET	365	246	119
CGCI	192	0	192
HARTWIG	4,386	3,742	644
OTHER	2,915	2,257	658
CBIOP	4,566	3,570	996
CPTAC	1,076	0	1,076
ICGC	5,151	3,988	1,163
Total	33,019	28,076	4,943

**Figure 1 - New cohorts in the Intogen release 2023.** There are 48 new cohorts and 1 excluded cohort (`LINC_JP_WGS_ICGC`).

Updated pipeline

Several updates have been introduced in the IntOGen pipeline:

bgparsers replaced with OpenVariant:

We have developed OpenVariant, a new python package to parse the input files. This new method annotates all SNVs and indels from the mutation data files and transforms the data in a standardized format, by reading a yaml file prepared by the user.

Importantly, the previous bgparsers tool had a bug in the annotation of indels, leading to some indels not being processed. This bug has been corrected in OpenVariant. As a consequence, there are, overall, more mutations annotated per sample.

Seed to run OncodriveFML, OncodriveCLUSTL, smRegions and dNdScv added:

We have added the use of a seed in these methods to reduce the variability in the calculation of the p-value of genes across IntOGen runs. Due to difficulties in the implementation, we could not add it in the remaining three methods run by the pipeline.

Ensembl VEP updated from v92 to v101:

We updated the Ensembl VEP version to 101. This version may contain important gene annotation differences, and therefore the number of mutations annotated may change substantially from the previous release.

MSKCC oncotree updated to version 2021:

We have updated the oncotree from the latest version released by MSKCC (November 2021). As a result, the number of cancer type nodes has increased from 82 to 889.

We have applied some modifications to the MSKCC Oncotree:

NON_SOLID node added after TISSUE and before MYELOID and LYMPHOID
SOLID node added after TISSUE and before the rest of tissues
ALL node added after LNM and before TLL and BLL

CGC genes updated to v95:

We have updated the Cancer Gene Census gene list with the v95. This version distinguishes between Tier1 and Tier2 CGC genes. In the IntOGen pipeline, both tiers are considered as a unique list of CGC genes.

Hugo Symbols updated to the 20/01/2022 release:

We have unified all gene SYMBOLs according to the latest version of the HUGO Gene Nomenclature Committee.

CancerMine database updated to 07/12/2021 release:

The CancerMine database, which gathers information on literature evidence for cancer genes, was updated.

CADD update:

CADD, a tool for scoring the deleteriousness of single nucleotide variants - as well as insertion/deletions variants in the human genome, was updated from version 1.4 to version 1.6.

List of frequent artifacts updated:

In the previous release the pipeline was using a list of ‘Known Artifacts’ and a separated blacklist of genes, to rule out possible false positives in the post-processing. In this release, the blacklist genes are considered ‘Known Artifacts’, and, together with the ‘Warning artifacts’ (list of genes suspected to be artifacts) are all excluded in the post-processing step. Moreover, the olfactory receptors genes list from HORDE database used as a filter was updated to to Build #44c (30/July 2019).

File with unfiltered drivers included:

In this new release we include a list of unfiltered drivers as one of the outputs of the IntOGen pipeline. This file annotates all the filters applied to the post-processing step: from the output of the combination to the final set of driver genes. [unfiltered_drivers.tsv]

BoostDM connection prepared:

We have included along with the IntOGen output, all the files needed to run BoostDM. Specifically, two steps in the pipeline were added:

DriversSaturation: all possible mutations for a given gene are generated and returned as independent annotated files per gene. [{gene}.vep.gz]
filterMNVs: one file containing all positions detected as MNVs. [mnvs.tsv.gz]

New results

In the new release, several aspects have been improved: there are more cohorts and tumor type in which a gene is identified as driver; and more drivers and cohorts per tumor type (Figure 2).

Cohorts per driver Cohorts per cancer type Drivers per cancer type Cancer types per driver

**Figure 2 - Release 2020 vs Release 2023** - A. Number of cohorts in which each gene is called driver (first 80 driver genes). Overall, the number of cohorts in which a gene is a driver has increase, being TP53 the gene with the highest increase.

**Figure 2 - Release 2020 vs Release 2023** - B. Number of cohorts per cancer type (first 80 cancer types).

**Figure 2 - Release 2020 vs Release 2023** - C. Number of drivers per cancer type (first 80 cancer types).

**Figure 2 - Release 2020 vs Release 2023** - D. Number of cancer types in which each gene is called driver (first 80 driver genes).

The new release contains a total of 618 driver genes, 50 more than the previous release. There are 7 new tumor types: two of them are from two cohorts with re-assigned tumor type (from PAAD to PANCREAS on HARTWIG_PANCREAS and from RCCC to RCC on HARTWIG_KIDNEY_RENAL_CELL) and five of them are new; PSCC (Penil Squamous Cell Carcinoma, HARTWIG_WGS_PSCC_2020), MCC (Merkel Cell Carcinoma, HARTWIG_WGS_MCC_2020), COAD (Colon Adenocarcinoma, CPTAC_WXS_COAD_2020) and MLYM (Malignant Lymphoma, ICGC_WGS_MLYM_MALY_DE_PED_2019)

**Table 3 - Comparison of the counts between releases.** This new release contains ~50 million more mutations; ~5,000 more samples and 7 more tumor types, and 50 more drivers.
Intogen run	Total number of mutations	Total number of samples	Number of cancer types	Number of cohorts	Number of drivers
Release 2023	252,486,809	33,019	73	266	618
Release 2020	203,003,747	28,076	66	221	568

**Figure 3 - New drivers appear in this new release, but many other driver genes are not called in this new release**

Among these 618 driver genes, 133 that are new, while 82 included in the previous release are no longer identified as drivers (Figure 3). This is due to several reasons:

Different mutation annotations:

Increase in the number of processed indels (see above)
New Ensembl gene annotation, where annotation of mutations per gene may have changed with respect to the previous release

Different bona fide cancer gene / artifacts lists:

CGC genes: The step of combination in the pipeline depends on the list of bona fide cancer genes obtained from CGC.
CancerMine database: This database, that annotates literature-gathered information, is used to provide evidence of the involvement of genes in tumorigenesis.
Lists of artifacts: in this release we filter out in the post-processing step all known artifacts and suspect/warning artifacts. The previous release was just filtering the known artifacts. These are accordingly annotated in the unfiltered drivers file.

The final ranking q-value list:

Implementation of a seed: Everytime we run Intogen, the final ranking of q-values may change, due to heuristic calculation of the background model in different methods. This means that genes that are usually found close to the driver threshold may have driver or passenger status across runs. In this release, we included a seed option for 4 methods (those that allowed it), to reduce the variability across runs. This problem is still not solved, as there are still 3 methods where the background calculation cannot be fixed with a seed.

We provide a table summarizing the reason why genes previously identified as drivers do not appear in the current release (See here). For further inquiries on the reason why a driver is not appearing in this release, please send a request to (bbglab@irbbarcelona.org) and we will try to provide a more detailed explanation.