TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages
TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages
- 1 About
- 2 Abstract
- 3 Introduction
- 4 Methods
- 5 Conclusion
- Author contributions
- Competing interests
- Grant information
- Acknowledgements
- References
1 About
This workflow is based on the article: TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages (Silva et al. 2016). Due to time and space contraints we downloaded only a subset of the data, for a real analysis please use all data available. The data used in the examples are available in the package TCGAWorkflowData
.
1.1 Installation
To be able to execute all the steps of this workflow please install it with the following code:
source("https://bioconductor.org/biocLite.R")
deps <- c("pathview", "clusterProfiler", "ELMER", "DO.db","GO.db",
"ComplexHeatmap", "EDASeq", "TCGAbiolinks", "AnnotationHub",
"gaia","ChIPseeker", "minet","BSgenome.Hsapiens.UCSC.hg19",
"MotifDb", "MotIV", "rGADEM", "motifStack", "RTCGAToolbox")
for(pkg in deps) if (!pkg %in% installed.packages()) biocLite(pkg, dependencies = TRUE)
deps <- c("devtools","DT","pbapply","readr","circlize")
for(pkg in deps) if (!pkg %in% installed.packages()) install.packages(pkg,dependencies = TRUE)
devtools::install_github("BioinformaticsFMRP/TCGAWorkflowData")
devtools::install_github("BioinformaticsFMRP/TCGAWorkflow", dependencies = TRUE)
1.2 Loading packages
At the beginning of each section, the packages required to execute the code will be loaded. However the following packages are required for all sections.
- TCGAWorkflowData: this package contains the data necessary to execute each of the analysis steps. This is a subset of the downloaded to make the example faster. For a real analysis, please use all the data available.
- DT: we will use it to visualize the results
library(TCGAWorkflowData)
library(DT)
2 Abstract
Biotechnological advances in sequencing have led to an explosion of publicly available data via large international consortia such as The Cancer Genome Atlas (TCGA), The Encyclopedia of DNA Elements (ENCODE), and The NIH Roadmap Epigenomics Mapping Consortium (Roadmap). These projects have provided unprecedented opportunities to interrogate the epigenome of cultured cancer cell lines as well as normal and tumor tissues with high genomic resolution. The Bioconductor project offers more than 1,000 open-source software and statistical packages to analyze high-throughput genomic data. However, most packages are designed for specific data types (e.g. expression, epigenetics, genomics) and there is no one comprehensive tool that provides a complete integrative analysis of the resources and data provided by all three public projects. A need to create an integration of these different analyses was recently proposed. In this workflow, we provide a series of biologically focused integrative analyses of different molecular data. We describe how to download, process and prepare TCGA data and by harnessing several key Bioconductor packages, we describe how to extract biologically meaningful genomic and epigenomic data. Using Roadmap and ENCODE data, we provide a work plan to identify biologically relevant functional epigenomic elements associated with cancer.
To illustrate our workflow, we analyzed two types of brain tumors: low-grade glioma (LGG) versus high-grade glioma (glioblastoma multiform or GBM).
All the package landing pages used in this workflow can be found through the biocViews interface.
Keywords: Epigenomics, Genomics, Cancer, non-coding, TCGA, ENCODE, Roadmap, Bioinformatics.
3 Introduction
Cancer is a complex genetic disease spanning multiple molecular events such as point mutations, structural variations, translocations and activation of epigenetic and transcriptional signatures and networks. The effects of these events take place at different spatial and temporal scales with interlayer communications and feedback mechanisms creating a highly complex dynamic system. To gain insight into the biology of tumours most of the research in cancer genomics is aimed at the integration of the observations at multiple molecular scales and the analysis of their interplay. Even if many tumors share similar recurrent genomic events, their relationships with the observed phenotype are often not understood. For example, although we know that the majority of the most aggressive form of brain tumours such as glioma harbour the mutation of a single gene (IDH), the mechanistic explanation of the activation of its characteristic epigenetic and transcriptional signatures are still far to be well characterized. Moreover, network-based strategies have recently emerged as an effective framework for the discovery functional disease drivers that act as main regulators of cancer phenotypes. Here we describe a comprehensice workflow that integrates many Bioconductor packages in order to analyze and integrate the molteplicity of molecular observation layers in large scale cancer dataset.
Indeed, recent technological developments allowed the deposition of large amounts of genomic and epigenomic data, such as gene expression, DNA methylation, and genomic localization of transcription factors, into freely available public international consortia like The Cancer Genome Atlas (TCGA), The Encyclopedia of DNA Elements (ENCODE), and The NIH Roadmap Epigenomics Mapping Consortium (Roadmap) (Hawkins, Hon, and Ren 2010). An overview of the three consortia is described below:
The Cancer Genome Atlas (TCGA): The TCGA consortium, which is a National Institute of Health (NIH) initiative, makes publicly available molecular and clinical information for more than 30 types of human cancers including exome (variant analysis), single nucleotide polymorphism (SNP), DNA methylation, transcriptome (mRNA), microRNA (miRNA) and proteome . Sample types available at TCGA are: primary solid tumors, recurrent solid tumors, blood derived normal and tumor, metastatic, and solid tissue normal (Weinstein et al. 2013).
The Encyclopedia of DNA Elements (ENCODE): Found in 2003 by the National Human Genome Research Institute (NHGRI), the project aims to build a comprehensive list of functional elements that have an active role in the genome, including regulatory elements that govern gene expression. Biosamples includes immortalized cell lines, tissues, primary cells and stem cells (Consortium and others 2011).
The NIH Roadmap Epigenomics Mapping Consortium: This was launched with the goal of producing a public resource of human epigenomic data in order to analyze biology and disease-oriented research. Roadmap maps DNA methylation, histone modifications, chromatin accessibility, and small RNA transcripts in stem cells and primary ex vivo tissues (Fingerman et al. 2011; Bernstein et al. 2010).
Briefly, these three consortia provide large scale epigenomic data onto a variety of microarrays and next-generation sequencing (NGS) platforms. Each consortium encompasses specific types of biological information on specific type of tissue or cell and when analyzed together, it provides an invaluable opportunity for research laboratories to better understand the developmental progression of normal cells to cancer state at the molecular level and importantly, correlate these phenotypes with tissue of origins.
Although there exists a wealth of possibilities (Kannan et al. 2015) in accessing cancer associated data, Bioconductor represent the most comprehensive set of open source, updated and integrated professional tools for the statistical analysis of large scale genomic data. Thus, we propose our workflow within Bioconductor to describe how to download, process, analyze and integrate cancer data to understand specific cancer-related specific questions. However, there is no tool that solves the issue of integration in a comprehensive sequence and mutation information, epigenomic state and gene expression within the context of gene regulatory networks to identify oncogenic drivers and characterize altered pathways during cancer progression. Therefore, our workflow presents several Bioconductor packages to work with genomic and epigenomics data.
4 Methods
4.1 Access to the data
TCGA data is accessible via the the NCI Genomic Data Commons (GDC) data portal, GDC Legacy Archive and the Broad Institute’s GDAC Firehose. The GDC Data Portal provides access to the subset of TCGA data that has been harmonized against GRCh38 (hg38) using GDC Bioinformatics Pipelines which provides methods to the standardization of biospecimen and clinical data, the re-alignment of DNA and RNA sequence data against a common reference genome build GRCh38, and the generation of derived data. Whereas the GDC Legacy Archive provides access to an unmodified copy of data that was previously stored in CGHub(Wilks et al. 2014) and in the TCGA Data Portal hosted by the TCGA Data Coordinating Center (DCC), in which uses as references GRCh37 (hg19) and GRCh36 (hg18).
The previously stored data in CGHub, TCGA Data Portal and Broad Institute’s GDAC Firehose, were provided as different levels or tiers that were defined in terms of a specific combination of both processing level (raw, normalized, integrated) and access level (controlled or open access). Level 1 indicated raw and controlled data, level 2 indicated processed and controlled data, level 3 indicated Segmented or Interpreted Data and open access and level 4 indicated region of interest and open access data. While the TCGA data portal provided level 1 to 3 data, Firehose only provides level 3 and 4. An explanation of the different levels can be found at TCGA Wiki. However, the GDC data portal no longer uses this based classification model in levels. Instead a new data model was created, its documentation can be found in GDC documentation. In this new model, data can be open or controlled access. While the GDC open access data does not require authentication or authorization to access it and generally includes high level genomic data that is not individually identifiable, as well as most clinical and all biospecimen data elements, the GDC controlled access data requires dbGaP authorization and eRA Commons authentication and generally includes individually identifiable data such as low level genomic sequencing data, germline variants, SNP6 genotype data, and certain clinical data elements. The process to obtain access to controlled data is found in GDC web site.
Finally, the data provided by GDC data portal and GDC Legacy Archive can be accessed using Bioconductor package TCGAbiolinks, while the data provided by Firehose can be accessed by Bioconductor package RTCGAToolbox.
The next steps describes how one could use TCGAbiolinks & RTCGAToolbox to download clinical, genomics, transcriptomics, epigenomics data, as well as subtype information and GISTIC results (i.e., identified genes targeted by somatic copy-number alterations (SCNAs) that drive cancer growth). All the data used in this workflow has as reference the Genome Reference Consortium human genome (build 37 - hg19).
4.1.1 Downloading data from TCGA data portal
The Bioconductor package TCGAbiolinks (Colaprico et al. 2016) has three main functions GDCquery, GDCdownload and GDCprepare that should sequentially be used to respectively search, download and load the data as an R object.
GDCquery uses GDC API to search the data for a given project and data category and filters the results by samples, sample type, file type and others features if requested by the user. This function returns a object with a summary table with the results found (samples, files and other useful information) and the arguments used in the query. The most important GDCquery arguments are project which receives a GDC project (TCGA-USC, TCGA-LGG, TARGET-AML, etc), data.category which receives a data category (Transcriptome Profiling, Copy Number Variation, DNA methylation, Gene expression, etc), data.type which receives a data type (Gene expression quantification, Isoform Expression Quantification, miRNA Expression Quantification, Copy Number Segment, Masked Copy Number Segment, etc), workflow.type, which receives a GDC workflow type (HTSeq - Counts, HTSeq - FPKM-UQ, HTSeq - FPKM), legacy, which selects to use the legacy database or the harmonized database, file.type, which receives a file type for the searches in the legacy database (hg18.seg, hg19.seg, nocnv_,hg18.seg, nocnv_hg19.seg, rsem.genes.results, rsem.genes.normalized_results, etc) and platform, which receives a the platform for the searches in the legacy database (HumanMethylation27, Genome_Wide_SNP_6, IlluminaHiSeq_RNASeqV2, etc). A complete list of possible entries for arguments can be found in the TCGAbiolinks vignette. Listing 1 shows an example of this function.
After the search step, the user will be able to download the data using the GDCdownload function which can use either the GDC API to download the samples, or the gdc client tools. The downloaded data will be saved in a directory with the project name and a sub-folder with the data.category, for example “TCGA-GBM/DNA_methylation”.
Finally, GDCprepare transforms the downloaded data into a summarizedExperiment object (Huber et al. 2015) or a data frame. If SummarizedExperiment is set to TRUE, TCGAbiolinks will add to the object sub-type information, which was defined by The Cancer Genome Atlas (TCGA) Research Network reports (the full list of papers can be seen in TCGAquery_subtype section in TCGAbiolinks vignette), and clinical information. Listing 1 shows how to use these functions to download DNA methylation and gene expression data from the GDC legacy database and 2 shows how to download copy number variation from harmonized data portal. Other examples, that access the harmonized data can be found in the TCGAbiolinks vignette.
library(TCGAbiolinks)
# Obs: The data in the legacy database has been aligned to hg19
query.met.gbm <- GDCquery(project = "TCGA-GBM",
legacy = TRUE,
data.category = "DNA methylation",
platform = "Illumina Human Methylation 450",
barcode = c("TCGA-76-4926-01B-01D-1481-05", "TCGA-28-5211-01C-11D-1844-05"))
GDCdownload(query.met.gbm)
met.gbm.450 <- GDCprepare(query = query.met.gbm,
save = TRUE,
save.filename = "gbmDNAmet450k.rda",
summarizedExperiment = TRUE)
query.met.lgg <- GDCquery(project = "TCGA-LGG",
legacy = TRUE,
data.category = "DNA methylation",
platform = "Illumina Human Methylation 450",
barcode = c("TCGA-HT-7879-01A-11D-2399-05", "TCGA-HT-8113-01A-11D-2399-05"))
GDCdownload(query.met.lgg)
met.lgg.450 <- GDCprepare(query = query.met.lgg,
save = TRUE,
save.filename = "lggDNAmet450k.rda",
summarizedExperiment = TRUE)
met.gbm.lgg <- SummarizedExperiment::cbind(met.lgg.450, met.gbm.450)
query.exp.lgg <- GDCquery(project = "TCGA-LGG",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
sample.type = "Primary solid Tumor")
GDCdownload(query.exp.lgg)
exp.lgg <- GDCprepare(query = query.exp.lgg, save = TRUE, save.filename = "lggExp.rda")
query.exp.gbm <- GDCquery(project = "TCGA-GBM",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
sample.type = "Primary solid Tumor")
GDCdownload(query.exp.gbm)
exp.gbm <- GDCprepare(query = query.exp.gbm, save = TRUE, save.filename = "gbmExp.rda")
exp.gbm.lgg <- SummarizedExperiment::cbind(exp.lgg, exp.gbm)
#-----------------------------------------------------------------------------
# Data.category: Copy number variation aligned to hg38
#-----------------------------------------------------------------------------
query <- GDCquery(project = "TCGA-ACC",
data.category = "Copy Number Variation",
data.type = "Copy Number Segment",
barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01"))
GDCdownload(query)
data <- GDCprepare(query)
query <- GDCquery("TCGA-ACC",
"Copy Number Variation",
data.type = "Masked Copy Number Segment",
sample.type = c("Primary solid Tumor")) # see the barcodes with getResults(query)$cases
GDCdownload(query)
data <- GDCprepare(query)
If a SummarizedExperiment object was chosen, the data can be accessed with three different accessors: assay for the data information, rowRanges to gets the range of values in each row and colData to get the sample information (patient, batch, sample type, etc) (Huber et al. 2015; Morgan M and H., n.d.). An example is shown in listing below.
library(SummarizedExperiment)
# Load object from TCGAWorkflowData package
# THis object will be created in the further sections,
data(GBMIllumina_HiSeq)
# get expression matrix
data <- assay(gbm.exp)
data[1:5,1:5]
## TCGA-28-1753-01A-01R-1850-01 TCGA-06-0749-01A-01R-1849-01
## A1BG|1 176.60 603.41
## A2M|2 23071.89 23619.77
## NAT1|9 51.00 104.00
## NAT2|10 11.00 17.00
## SERPINA3|12 8742.00 137493.00
## TCGA-28-1747-01C-01R-1850-01 TCGA-27-1834-01A-01R-1850-01
## A1BG|1 173.63 419.63
## A2M|2 41407.90 32596.30
## NAT1|9 156.00 148.00
## NAT2|10 23.00 26.00
## SERPINA3|12 127501.00 23065.00
## TCGA-28-5220-01A-01R-1850-01
## A1BG|1 217.89
## A2M|2 15286.62
## NAT1|9 79.00
## NAT2|10 22.00
## SERPINA3|12 4444.00
# get genes information
genes.info <- rowRanges(gbm.exp)
genes.info
## GRanges object with 21022 ranges and 4 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## A1BG|1 chr19 [58856544, 58864865] - |
## A2M|2 chr12 [ 9220260, 9268825] - |
## NAT1|9 chr8 [18027986, 18081198] + |
## NAT2|10 chr8 [18248755, 18258728] + |
## SERPINA3|12 chr14 [95058395, 95090983] + |
## ... ... ... ... .
## NCRNA00182|100302692 chrX [ 73183790, 73513409] - |
## TMED7-TICAM2|100302736 chr5 [114914339, 114961858] - |
## TMED7-TICAM2|100302736 chr5 [114949205, 114968689] - |
## TMED7-TICAM2|100302736 chr5 [114914339, 114961876] - |
## LOC100303728|100303728 chrX [118599997, 118603061] - |
## gene_id entrezgene ensembl_gene_id
## <character> <numeric> <character>
## A1BG|1 A1BG 1 ENSG00000121410
## A2M|2 A2M 2 ENSG00000175899
## NAT1|9 NAT1 9 ENSG00000171428
## NAT2|10 NAT2 10 ENSG00000156006
## SERPINA3|12 RP11-986E7.7 12 ENSG00000273259
## ... ... ... ...
## NCRNA00182|100302692 FTX 100302692 ENSG00000230590
## TMED7-TICAM2|100302736 TMED7-TICAM2 100302736 ENSG00000251201
## TMED7-TICAM2|100302736 TMED7 100302736 ENSG00000134970
## TMED7-TICAM2|100302736 TICAM2 100302736 ENSG00000243414
## LOC100303728|100303728 SLC25A5-AS1 100303728 ENSG00000224281
## transcript_id.transcript_id_TCGA-28-1753-01A-01R-1850-01
## <character>
## A1BG|1 uc002qsd.3,uc002qsf.1
## A2M|2 uc001qvj.1,uc001qvk.1,uc009zgk.1
## NAT1|9 uc003wyq.2,uc003wyr.2,uc003wys.2,uc003wyt.2,uc003wyu.2,uc003wyv.2,uc010ltc.2,uc010ltd.2,uc011kyl.1
## NAT2|10 uc003wyw.1
## SERPINA3|12 uc001ydo.3,uc001ydp.2,uc001ydq.2,uc001ydr.2,uc001yds.2,uc010avf.1,uc010avg.2
## ... ...
## NCRNA00182|100302692 uc004ebr.1,uc010nlq.1
## TMED7-TICAM2|100302736 uc003krd.2,uc003kre.2
## TMED7-TICAM2|100302736 uc003krd.2,uc003kre.2
## TMED7-TICAM2|100302736 uc003krd.2,uc003kre.2
## LOC100303728|100303728 uc004ere.1,uc004erg.1
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
# get sample information
sample.info <- colData(gbm.exp)
head(sample.info)
## DataFrame with 6 rows and 106 columns
## patient barcode
## <character> <character>
## TCGA-28-1753-01A-01R-1850-01 TCGA-28-1753 TCGA-28-1753-01A-01R-1850-01
## TCGA-06-0749-01A-01R-1849-01 TCGA-06-0749 TCGA-06-0749-01A-01R-1849-01
## TCGA-28-1747-01C-01R-1850-01 TCGA-28-1747 TCGA-28-1747-01C-01R-1850-01
## TCGA-27-1834-01A-01R-1850-01 TCGA-27-1834 TCGA-27-1834-01A-01R-1850-01
## TCGA-28-5220-01A-01R-1850-01 TCGA-28-5220 TCGA-28-5220-01A-01R-1850-01
## TCGA-06-0132-01A-02R-1849-01 TCGA-06-0132 TCGA-06-0132-01A-02R-1849-01
## sample shortLetterCode
## <character> <character>
## TCGA-28-1753-01A-01R-1850-01 TCGA-28-1753-01A TP
## TCGA-06-0749-01A-01R-1849-01 TCGA-06-0749-01A TP
## TCGA-28-1747-01C-01R-1850-01 TCGA-28-1747-01C TP
## TCGA-27-1834-01A-01R-1850-01 TCGA-27-1834-01A TP
## TCGA-28-5220-01A-01R-1850-01 TCGA-28-5220-01A TP
## TCGA-06-0132-01A-02R-1849-01 TCGA-06-0132-01A TP
## definition classification_of_tumor
## <character> <character>
## TCGA-28-1753-01A-01R-1850-01 Primary solid Tumor not reported
## TCGA-06-0749-01A-01R-1849-01 Primary solid Tumor not reported
## TCGA-28-1747-01C-01R-1850-01 Primary solid Tumor not reported
## TCGA-27-1834-01A-01R-1850-01 Primary solid Tumor not reported
## TCGA-28-5220-01A-01R-1850-01 Primary solid Tumor not reported
## TCGA-06-0132-01A-02R-1849-01 Primary solid Tumor not reported
## last_known_disease_status
## <character>
## TCGA-28-1753-01A-01R-1850-01 not reported
## TCGA-06-0749-01A-01R-1849-01 not reported
## TCGA-28-1747-01C-01R-1850-01 not reported
## TCGA-27-1834-01A-01R-1850-01 not reported
## TCGA-28-5220-01A-01R-1850-01 not reported
## TCGA-06-0132-01A-02R-1849-01 not reported
## updated_datetime.x
## <character>
## TCGA-28-1753-01A-01R-1850-01 2016-09-02T19:11:19.658754-05:00
## TCGA-06-0749-01A-01R-1849-01 2016-09-02T19:01:55.836912-05:00
## TCGA-28-1747-01C-01R-1850-01 2016-09-02T19:11:33.850859-05:00
## TCGA-27-1834-01A-01R-1850-01 2016-09-02T19:12:57.685339-05:00
## TCGA-28-5220-01A-01R-1850-01 2016-09-02T19:08:48.447713-05:00
## TCGA-06-0132-01A-02R-1849-01 2016-09-02T19:13:19.722175-05:00
## primary_diagnosis tumor_stage
## <character> <character>
## TCGA-28-1753-01A-01R-1850-01 c71.9 not reported
## TCGA-06-0749-01A-01R-1849-01 c71.9 not reported
## TCGA-28-1747-01C-01R-1850-01 c71.9 not reported
## TCGA-27-1834-01A-01R-1850-01 c71.9 not reported
## TCGA-28-5220-01A-01R-1850-01 c71.9 not reported
## TCGA-06-0132-01A-02R-1849-01 c71.9 not reported
## age_at_diagnosis vital_status morphology
## <numeric> <character> <character>
## TCGA-28-1753-01A-01R-1850-01 19366 alive 9440/3
## TCGA-06-0749-01A-01R-1849-01 18571 dead 9440/3
## TCGA-28-1747-01C-01R-1850-01 16301 dead 9440/3
## TCGA-27-1834-01A-01R-1850-01 20568 dead 9440/3
## TCGA-28-5220-01A-01R-1850-01 24622 dead 9440/3
## TCGA-06-0132-01A-02R-1849-01 18125 dead 9440/3
## days_to_death
## <numeric>
## TCGA-28-1753-01A-01R-1850-01 NA
## TCGA-06-0749-01A-01R-1849-01 82
## TCGA-28-1747-01C-01R-1850-01 77
## TCGA-27-1834-01A-01R-1850-01 1233
## TCGA-28-5220-01A-01R-1850-01 388
## TCGA-06-0132-01A-02R-1849-01 771
## days_to_last_known_disease_status
## <logical>
## TCGA-28-1753-01A-01R-1850-01 NA
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 NA
## TCGA-27-1834-01A-01R-1850-01 NA
## TCGA-28-5220-01A-01R-1850-01 NA
## TCGA-06-0132-01A-02R-1849-01 NA
## days_to_last_follow_up state.x
## <numeric> <logical>
## TCGA-28-1753-01A-01R-1850-01 37 NA
## TCGA-06-0749-01A-01R-1849-01 66 NA
## TCGA-28-1747-01C-01R-1850-01 77 NA
## TCGA-27-1834-01A-01R-1850-01 1008 NA
## TCGA-28-5220-01A-01R-1850-01 319 NA
## TCGA-06-0132-01A-02R-1849-01 570 NA
## days_to_recurrence
## <logical>
## TCGA-28-1753-01A-01R-1850-01 NA
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 NA
## TCGA-27-1834-01A-01R-1850-01 NA
## TCGA-28-5220-01A-01R-1850-01 NA
## TCGA-06-0132-01A-02R-1849-01 NA
## diagnosis_id
## <character>
## TCGA-28-1753-01A-01R-1850-01 a0282570-3e64-5c3a-b073-9e1c169f33fc
## TCGA-06-0749-01A-01R-1849-01 8af027e1-3ad6-5221-b295-95c87f1cd184
## TCGA-28-1747-01C-01R-1850-01 b6d50cb3-6104-5cd7-a485-e1825e05aed9
## TCGA-27-1834-01A-01R-1850-01 621f8adb-470e-5586-aa0a-3845262bbfca
## TCGA-28-5220-01A-01R-1850-01 13be312a-2b28-5e0f-83ce-412c38852e5f
## TCGA-06-0132-01A-02R-1849-01 6d74ecf4-7ce5-566f-8389-d24e1617ba27
## tumor_grade
## <character>
## TCGA-28-1753-01A-01R-1850-01 not reported
## TCGA-06-0749-01A-01R-1849-01 not reported
## TCGA-28-1747-01C-01R-1850-01 not reported
## TCGA-27-1834-01A-01R-1850-01 not reported
## TCGA-28-5220-01A-01R-1850-01 not reported
## TCGA-06-0132-01A-02R-1849-01 not reported
## treatments
## <list>
## TCGA-28-1753-01A-01R-1850-01 NA,2016-09-02T19:11:19.658754-05:00,NA,...
## TCGA-06-0749-01A-01R-1849-01 NA,2016-09-02T19:01:55.836912-05:00,NA,...
## TCGA-28-1747-01C-01R-1850-01 NA,2016-09-02T19:11:33.850859-05:00,NA,...
## TCGA-27-1834-01A-01R-1850-01 NA,2016-09-02T19:12:57.685339-05:00,NA,...
## TCGA-28-5220-01A-01R-1850-01 NA,2016-09-02T19:08:48.447713-05:00,NA,...
## TCGA-06-0132-01A-02R-1849-01 NA,2016-09-02T19:13:19.722175-05:00,NA,...
## tissue_or_organ_of_origin days_to_birth
## <character> <numeric>
## TCGA-28-1753-01A-01R-1850-01 c71.9 -19366
## TCGA-06-0749-01A-01R-1849-01 c71.9 -18571
## TCGA-28-1747-01C-01R-1850-01 c71.9 -16301
## TCGA-27-1834-01A-01R-1850-01 c71.9 -20568
## TCGA-28-5220-01A-01R-1850-01 c71.9 -24622
## TCGA-06-0132-01A-02R-1849-01 c71.9 -18125
## progression_or_recurrence prior_malignancy
## <character> <character>
## TCGA-28-1753-01A-01R-1850-01 not reported not reported
## TCGA-06-0749-01A-01R-1849-01 not reported not reported
## TCGA-28-1747-01C-01R-1850-01 not reported not reported
## TCGA-27-1834-01A-01R-1850-01 not reported not reported
## TCGA-28-5220-01A-01R-1850-01 not reported not reported
## TCGA-06-0132-01A-02R-1849-01 not reported not reported
## site_of_resection_or_biopsy
## <character>
## TCGA-28-1753-01A-01R-1850-01 c71.9
## TCGA-06-0749-01A-01R-1849-01 c71.9
## TCGA-28-1747-01C-01R-1850-01 c71.9
## TCGA-27-1834-01A-01R-1850-01 c71.9
## TCGA-28-5220-01A-01R-1850-01 c71.9
## TCGA-06-0132-01A-02R-1849-01 c71.9
## created_datetime.x cigarettes_per_day
## <logical> <logical>
## TCGA-28-1753-01A-01R-1850-01 NA NA
## TCGA-06-0749-01A-01R-1849-01 NA NA
## TCGA-28-1747-01C-01R-1850-01 NA NA
## TCGA-27-1834-01A-01R-1850-01 NA NA
## TCGA-28-5220-01A-01R-1850-01 NA NA
## TCGA-06-0132-01A-02R-1849-01 NA NA
## weight updated_datetime.y
## <logical> <character>
## TCGA-28-1753-01A-01R-1850-01 NA 2016-09-02T19:11:19.658754-05:00
## TCGA-06-0749-01A-01R-1849-01 NA 2016-09-02T19:01:55.836912-05:00
## TCGA-28-1747-01C-01R-1850-01 NA 2016-09-02T19:11:33.850859-05:00
## TCGA-27-1834-01A-01R-1850-01 NA 2016-09-02T19:12:57.685339-05:00
## TCGA-28-5220-01A-01R-1850-01 NA 2016-09-02T19:08:48.447713-05:00
## TCGA-06-0132-01A-02R-1849-01 NA 2016-09-02T19:13:19.722175-05:00
## alcohol_history alcohol_intensity bmi
## <logical> <logical> <logical>
## TCGA-28-1753-01A-01R-1850-01 NA NA NA
## TCGA-06-0749-01A-01R-1849-01 NA NA NA
## TCGA-28-1747-01C-01R-1850-01 NA NA NA
## TCGA-27-1834-01A-01R-1850-01 NA NA NA
## TCGA-28-5220-01A-01R-1850-01 NA NA NA
## TCGA-06-0132-01A-02R-1849-01 NA NA NA
## years_smoked height created_datetime.y
## <logical> <logical> <logical>
## TCGA-28-1753-01A-01R-1850-01 NA NA NA
## TCGA-06-0749-01A-01R-1849-01 NA NA NA
## TCGA-28-1747-01C-01R-1850-01 NA NA NA
## TCGA-27-1834-01A-01R-1850-01 NA NA NA
## TCGA-28-5220-01A-01R-1850-01 NA NA NA
## TCGA-06-0132-01A-02R-1849-01 NA NA NA
## state.y
## <logical>
## TCGA-28-1753-01A-01R-1850-01 NA
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 NA
## TCGA-27-1834-01A-01R-1850-01 NA
## TCGA-28-5220-01A-01R-1850-01 NA
## TCGA-06-0132-01A-02R-1849-01 NA
## exposure_id
## <character>
## TCGA-28-1753-01A-01R-1850-01 49a9d8e0-0826-5d48-8f94-5d740a11bb39
## TCGA-06-0749-01A-01R-1849-01 fe432d1c-259e-50d9-93a3-5cb54bce231b
## TCGA-28-1747-01C-01R-1850-01 130208d0-1c81-5e9c-a086-7faaa39534f7
## TCGA-27-1834-01A-01R-1850-01 88e1cef8-191c-59bf-8951-71eb48743b0f
## TCGA-28-5220-01A-01R-1850-01 e557f7ac-95a5-5cfa-9110-304a7cc3f320
## TCGA-06-0132-01A-02R-1849-01 323344b4-7ed5-597a-824c-46d8422a31de
## updated_datetime
## <character>
## TCGA-28-1753-01A-01R-1850-01 2016-09-02T19:11:19.658754-05:00
## TCGA-06-0749-01A-01R-1849-01 2016-09-02T19:01:55.836912-05:00
## TCGA-28-1747-01C-01R-1850-01 2016-09-02T19:11:33.850859-05:00
## TCGA-27-1834-01A-01R-1850-01 2016-09-02T19:12:57.685339-05:00
## TCGA-28-5220-01A-01R-1850-01 2016-09-02T19:08:48.447713-05:00
## TCGA-06-0132-01A-02R-1849-01 2016-09-02T19:13:19.722175-05:00
## created_datetime gender state
## <logical> <character> <logical>
## TCGA-28-1753-01A-01R-1850-01 NA male NA
## TCGA-06-0749-01A-01R-1849-01 NA male NA
## TCGA-28-1747-01C-01R-1850-01 NA male NA
## TCGA-27-1834-01A-01R-1850-01 NA male NA
## TCGA-28-5220-01A-01R-1850-01 NA male NA
## TCGA-06-0132-01A-02R-1849-01 NA male NA
## year_of_birth race
## <integer> <character>
## TCGA-28-1753-01A-01R-1850-01 1956 white
## TCGA-06-0749-01A-01R-1849-01 1958 black or african american
## TCGA-28-1747-01C-01R-1850-01 1964 white
## TCGA-27-1834-01A-01R-1850-01 1950 white
## TCGA-28-5220-01A-01R-1850-01 1943 white
## TCGA-06-0132-01A-02R-1849-01 1958 white
## demographic_id
## <character>
## TCGA-28-1753-01A-01R-1850-01 51ee4685-9d81-5e77-a3d8-73b6fc8f60be
## TCGA-06-0749-01A-01R-1849-01 383ca7c3-aed0-545f-9939-f8092f2894b1
## TCGA-28-1747-01C-01R-1850-01 16269cae-c6f3-5c91-997b-15a045259f11
## TCGA-27-1834-01A-01R-1850-01 d21c4b05-e2d5-5f3f-84f6-1dd0b2c2f9ab
## TCGA-28-5220-01A-01R-1850-01 dcb6fe26-48c9-510f-9b19-bdda92036b66
## TCGA-06-0132-01A-02R-1849-01 1ad48cdc-c76d-5770-af41-1fcefef18284
## ethnicity year_of_death
## <character> <integer>
## TCGA-28-1753-01A-01R-1850-01 hispanic or latino NA
## TCGA-06-0749-01A-01R-1849-01 not hispanic or latino 2008
## TCGA-28-1747-01C-01R-1850-01 not hispanic or latino 2008
## TCGA-27-1834-01A-01R-1850-01 not hispanic or latino 2009
## TCGA-28-5220-01A-01R-1850-01 not reported NA
## TCGA-06-0132-01A-02R-1849-01 not hispanic or latino 2009
## bcr_patient_barcode dbgap_accession_number
## <character> <logical>
## TCGA-28-1753-01A-01R-1850-01 TCGA-28-1753 NA
## TCGA-06-0749-01A-01R-1849-01 TCGA-06-0749 NA
## TCGA-28-1747-01C-01R-1850-01 TCGA-28-1747 NA
## TCGA-27-1834-01A-01R-1850-01 TCGA-27-1834 NA
## TCGA-28-5220-01A-01R-1850-01 TCGA-28-5220 NA
## TCGA-06-0132-01A-02R-1849-01 TCGA-06-0132 NA
## disease_type released state.1
## <character> <logical> <character>
## TCGA-28-1753-01A-01R-1850-01 Glioblastoma Multiforme TRUE legacy
## TCGA-06-0749-01A-01R-1849-01 Glioblastoma Multiforme TRUE legacy
## TCGA-28-1747-01C-01R-1850-01 Glioblastoma Multiforme TRUE legacy
## TCGA-27-1834-01A-01R-1850-01 Glioblastoma Multiforme TRUE legacy
## TCGA-28-5220-01A-01R-1850-01 Glioblastoma Multiforme TRUE legacy
## TCGA-06-0132-01A-02R-1849-01 Glioblastoma Multiforme TRUE legacy
## primary_site project_id
## <character> <character>
## TCGA-28-1753-01A-01R-1850-01 Brain TCGA-GBM
## TCGA-06-0749-01A-01R-1849-01 Brain TCGA-GBM
## TCGA-28-1747-01C-01R-1850-01 Brain TCGA-GBM
## TCGA-27-1834-01A-01R-1850-01 Brain TCGA-GBM
## TCGA-28-5220-01A-01R-1850-01 Brain TCGA-GBM
## TCGA-06-0132-01A-02R-1849-01 Brain TCGA-GBM
## name subtype_patient
## <character> <factor>
## TCGA-28-1753-01A-01R-1850-01 Glioblastoma Multiforme TCGA-28-1753
## TCGA-06-0749-01A-01R-1849-01 Glioblastoma Multiforme TCGA-06-0749
## TCGA-28-1747-01C-01R-1850-01 Glioblastoma Multiforme TCGA-28-1747
## TCGA-27-1834-01A-01R-1850-01 Glioblastoma Multiforme TCGA-27-1834
## TCGA-28-5220-01A-01R-1850-01 Glioblastoma Multiforme TCGA-28-5220
## TCGA-06-0132-01A-02R-1849-01 Glioblastoma Multiforme TCGA-06-0132
## subtype_Tissue.source.site
## <factor>
## TCGA-28-1753-01A-01R-1850-01 Cedars Sinai
## TCGA-06-0749-01A-01R-1849-01 Henry Ford Hospital
## TCGA-28-1747-01C-01R-1850-01 Cedars Sinai
## TCGA-27-1834-01A-01R-1850-01 Milan - Italy, Fondazione IRCCS Instituto Neuroligico C. Besta
## TCGA-28-5220-01A-01R-1850-01 Cedars Sinai
## TCGA-06-0132-01A-02R-1849-01 Henry Ford Hospital
## subtype_Study subtype_BCR
## <factor> <factor>
## TCGA-28-1753-01A-01R-1850-01 Glioblastoma multiforme IGC
## TCGA-06-0749-01A-01R-1849-01 Glioblastoma multiforme IGC
## TCGA-28-1747-01C-01R-1850-01 Glioblastoma multiforme IGC
## TCGA-27-1834-01A-01R-1850-01 Glioblastoma multiforme IGC
## TCGA-28-5220-01A-01R-1850-01 Glioblastoma multiforme IGC
## TCGA-06-0132-01A-02R-1849-01 Glioblastoma multiforme IGC
## subtype_Whole.exome subtype_Whole.genome
## <factor> <factor>
## TCGA-28-1753-01A-01R-1850-01 Yes No
## TCGA-06-0749-01A-01R-1849-01 Yes No
## TCGA-28-1747-01C-01R-1850-01 Yes No
## TCGA-27-1834-01A-01R-1850-01 Yes No
## TCGA-28-5220-01A-01R-1850-01 Yes No
## TCGA-06-0132-01A-02R-1849-01 Yes No
## subtype_RNAseq subtype_SNP6 subtype_U133a
## <factor> <factor> <factor>
## TCGA-28-1753-01A-01R-1850-01 Yes Yes Yes
## TCGA-06-0749-01A-01R-1849-01 Yes Yes Yes
## TCGA-28-1747-01C-01R-1850-01 Yes Yes Yes
## TCGA-27-1834-01A-01R-1850-01 Yes Yes Yes
## TCGA-28-5220-01A-01R-1850-01 Yes Yes Yes
## TCGA-06-0132-01A-02R-1849-01 Yes Yes Yes
## subtype_HM450 subtype_HM27 subtype_RPPA
## <factor> <factor> <factor>
## TCGA-28-1753-01A-01R-1850-01 No Yes Yes
## TCGA-06-0749-01A-01R-1849-01 No No No
## TCGA-28-1747-01C-01R-1850-01 No Yes Yes
## TCGA-27-1834-01A-01R-1850-01 No Yes Yes
## TCGA-28-5220-01A-01R-1850-01 Yes No No
## TCGA-06-0132-01A-02R-1849-01 No No No
## subtype_Histology subtype_Grade
## <factor> <factor>
## TCGA-28-1753-01A-01R-1850-01 glioblastoma G4
## TCGA-06-0749-01A-01R-1849-01 glioblastoma G4
## TCGA-28-1747-01C-01R-1850-01 glioblastoma G4
## TCGA-27-1834-01A-01R-1850-01 glioblastoma G4
## TCGA-28-5220-01A-01R-1850-01 glioblastoma G4
## TCGA-06-0132-01A-02R-1849-01 glioblastoma G4
## subtype_Age..years.at.diagnosis.
## <integer>
## TCGA-28-1753-01A-01R-1850-01 53
## TCGA-06-0749-01A-01R-1849-01 50
## TCGA-28-1747-01C-01R-1850-01 44
## TCGA-27-1834-01A-01R-1850-01 56
## TCGA-28-5220-01A-01R-1850-01 67
## TCGA-06-0132-01A-02R-1849-01 49
## subtype_Gender subtype_Survival..months.
## <factor> <numeric>
## TCGA-28-1753-01A-01R-1850-01 male 1.215631
## TCGA-06-0749-01A-01R-1849-01 male 2.694102
## TCGA-28-1747-01C-01R-1850-01 male 2.529827
## TCGA-27-1834-01A-01R-1850-01 male 40.510092
## TCGA-28-5220-01A-01R-1850-01 male 10.480713
## TCGA-06-0132-01A-02R-1849-01 male 25.331128
## subtype_Vital.status..1.dead.
## <integer>
## TCGA-28-1753-01A-01R-1850-01 0
## TCGA-06-0749-01A-01R-1849-01 1
## TCGA-28-1747-01C-01R-1850-01 1
## TCGA-27-1834-01A-01R-1850-01 1
## TCGA-28-5220-01A-01R-1850-01 0
## TCGA-06-0132-01A-02R-1849-01 1
## subtype_Karnofsky.Performance.Score
## <integer>
## TCGA-28-1753-01A-01R-1850-01 NA
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 NA
## TCGA-27-1834-01A-01R-1850-01 80
## TCGA-28-5220-01A-01R-1850-01 90
## TCGA-06-0132-01A-02R-1849-01 NA
## subtype_Mutation.Count
## <integer>
## TCGA-28-1753-01A-01R-1850-01 45
## TCGA-06-0749-01A-01R-1849-01 49
## TCGA-28-1747-01C-01R-1850-01 41
## TCGA-27-1834-01A-01R-1850-01 41
## TCGA-28-5220-01A-01R-1850-01 35
## TCGA-06-0132-01A-02R-1849-01 23
## subtype_Percent.aneuploidy subtype_IDH.status
## <numeric> <factor>
## TCGA-28-1753-01A-01R-1850-01 0.1271441 WT
## TCGA-06-0749-01A-01R-1849-01 0.1562720 WT
## TCGA-28-1747-01C-01R-1850-01 0.1091594 WT
## TCGA-27-1834-01A-01R-1850-01 0.5039493 WT
## TCGA-28-5220-01A-01R-1850-01 0.6261002 WT
## TCGA-06-0132-01A-02R-1849-01 NA WT
## subtype_X1p.19q.codeletion
## <factor>
## TCGA-28-1753-01A-01R-1850-01 non-codel
## TCGA-06-0749-01A-01R-1849-01 non-codel
## TCGA-28-1747-01C-01R-1850-01 non-codel
## TCGA-27-1834-01A-01R-1850-01 non-codel
## TCGA-28-5220-01A-01R-1850-01 non-codel
## TCGA-06-0132-01A-02R-1849-01 non-codel
## subtype_IDH.codel.subtype
## <factor>
## TCGA-28-1753-01A-01R-1850-01 IDHwt
## TCGA-06-0749-01A-01R-1849-01 IDHwt
## TCGA-28-1747-01C-01R-1850-01 IDHwt
## TCGA-27-1834-01A-01R-1850-01 IDHwt
## TCGA-28-5220-01A-01R-1850-01 IDHwt
## TCGA-06-0132-01A-02R-1849-01 IDHwt
## subtype_MGMT.promoter.status
## <factor>
## TCGA-28-1753-01A-01R-1850-01 Unmethylated
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 Methylated
## TCGA-27-1834-01A-01R-1850-01 Methylated
## TCGA-28-5220-01A-01R-1850-01 Unmethylated
## TCGA-06-0132-01A-02R-1849-01 NA
## subtype_Chr.7.gain.Chr.10.loss
## <factor>
## TCGA-28-1753-01A-01R-1850-01 Gain chr 7 & loss chr 10
## TCGA-06-0749-01A-01R-1849-01 Gain chr 7 & loss chr 10
## TCGA-28-1747-01C-01R-1850-01 No combined CNA
## TCGA-27-1834-01A-01R-1850-01 Gain chr 7 & loss chr 10
## TCGA-28-5220-01A-01R-1850-01 Gain chr 7 & loss chr 10
## TCGA-06-0132-01A-02R-1849-01 No combined CNA
## subtype_Chr.19.20.co.gain
## <factor>
## TCGA-28-1753-01A-01R-1850-01 No chr 19/20 gain
## TCGA-06-0749-01A-01R-1849-01 No chr 19/20 gain
## TCGA-28-1747-01C-01R-1850-01 No chr 19/20 gain
## TCGA-27-1834-01A-01R-1850-01 No chr 19/20 gain
## TCGA-28-5220-01A-01R-1850-01 No chr 19/20 gain
## TCGA-06-0132-01A-02R-1849-01 No chr 19/20 gain
## subtype_TERT.promoter.status
## <factor>
## TCGA-28-1753-01A-01R-1850-01 NA
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 NA
## TCGA-27-1834-01A-01R-1850-01 NA
## TCGA-28-5220-01A-01R-1850-01 NA
## TCGA-06-0132-01A-02R-1849-01 NA
## subtype_TERT.expression..log2.
## <numeric>
## TCGA-28-1753-01A-01R-1850-01 5.727920
## TCGA-06-0749-01A-01R-1849-01 2.584963
## TCGA-28-1747-01C-01R-1850-01 5.554589
## TCGA-27-1834-01A-01R-1850-01 1.000000
## TCGA-28-5220-01A-01R-1850-01 3.700440
## TCGA-06-0132-01A-02R-1849-01 0.000000
## subtype_TERT.expression.status
## <factor>
## TCGA-28-1753-01A-01R-1850-01 Expressed
## TCGA-06-0749-01A-01R-1849-01 Expressed
## TCGA-28-1747-01C-01R-1850-01 Expressed
## TCGA-27-1834-01A-01R-1850-01 Not expressed
## TCGA-28-5220-01A-01R-1850-01 Expressed
## TCGA-06-0132-01A-02R-1849-01 Not expressed
## subtype_ATRX.status subtype_DAXX.status
## <factor> <factor>
## TCGA-28-1753-01A-01R-1850-01 WT WT
## TCGA-06-0749-01A-01R-1849-01 WT WT
## TCGA-28-1747-01C-01R-1850-01 WT WT
## TCGA-27-1834-01A-01R-1850-01 WT WT
## TCGA-28-5220-01A-01R-1850-01 WT WT
## TCGA-06-0132-01A-02R-1849-01 WT WT
## subtype_Telomere.Maintenance
## <factor>
## TCGA-28-1753-01A-01R-1850-01 NA
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 NA
## TCGA-27-1834-01A-01R-1850-01 NA
## TCGA-28-5220-01A-01R-1850-01 NA
## TCGA-06-0132-01A-02R-1849-01 NA
## subtype_BRAF.V600E.status
## <factor>
## TCGA-28-1753-01A-01R-1850-01 WT
## TCGA-06-0749-01A-01R-1849-01 WT
## TCGA-28-1747-01C-01R-1850-01 WT
## TCGA-27-1834-01A-01R-1850-01 WT
## TCGA-28-5220-01A-01R-1850-01 WT
## TCGA-06-0132-01A-02R-1849-01 WT
## subtype_BRAF.KIAA1549.fusion
## <factor>
## TCGA-28-1753-01A-01R-1850-01 WT
## TCGA-06-0749-01A-01R-1849-01 WT
## TCGA-28-1747-01C-01R-1850-01 WT
## TCGA-27-1834-01A-01R-1850-01 WT
## TCGA-28-5220-01A-01R-1850-01 WT
## TCGA-06-0132-01A-02R-1849-01 WT
## subtype_ABSOLUTE.purity
## <numeric>
## TCGA-28-1753-01A-01R-1850-01 0.62
## TCGA-06-0749-01A-01R-1849-01 0.57
## TCGA-28-1747-01C-01R-1850-01 0.80
## TCGA-27-1834-01A-01R-1850-01 0.76
## TCGA-28-5220-01A-01R-1850-01 0.89
## TCGA-06-0132-01A-02R-1849-01 0.27
## subtype_ABSOLUTE.ploidy
## <numeric>
## TCGA-28-1753-01A-01R-1850-01 2.00
## TCGA-06-0749-01A-01R-1849-01 1.95
## TCGA-28-1747-01C-01R-1850-01 1.96
## TCGA-27-1834-01A-01R-1850-01 1.91
## TCGA-28-5220-01A-01R-1850-01 1.83
## TCGA-06-0132-01A-02R-1849-01 1.95
## subtype_ESTIMATE.stromal.score
## <numeric>
## TCGA-28-1753-01A-01R-1850-01 2509.6100
## TCGA-06-0749-01A-01R-1849-01 981.3623
## TCGA-28-1747-01C-01R-1850-01 850.3133
## TCGA-27-1834-01A-01R-1850-01 1005.6140
## TCGA-28-5220-01A-01R-1850-01 440.0687
## TCGA-06-0132-01A-02R-1849-01 2673.8230
## subtype_ESTIMATE.immune.score
## <numeric>
## TCGA-28-1753-01A-01R-1850-01 2893.989
## TCGA-06-0749-01A-01R-1849-01 2495.874
## TCGA-28-1747-01C-01R-1850-01 2574.148
## TCGA-27-1834-01A-01R-1850-01 2487.626
## TCGA-28-5220-01A-01R-1850-01 1904.311
## TCGA-06-0132-01A-02R-1849-01 3265.700
## subtype_ESTIMATE.combined.score
## <numeric>
## TCGA-28-1753-01A-01R-1850-01 5403.599
## TCGA-06-0749-01A-01R-1849-01 3477.237
## TCGA-28-1747-01C-01R-1850-01 3424.462
## TCGA-27-1834-01A-01R-1850-01 3493.239
## TCGA-28-5220-01A-01R-1850-01 2344.380
## TCGA-06-0132-01A-02R-1849-01 5939.523
## subtype_Original.Subtype
## <factor>
## TCGA-28-1753-01A-01R-1850-01 Mesenchymal
## TCGA-06-0749-01A-01R-1849-01 Neural
## TCGA-28-1747-01C-01R-1850-01 Classical
## TCGA-27-1834-01A-01R-1850-01 Neural
## TCGA-28-5220-01A-01R-1850-01 Classical
## TCGA-06-0132-01A-02R-1849-01 Neural
## subtype_Transcriptome.Subtype
## <factor>
## TCGA-28-1753-01A-01R-1850-01 ME
## TCGA-06-0749-01A-01R-1849-01 NE
## TCGA-28-1747-01C-01R-1850-01 CL
## TCGA-27-1834-01A-01R-1850-01 NA
## TCGA-28-5220-01A-01R-1850-01 NA
## TCGA-06-0132-01A-02R-1849-01 NE
## subtype_Pan.Glioma.RNA.Expression.Cluster
## <factor>
## TCGA-28-1753-01A-01R-1850-01 LGr4
## TCGA-06-0749-01A-01R-1849-01 LGr4
## TCGA-28-1747-01C-01R-1850-01 LGr4
## TCGA-27-1834-01A-01R-1850-01 LGr4
## TCGA-28-5220-01A-01R-1850-01 LGr4
## TCGA-06-0132-01A-02R-1849-01 LGr2
## subtype_IDH.specific.RNA.Expression.Cluster
## <factor>
## TCGA-28-1753-01A-01R-1850-01 IDHwt-R3
## TCGA-06-0749-01A-01R-1849-01 IDHwt-R4
## TCGA-28-1747-01C-01R-1850-01 IDHwt-R3
## TCGA-27-1834-01A-01R-1850-01 IDHwt-R2
## TCGA-28-5220-01A-01R-1850-01 IDHwt-R2
## TCGA-06-0132-01A-02R-1849-01 IDHwt-R4
## subtype_Pan.Glioma.DNA.Methylation.Cluster
## <factor>
## TCGA-28-1753-01A-01R-1850-01 LGm5
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 LGm5
## TCGA-27-1834-01A-01R-1850-01 LGm5
## TCGA-28-5220-01A-01R-1850-01 LGm5
## TCGA-06-0132-01A-02R-1849-01 NA
## subtype_IDH.specific.DNA.Methylation.Cluster
## <factor>
## TCGA-28-1753-01A-01R-1850-01 IDHwt-K2
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 IDHwt-K2
## TCGA-27-1834-01A-01R-1850-01 IDHwt-K2
## TCGA-28-5220-01A-01R-1850-01 IDHwt-K2
## TCGA-06-0132-01A-02R-1849-01 NA
## subtype_Supervised.DNA.Methylation.Cluster
## <factor>
## TCGA-28-1753-01A-01R-1850-01 Mesenchymal-like
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 Mesenchymal-like
## TCGA-27-1834-01A-01R-1850-01 Mesenchymal-like
## TCGA-28-5220-01A-01R-1850-01 Mesenchymal-like
## TCGA-06-0132-01A-02R-1849-01 NA
## subtype_Random.Forest.Sturm.Cluster
## <factor>
## TCGA-28-1753-01A-01R-1850-01 NA
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 NA
## TCGA-27-1834-01A-01R-1850-01 NA
## TCGA-28-5220-01A-01R-1850-01 RTK II 'Classic'
## TCGA-06-0132-01A-02R-1849-01 NA
## subtype_RPPA.cluster
## <factor>
## TCGA-28-1753-01A-01R-1850-01 K1
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 K1
## TCGA-27-1834-01A-01R-1850-01 K1
## TCGA-28-5220-01A-01R-1850-01 NA
## TCGA-06-0132-01A-02R-1849-01 NA
## subtype_Telomere.length.estimate.in.blood.normal..Kb.
## <numeric>
## TCGA-28-1753-01A-01R-1850-01 NA
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 NA
## TCGA-27-1834-01A-01R-1850-01 NA
## TCGA-28-5220-01A-01R-1850-01 NA
## TCGA-06-0132-01A-02R-1849-01 NA
## subtype_Telomere.length.estimate.in.tumor..Kb.
## <numeric>
## TCGA-28-1753-01A-01R-1850-01 NA
## TCGA-06-0749-01A-01R-1849-01 NA
## TCGA-28-1747-01C-01R-1850-01 NA
## TCGA-27-1834-01A-01R-1850-01 NA
## TCGA-28-5220-01A-01R-1850-01 NA
## TCGA-06-0132-01A-02R-1849-01 NA
The clinical data can be obtained using TCGAbiolinks through two methods. The first one will download only the indexed GDC clinical data which includes diagnoses (vital status, days to death, age at diagnosis, days to last follow up, days to recurrence), treatments (days to treatment, treatment id, therapeutic agents, treatment intent type), demographic (gender, race, ethnicity) and exposures (cigarettes per day, weight, height, alcohol history) information. This indexed clinical data can be obtained using the function GDCquery_clinical which can be used as described in listing below. This function has two arguments project (“TCGA-GBM”,“TARGET-AML”,etc) and type (“Clinical” or “Biospecimen”). The second method will download the xml files with all clinical data for the patient and retrieve the desired information from it. This will give access to all clinical data available which includes patient (tumor tissue site, histological type, gender, vital status, days to birth, days to last follow up, etc), drug (days to drug therapy start, days to drug therapy end, therapy types, drug name), radiation (days to radiation therapy start, days to radiation therapy end, radiation type, radiation dosage ), new tumor event (days to new tumor event after initial treatment, new neoplasm event type, additional pharmaceutical therapy), follow up (primary therapy outcome success, follow up treatment success, vital status, days to last follow up, date of form completion), stage event (pathologic stage, tnm categories), admin (batch number, project code, disease code, Biospecimen Core Resource).
# get indexed clinical patient data for GBM samples
gbm_clin <- GDCquery_clinic(project = "TCGA-GBM", type = "Clinical")
# get indexed clinical patient data for LGG samples
lgg_clin <- GDCquery_clinic(project = "TCGA-LGG", type = "Clinical")
# Bind the results, as the columns might not be the same,
# we will will plyr rbind.fill, to have all columns from both files
clinical <- plyr::rbind.fill(gbm_clin,lgg_clin)
head(clinical)
## submitter_id classification_of_tumor last_known_disease_status
## 1 TCGA-02-0001 not reported not reported
## 2 TCGA-02-0003 not reported not reported
## 3 TCGA-02-0004 not reported not reported
## 4 TCGA-02-0006 not reported not reported
## 5 TCGA-02-0007 not reported not reported
## 6 TCGA-02-0009 not reported not reported
## updated_datetime primary_diagnosis tumor_stage
## 1 2017-03-04T16:44:35.784223-06:00 c71.9 not reported
## 2 2017-03-04T16:44:35.784223-06:00 c71.9 not reported
## 3 2017-03-04T16:44:35.784223-06:00 c71.9 not reported
## 4 2017-03-04T16:44:35.784223-06:00 c71.9 not reported
## 5 2017-03-04T16:44:35.784223-06:00 c71.9 not reported
## 6 2017-03-04T16:44:35.784223-06:00 c71.9 not reported
## age_at_diagnosis vital_status morphology days_to_death
## 1 16179 dead 9440/3 358
## 2 18341 dead 9440/3 144
## 3 21617 dead 9440/3 345
## 4 20516 dead 9440/3 558
## 5 14806 dead 9440/3 705
## 6 22457 dead 9440/3 322
## days_to_last_known_disease_status created_datetime state
## 1 NA NA live
## 2 NA NA live
## 3 NA NA live
## 4 NA NA live
## 5 NA NA live
## 6 NA NA live
## days_to_recurrence diagnosis_id tumor_grade
## 1 NA e99b166c-a599-5d2e-9534-f660e3b18a84 not reported
## 2 NA 6b9e2d98-e6cd-5c0b-8e38-1a851c07ac64 not reported
## 3 NA 7a2d4e0b-7492-5f85-ba22-311d363d97e7 not reported
## 4 NA 4023ae86-3936-5648-90ec-e7cdc64ee012 not reported
## 5 NA bbdde045-6918-52da-8a68-00a350b73c51 not reported
## 6 NA 5eea7161-1261-5481-a941-b89eb1c4cf70 not reported
## tissue_or_organ_of_origin days_to_birth progression_or_recurrence
## 1 c71.9 -16179 not reported
## 2 c71.9 -18341 not reported
## 3 c71.9 -21617 not reported
## 4 c71.9 -20516 not reported
## 5 c71.9 -14806 not reported
## 6 c71.9 -22457 not reported
## prior_malignancy site_of_resection_or_biopsy days_to_last_follow_up
## 1 not reported c71.9 279
## 2 not reported c71.9 144
## 3 not reported c71.9 345
## 4 not reported c71.9 558
## 5 not reported c71.9 705
## 6 not reported c71.9 322
## cigarettes_per_day weight alcohol_history alcohol_intensity bmi
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## years_smoked exposure_id height gender
## 1 NA 5a04d603-02a9-527e-b4da-24e450cbdf87 NA female
## 2 NA ad77c843-36de-58bb-83d7-37437cb2e7f2 NA male
## 3 NA f836e994-6e80-5ff1-9fcc-1c2c320391da NA male
## 4 NA 3e2c7703-2ed3-5555-9c1d-0301c684c7a5 NA female
## 5 NA 7ca8fa74-2ff8-562a-94ae-682a73a31902 NA female
## 6 NA 8b18c7a4-a6ea-5cd4-91b0-634d62400b45 NA female
## year_of_birth race demographic_id
## 1 1958 white 3a9037c5-e664-542c-9306-2201e017e523
## 2 1953 white b3d375be-0fe4-527a-8b7e-a537551a61a5
## 3 1943 white 10aedf8e-4e75-54ca-adeb-a0f792b95d8e
## 4 1946 white 8a707557-d498-5ff6-9700-f2625b944b95
## 5 1962 white a167b83d-15ec-5061-b53b-fb62e672db3c
## 6 1942 white 9ec62946-28d0-5316-888e-c096184e797a
## ethnicity year_of_death therapeutic_agents
## 1 not hispanic or latino 2002 NA
## 2 not hispanic or latino 2003 NA
## 3 not hispanic or latino 2002 NA
## 4 not hispanic or latino 2003 NA
## 5 not hispanic or latino 2003 NA
## 6 not hispanic or latino 2003 NA
## treatment_id days_to_treatment
## 1 17db47ee-d5ad-53e4-a686-d28b5412c1ec NA
## 2 75c15d67-5fb5-5cae-8712-bbfb14d4de2d NA
## 3 fd026dcf-374c-5c56-9a6b-3c4c37fa6199 NA
## 4 ca94eeb8-8ead-5268-ac42-08fd1b5163b2 NA
## 5 5cbaf906-94c7-5396-b6b5-e88c18b9817a NA
## 6 c4622b4d-bed7-509b-b507-50210e438303 NA
## treatment_intent_type treatment_or_therapy bcr_patient_barcode disease
## 1 NA NA TCGA-02-0001 GBM
## 2 NA NA TCGA-02-0003 GBM
## 3 NA NA TCGA-02-0004 GBM
## 4 NA NA TCGA-02-0006 GBM
## 5 NA NA TCGA-02-0007 GBM
## 6 NA NA TCGA-02-0009 GBM
# Fetch clinical data directly from the clinical XML files.
# if barcode is not set, it will consider all samples.
# We only set it to make the example faster
query <- GDCquery(project = "TCGA-GBM",
data.category = "Clinical",
barcode = c("TCGA-08-0516","TCGA-02-0317"))
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
head(clinical)
## additional_studies tumor_tissue_site histological_type
## 1 NA Brain Untreated primary (de novo) GBM
## 2 NA Brain Untreated primary (de novo) GBM
## prior_glioma gender vital_status days_to_birth days_to_death
## 1 NO MALE Dead -5309 596
## 2 NO MALE Dead -14636 372
## days_to_last_followup race_list bcr_patient_barcode tissue_source_site
## 1 561 WHITE TCGA-08-0516 8
## 2 372 WHITE TCGA-02-0317 2
## patient_id bcr_patient_uuid
## 1 516 b0ebaa85-476c-4f44-8b23-5bab107454a9
## 2 317 6879e9da-dacf-48b3-8e95-760796f96a1b
## informed_consent_verified icd_o_3_site icd_o_3_histology icd_10
## 1 YES C71.9 9440/3 C71.9
## 2 YES C71.9 9440/3 C71.9
## tissue_prospective_collection_indicator
## 1 NA
## 2 NA
## tissue_retrospective_collection_indicator
## 1 NA
## 2 NA
## days_to_initial_pathologic_diagnosis age_at_initial_pathologic_diagnosis
## 1 0 14
## 2 0 40
## year_of_initial_pathologic_diagnosis person_neoplasm_cancer_status
## 1 2001 WITH TUMOR
## 2 1994 WITH TUMOR
## day_of_form_completion month_of_form_completion year_of_form_completion
## 1 11 12 2008
## 2 17 12 2008
## ethnicity other_dx history_of_neoadjuvant_treatment
## 1 NOT HISPANIC OR LATINO NA No
## 2 NOT HISPANIC OR LATINO NA No
## initial_pathologic_diagnosis_method init_pathology_dx_method_other
## 1 Tumor resection NA
## 2 Tumor resection NA
## anatomic_neoplasm_subdivision radiation_therapy postoperative_rx_tx
## 1 NA NA NA
## 2 NA NA NA
## primary_therapy_outcome_success karnofsky_performance_score
## 1 NA NA
## 2 NA 80
## eastern_cancer_oncology_group performance_status_scale_timing
## 1 NA NA
## 2 NA NA
## has_new_tumor_events_information has_drugs_information
## 1 NO NO
## 2 NO YES
## has_radiations_information has_follow_ups_information
## 1 YES YES
## 2 YES YES
clinical.drug <- GDCprepare_clinic(query, clinical.info = "drug")
head(clinical.drug)
## tx_on_clinical_trial regimen_number bcr_drug_barcode
## 1 NA NA TCGA-02-0317-D22476
## bcr_drug_uuid total_dose total_dose_units
## 1 91d4a74b-253a-403c-a8b0-d11fe903c32b NA NA
## prescribed_dose prescribed_dose_units number_cycles
## 1 NA NA NA
## days_to_drug_therapy_start days_to_drug_therapy_end therapy_types
## 1 23 176 Chemotherapy
## drug_name clinical_trail_drug_classification regimen_indication
## 1 anti neopastons NA ADJUVANT
## regimen_indication_notes route_of_administrations therapy_ongoing
## 1 NA NA NA
## measure_of_response day_of_form_completion month_of_form_completion
## 1 NA 27 3
## year_of_form_completion bcr_patient_barcode
## 1 2009 TCGA-02-0317
clinical.radiation <- GDCprepare_clinic(query, clinical.info = "radiation")
head(clinical.radiation)
## bcr_radiation_barcode bcr_radiation_uuid
## 1 TCGA-08-0516-R23304 e7390ac0-7d2f-4452-83d8-304919b45c17
## 2 TCGA-08-0516-R23305 50499afc-f3d0-4194-a2e5-a15271c22251
## 3 TCGA-02-0317-R22477 264bf3d4-7479-4d24-814b-0188b17a7526
## days_to_radiation_therapy_start days_to_radiation_therapy_end
## 1 10 40
## 2 171 171
## 3 NA NA
## radiation_type radiation_type_notes radiation_dosage units numfractions
## 1 EXTERNAL BEAM NA NA NA NA
## 2 IMPLANTS NA NA NA NA
## 3 EXTERNAL BEAM NA NA NA NA
## anatomic_treatment_site regimen_indication regimen_indication_notes
## 1 Primary Tumor Field ADJUVANT NA
## 2 PROGRESSION NA
## 3 Primary Tumor Field ADJUVANT NA
## radiation_treatment_ongoing course_number measure_of_response
## 1 NO NA NA
## 2 NO NA NA
## 3 NA NA
## day_of_form_completion month_of_form_completion year_of_form_completion
## 1 11 12 2008
## 2 11 12 2008
## 3 27 3 2009
## bcr_patient_barcode
## 1 TCGA-08-0516
## 2 TCGA-08-0516
## 3 TCGA-02-0317
clinical.admin <- GDCprepare_clinic(query, clinical.info = "admin")
head(clinical.admin)
## bcr file_uuid
## 1 Nationwide Children's Hospital 8C65AA78-D60C-4367-92F6-9BAD27E33AFC
## 2 Nationwide Children's Hospital 21DE62FD-3CD7-413D-8F3B-E76DE3F138B3
## batch_number project_code disease_code day_of_dcc_upload
## 1 5.89.0 TCGA GBM 31
## 2 5.89.0 TCGA GBM 31
## month_of_dcc_upload year_of_dcc_upload patient_withdrawal
## 1 3 2016 false
## 2 3 2016 false
## bcr_patient_barcode
## 1 TCGA-08-0516
## 2 TCGA-02-0317
Mutation information is stored in two types of Mutation Annotation Format (MAF): Protected and Somatic (or Public) MAF files, which are derived from the GDC annotated VCF files. Annotated VCF files often have variants reported on multiple transcripts whereas the protected MAF (*protected.maf) only reports the most critically affected one and the Somatic MAFs (*somatic.maf) are further processed to remove low quality and potential germline variants. To download Somatic MAFs data using TCGAbiolinks, GDCquery_maf function is provided (see listing below).
LGGmut <- GDCquery_Maf(tumor = "LGG", pipelines = "mutect2")
data(mafMutect2LGGGBM)
head(LGGmut)
## # A tibble: 6 x 114
## Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
## <chr> <int> <chr> <chr> <chr> <int>
## 1 FBN2 2201 BI GRCh38 chr5 128305070
## 2 RP11-368M16.8 0 BI GRCh38 chr7 57653425
## 3 LIMK1 3984 BI GRCh38 chr7 74105957
## 4 PDGFRL 5157 BI GRCh38 chr8 17589503
## 5 ACER2 340485 BI GRCh38 chr9 19435026
## 6 KCNT1 57582 BI GRCh38 chr9 135784546
## # ... with 108 more variables: End_Position <int>, Strand <chr>,
## # Variant_Classification <chr>, Variant_Type <chr>,
## # Reference_Allele <chr>, Tumor_Seq_Allele1 <chr>,
## # Tumor_Seq_Allele2 <chr>, dbSNP_RS <chr>, dbSNP_Val_Status <chr>,
## # Tumor_Sample_Barcode <chr>, Matched_Norm_Sample_Barcode <chr>,
## # Match_Norm_Seq_Allele1 <chr>, Match_Norm_Seq_Allele2 <chr>,
## # Tumor_Validation_Allele1 <chr>, Tumor_Validation_Allele2 <chr>,
## # Match_Norm_Validation_Allele1 <chr>,
## # Match_Norm_Validation_Allele2 <chr>, Verification_Status <chr>,
## # Validation_Status <chr>, Mutation_Status <chr>,
## # Sequencing_Phase <chr>, Sequence_Source <chr>,
## # Validation_Method <chr>, Score <chr>, BAM_File <chr>, Sequencer <chr>,
## # Tumor_Sample_UUID <chr>, Matched_Norm_Sample_UUID <chr>, HGVSc <chr>,
## # HGVSp <chr>, HGVSp_Short <chr>, Transcript_ID <chr>,
## # Exon_Number <chr>, t_depth <int>, t_ref_count <int>,
## # t_alt_count <int>, n_depth <int>, n_ref_count <chr>,
## # n_alt_count <chr>, all_effects <chr>, Allele <chr>, Gene <chr>,
## # Feature <chr>, Feature_type <chr>, Consequence <chr>,
## # cDNA_position <chr>, CDS_position <chr>, Protein_position <chr>,
## # Amino_acids <chr>, Codons <chr>, Existing_variation <chr>,
## # ALLELE_NUM <int>, DISTANCE <chr>, TRANSCRIPT_STRAND <int>,
## # SYMBOL <chr>, SYMBOL_SOURCE <chr>, HGNC_ID <chr>, BIOTYPE <chr>,
## # CANONICAL <chr>, CCDS <chr>, ENSP <chr>, SWISSPROT <chr>,
## # TREMBL <chr>, UNIPARC <chr>, RefSeq <chr>, SIFT <chr>, PolyPhen <chr>,
## # EXON <chr>, INTRON <chr>, DOMAINS <chr>, GMAF <chr>, AFR_MAF <chr>,
## # AMR_MAF <chr>, ASN_MAF <chr>, EAS_MAF <chr>, EUR_MAF <chr>,
## # SAS_MAF <chr>, AA_MAF <chr>, EA_MAF <chr>, CLIN_SIG <chr>,
## # SOMATIC <dbl>, PUBMED <dbl>, MOTIF_NAME <chr>, MOTIF_POS <chr>,
## # HIGH_INF_POS <chr>, MOTIF_SCORE_CHANGE <chr>, IMPACT <chr>,
## # PICK <int>, VARIANT_CLASS <chr>, TSL <int>, HGVS_OFFSET <int>,
## # PHENO <dbl>, MINIMISED <int>, ExAC_AF <dbl>, ExAC_AF_AFR <int>,
## # ExAC_AF_AMR <int>, ExAC_AF_EAS <int>, ExAC_AF_FIN <int>,
## # ExAC_AF_NFE <dbl>, ExAC_AF_OTH <int>, ...
Finally, the Cancer Genome Atlas (TCGA) Research Network has reported integrated genome-wide studies of various diseases (ACC (Zheng et al. 2016), BRCA (C. G. A. Network and others 2012b), COAD (C. G. A. Network and others 2012a), GBM (Ceccarelli et al. 2016), HNSC (C. G. A. Network and others 2015a), KICH (Davis et al. 2014), KIRC (Network and others 2013), KIRP (Network and others 2016), LGG (Ceccarelli et al. 2016), LUAD (Network and others 2014c), LUSC (Network and others 2012), PRAD (Network and others 2015), READ (C. G. A. Network and others 2012a), SKCM (C. G. A. Network and others 2015b), STAD (Network and others 2014a), THCA (Network and others 2014d) and UCEC (Network and others 2014b)) which classified them in different subtypes. This classification can be retrieved using the TCGAquery_subtype function or by accessing the samples information in the SummarizedExperiment object that created by the GDCprepare function, which automatically incorporates it into the object.
gbm.subtypes <- TCGAquery_subtype(tumor = "gbm")
head(gbm.subtypes)
## patient Tissue.source.site Study BCR
## 517 TCGA-02-0001 MD Anderson Cancer Center Glioblastoma multiforme IGC
## 518 TCGA-02-0003 MD Anderson Cancer Center Glioblastoma multiforme IGC
## 519 TCGA-02-0004 MD Anderson Cancer Center Glioblastoma multiforme IGC
## 520 TCGA-02-0006 MD Anderson Cancer Center Glioblastoma multiforme IGC
## 521 TCGA-02-0007 MD Anderson Cancer Center Glioblastoma multiforme IGC
## 522 TCGA-02-0009 MD Anderson Cancer Center Glioblastoma multiforme IGC
## Whole.exome Whole.genome RNAseq SNP6 U133a HM450 HM27 RPPA
## 517 No No No Yes Yes No Yes No
## 518 Yes No No Yes Yes No Yes Yes
## 519 No No No No Yes No No Yes
## 520 No No No Yes No No Yes No
## 521 No No No Yes Yes No Yes No
## 522 No No No Yes Yes No Yes No
## Histology Grade Age..years.at.diagnosis. Gender Survival..months.
## 517 glioblastoma G4 44 female 11.762054
## 518 glioblastoma G4 50 male 4.731106
## 519 glioblastoma G4 59 male 11.334941
## 520 glioblastoma G4 56 female 18.333034
## 521 glioblastoma G4 40 female 23.162705
## 522 glioblastoma G4 61 female 10.579278
## Vital.status..1.dead. Karnofsky.Performance.Score Mutation.Count
## 517 1 80 NA
## 518 1 100 52
## 519 1 80 NA
## 520 1 80 NA
## 521 1 80 NA
## 522 1 80 NA
## Percent.aneuploidy IDH.status X1p.19q.codeletion IDH.codel.subtype
## 517 0.4249000 WT non-codel IDHwt
## 518 0.1479882 WT non-codel IDHwt
## 519 0.1178814 WT <NA> <NA>
## 520 0.2390546 WT non-codel IDHwt
## 521 0.2574537 WT non-codel IDHwt
## 522 0.1596193 WT non-codel IDHwt
## MGMT.promoter.status Chr.7.gain.Chr.10.loss Chr.19.20.co.gain
## 517 Unmethylated Gain chr 7 & loss chr 10 No chr 19/20 gain
## 518 Unmethylated Gain chr 7 & loss chr 10 No chr 19/20 gain
## 519 <NA> <NA> <NA>
## 520 Unmethylated Gain chr 7 & loss chr 10 No chr 19/20 gain
## 521 Unmethylated No combined CNA No chr 19/20 gain
## 522 Unmethylated Gain chr 7 & loss chr 10 No chr 19/20 gain
## TERT.promoter.status TERT.expression..log2. TERT.expression.status
## 517 <NA> NA <NA>
## 518 <NA> NA <NA>
## 519 <NA> NA <NA>
## 520 <NA> NA <NA>
## 521 <NA> NA <NA>
## 522 <NA> NA <NA>
## ATRX.status DAXX.status Telomere.Maintenance BRAF.V600E.status
## 517 <NA> <NA> <NA> <NA>
## 518 WT WT <NA> WT
## 519 <NA> <NA> <NA> <NA>
## 520 <NA> <NA> <NA> <NA>
## 521 <NA> <NA> <NA> <NA>
## 522 <NA> <NA> <NA> <NA>
## BRAF.KIAA1549.fusion ABSOLUTE.purity ABSOLUTE.ploidy
## 517 <NA> 0.49 3.47
## 518 <NA> 0.90 1.96
## 519 <NA> NA NA
## 520 <NA> 0.60 1.93
## 521 <NA> 0.93 1.89
## 522 <NA> 0.90 1.99
## ESTIMATE.stromal.score ESTIMATE.immune.score ESTIMATE.combined.score
## 517 NA NA NA
## 518 NA NA NA
## 519 NA NA NA
## 520 NA NA NA
## 521 NA NA NA
## 522 NA NA NA
## Original.Subtype Transcriptome.Subtype
## 517 Classical NE
## 518 Proneural PN
## 519 Mesenchymal ME
## 520 Mesenchymal ME
## 521 Proneural <NA>
## 522 Classical CL
## Pan.Glioma.RNA.Expression.Cluster IDH.specific.RNA.Expression.Cluster
## 517 LGr4 <NA>
## 518 LGr4 <NA>
## 519 LGr4 <NA>
## 520 <NA> <NA>
## 521 unclassified <NA>
## 522 LGr4 <NA>
## Pan.Glioma.DNA.Methylation.Cluster
## 517 LGm5
## 518 LGm5
## 519 <NA>
## 520 LGm5
## 521 LGm4
## 522 LGm4
## IDH.specific.DNA.Methylation.Cluster
## 517 IDHwt-K2
## 518 IDHwt-K2
## 519 <NA>
## 520 IDHwt-K2
## 521 IDHwt-K1
## 522 IDHwt-K1
## Supervised.DNA.Methylation.Cluster Random.Forest.Sturm.Cluster
## 517 Mesenchymal-like <NA>
## 518 Mesenchymal-like <NA>
## 519 <NA> <NA>
## 520 Mesenchymal-like <NA>
## 521 Classic-like <NA>
## 522 Classic-like <NA>
## RPPA.cluster Telomere.length.estimate.in.blood.normal..Kb.
## 517 <NA> NA
## 518 K1 NA
## 519 K1 NA
## 520 <NA> NA
## 521 <NA> NA
## 522 <NA> NA
## Telomere.length.estimate.in.tumor..Kb.
## 517 NA
## 518 NA
## 519 NA
## 520 NA
## 521 NA
## 522 NA
4.1.2 Downloading data from Broad TCGA GDAC
The Bioconductor package RTCGAToolbox (Samur 2014) provides access to Firehose Level 3 and 4 data through the function getFirehoseData. The following arguments allows users to select the version and tumor type of interest:
dataset - Tumor to download. A complete list of possibilities can be viewed with getFirehoseDatasets function.
runDate - Stddata run dates. Dates can be viewed with getFirehoseRunningDates function.
gistic2_Date - Analyze run dates. Dates can viewed with getFirehoseAnalyzeDates function.
These arguments can be used to select the data type to download: RNAseq_Gene, Clinic, miRNASeq_Gene, ccRNAseq2_Gene_Norm, CNA_SNP, CNV_SNP, CNA_Seq, CNA_CGH, Methylation, Mutation, mRNA_Array , miRNA_Array, and RPPA.
By default, RTCGAToolbox allows users to download up to 500 MB worth of data. To increase the size of the download, users are encouraged to use fileSizeLimit argument. An example is found in listingbelow. The getData function allow users to access the downloaded data (see lines 22-24 of listing below as a S4Vector object.
library(RTCGAToolbox)
# Get the last run dates
lastRunDate <- getFirehoseRunningDates()[1]
# get DNA methylation data, RNAseq2 and clinical data for GBM
gbm.data <- getFirehoseData(dataset = "GBM",
runDate = lastRunDate, gistic2_Date = getFirehoseAnalyzeDates(1),
Methylation = FALSE, Clinic = TRUE,
RNAseq2_Gene_Norm = FALSE, Mutation = TRUE,
fileSizeLimit = 10000)
gbm.mut <- getData(gbm.data,"Mutations")
gbm.clin <- getData(gbm.data,"Clinical")
Finnaly, RTCGAToolbox can access level 4 data, which can be handy when the user requires GISTIC results. GISTIC is used to identify genes targeted by somatic copy-number alterations (SCNAs) (Mermel et al. 2011).
# Download GISTIC results
lastAnalyseDate <- getFirehoseAnalyzeDates(1)
gistic <- getFirehoseData("GBM",gistic2_Date = lastAnalyseDate)
# get GISTIC results
gistic.allbygene <- getData(gistic,type = "GISTIC", CN = "All")
gistic.thresholedbygene <- getData(gistic,type = "GISTIC", CN = "Thresholed")
data(GBMGistic)
gistic.allbygene[1:5,1:5]
## Gene.Symbol Locus.ID Cytoband TCGA.02.0001.01C.01D.0182.01
## 1 ACAP3 116983 1p36.33 0.242
## 2 ACTRT2 140625 1p36.32 0.242
## 3 AGRN 375790 1p36.33 0.242
## 4 ANKRD65 441869 1p36.33 0.242
## 5 ATAD3A 55210 1p36.33 0.242
## TCGA.02.0003.01A.01D.0182.01
## 1 0.007
## 2 0.007
## 3 0.007
## 4 0.007
## 5 0.007
gistic.thresholedbygene[1:5,1:5]
## Gene.Symbol Locus.ID Cytoband TCGA.02.0001.01C.01D.0182.01
## 1 ACAP3 116983 1p36.33 1
## 2 ACTRT2 140625 1p36.32 1
## 3 AGRN 375790 1p36.33 1
## 4 ANKRD65 441869 1p36.33 1
## 5 ATAD3A 55210 1p36.33 1
## TCGA.02.0003.01A.01D.0182.01
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
4.2 Genomic analysis
Copy number variations (CNVs) have a critical role in cancer development and progression. A chromosomal segment can be deleted or amplified as a result of genomic rearrangements, such as deletions, duplications, insertions and translocations. CNVs are genomic regions greater than 1 kb with an alteration of copy number between two conditions (e.g., Tumor versus Normal).
TCGA collects copy number data and allows the CNV profiling of cancer. Tumor and paired-normal DNA samples were analysed for CNV detection using microarray- and sequencing-based technologies. Level 3 processed data are the aberrant regions along the genome resulting from CNV segmentation, and they are available for all copy number technologies.
In this section, we will show how to analyze CNV level 3 data from TCGA to identify recurrent alterations in cancer genome. We analyzed GBM segmented CNV from SNP array (Affymetrix Genome-Wide Human SNP Array 6.0).
Pre-Processing Data
The only CNV platform available for GBM in TCGA is “Affymetrix Genome-Wide Human SNP Array 6.0”. Using TCGAbiolinks, we queried for CNV SNP6 level 3 data for 20 primary solid tumor samples in the legacy database. Data for selected samples were downloaded and prepared in a rse objects (RangedSummarizedExperiment).
library(TCGAbiolinks)
# Select common CN technology available for GBM and LGG
#############################
## CNV data pre-processing ##
#############################
query.gbm.nocnv <- GDCquery(project = "TCGA-GBM",
data.category = "Copy number variation",
legacy = TRUE,
file.type = "nocnv_hg19.seg",
sample.type = c("Primary solid Tumor"))
# to reduce time we will select only 20 samples
query.gbm.nocnv$results[[1]] <- query.gbm.nocnv$results[[1]][1:20,]
GDCdownload(query.gbm.nocnv, chunks.per.download = 100)
gbm.nocnv <- GDCprepare(query.gbm.nocnv, save = TRUE, save.filename = "GBMnocnvhg19.rda")
4.2.1 Identification of recurrent CNV in cancer
Cancer related CNV have to be present in many of the analyzed genomes. The most significant recurrent CNV were identified using GAIA (Morganella, Pagnotta, and Ceccarelli, n.d.), an iterative procedure where a statistical hypothesis framework is extended to take into account within-sample homogeneity. GAIA is based on a conservative permutation test allowing the estimation of the probability distribution of the contemporary mutations expected for non-driver markers.
Segmented data retrieved from TCGA were used to generate a matrix including all needed information about the observed aberrant regions. Furthermore, GAIA requires genomic probes metadata (specific for each CNV technology), that can be downloaded from broadinstitute website.
# -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--=-==--=--==---=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--=-==--=--==--
# Retrieve probes meta file from broadinstitute website for hg19
# For hg38 analysis please take a look on:
# https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files
# File: SNP6 GRCh38 Liftover Probeset File for Copy Number Variation Analysis
# -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--=-==--=--==---=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--=-==--=--==--
gdac.root <- "ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/"
file <- paste0(gdac.root, "genome.info.6.0_hg19.na31_minus_frequent_nan_probes_sorted_2.1.txt")
if(!file.exists(basename(file))) downloader::download(file, basename(file))
markersMatrix <- readr::read_tsv(basename(file), col_names = FALSE, col_types = "ccn", progress = FALSE)
save(markersMatrix, file = "markersMatrix.rda", compress = "xz")
cancer <- "GBM"
message(paste0("Starting ", cancer))
# get objects created above
data(GBMnocnvhg19)
data(markersMatrix)
# Add label (0 for loss, 1 for gain)
cnvMatrix <- cbind(cnvMatrix,Label=NA)
cnvMatrix[cnvMatrix[,"Segment_Mean"] < -0.3,"Label"] <- 0
cnvMatrix[cnvMatrix[,"Segment_Mean"] > 0.3,"Label"] <- 1
cnvMatrix <- cnvMatrix[!is.na(cnvMatrix$Label),]
# Remove "Segment_Mean" and change col.names
cnvMatrix <- cnvMatrix[,-6]
colnames(cnvMatrix) <- c("Sample.Name", "Chromosome", "Start", "End", "Num.of.Markers", "Aberration")
# Substitute Chromosomes "X" and "Y" with "23" and "24"
cnvMatrix[cnvMatrix$Chromosome == "X","Chromosome"] <- 23
cnvMatrix[cnvMatrix$Chromosome == "Y","Chromosome"] <- 24
cnvMatrix$Chromosome <- as.integer(cnvMatrix$Chromosome)
# Recurrent CNV identification with GAIA
colnames(markersMatrix) <- c("Probe.Name", "Chromosome", "Start")
unique(markersMatrix$Chromosome)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "X" "Y"
markersMatrix[markersMatrix$Chromosome == "X","Chromosome"] <- 23
markersMatrix[markersMatrix$Chromosome == "Y","Chromosome"] <- 24
markersMatrix$Chromosome <- as.integer(markersMatrix$Chromosome)
markerID <- paste(markersMatrix$Chromosome,markersMatrix$Start, sep = ":")
# Removed duplicates
markersMatrix <- markersMatrix[!duplicated(markerID),]
# Filter markersMatrix for common CNV
markerID <- paste(markersMatrix$Chromosome,markersMatrix$Start, sep = ":")
file <- "ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/CNV.hg19.bypos.111213.txt"
if(!file.exists(basename(file))) downloader::download(file, basename(file))
commonCNV <- readr::read_tsv(basename(file), progress = FALSE)
commonID <- paste(commonCNV$Chromosome,commonCNV$Start, sep = ":")
markersMatrix_fil <- markersMatrix[!markerID %in% commonID,]
library(gaia)
set.seed(200)
markers_obj <- load_markers(as.data.frame(markersMatrix_fil))
nbsamples <- length(unique(cnvMatrix$Sample))
cnv_obj <- load_cnv(cnvMatrix, markers_obj, nbsamples)
suppressWarnings({
results <- runGAIA(cnv_obj,
markers_obj,
output_file_name = paste0("GAIA_",cancer,"_flt.txt"),
aberrations = -1, # -1 to all aberrations
chromosomes = 9, # -1 to all chromosomes
approximation = TRUE, # Set to TRUE to speed up the time requirements
num_iterations = 5000, # Reduced to speed up the time requirements
threshold = 0.25)
})
# Set q-value threshold
# Use a smalled value for your analysis. We set this as high values
# due to the small number of samples which did not reproduced
# results with smaller q-values
threshold <- 0.3
# Plot the results
RecCNV <- t(apply(results,1,as.numeric))
colnames(RecCNV) <- colnames(results)
RecCNV <- cbind(RecCNV, score = 0)
minval <- format(min(RecCNV[RecCNV[,"q-value"] != 0,"q-value"]), scientific = FALSE)
minval <- substring(minval,1, nchar(minval) - 1)
RecCNV[RecCNV[,"q-value"] == 0,"q-value"] <- as.numeric(minval)
RecCNV[,"score"] <- sapply(RecCNV[,"q-value"],function(x) -log10(as.numeric(x)))
RecCNV[RecCNV[,"q-value"] == as.numeric(minval),]
## Chromosome Aberration Kind Region Start [bp] Region End [bp]
## 1 9 0 18259823 18387361
## 2 9 0 19290143 19485455
## 3 9 0 19535498 26257238
## 4 9 0 26433917 26725609
## 5 9 0 26834232 27009100
## Region Size [bp] q-value score
## 1 127539 0.00875417 2.057785
## 2 195313 0.00875417 2.057785
## 3 6721741 0.00875417 2.057785
## 4 291693 0.00875417 2.057785
## 5 174869 0.00875417 2.057785
gaiaCNVplot(RecCNV,threshold)
save(results, RecCNV, threshold, file = paste0(cancer,"_CNV_results.rda"))
Recurrent amplifications and deletions were identified for GBM, and represented in chromosomal overview plots by a statistical score (\(-log_{10}\) corrected p-value for amplifications and \(log_{10}\) corrected p-value for deletions). Genomic regions identified as significantly altered in copy number (corrected p-value < \(10^{-4}\)) were then annotated to report amplified and deleted genes potentially related with cancer.
4.2.2 Gene annotation of recurrent CNV
The aberrant recurrent genomic regions in cancer, as identified by GAIA, have to be annotated to verify which genes are significantly amplified or deleted. Using biomaRt we retrieved the genomic ranges of all human genes and we compared them with significant aberrant regions to select full length genes.
library(GenomicRanges)
# Get gene information from GENCODE using biomart
genes <- TCGAbiolinks:::get.GRCh.bioMart(genome = "hg19")
genes <- genes[genes$external_gene_id != "" & genes$chromosome_name %in% c(1:22,"X","Y"),]
genes[genes$chromosome_name == "X", "chromosome_name"] <- 23
genes[genes$chromosome_name == "Y", "chromosome_name"] <- 24
genes$chromosome_name <- sapply(genes$chromosome_name,as.integer)
genes <- genes[order(genes$start_position),]
genes <- genes[order(genes$chromosome_name),]
genes <- genes[,c("external_gene_id", "chromosome_name", "start_position","end_position")]
colnames(genes) <- c("GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(genes,keep.extra.columns = TRUE)
save(genes_GR,genes,file = "genes_GR.rda", compress = "xz")
##############################
## Recurrent CNV annotation ##
##############################
# Get gene information from GENCODE using biomart
data(genes_GR) # downloaded in the previous step (available in TCGAWorkflowData)
load(paste0(cancer,"_CNV_results.rda"))
sCNV <- RecCNV[RecCNV[,"q-value"] <= threshold,c(1:4,6)]
sCNV <- sCNV[order(sCNV[,3]),]
sCNV <- sCNV[order(sCNV[,1]),]
colnames(sCNV) <- c("Chr","Aberration","Start","End","q-value")
sCNV_GR <- makeGRangesFromDataFrame(sCNV,keep.extra.columns = TRUE)
hits <- findOverlaps(genes_GR, sCNV_GR, type = "within")
sCNV_ann <- cbind(sCNV[subjectHits(hits),],genes[queryHits(hits),])
AberrantRegion <- paste0(sCNV_ann[,1],":",sCNV_ann[,3],"-",sCNV_ann[,4])
GeneRegion <- paste0(sCNV_ann[,7],":",sCNV_ann[,8],"-",sCNV_ann[,9])
AmpDel_genes <- cbind(sCNV_ann[,c(6,2,5)],AberrantRegion,GeneRegion)
AmpDel_genes[AmpDel_genes[,2] == 0,2] <- "Del"
AmpDel_genes[AmpDel_genes[,2] == 1,2] <- "Amp"
rownames(AmpDel_genes) <- NULL
save(RecCNV, AmpDel_genes, file = paste0(cancer,"_CNV_results.rda"))
GeneSymbol | Aberration | q-value | AberrantRegion | GeneRegion |
---|---|---|---|---|
C9orf92 | Del | 0.0087542 | 9:16000812-18259760 | 9:16203933-16276311 |
BNC2 | Del | 0.0087542 | 9:16000812-18259760 | 9:16409501-16870841 |
RP11-183I6.2 | Del | 0.0087542 | 9:16000812-18259760 | 9:16473186-16476235 |
RP11-62F24.1 | Del | 0.0087542 | 9:16000812-18259760 | 9:16625674-16626388 |
RP11-62F24.2 | Del | 0.0087542 | 9:16000812-18259760 | 9:16726812-16727522 |
RP11-335H2.2 | Del | 0.0087542 | 9:16000812-18259760 | 9:16775571-16775843 |
4.2.3 Visualizing multiple genomic alteration events
In order to visualize multiple genomic alteration events we recommend using OncoPrint plot which is provided by Bioconductor package complexHeatmap (Z., n.d.). The listing below shows how to download mutation data using GDCquery_maf (line 4), then we filtered the genes to obtain genes with mutations found among glioma specific pathways (lines 6 - 12). The following steps prepared the data into a matrix to fit oncoPrint function. We defined SNPs as blue, insertions as green and deletions as red. The upper barplot indicates the number of genetic mutation per patient, while the right barplot shows the number of genetic mutations per gene. Also, it is possible to add annotations to rows or columns. For the columns, an insertion made at the top will remove the barplot. The final result for adding the annotation to the bottom is highlighted in the figure .
LGGmut <- GDCquery_Maf(tumor = "LGG", pipelines = "mutect2")
GBMmut <- GDCquery_Maf(tumor = "GBM", pipelines = "mutect2")
mut <- plyr::rbind.fill(LGGmut, GBMmut)
save(mut,file ="mafMutect2LGGGBM.rda")
library(ComplexHeatmap)
# recovering data from TCGAWorkflowData package.
data(mafMutect2LGGGBM)
# Filtering mutations in gliomas
EA_pathways <- TCGAbiolinks:::listEA_pathways
Glioma_pathways <- EA_pathways[grep("glioma", tolower(EA_pathways$Pathway)),]
Glioma_signaling <- Glioma_pathways[Glioma_pathways$Pathway == "Glioma Signaling",]
Glioma_signaling_genes <- unlist(strsplit(as.character(Glioma_signaling$Molecules),","))[1:10]
mut <- mut[mut$Hugo_Symbol %in% Glioma_signaling_genes,]
clinical <- plyr::rbind.fill(gbm_clin,lgg_clin)
TCGAvisualize_oncoprint(mut = mut,
gene = Glioma_signaling_genes,
annotation = clinical[, c("bcr_patient_barcode","disease","vital_status","ethnicity")],
annotation.position = "bottom",
rm.empty.columns = FALSE)
Oncoprint for LGG samples. Blue defines SNP, green defines insertions and red defines deletions. The upper barplot shows the number of these genetic mutation for each patient, while the right barplot shows the number of genetic mutations for each gene. The bottom bar shows the group of each sample.
Overview of genomic alterations by circos plot
Genomic alterations in cancer, including CNV and mutations, can be represented in an effective overview plot named circos. We used circlize CRAN package to represent significant CNV (resulting from GAIA analysis) and recurrent mutations (selecting curated genetic variations retrieved from TCGA that are identified in at least two tumor samples) in LGG. Circos plot can illustrate molecular alterations genome-wide or only in one or more selected chromosomes.
LGGmut <- GDCquery_Maf(tumor = "LGG", pipelines = "mutect2")
###############################################
## Genomic aberration overview - Circos plot ##
###############################################
# Retrieve curated mutations for selected cancer (e.g. "LGG")
data(mafMutect2LGGGBM)
# Select only potentially damaging mutations
LGGmut <- LGGmut[LGGmut$Variant_Classification %in% c("Missense_Mutation",
"Nonsense_Mutation",
"Nonstop_Mutation",
"Frame_Shift_Del",
"Frame_Shift_Ins"),]
# Select recurrent mutations (identified in at least two samples)
mut.id <- paste0(LGGmut$Chromosome,":",LGGmut$Start_Position,"-",LGGmut$End_Position,"|",LGGmut$Reference_Allele)
mut <- cbind(mut.id, LGGmut)
# Prepare selected mutations data for circos plot
s.mut <- mut[mut$mut.id %in% unique(mut.id[duplicated(mut.id)]),]
s.mut <- s.mut[,c("Chromosome","Start_Position","End_Position","Variant_Classification","Hugo_Symbol")]
s.mut <- unique(s.mut)
typeNames <- unique(s.mut[,4])
type <- c(4:1)
names(type) <- typeNames[1:4]
Type <- type[s.mut[,4]]
s.mut <- cbind(s.mut,Type)
s.mut <- s.mut[,c(1:3,6,4,5)]
# Load recurrent CNV data for selected cancer (e.g. "LGG")
load("GBM_CNV_results.rda")
# Prepare selected sample CNV data for circos plot
s.cnv <- as.data.frame(RecCNV[RecCNV[,"q-value"] <= threshold,c(1:4,6)])
s.cnv <- s.cnv[,c(1,3,4,2)]
s.cnv[s.cnv$Chromosome == 23,"Chromosome"] <- "X"
s.cnv[s.cnv$Chromosome == 24,"Chromosome"] <- "Y"
Chromosome <- paste0("chr",s.cnv[,1])
s.cnv <- cbind(Chromosome, s.cnv[,-1])
s.cnv[,1] <- as.character(s.cnv[,1])
s.cnv[,4] <- as.character(s.cnv[,4])
s.cnv <- cbind(s.cnv,CNV=1)
colnames(s.cnv) <- c("Chromosome","Start_position","End_position","Aberration_Kind","CNV")
library(circlize)
# Draw genomic circos plot
par(mar=c(1,1,1,1), cex=1)
circos.initializeWithIdeogram()
# Add CNV results
colors <- c("forestgreen","firebrick")
names(colors) <- c(0,1)
circos.genomicTrackPlotRegion(s.cnv, ylim = c(0,1.2),
panel.fun = function(region, value, ...) {
circos.genomicRect(region, value, ytop.column = 2, ybottom = 0,
col = colors[value[[1]]],
border="white")
cell.xlim = get.cell.meta.data("cell.xlim")
circos.lines(cell.xlim, c(0, 0), lty = 2, col = "#00000040")
})
# Add mutation results
colors <- c("blue","green","red","gold")
names(colors) <- typeNames[1:4]
circos.genomicTrackPlotRegion(s.mut, ylim = c(1.2,4.2),
panel.fun = function(region, value, ...) {
circos.genomicPoints(region, value, cex = 0.8, pch = 16, col = colors[value[[2]]], ...)
})
circos.clear()
legend(-0.2, 0.2, bty="n", y.intersp=1, c("Amp","Del"), pch=15,
col=c("firebrick","forestgreen"), title="CNVs", text.font=1, cex=0.4, title.adj=0)
legend(-0.2, 0, bty="n", y.intersp=1, names(colors), pch=16,
col=colors, title="Mutations", text.font=1, cex=0.4, title.adj=0)
par(mar=c(1,1,1,1),cex=1.5)
circos.par("start.degree" = 90, canvas.xlim = c(0, 1), canvas.ylim = c(0, 1),
gap.degree = 270, cell.padding = c(0, 0, 0, 0), track.margin = c(0.005, 0.005))
circos.initializeWithIdeogram(chromosome.index = "chr17")
circos.par(cell.padding = c(0, 0, 0, 0))
# Add CNV results
colors <- c("forestgreen","firebrick")
names(colors) <- c(0,1)
circos.genomicTrackPlotRegion(s.cnv, ylim = c(0,1.2),
panel.fun = function(region, value, ...) {
circos.genomicRect(region, value, ytop.column = 2, ybottom = 0,
col = colors[value[[1]]],
border="white")
cell.xlim = get.cell.meta.data("cell.xlim")
circos.lines(cell.xlim, c(0, 0), lty = 2, col = "#00000040")
})
# Add mutation results representing single genes
genes.mut <- paste0(s.mut$Hugo_Symbol,"-",s.mut$Type)
s.mutt <- cbind(s.mut,genes.mut)
n.mut <- table(genes.mut)
idx <- !duplicated(s.mutt$genes.mut)
s.mutt <- s.mutt[idx,]
s.mutt <- cbind(s.mutt,num=n.mut[s.mutt$genes.mut])
genes.num <- paste0(s.mutt$Hugo_Symbol," (",s.mutt$num.Freq,")")
s.mutt <- cbind(s.mutt[,-c(6:8)],genes.num)
s.mutt[,6] <- as.character(s.mutt[,6])
s.mutt[,4] <- s.mutt[,4]/2
s.mutt$num.Freq <- NULL
colors <- c("blue","green","red","gold")
names(colors) <- typeNames[1:4]
circos.genomicTrackPlotRegion(s.mutt, ylim = c(0.3,2.2), track.height = 0.05,
panel.fun = function(region, value, ...) {
circos.genomicPoints(region, value, cex = 0.4, pch = 16, col = colors[value[[2]]], ...)
})
circos.genomicTrackPlotRegion(s.mutt, ylim = c(0, 1), track.height = 0.1, bg.border = NA)
i_track = get.cell.meta.data("track.index")
circos.genomicTrackPlotRegion(s.mutt, ylim = c(0,1),
panel.fun = function(region, value, ...) {
circos.genomicText(region, value,
y = 1,
labels.column = 3,
col = colors[value[[2]]],
facing = "clockwise", adj = c(1, 0.5),
posTransform = posTransform.text, cex = 0.4, niceFacing = TRUE)
}, track.height = 0.1, bg.border = NA)
circos.genomicPosTransformLines(s.mutt,
posTransform = function(region, value)
posTransform.text(region,
y = 1,
labels = value[[3]],
cex = 0.4, track.index = i_track+1),
direction = "inside", track.index = i_track)
circos.clear()
legend(0.25, 0.2, bty="n", y.intersp=1, c("Amp","Del"), pch=15,
col=c("firebrick","forestgreen"), title="CNVs", text.font=1, cex=0.4, title.adj=0)
legend(0, 0.2, bty="n", y.intersp=1, names(colors), pch=16,
col=colors, title="Mutations", text.font=1, cex=0.4, title.adj=0)
4.3 Transcriptomic analysis
4.3.1 Pre-Processing Data
The LGG and GBM data used for following transcriptomic analysis were downloaded using TCGAbiolinks. We downloaded only primary solid tumor (TP) samples, which resulted in 516 LGG samples and 156 GBM samples, then prepared it in two separate rse objects (RangedSummarizedExperiment) saving them as an R object with a filename including both the name of the cancer and the name of the plaftorm used for gene expression data (see listing below).
query <- GDCquery(project = "TCGA-GBM",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
sample.type = c("Primary solid Tumor"),
legacy = TRUE)
# We will use only 20 samples to make the example faster
query$results[[1]] <- query$results[[1]][1:20,]
GDCdownload(query)
gbm.exp <- GDCprepare(query,
save = TRUE,
summarizedExperiment = TRUE,
save.filename = "GBMIllumina_HiSeq.rda")
query <- GDCquery(project = "TCGA-LGG",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
sample.type = c("Primary solid Tumor"),
legacy = TRUE)
# We will use only 20 samples to make the example faster
query$results[[1]] <- query$results[[1]][1:20,]
GDCdownload(query)
lgg.exp <- GDCprepare(query,
save = TRUE,
summarizedExperiment = TRUE,
save.filename = "LGGIllumina_HiSeq.rda")
To pre-process the data, first, we searched for possible outliers using the TCGAanalyze_Preprocessing function, which performs an Array Array Intensity correlation AAIC (lines 14-17 and 26-29 of listing below). In this way we defined a square symmetric matrix of pearson correlation among all samples in each cancer type (LGG or GBM). This matrix found 0 samples with low correlation (cor.cut = 0.6) that can be identified as possible outliers.
Second, using the TCGAanalyze_Normalization function, which encompasses the functions of the EDASeq package, we normalized mRNA transcripts.
This function implements Within-lane normalization procedures to adjust for GC-content effect (or other gene-level effects) on read counts: loess robust local regression, global-scaling, and full-quantile normalization (Risso et al. 2011) and between-lane normalization procedures to adjust for distributional differences between lanes (e.g., sequencing depth): global-scaling and full-quantile normalization (Bullard et al. 2010).
data("LGGIllumina_HiSeq")
data("GBMIllumina_HiSeq")
dataPrep_LGG <- TCGAanalyze_Preprocessing(object = lgg.exp,
cor.cut = 0.6,
datatype = "raw_count",
filename = "LGG_IlluminaHiSeq_RNASeqV2.png")
dataPrep_GBM <- TCGAanalyze_Preprocessing(object = gbm.exp,
cor.cut = 0.6,
datatype = "raw_count",
filename = "GBM_IlluminaHiSeq_RNASeqV2.png")
dataNorm <- TCGAanalyze_Normalization(tabDF = cbind(dataPrep_LGG, dataPrep_GBM),
geneInfo = TCGAbiolinks::geneInfo,
method = "gcContent") #18323 672
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
method = "quantile",
qnt.cut = 0.25) # 13742 672
save(dataFilt, file = paste0("LGG_GBM_Norm_IlluminaHiSeq.rda"))
dataFiltLGG <- subset(dataFilt, select = substr(colnames(dataFilt),1,12) %in% lgg_clin$bcr_patient_barcode)
dataFiltGBM <- subset(dataFilt, select = substr(colnames(dataFilt),1,12) %in% gbm_clin$bcr_patient_barcode)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFiltLGG,
mat2 = dataFiltGBM,
Cond1type = "LGG",
Cond2type = "GBM",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT")
# Number of differentially expressed genes (DEG)
nrow(dataDEGs)
[1] 2535
4.3.2 EA: enrichment analysis
In order to understand the underlying biological process of DEGs we performed an enrichment analysis using TCGAanalyze_EA_complete function.
#------------------- 4.2 EA: enrichment analysis --------------------
ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes LGG Vs GBM", RegulonList = rownames(dataDEGs))
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
filename = NULL,
GOBPTab = ansEA$ResBP,
GOCCTab = ansEA$ResCC,
GOMFTab = ansEA$ResMF,
PathTab = ansEA$ResPat,
nRGTab = rownames(dataDEGs),
nBar = 20)
The plot shows canonical pathways significantly overrepresented (enriched) by the DEGs (differentially expressed genes) with the number of genes for the main categories of three ontologies (GO:biological process, GO:cellular component, and GO:molecular function, respectively). The most statistically significant canonical pathways identified in DEGs list are listed according to their p value corrected FDR (-Log) (colored bars) and the ratio of list genes found in each pathway over the total number of genes in that pathway (ratio, red line).]
TCGAanalyze_EAbarplot outputs a bar chart as shown in figure with the number of genes for the main categories of three ontologies (i.e., GO biological process, GO cellular component, and GO molecular function).
The Figure shows canonical pathways significantly overrepresented (enriched) by the DEGs. The most statistically significant canonical pathways identified in the DEGs are ranked according to their p-value corrected FDR (-Log10) (colored bars) and the ratio of list genes found in each pathway over the total number of genes in that pathway (ratio, red line).
4.3.3 PEA: Pathways enrichment analysis
To verify if the genes found have a specific role in a pathway, the Bioconductor package pathview (Luo and Brouwer 2013) can be used. Listing below shows an example how to use it. It can receive, for example, a named vector of gene with the expression level, the pathway.id which can be found in KEGG database, the species (’hsa’ for Homo sapiens) and the limits for the gene expression.
library(SummarizedExperiment)
GenelistComplete <- rownames(assay(lgg.exp,1))
# DEGs TopTable
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"LGG","GBM",
dataFilt[,colnames(dataFiltLGG)],
dataFilt[,colnames(dataFiltGBM)])
dataDEGsFiltLevel$GeneID <- 0
library(clusterProfiler)
# Converting Gene symbol to geneID
eg = as.data.frame(bitr(dataDEGsFiltLevel$mRNA,
fromType="SYMBOL",
toType="ENTREZID",
OrgDb="org.Hs.eg.db"))
eg <- eg[!duplicated(eg$SYMBOL),]
dataDEGsFiltLevel <- dataDEGsFiltLevel[dataDEGsFiltLevel$mRNA %in% eg$SYMBOL,]
dataDEGsFiltLevel <- dataDEGsFiltLevel[order(dataDEGsFiltLevel$mRNA,decreasing=FALSE),]
eg <- eg[order(eg$SYMBOL,decreasing=FALSE),]
# table(eg$SYMBOL == dataDEGsFiltLevel$mRNA) should be TRUE
all(eg$SYMBOL == dataDEGsFiltLevel$mRNA)
[1] TRUE
dataDEGsFiltLevel$GeneID <- eg$ENTREZID
dataDEGsFiltLevel_sub <- subset(dataDEGsFiltLevel, select = c("GeneID", "logFC"))
genelistDEGs <- as.numeric(dataDEGsFiltLevel_sub$logFC)
names(genelistDEGs) <- dataDEGsFiltLevel_sub$GeneID
library(pathview)
# pathway.id: hsa05214 is the glioma pathway
# limit: sets the limit for gene expression legend and color
hsa05214 <- pathview::pathview(
gene.data = genelistDEGs,
pathway.id = "hsa05214",
species = "hsa",
limit = list(gene=as.integer(max(abs(genelistDEGs)))))
The red genes are up-regulated and the green genes are down-regulated in the LGG samples compared to GBM.
4.3.4 Inference of gene regulatory networks
Starting with the set of differentially expressed genes, we infer gene regulatory networks using the following state-of-the art inference algorithms: ARACNE (Margolin et al. 2006), CLR (Faith et al. 2007), MRNET (Meyer et al. 2007) and C3NET (Altay and Emmert-Streib 2010). These methods are based on mutual inference and use different heuristics to infer the edges in the network. These methods have been made available via Bioconductor/CRAN packages (MINET (Meyer, Lafitte, and Bontempi 2008) and c3net, (Altay and Emmert-Streib 2010) respectively). Many gene regulatory interactions have been experimentally validated and published. These ’known’ interactions can be accessed using different tools and databases such as BioGrid (Stark et al. 2006) or GeneMANIA (Montojo et al. 2010). However, this knowledge is far from complete and in most cases only contains a small subset of the real interactome. The quality assessment of the inferred networks can be carried out by comparing the inferred interactions to those that have been validated. This comparison results in a confusion matrix as presented in Table below.
validated | not validated/non-existing | |
---|---|---|
inferred | TP | FP |
not inferred | FN | TN |
Different quality measures can then be computed such as the false positive rate \[fpr=\frac{FP}{FP+TN},\] the true positive rate (also called recall) \[tpr=\frac{TP}{TP+FN}\] and the precision \[p=\frac{TP}{TP+FP}.\] The performance of an algorithm can then be summarized using ROC (false positive rate versus true positive rate) or PR (precision versus recall) curves.
A weakness of this type of comparison is that an edge that is not present in the set of known interactions can either mean that an experimental validation has been tried and did not show any regulatory mechanism or (more likely) has not yet been attempted.
In the following, we ran the nce on i) the 2,901 differentially expressed genes identified in Section “Transcriptomic analysis”.
Retrieving known interactions
We obtained a set of known interactions from the BioGrid database.
There are 3,941 unique interactions between the 2,901 differentially expressed genes.
Using differentially expressed genes from TCGAbiolinks workflow
We start this analysis by inferring one gene set for the LGG data.
### read biogrid info (available in TCGAWorkflowData as "biogrid")
### Check last version in https://thebiogrid.org/download.php
file <- "http://thebiogrid.org/downloads/archives/Release%20Archive/BIOGRID-3.4.146/BIOGRID-ALL-3.4.146.tab2.zip"
if(!file.exists(gsub("zip","txt",basename(file)))){
downloader::download(file,basename(file))
unzip(basename(file),junkpaths =TRUE)
}
tmp.biogrid <- read.csv(gsub("zip","txt",basename(file)), header=TRUE, sep="\t", stringsAsFactors=FALSE)
### plot details (colors & symbols)
mycols <- c('#e41a1c','#377eb8','#4daf4a','#984ea3','#ff7f00','#ffff33','#a65628')
### load network inference libraries
library(minet)
library(c3net)
### deferentially identified genes using TCGAbiolinks
# we will use only a subset (first 50 genes) of it to make the example faster
names.genes.de <- rownames(dataDEGs)[1:30]
data(biogrid)
net.biogrid.de <- getAdjacencyBiogrid(tmp.biogrid, names.genes.de)
mydata <- dataFiltLGG[names.genes.de, ]
### infer networks
t.mydata <- t(mydata)
net.aracne <- minet(t.mydata, method = "aracne")
net.mrnet <- minet(t.mydata)
net.clr <- minet(t.mydata, method = "clr")
net.c3net <- c3net(mydata)
### validate compared to biogrid network
tmp.val <- list(validate(net.aracne, net.biogrid.de),
validate(net.mrnet, net.biogrid.de),
validate(net.clr, net.biogrid.de),
validate(net.c3net, net.biogrid.de))
### plot roc and compute auc for the different networks
dev1 <- show.roc(tmp.val[[1]],cex=0.3,col=mycols[1],type="l")
res.auc <- auc.roc(tmp.val[[1]])
for(count in 2:length(tmp.val)){
show.roc(tmp.val[[count]],device=dev1,cex=0.3,col=mycols[count],type="l")
res.auc <- c(res.auc, auc.roc(tmp.val[[count]]))
}
legend("bottomright", legend=paste(c("aracne","mrnet","clr","c3net"), signif(res.auc,4), sep=": "),
col=mycols[1:length(tmp.val)],lty=1, bty="n" )
# Please, uncomment this line to produce the pdf files.
# dev.copy2pdf(width=8,height=8,device = dev1, file = paste0("roc_biogrid_",cancertype,".pdf"))
In Figure above, the obtained ROC curve and the corresponding area under curve (AUC) are presented. It can be observed that CLR and MRNET perform best when comparing the inferred network with known interactions from the BioGrid database.
4.4 Epigenetic analysis
The DNA methylation is an important component in numerous cellular processes, such as embryonic development, genomic imprinting, X-chromosome inactivation, and preservation of chromosome stability (Phillips 2008).
In mammals DNA methylation is found sparsely but globally, distributed in definite CpG sequences throughout the entire genome; however, there is an exception. CpG islands (CGIs) which are short interspersed DNA sequences that are enriched for GC. These islands are normally found in sites of transcription initiation and their methylation can lead to gene silencing (Deaton and Bird 2011).
Thus, the investigation of the DNA methylation is crucial to understanding regulatory gene networks in cancer as the DNA methylation represses transcription (Robertson 2005). Therefore, the DMR (Differentially Methylation Region) detection can help us investigate regulatory gene networks.
This section describes the analysis of DNA methylation using the Bioconductor package TCGAbiolinks (Colaprico et al. 2016). For this analysis, and due to the time required to perform it, we selected only 10 LGG samples and 10 GBM samples that have both DNA methylation data from Infinium HumanMethylation450 and gene expression from Illumina HiSeq 2000 RNA Sequencing Version 2 analysis (lines 1-56 of the listing below describes how to make the data acquisition). We started by checking the mean DNA methylation of different groups of samples, then performed a DMR in which we search for regions of possible biological significance, (e.g., regions that are methylated in one group and unmethylated in the other). After finding these regions, they can be visualized using heatmaps.
4.4.1 Visualizing the mean DNA methylation of each patient
It should be highlighted that some pre-processing of the DNA methylation data was done. The DNA methylation data from the 450k platform has three types of probes cg (CpG loci) , ch (non-CpG loci) and rs (SNP assay). The last type of probe can be used for sample identification and tracking and should be excluded for differential methylation analysis according to the ilumina manual. Therefore, the rs probes were removed (see listing below lines 68). Also, probes in chromosomes X, Y were removed to eliminate potential artifacts originating from the presence of a different proportion of males and females (Marabita et al. 2013). The last pre-processing steps were to remove probes with at least one NA value (see listing below lines 65).
After this pre-processing step and using the function TCGAvisualize_meanMethylation function, we can look at the mean DNA methylation of each patient in each group. It receives as argument a SummarizedExperiment object with the DNA methylation data, and the arguments groupCol and subgroupCol which should be two columns from the sample information matrix of the SummarizedExperiment object (accessed by the colData function) (see listing below lines 70-74).
#----------------------------
# Obtaining DNA methylation
#----------------------------
# Samples
lgg.samples <- matchedMetExp("TCGA-LGG", n = 10)
gbm.samples <- matchedMetExp("TCGA-GBM", n = 10)
samples <- c(lgg.samples,gbm.samples)
#-----------------------------------
# 1 - Methylation
# ----------------------------------
# For methylation it is quicker in this case to download the tar.gz file
# and get the samples we want instead of downloading files by files
query <- GDCquery(project = c("TCGA-LGG","TCGA-GBM"),
data.category = "DNA methylation",
platform = "Illumina Human Methylation 450",
legacy = TRUE,
barcode = samples)
GDCdownload(query)
met <- GDCprepare(query, save = FALSE)
# We will use only chr9 to make the example faster
met <- subset(met,subset = as.character(seqnames(met)) %in% c("chr9"))
# This data is avaliable in the package
save(met, file = "met20SamplesGBMLGGchr9.rda")
data(met20SamplesGBMLGGchr9)
#----------------------------
# Mean methylation
#----------------------------
# Plot a barplot for the groups in the disease column in the
# summarizedExperiment object
# remove probes with NA (similar to na.omit)
met <- subset(met,subset = (rowSums(is.na(assay(met))) == 0))
TCGAvisualize_meanMethylation(met,
groupCol = "disease_type",
group.legend = "Groups",
filename = NULL,
print.pvalue = TRUE)
## groups Mean Median Max Min
## 1 Brain Lower Grade Glioma 0.5159918 0.5320007 0.5610239 0.4509977
## 2 Glioblastoma Multiforme 0.4600086 0.4780980 0.5186596 0.3060219
## Brain Lower Grade Glioma Glioblastoma Multiforme
## Brain Lower Grade Glioma NA 0.02181625
## Glioblastoma Multiforme 0.02181625 NA
The figure above illustrates a mean DNA methylation plot for each sample in the GBM group (140 samples) and a mean DNA methylation for each sample in the LGG group. Genome-wide view of the data highlights a difference between the groups of tumors.
4.4.2 Searching for differentially methylated CpG sites
The next step is to define differentially methylated CpG sites between the two groups. This can be done using the TCGAanalyze_DMR function (see listing below). The DNA methylation data (level 3) is presented in the form of beta-values that uses a scale ranging from 0.0 (probes completely unmethylated ) up to 1.0 (probes completely methylated).
To find these differentially methylated CpG sites, first, the function calculates the difference between the mean DNA methylation (mean of the beta-values) of each group for each probe. Second, it test for differential expression between two groups using the wilcoxon test adjusting by the Benjamini-Hochberg method. Arguments of TCGAanalyze_DMR was set to require a minimum absolute beta-values difference of 0.15 and an adjusted p-value of less than \(0.05\).
After these tests, a volcano plot (x-axis: difference of mean DNA methylation, y-axis: statistical significance) is created to help users identify the differentially methylated CpG sites and return the object with the results in the rowRanges.
#------- Searching for differentially methylated CpG sites ----------
met <- TCGAanalyze_DMR(met,
groupCol = "disease_type", # a column in the colData matrix
group1 = "Glioblastoma Multiforme", # a type of the disease type column
group2 = "Brain Lower Grade Glioma", # a type of the disease column
p.cut = 0.05,
diffmean.cut = 0.15,
save = FALSE,
legend = "State",
plot.filename = "LGG_GBM_metvolcano.png",
cores = 1
)
Figure below shows the volcano plot produced by listing below. This plot aids the user in selecting relevant thresholds, as we search for candidate biological DMRs.
To visualize the level of DNA methylation of these probes across all samples, we use heatmaps that can be generate by the Bioconductor package complexHeatmap (Z., n.d.). To create a heatmap using the complexHeatmap package, the user should provide at least one matrix with the DNA methylation levels. Also, annotation layers can be added and placed at the bottom, top, left side and right side of the heatmap to provide additional metadata description. The listing below shows the code to produce the heatmap of a DNA methylation datum.
#--------------------------
# DNA Methylation heatmap
#-------------------------
library(ComplexHeatmap)
clinical <- plyr::rbind.fill(gbm_clin,lgg_clin)
# get the probes that are Hypermethylated or Hypomethylated
# met is the same object of the section 'DNA methylation analysis'
status.col <- "status.Glioblastoma.Multiforme.Brain.Lower.Grade.Glioma"
sig.met <- met[values(met)[,status.col] %in% c("Hypermethylated","Hypomethylated"),]
# top annotation, which sampples are LGG and GBM
# We will add clinical data as annotation of the samples
# we will sort the clinical data to have the same order of the DNA methylation matrix
clinical.order <- clinical[match(substr(colnames(sig.met),1,12),clinical$bcr_patient_barcode),]
ta = HeatmapAnnotation(df = clinical.order[,c("disease","gender","vital_status","race")],
col = list(disease = c("LGG" = "grey", "GBM" = "black"),
gender = c("male" = "blue","female" = "pink")))
# row annotation: add the status for LGG in relation to GBM
# For exmaple: status.gbm.lgg Hypomethyated means that the
# mean DNA methylation of probes for lgg are hypomethylated
# compared to GBM ones.
ra = rowAnnotation(df = values(sig.met)[status.col],
col = list("status.Glioblastoma.Multiforme.Brain.Lower.Grade.Glioma" =
c("Hypomethylated" = "orange",
"Hypermethylated" = "darkgreen")),
width = unit(1, "cm"))
heatmap <- Heatmap(assay(sig.met),
name = "DNA methylation",
col = matlab::jet.colors(200),
show_row_names = FALSE,
cluster_rows = TRUE,
cluster_columns = FALSE,
show_column_names = FALSE,
bottom_annotation = ta,
column_title = "DNA Methylation")
# Save to pdf
png("heatmap.png",width = 600, height = 400)
draw(heatmap, annotation_legend_side = "bottom")
dev.off()
4.4.3 Motif analysis
Motif discovery is the attempt to extract small sequence signals hidden within largely non-functional intergenic sequences. These small sequence nucleotide signals (6-15 bp) might have a biological significance as they can be used to control the expression of genes. These sequences are called Regulatory motifs. The Bioconductor package rGADEM (Droit et al. 2015; Li 2009) provides an efficient de novo motif discovery algorithm for large-scale genomic sequence data.
The user may be interested in looking for unique signatures in the regions defined by ‘differentially methylated’ to identify candidate transcription factors that could bind to these elements affected by the accumulation or absence of DNA methylation. For this analysis we use a sequence of 100 bases before and after the probe location (See lines 6-8 in the listing below). An object will be returned which contains all relevant information about your motif analysis (i.e., sequence consensus, pwm, chromosome, p-value, etc).
Using Bioconductor package motifStack (Ou et al. 2013) it is possible to generate a graphic representation of multiple motifs with different similarity scores.
library(rGADEM)
library(BSgenome.Hsapiens.UCSC.hg19)
library(motifStack)
probes <- rowRanges(met)[values(met)[,status.col] %in% c("Hypermethylated" ,"Hypomethylated") ,]
# Get hypo/hyper methylated probes and make a 200bp window
# surrounding each probe.
sequence <- RangedData(space = as.character(seqnames(probes)),
IRanges(start = start(ranges(probes)) - 100,
end = start(ranges(probes)) + 100), strand = "*")
#look for motifs
gadem <- GADEM(sequence, verbose = FALSE, genome = Hsapiens)
top 3 4, 5-mers: 20 40 60 top 3 4, 5-mers: 20 40 60
# How many motifs were found?
nMotifs(gadem)
[1] 3
# get the number of occurences
nOccurrences(gadem)
[1] 134 197 128
# view all sequences consensus
consensus(gadem)
[1] “nTTTyTyTTTyyyTy” “sCCCynsCCyCnsCCsCnCCm” “srGsCwGGGs”
# Print motif
pwm <- getPWM(gadem)
pfm <- new("pfm",mat=pwm[[1]],name="Novel Site 1")
plotMotifLogo(pfm)
# Number of instances of motif 1?
length(gadem@motifList[[1]]@alignList)
[1] 134
After rGADEM returns it’s results, the user can use MotIV package (Mercier and Gottardo 2014; Mahony, Auron, and Benos 2007; Mahony and Benos 2007; Mercier et al. 2011) to start the motif matching analysis (line 4 of listing below).
library(MotIV)
# motif matching analysis
analysis.jaspar <- motifMatch(pwm)
Ungapped Alignment
Scores read
Database read
Motif matches : 5
summary(analysis.jaspar)
Number of input motifs : 3
Input motifs names : CTTCCTGTCTCCTTC CCCCAGGCCTCACCGGAGGCC GAGCCGGGGG
Number of matches per motif: 5
Matches repartition :
EWSR1-FLI1 INSM1 SP1 ESR1 IRF1 IRF2 2 2 2 1 1 1 Klf4 NFATC2 Pax4 PLAG1 RREB1 SPI1 1 1 1 1 1 1 Arguments used : -metric name : PCC -alignment : SWU
plot(analysis.jaspar, ncol=2, top=5, rev=FALSE, main="", bysim=TRUE, cex=0.4)
# visualize the quality of the results looking into the alignments
# E-value give an estimation of the match.
alignment <- viewAlignments(analysis.jaspar)
print(alignment[[1]])
IRF1 EWSR1-FLI1 IRF2
seq “-NTTTTTYTTTYYYTN” “–NTTTTTYTTTYYYTN-” “——NTTTTTYTTTYYYTN” match “GGTTTCRNTTTN—-” “CCTTCCTTCCTTCCTTCC” “NNWWNRSTTTCRCTTTCS—” evalue “1.1304e-04” “2.0133e-03” “8.4348e-03”
NFATC2 SPI1
seq “NTTTTTYTTTYYYTN” “NTTTTTYTTTYYYTN” match “–TTTTCCA——” “—–ACTTCCT—” evalue “2.3439e-02” “3.9846e-02”
4.5 Integrative (Epigenomic & Transcriptomic) analysis
Recent studies have shown that providing a deep integrative analysis can aid researchers in identifying and extracting biological insight from high through put data (Phillips 2008; Shi et al. 2014; Rhodes and Chinnaiyan 2005). In this section, we will introduce a Bioconductor package called ELMER to identify regulatory enhancers using gene expression + DNA methylation data + motif analysis. In addition, we show how to integrate the results from the previous sections with important epigenomic data derived from both the ENCODE and Roadmap.
4.5.1 Integration of DNA methylation & gene expression data
After finding differentially methylated CpG sites, one might ask whether nearby genes also undergo a change in expression either an increase or a decrease. DNA methylation at promoters of genes has been shown to be associated with silencing of the respective gene. The starburst plot is proposed to combine information from two volcano plots and is applicable for studies of DNA methylation and gene expression (Noushmehr et al. 2010). Even though, being desirable that both gene expression and DNA methylation data are from the same samples, the starburst plot can be applied as a meta-analysis tool, combining data from different samples (Siegmund 2011). We used the TCGAvisualize_starburst function to create a starburst plot. The \(log_{10}\) (FDR-corrected P value) for DNA methylation is plotted on the x axis, and for gene expression on the y axis, for each gene. The horizontal black dashed line shows the FDR-adjusted P value of \(10^{-2}\) for the expression data and the vertical ones shows the FDR-adjusted P value of \(10^{-2}\) for the DNA methylation data. While the argument met.p.cut and exp.p.cut controls the black dashed lines, the arguments diffmean.cut and logFC.cut will be used to highlight the genes that respects these parameters (circled genes in Figure below). For the example below we set higher p.cuts trying to get the most significant list of pair gene/probes. But for the next sections we will use exp.p.cut = 0.01 and logFC.cut = 1 as the previous sections.
#------------------- Starburst plot ------------------------------
starburst <- TCGAvisualize_starburst(met, # DNA methylation with results
dataDEGs, # DEG results
group1 = "Glioblastoma Multiforme",
group2 = "Brain Lower Grade Glioma",
filename = "starburst.png",
met.p.cut = 0.05,
exp.p.cut = 10^-2,
diffmean.cut = 0.15,
logFC.cut = 1,
width = 15,height = 10,
names = TRUE)
Due to the time constraints, with the reduced number of samples, genes and probes, the result showed previously might not be very descriptive. A results with a large data set can be found in TCGAbiolinks vignette.
4.5.2 ChIP-seq analysis
ChIP-seq is used primarily to determine how transcription factors and other chromatin-associated proteins influence phenotype-affecting mechanisms. Determining how proteins interact with DNA to regulate gene expression is essential for fully understanding many biological processes and disease states. The aim is to explore significant overlap datasets for inferring co-regulation or transcription factor complex for further investigation. A summary of the association of each histone mark is shown in the table below.
Histone marks | Role |
---|---|
Histone H3 lysine 4 trimethylation (H3K4me3) | Promoter regions (Heintzman et al. 2007,Bernstein et al. (2005)) |
Histone H3 lysine 4 monomethylation (H3K4me1) | Enhancer regions (Heintzman et al. 2007) |
Histone H3 lysine 36 trimethylation (H3K36me3) | Transcribed regions |
Histone H3 lysine 27 trimethylation (H3K27me3) | Polycomb repression (Bonasio, Tu, and Reinberg 2010) |
Histone H3 lysine 9 trimethylation (H3K9me3) | Heterochromatin regions (Peters et al. 2003) |
Histone H3 acetylated at lysine 27 (H3K27ac) | Increase activation of genomic elements (Heintzman et al. 2009,Rada-Iglesias et al. (2011),Creyghton et al. (2010)) |
Histone H3 lysine 9 acetylation (H3K9ac) | Transcriptional activation (Nishida et al. 2006) |
Besides, ChIP-seq data exists in the ROADMAP database and can be obtained through the AnnotationHub package (Morgan M and S., n.d.) or from Roadmap web portal. The table below shows the description for all the roadmap files that are available through AnnotationHub.
File | Description |
---|---|
fc.signal.bigwig | Bigwig File containing fold enrichment signal tracks |
pval.signal.bigwig | Bigwig File containing -log10(p-value) signal tracks |
hotspot.fdr0.01.broad.bed.gz | Broad domains on enrichment for DNase-seq for consolidated epigenomes |
hotspot.broad.bed.gz | Broad domains on enrichment for DNase-seq for consolidated epigenomes |
broadPeak.gz | Broad ChIP-seq peaks for consolidated epigenomes |
gappedPeak.gz | Gapped ChIP-seq peaks for consolidated epigenomes |
narrowPeak.gz | Narrow ChIP-seq peaks for consolidated epigenomes |
hotspot.fdr0.01.peaks.bed.gz | Narrow DNasePeaks for consolidated epigenomes |
hotspot.all.peaks.bed.gz | Narrow DNasePeaks for consolidated epigenomes |
.macs2.narrowPeak.gz | Narrow DNasePeaks for consolidated epigenomes |
coreMarks_mnemonics.bed.gz | 15 state chromatin segmentations |
mCRF_FractionalMethylation.bigwig | MeDIP/MRE(mCRF) fractional methylation calls |
RRBS_FractionalMethylation.bigwig | RRBS fractional methylation calls |
WGBS_FractionalMethylation.bigwig | Whole genome bisulphite fractional methylation calls |
After obtaining the ChIP-seq data, we can then identify overlapping regions with the regions identified in the starburst plot. The narrowPeak files are the ones selected for this step. For a complete pipeline with Chip-seq data, Bioconductor provides excellent tutorials to work with ChIP-seq and we encourage our readers to review the following article (Aleksandra Pekowska 2015). The first step shown in listing below is to download the chip-seq data. The function query received as argument the annotationHub database (ah) and a list of keywords to be used for searching the data, EpigenomeRoadmap is selecting the roadmap database, consolidated is selecting only the consolidate epigenomes, brain is selecting the brain samples, E068 is one of the epigenomes for brain (keywords can be seen in the summary table) and narrowPeak is selecting the type of file. The data downloaded is a processed data from an integrative Analysis of 111 reference human epigenomes (Kundaje et al. 2015).
library(ChIPseeker)
library(pbapply)
library(ggplot2)
#------------------ Working with ChipSeq data ---------------
# Step 1: download histone marks for a brain and non-brain samples.
#------------------------------------------------------------
# loading annotation hub database
library(AnnotationHub)
setAnnotationHubOption("CACHE", "/liftrroot/")
ah = AnnotationHub()
# Searching for brain consolidated epigenomes in the roadmap database
bpChipEpi_brain <- query(ah , c("EpigenomeRoadMap", "narrowPeak", "chip", "consolidated","brain","E068"))
# Get chip-seq data
histone.marks <- pblapply(names(bpChipEpi_brain), function(x) {ah[[x]]})
names(histone.marks) <- names(bpChipEpi_brain)
# OBS: histone.marks is available in TCGAWorkflowData package
The Chipseeker package (Yu, Wang, and He 2015) implements functions that uses Chip-seq data to retrieve the nearest genes around the peak, to annotate genomic region of the peak, among others. Also, it provides several visualization functions to summarize the coverage of the peak, average profile and heatmap of peaks binding to TSS regions, genomic annotation, distance to TSS and overlap of peaks or genes.
After downloading the histone marks, it is useful to verify the average profile of peaks binding to hypomethylated and hypermethylated regions, which will help the user understand better the regions found. Listing below shows an example of code to plot the average profile.
To help the user better understand the regions found in the DMR analysis, we downloaded histone marks specific for brain tissue using the AnnotationHub package that can access the Roadmap database. Next, the Chipseeker was used to visualize how histone modifications are enriched onto hypomethylated and hypermethylated regions, (listing below). The enrichment heatmap and the average profile of peaks binding to those regions.
data(histoneMarks)
# Create a GR object based on the hypo/hypermethylated probes.
probes <- keepStandardChromosomes(rowRanges(met)[values(met)[,status.col] %in% c("Hypermethylated","Hypomethylated"),])
# Defining a window of 3kbp - 3kbp_probe_3kbp
ranges(probes) <- IRanges(start = c(start(ranges(probes)) - 3000), end = c(start(ranges(probes)) + 3000))
### Profile of ChIP peaks binding to TSS regions
# First of all, for calculate the profile of ChIP peaks binding to TSS regions, we should
# prepare the TSS regions, which are defined as the flanking sequence of the TSS sites.
# Then align the peaks that are mapping to these regions, and generate the tagMatrix.
tagMatrixList <- pbapply::pblapply(histone.marks, function(x) {
getTagMatrix(keepStandardChromosomes(x), windows = probes, weightCol = "score")
})
# change nanmes retrieved with the following command: basename(bpChipEpi_brain$title)
names(tagMatrixList) <- c("H3K4me1","H3K4me3", "H3K9ac", "H3K9me3", "H3K27ac", "H3K27me3", "H3K36me3")
To plot the enrichment heatmap use the function tagHeatmap
tagHeatmap(tagMatrixList, xlim=c(-3000, 3000),color = NULL)
To plot the average profile of peaks binding to those region use plotAvgProf
:
p <- plotAvgProf(tagMatrixList, xlim = c(-3000,3000), xlab = "Genomic Region (5'->3', centered on CpG)")
# We are centreing in the CpG instead of the TSS. So we'll change the labels manually
p <- p + scale_x_continuous(breaks=c(-3000,-1500,0,1500,3000),labels=c(-3000,-1500,"CpG",1500,3000))
library(ggthemes)
p + theme_few() + scale_colour_few(name="Histone marks") + guides(colour = guide_legend(override.aes = list(size=4)))
The hypomethylated and hypermethylated regions are enriched for H3K4me3, H3K9ac, H3K27ac and H3K4me1 which indicates regions of enhancers, promoters and increased activation of genomic elements. However, these regions are associated neither with transcribed regions nor Polycomb repression as the H3K36me3 and H3K27me3 heatmaps does not show an enrichment nearby the position 0, and the average profile also does not show a peak at position 0.
Recently, many studies suggest that enhancers play a major role as regulators of cell-specific phenotypes leading to alteration in transcriptomes realated to diseases (Giorgio et al. 2015; Gröschel et al. 2014; Sur et al. 2012; Yao, Berman, and Farnham 2015). In order to investigate regulatory enhancers that can be located at long distances upstream or downstream of target genes Bioconductor offer the Enhancer Linking by Methylation/Expression Relationship (ELMER) package. This package is designed to combine DNA methylation and gene expression data from human tissues to infer multi-level cis-regulatory networks. It uses DNA methylation to identify enhancers and correlates their state with expression of nearby genes to identify one or more transcriptional targets. Transcription factor (TF) binding site analysis of enhancers is coupled with expression analysis of all TFs to infer upstream regulators. This package can be easily applied to TCGA public available cancer data sets and custom DNA methylation and gene expression data sets (Yao et al. 2015).
ELMER analysis have 5 main steps:
Identify distal enhancer probes on HM450K.
Identify distal enhancer probes with significantly different DNA methyaltion level in control group and experiment group.
Identify putative target genes for differentially methylated distal enhancer probes.
Identify enriched motifs for the distal enhancer probes which are significantly differentially methylated and linked to putative target gene.
Identify regulatory TFs whose expression associate with DNA methylation at motifs.
This section shows how to use ELMER to analyze TCGA data using as example LGG and GBM samples.
Preparing the data for ELMER package
After downloading the data with TCGAbiolinks package, some steps are still required to use TCGA data with ELMER. These steps can be done with the function TCGAprepare_elmer. This function for the DNA methylation data will remove probes with NA values in more than 20% samples and remove the annotation data, for RNA expression data it will take the log2(expression + 1) of the expression matrix in order to linearize the relation between DNA methylation and expression. Also, it will prepare the row names of the matrix as required by the package.
The listing below shows how to use TCGAbiolinks (Colaprico et al. 2016) to search, download and prepare the data for the ELMER package. Due to time and memory constraints, we will use in this example only data from 10 LGG patients and 10 GBM patients that have both DNA methylation and gene expression data. This samples are the same used in the previous steps.
#----------- 8.3 Identification of Regulatory Enhancers -------
library(TCGAbiolinks)
# Samples: primary solid tumor w/ DNA methylation and gene expression
lgg.samples <- matchedMetExp("TCGA-LGG", n = 10)
gbm.samples <- matchedMetExp("TCGA-GBM", n = 10)
samples <- c(lgg.samples,gbm.samples)
#-----------------------------------
# 1 - Methylation
# ----------------------------------
query.met <- GDCquery(project = c("TCGA-LGG","TCGA-GBM"),
data.category = "DNA methylation",
platform = "Illumina Human Methylation 450",
legacy = TRUE,
barcode = samples)
GDCdownload(query.met)
met <- GDCprepare(query.met, save = FALSE)
met <- subset(met,subset = as.character(GenomicRanges::seqnames(met)) %in% c("chr9"))
met.elmer <- TCGAprepare_elmer(met, platform = "HumanMethylation450")
#-----------------------------------
# 2 - Expression
# ----------------------------------
query.exp <- GDCquery(project = c("TCGA-LGG","TCGA-GBM"),
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
legacy = TRUE,
barcode = samples)
GDCdownload(query.exp)
exp.lgg <- GDCprepare(query.exp, save = FALSE)
exp.elmer <- TCGAprepare_elmer(exp.lgg, platform = "IlluminaHiSeq_RNASeqV2")
save(exp.elmer, met.elmer, gbm.samples, lgg.samples, file = "elmer.example.rda", compress = "xz")
Finally, the ELMER input is a mee object that contains a DNA methylation matrix, an gene expression matrix, a probe information GRanges, the gene information GRanges and a data frame summarizing the data. It should be highlighted that samples without both the gene expression and DNA methylation data will be removed from the mee object.
By default the function fetch.mee that is used to create the mee object will separate the samples into two groups, the control group (normal samples) and the experiment group (tumor samples), but the user can relabel the samples to compare different groups. For the next sections, we will work with both the experimental group (LGG) and control group (GBM).
library(ELMER)
geneAnnot <- txs()
geneAnnot$GENEID <- paste0("ID",geneAnnot$GENEID)
geneInfo <- promoters(geneAnnot,upstream = 0, downstream = 0)
probe <- get.feature.probe()
# Recover the data created in the last step
data(elmerExample)
# create mee object, use @ to access the matrices inside the object
mee <- fetch.mee(meth = met.elmer[1:800,],
exp = exp.elmer,
TCGA = TRUE,
probeInfo = probe,
geneInfo = geneInfo)
# Relabel GBM samples in the mee object: GBM is control
mee@sample$TN[mee@sample$ID %in% gbm.samples] <- "Control"
4.6 ELMER analysis
After preparing the data into a mee object, we executed the five ELMER steps for both the hypo (distal enhancer probes hypomethylated in the LGG group) and hyper (distal enhancer probes hypermethylated in the LGG group) direction. The code is shown below. A description of how these distal enhacer probes are identified is found in the ELMER.data vignette.
cores <- 1 # you can use more cores if you want
# Available directions are hypo and hyper, we will use only hypo
# due to speed constraint
direction <- c("hyper")
dir.out <- paste0("elmer/",direction)
dir.create(dir.out, showWarnings = FALSE, recursive = TRUE)
#--------------------------------------
# STEP 3: Analysis |
#--------------------------------------
# Step 3.1: Get diff methylated probes |
#--------------------------------------
message("Get diff methylated probes")
Sig.probes <- get.diff.meth(mee,
cores = cores,
dir.out = dir.out,
diff.dir = direction,
save = FALSE,
pvalue = 0.1) # Please a more strigent pvalue
#-------------------------------------------------------------
# Step 3.2: Identify significant probe-gene pairs |
#-------------------------------------------------------------
# Collect nearby 20 genes for Sig.probes
message("Get nearby genes")
nearGenes <- GetNearGenes(TRange=getProbeInfo(mee, probe=Sig.probes$probe),
cores=cores,
geneNum = 4, # defaul is 20
geneAnnot=getGeneInfo(mee))
message("Get anti correlated probes-genes")
pair <- get.pair(mee = mee,
probes = na.omit(Sig.probes$probe),
nearGenes = nearGenes,
permu.dir = paste0(dir.out,"/permu"),
dir.out = dir.out,
cores = cores,
label = direction,
permu.size = 5, # For significant results use 10000
Pe = 0.5) # For significant results use 0.001
Sig.probes.paired <- pair$Probe
#-------------------------------------------------------------
# Step 3.3: Motif enrichment analysis on the selected probes |
#-------------------------------------------------------------
if(length(Sig.probes.paired) > 0 ){
#-------------------------------------------------------------
# Step 3.3: Motif enrichment analysis on the selected probes |
#-------------------------------------------------------------
enriched.motif <- get.enriched.motif(probes=Sig.probes.paired,
dir.out=dir.out,
# Default value for min.incidence is 10
# we set 6 because we have only a subset
# of probes (only chr 9)
min.incidence = 5,
label=direction,
background.probes = probe$name)
# One of the output from the previous function is a file with the motif, OR and Number of probes
# It will be used for plotting purposes
motif.enrichment <- read.csv(paste0(dir.out,"/getMotif.",direction, ".motif.enrichment.csv"))
if(length(enriched.motif) > 0){
#-------------------------------------------------------------
# Step 3.4: Identifying regulatory TFs |
#-------------------------------------------------------------
message("get.TFs")
TF <- get.TFs(mee = mee,
enriched.motif = enriched.motif,
dir.out = dir.out,
cores = cores,
label = direction)
# One of the output from the previous function is a file with the raking of TF,
# for each motif. It will be used for plotting purposes
TF.meth.cor <- get(load(
paste0(dir.out,"/getTF.",direction,".TFs.with.motif.pvalue.rda")
))
}
}
When ELMER Identifies the enriched motifs for the distal enhancer probes which are significantly differentially methylated and linked to putative target gene, it will plot the Odds Ratio (x axis) for the each motifs found.
The list of motifs for the hyper direction (probes hypermethylated in LGG group compared to the GBM group) is found in the Figure below.
motif.enrichment.plot(motif.enrichment = motif.enrichment, save = FALSE)
We selected motifs that had a minimum incidence of 10 in the given probes set and the smallest lower boundary of 95% confidence interval for Odds Ratio of 1.1. These both values are the default from the ELMER package.
motif top.potential.TF potential.TFs
EGR1 EGR1
ETS_family ETS_family ETV2 ETV2 RXR_RAR_DR5 RXR_RAR_DR5
ZNF281 ZNF281
top_5percent EGR1 HOXB8;HOXA2;HOXC9;HOXB5;FOXD3;HOXC8;NKX2-5;HIST1H2BH;HIST1H1B;HOXC6;HOXC10;HOXB3;HMGA2;HMGN2;POLR2G;PDLIM1;MYBL2;PTTG1;ID3;HOXB2;TRIP13;EZH2;SERTAD3;FOXM1;MYCBP;TGIF1;E2F2;MED31;E2F8;BATF3;ZNF90;GATA3;RAD51;NFATC1;ANKRD53;HDAC1;TRIM5;ETV2;C21orf7;HIST1H2BJ;HOXC11;HOXA7;GFI1;HOXC5;HIST1H1D;DMRT1;HOXA9;EOMES;EN1;SRSF9;POLR2I;CENPK;TCF19;ESPL1;SERTAD1;ISL2;NME2;MXD3;RAD54B;DDB2;RAD51B;MCM2;ELF4;PRICKLE3;NR5A2;ZNF280A;HOXC13;FOXB1;HOXA3;HAND2;HOXA4;HOXD10;HOXD11;PRRX2;HMGB2;ALYREF;BUD31;TAF12;PITX1;SNAPC2;TAF13;MED30;HOXB7;BRIP1;ZNF311;TBX19;RUVBL1;ZNF593;PHTF1;HOPX;NMI;GTF2B;HMG20B;TP63;MAFA;LBX1;HOXA5 ETS_family HOXB5;HIST1H1D;HOXC10;HOXC13;HOXC11;HOXA2;HOXC9;YEATS4;HIST1H2BH;HOXC6;HIST1H1B;HOXC8;HOXD11;HOXB3;HMGA2;NKX2-5;MYBL2;PTTG1;HOXB2;TRIP13;EZH2;CENPK;FOXM1;TGIF1;E2F2;MED31;PIR;E2F8;ZNF90;RAD51;ZNF593;MXD3;MCM2;HIST1H2BJ;HOXB8;HOXC5;ETV2;DMRT1;ISL2;PRRX2;HMGN2;PDLIM1;PTRF;PITX1;TCF19;ESPL1;HOXD10;RAD54B;DDB2;HOPX;HDAC1;C21orf7;HOXA5;ZNF280A;BARHL2;HOXB9;FOXB1;HOXA3;HOXD13;LBX1;HIST1H2BG;HIST1H1C;HMGB2;POLR2I;BUD31;TAF12;BRIP1;EN1;NKX3-2;ANKRD45;ARNTL2;GATA3;RUVBL1;PHTF1;TCF7;TCEA3;AEBP1;VGLL2;HOXA7;HOXA9;MEOX2;NR5A1;PAX3;OLIG3;BARX1;HOXB13;HOXD9;ASB11;ZNF479;TAL2;EOMES;TAF13;HAND2;FOXD3;GATA4;ENO1;POLR2L RXR_RAR_DR5 HAND2;HOXC11;HOXB8;HOXC9;HOXC6;HIST1H1B;FOXD3;HOXC8;HOXC10;BARX1;NKX2-5;HOXA2;HOXB5;GFI1;TTF2;HIST1H2BH;GATA3;HOXA4;TCEA3;HMGA2;PRICKLE3;ENO1;HMGN2;POLR2G;PDLIM1;POLR2I;MYBL2;PTTG1;DR1;HOXB2;TRIP13;EZH2;TAF12;SNAPC2;SERTAD3;MYCBP;TAF13;MED30;DDX20;SERTAD1;TGIF1;E2F2;MED31;E2F8;ZNF311;BATF3;EN1;TBX19;IRX5;KLF5;ZNF90;ZNF547;XBP1;IRF3;MED8;RAD51;HOXB3;RAD54B;PHTF1;RAD51B;ANKRD53;EBF3;HDAC1;GTF2B;TRIM5;ZNF20;ETV2;C21orf7;HIST1H2BJ;HOXC5;HOXA7;SPIB;HIST1H1D;HOXA9;ISL2;HOXD10;ZNF593;TP63;SRSF9;HMGB2;BUD31;ZNF581;CENPK;DEDD2;TCF19;FOXM1;ESPL1;NFE2L3;VDR;NME2;ZNF613;ZFP82;HOXD8;PRDM6;ZDHHC1;MXD3;DDB2 ZNF281 HOXB8;HOXC13;HOXC9;FOXD3;HOXA2;NKX2-5;HIST1H2BH;HOXD11;HOXC6;HOXC10;HOXB3;HMGA2;PDLIM1;MYBL2;PTTG1;HOXB2;TRIP13;BUD31;SNAPC2;SERTAD3;MYCBP;ESPL1;SERTAD1;TGIF1;E2F2;E2F8;BATF3;TGFB1I1;ZNF90;RAD51;PDLIM4;HDAC1;TEAD3;TRIM5;NR5A2;ETV2;C21orf7;HOXC8;HOXA7;GFI1;EOMES;HIST1H1B;HAND2;HOXA3;HOXD10;PRRX2;PTRF;GTF2E2;ZNF581;TCF19;FOXM1;ISL2;NME2;TBX19;SNAI2;ANKRD53;ELF4;PRICKLE3;TP63;HOXA5;ZNF280A;FOXB1;HOXB9;DMBX1;HOXD13;HOXA9;LBX1;HOXA4;EN1;FOSL1;POLR2L;ID3;EZH2;CSDA;CENPK;HOXB7;XBP1;ZNF593;MXD3;RAD54B;DDB2;NFATC1;CRABP2;TCEA3;MCM2;NMI;BCL3;HMG20B;MAFA;HOXB5;HIST1H1D;HOXC11;POU2F3;HOXA10;KLF14;VGLL2;FOXE3
After finding the enriched motifs, ELMER identifies regulatory transcription factors (TFs) whose expression is associated with DNA methylation at motifs. ELMER automatically creates a TF ranking plot for each enriched motif. This plot shows the TF ranking plots based on the association score \((-log(P value))\) between TF expression and DNA methylation of the motif. We can see in Figure below that the top three TFs that are associated with a motif found.
grid:TF.rank.plot(motif.pvalue=TF.meth.cor, motif="AP1", save=FALSE)
The output of this step is a data frame with the following columns:
motif: the names of motif.
top.potential.TF: the highest ranking upstream TFs which are known recognized the motif.
potential.TFs: TFs which are within top 5% list and are known recognized the motif. top5percent: all TFs which are within top 5% list considered candidate upstream regulators
Also, for each motif we can take a look at the three most relevant TFs. For example, Figure below shows the average DNA methylation level of sites with the first motif plotted against the expression of some top TFs associated with it. We can see that the experiment group (LGG samples) has a lower average methylation level of sites with the motif plotted and a higher average expression of the TFs.
png("TF.png",width = 800, height = 400)
scatter.plot(mee, category="TN", save=FALSE, lm_line=TRUE,
byTF=list(TF=unlist(stringr::str_split(TF[1,"top_5percent"],";"))[1:4],
probe=enriched.motif[[TF$motif[1]]]))
dev.off()
png 2
5 Conclusion
This workflow outlines how one can use specific Bioconductor packages for the analysis of cancer genomics and epigenomics data derived from the TCGA. In addition, we highlight the importance of using ENCODE and Roadmap data to inform on the biology of the non-coding elements defined by functional roles in gene regulation. We introduced TCGAbiolinks and RTCGAToolbox Bioconductor packages in order to illustrate how one can acquire TCGA specific data, followed by key steps for genomics analysis using GAIA package, for transcriptomic analysis using TCGAbiolinks, pathview packages and for DNA methylation analysis using TCGAbiolinks package. An inference of gene regulatory networks was also introduced by MINET package. Finally, we introduced Bioconductor packages AnnotationHub, ChIPSeeker, ComplexHeatmap, and ELMER to illustrate how one can acquire ENCODE/Roadmap data and integrate these data with the results obtained from analyzing TCGA data to identify and characterize candidate regulatory enhancers associated with cancer.
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats4 methods stats graphics grDevices
## [8] utils datasets base
##
## other attached packages:
## [1] ELMER_1.6.0
## [2] ELMER.data_1.6.0
## [3] Homo.sapiens_1.3.1
## [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [5] GO.db_3.4.1
## [6] OrganismDbi_1.18.0
## [7] GenomicFeatures_1.28.4
## [8] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0
## [9] minfi_1.22.1
## [10] bumphunter_1.16.0
## [11] locfit_1.5-9.1
## [12] iterators_1.0.8
## [13] foreach_1.4.3
## [14] ggplot2_2.2.1
## [15] pbapply_1.3-3
## [16] ChIPseeker_1.12.1
## [17] motifStack_1.20.1
## [18] ade4_1.7-6
## [19] MotIV_1.32.0
## [20] grImport_0.9-0
## [21] XML_3.98-1.9
## [22] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [23] rGADEM_2.24.0
## [24] seqLogo_1.42.0
## [25] BSgenome_1.44.0
## [26] rtracklayer_1.36.4
## [27] Biostrings_2.44.2
## [28] XVector_0.16.0
## [29] c3net_1.1.1
## [30] igraph_1.0.1
## [31] minet_3.34.0
## [32] org.Hs.eg.db_3.4.1
## [33] AnnotationDbi_1.38.1
## [34] clusterProfiler_3.4.4
## [35] DOSE_3.2.0
## [36] circlize_0.4.1
## [37] ComplexHeatmap_1.14.0
## [38] gaia_2.20.0
## [39] bindrcpp_0.2
## [40] SummarizedExperiment_1.6.3
## [41] DelayedArray_0.2.7
## [42] matrixStats_0.52.2
## [43] Biobase_2.36.2
## [44] GenomicRanges_1.28.4
## [45] GenomeInfoDb_1.12.2
## [46] IRanges_2.10.2
## [47] S4Vectors_0.14.3
## [48] BiocGenerics_0.22.0
## [49] TCGAbiolinks_2.4.6
## [50] DT_0.2
## [51] TCGAWorkflowData_1.0.0
## [52] TCGAWorkflow_0.99.73
## [53] shiny_1.0.3
## [54] rmarkdown_1.6
## [55] knitr_1.16
##
## loaded via a namespace (and not attached):
## [1] ggthemes_3.4.0 prabclus_2.2-6
## [3] R.methodsS3_1.7.1 pkgmaker_0.22
## [5] tidyr_0.6.3 bit64_0.9-7
## [7] aroma.light_3.6.0 R.utils_2.5.0
## [9] data.table_1.10.4 hwriter_1.3.2
## [11] KEGGREST_1.16.0 RCurl_1.95-4.8
## [13] GEOquery_2.42.0 doParallel_1.0.10
## [15] snow_0.4-2 preprocessCore_1.38.1
## [17] RSQLite_2.0 bit_1.1-12
## [19] xml2_1.1.1 httpuv_1.3.5
## [21] assertthat_0.2.0 viridis_0.4.0
## [23] hms_0.3 evaluate_0.10.1
## [25] BiocInstaller_1.26.0 DEoptimR_1.0-8
## [27] caTools_1.17.1 dendextend_1.5.2
## [29] Rgraphviz_2.20.0 km.ci_0.5-2
## [31] DBI_0.7 geneplotter_1.54.0
## [33] htmlwidgets_0.9 reshape_0.8.6
## [35] EDASeq_2.10.0 matlab_1.0.2
## [37] purrr_0.2.2.2 selectr_0.3-1
## [39] dplyr_0.7.1 ggpubr_0.1.4
## [41] backports_1.1.0 bookdown_0.4
## [43] trimcluster_0.1-2 annotate_1.54.0
## [45] gridBase_0.4-7 biomaRt_2.32.1
## [47] pathview_1.16.1 robustbase_0.92-7
## [49] GenomicAlignments_1.12.1 mclust_5.3
## [51] mnormt_1.5-5 cluster_2.0.6
## [53] lazyeval_0.2.0 genefilter_1.58.1
## [55] labeling_0.3 edgeR_3.18.1
## [57] pkgconfig_2.0.1 nlme_3.1-131
## [59] nnet_7.3-12 bindr_0.1
## [61] rlang_0.1.1 RJSONIO_1.3-0
## [63] diptest_0.75-7 questionr_0.6.1
## [65] miniUI_0.1.1 downloader_0.4
## [67] registry_0.3 AnnotationHub_2.8.2
## [69] rprojroot_1.2 graph_1.54.0
## [71] rngtools_1.2.4 base64_2.0
## [73] Matrix_1.2-10 KMsurv_0.1-5
## [75] zoo_1.8-0 boot_1.3-19
## [77] RTCGAToolbox_2.6.0 whisker_0.3-2
## [79] GlobalOptions_0.0.12 png_0.1-7
## [81] viridisLite_0.2.0 rjson_0.2.15
## [83] bitops_1.0-6 R.oo_1.21.0
## [85] ConsensusClusterPlus_1.40.0 KernSmooth_2.23-15
## [87] blob_1.1.0 doRNG_1.6.6
## [89] shape_1.4.2 stringr_1.2.0
## [91] qvalue_2.8.0 nor1mix_1.2-2
## [93] ShortRead_1.34.0 readr_1.1.1
## [95] scales_0.4.1 memoise_1.1.0
## [97] magrittr_1.5 plyr_1.8.4
## [99] gplots_3.0.1 gdata_2.18.0
## [101] zlibbioc_1.22.0 compiler_3.4.1
## [103] RColorBrewer_1.1-2 illuminaio_0.18.0
## [105] KEGGgraph_1.38.0 plotrix_3.6-5
## [107] Rsamtools_1.28.0 MASS_7.3-47
## [109] stringi_1.1.5 highr_0.6
## [111] yaml_2.1.14 GOSemSim_2.2.0
## [113] latticeExtra_0.6-28 ggrepel_0.6.5
## [115] survMisc_0.5.4 fastmatch_1.1-0
## [117] tools_3.4.1 rstudioapi_0.6
## [119] foreign_0.8-69 gridExtra_2.2.1
## [121] rmdformats_0.3.3 rvcheck_0.0.9
## [123] digest_0.6.12 cmprsk_2.2-7
## [125] quadprog_1.5-5 fpc_2.1-10
## [127] Rcpp_0.12.11 siggenes_1.50.0
## [129] broom_0.4.2 httr_1.2.1
## [131] survminer_0.4.0 RCircos_1.2.0
## [133] psych_1.7.5 kernlab_0.9-25
## [135] colorspace_1.3-2 rvest_0.3.2
## [137] splines_3.4.1 RBGL_1.52.0
## [139] multtest_2.32.0 flexmix_2.3-14
## [141] xtable_1.8-2 jsonlite_1.5
## [143] UpSetR_1.3.3 modeltools_0.2-21
## [145] R6_2.2.2 htmltools_0.3.6
## [147] mime_0.5 glue_1.1.1
## [149] BiocParallel_1.10.1 DESeq_1.28.0
## [151] interactiveDisplayBase_1.14.0 class_7.3-14
## [153] beanplot_1.2 codetools_0.2-15
## [155] fgsea_1.2.1 mvtnorm_1.0-6
## [157] lattice_0.20-35 tibble_1.3.3
## [159] curl_2.7 gtools_3.5.0
## [161] openssl_0.9.6 survival_2.41-3
## [163] limma_3.32.3 munsell_0.4.3
## [165] DO.db_2.9 GetoptLong_0.1.6
## [167] GenomeInfoDbData_0.99.0 reshape2_1.4.2
## [169] gtable_0.2.0
Competing interests
No competing interests were disclosed.
Grant information
The project was supported by the São Paulo Research Foundation (FAPESP) (2015/02844-7 and 2016/01389-7 to T.C.S. & H.N. and 2015/07925-5 to H.N.), the BridgeIRIS project, funded by INNOVIRIS, Region de Bruxelles Capitale, Brussels, Belgium, and by GENomic profiling of Gastrointestinal Inflammatory-Sensitive CANcers (GENGISCAN), Belgian FNRS PDR (T100914F to G.B.). Funding for open access charge: São Paulo Research Foundation (FAPESP) (2015/07925-5).
Acknowledgements
We are grateful to all the authors of the packages used in this article. Also, we would like to thank The GDC Support Team for the provided help, which was necessary in order to update the TCGAbiolinks package to use GDC API.
References
Aleksandra Pekowska, Simon Anders. 2015. ChIP-Seq Analysis Basics.
Altay, Gökmen, and Frank Emmert-Streib. 2010. “Inferring the Conservative Causal Core of Gene Regulatory Networks.” BMC Systems Biology 4 (1). BioMed Central Ltd: 132.
Bernstein, B. E., J. A. Stamatoyannopoulos, J. F. Costello, B. Ren, A. Milosavljevic, A. Meissner, M. Kellis, et al. 2010. “The NIH Roadmap Epigenomics Mapping Consortium.” Nat. Biotechnol. 28 (10): 1045–8.
Bernstein, Bradley E, Michael Kamal, Kerstin Lindblad-Toh, Stefan Bekiranov, Dione K Bailey, Dana J Huebert, Scott McMahon, et al. 2005. “Genomic Maps and Comparative Analysis of Histone Modifications in Human and Mouse.” Cell 120 (2). Elsevier: 169–81.
Bonasio, Roberto, Shengjiang Tu, and Danny Reinberg. 2010. “Molecular Signals of Epigenetic States.” Science 330 (6004). American Association for the Advancement of Science: 612–16.
Bullard, James H, Elizabeth Purdom, Kasper D Hansen, and Sandrine Dudoit. 2010. “Evaluation of Statistical Methods for Normalization and Differential Expression in MRNA-Seq Experiments.” BMC Bioinformatics 11 (1). BioMed Central Ltd: 94.
Ceccarelli, Michele, FlorisP. Barthel, TathianeM. Malta, ThaisS. Sabedot, SofieR. Salama, BradleyA. Murray, Olena Morozova, et al. 2016. “Molecular Profiling Reveals Biologically Discrete Subsets and Pathways of Progression in Diffuse Glioma.” Cell 164 (3): 550–63. doi:http://dx.doi.org/10.1016/j.cell.2015.12.028.
Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot, et al. 2016. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of Tcga Data.” Nucleic Acids Research 44 (8): e71. doi:10.1093/nar/gkv1507.
Consortium, ENCODE Project, and others. 2011. “A User’s Guide to the Encyclopedia of Dna Elements (Encode).” PLoS Biol 9 (4): e1001046.
Creyghton, Menno P, Albert W Cheng, G Grant Welstead, Tristan Kooistra, Bryce W Carey, Eveline J Steine, Jacob Hanna, et al. 2010. “Histone H3k27ac Separates Active from Poised Enhancers and Predicts Developmental State.” Proceedings of the National Academy of Sciences 107 (50). National Acad Sciences: 21931–6.
Davis, Caleb F, Christopher J Ricketts, Min Wang, Lixing Yang, Andrew D Cherniack, Hui Shen, Christian Buhay, et al. 2014. “The Somatic Genomic Landscape of Chromophobe Renal Cell Carcinoma.” Cancer Cell 26 (3). Elsevier: 319–30.
Deaton, Aimée M, and Adrian Bird. 2011. “CpG Islands and the Regulation of Transcription.” Genes & Development 25 (10). Cold Spring Harbor Lab: 1010–22.
Droit, A, R Gottardo, G Roberston, and L Li. 2015. “RGADEM: De Novo Motif Discovery.”
Faith, Jeremiah J, Boris Hayete, Joshua T Thaden, Ilaria Mogno, Jamey Wierzbowski, Guillaume Cottarel, Simon Kasif, James J Collins, and Timothy S Gardner. 2007. “Large-Scale Mapping and Validation of Escherichia Coli Transcriptional Regulation from a Compendium of Expression Profiles.” PLoS Biol 5 (1): e8.
Fingerman, I. M., L. McDaniel, X. Zhang, W. Ratzat, T. Hassan, Z. Jiang, R. F. Cohen, and G. D. Schuler. 2011. “NCBI Epigenomics: a new public resource for exploring epigenomic data sets.” Nucleic Acids Res. 39 (Database issue): D908–912.
Giorgio, Elisa, Daniel Robyr, Malte Spielmann, Enza Ferrero, Eleonora Di Gregorio, Daniele Imperiale, Giovanna Vaula, et al. 2015. “A Large Genomic Deletion Leads to Enhancer Adoption by the Lamin B1 Gene: A Second Path to Autosomal Dominant Leukodystrophy (Adld).” Human Molecular Genetics. Oxford Univ Press, ddv065.
Gröschel, Stefan, Mathijs A Sanders, Remco Hoogenboezem, Elzo de Wit, Britta AM Bouwman, Claudia Erpelinck, Vincent HJ van der Velden, et al. 2014. “A Single Oncogenic Enhancer Rearrangement Causes Concomitant Evi1 and Gata2 Deregulation in Leukemia.” Cell 157 (2). Elsevier: 369–81.
Hawkins, R. D., G. C. Hon, and B. Ren. 2010. “Next-generation genomics: an integrative approach.” Nat. Rev. Genet. 11 (7): 476–86.
Heintzman, Nathaniel D, Gary C Hon, R David Hawkins, Pouya Kheradpour, Alexander Stark, Lindsey F Harp, Zhen Ye, et al. 2009. “Histone Modifications at Human Enhancers Reflect Global Cell-Type-Specific Gene Expression.” Nature 459 (7243). Nature Publishing Group: 108–12.
Heintzman, Nathaniel D, Rhona K Stuart, Gary Hon, Yutao Fu, Christina W Ching, R David Hawkins, Leah O Barrera, et al. 2007. “Distinct and Predictive Chromatin Signatures of Transcriptional Promoters and Enhancers in the Human Genome.” Nature Genetics 39 (3). Nature Publishing Group: 311–18.
Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2). Nature Publishing Group: 115–21.
Kannan, Lavanya, Marcel Ramos, Angela Re, Nehme El-Hachem, Zhaleh Safikhani, Deena MA Gendoo, Sean Davis, et al. 2015. “Public Data and Open Source Tools for Multi-Assay Genomic Investigation of Disease.” Briefings in Bioinformatics. Oxford Univ Press, bbv080.
Kundaje, Anshul, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, et al. 2015. “Integrative Analysis of 111 Reference Human Epigenomes.” Nature 518 (7539). Nature Publishing Group: 317–30.
Li, Leping. 2009. “GADEM: A Genetic Algorithm Guided Formation of Spaced Dyads Coupled with an Em Algorithm for Motif Discovery.” Journal of Computational Biology 16 (2). Mary Ann Liebert, Inc. 2 Madison Avenue Larchmont, NY 10538 USA: 317–29.
Luo, Weijun, and Cory Brouwer. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14). Oxford Univ Press: 1830–1.
Mahony, Shaun, and Panayiotis V Benos. 2007. “STAMP: A Web Tool for Exploring Dna-Binding Motif Similarities.” Nucleic Acids Research 35 (suppl 2). Oxford Univ Press: W253–W258.
Mahony, Shaun, Philip E Auron, and Panayiotis V Benos. 2007. “DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies.” PLoS Comput Biol 3 (3). Public Library of Science: e61.
Marabita, Francesco, Malin Almgren, Maléne E Lindholm, Sabrina Ruhrmann, Fredrik Fagerström-Billai, Maja Jagodic, Carl J Sundberg, et al. 2013. “An Evaluation of Analysis Pipelines for Dna Methylation Profiling Using the Illumina Humanmethylation450 Beadchip Platform.” Epigenetics 8 (3). Taylor & Francis: 333–46.
Margolin, Adam A, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo D Favera, and Andrea Califano. 2006. “ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context.” BMC Bioinformatics 7 (Suppl 1). BioMed Central Ltd: S7.
Mercier, Eloi, and Raphael Gottardo. 2014. “Motiv: Motif Identification and Validation.”
Mercier, Eloi, Arnaud Droit, Leping Li, Gordon Robertson, Xuekui Zhang, and Raphael Gottardo. 2011. “An Integrated Pipeline for the Genome-Wide Analysis of Transcription Factor Binding Sites from Chip-Seq.” PLoS One 6 (2). Public Library of Science: e16432.
Mermel, Craig H, Steven E Schumacher, Barbara Hill, Matthew L Meyerson, Rameen Beroukhim, Gad Getz, and others. 2011. “GISTIC2. 0 Facilitates Sensitive and Confident Localization of the Targets of Focal Somatic Copy-Number Alteration in Human Cancers.” Genome Biol 12 (4): R41.
Meyer, Patrick E, Kevin Kontos, Frederic Lafitte, and Gianluca Bontempi. 2007. “Information-Theoretic Inference of Large Transcriptional Regulatory Networks.” EURASIP Journal on Bioinformatics and Systems Biology 2007. Hindawi Publishing Corp.: 8–8.
Meyer, Patrick E., Frédéric Lafitte, and Gianluca Bontempi. 2008. “Minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information.” BMC Bioinformatics 9 (1): 1–10. doi:10.1186/1471-2105-9-461.
Montojo, J, K Zuberi, H Rodriguez, F Kazi, G Wright, S L Donaldson, Q Morris, and G D Bader. 2010. “GeneMANIA Cytoscape plugin: fast gene function predictions on the desktop.” Bioinformatics 26 (22): 2927–8.
Morgan M, Hester J, Obenchain V, and Pagès H. n.d. “SummarizedExperiment: SummarizedExperiment Container. R Package Version 1.1.0.” http://bioconductor.org/packages/SummarizedExperiment/.
Morgan M, Tenenbaum D, Carlson M, and Arora S. n.d. “AnnotationHub: Client to Access Annotationhub Resources. R Package Version 2.2.2.” http://bioconductor.org/packages/Gviz/.
Morganella, Sandro, Stefano Maria Pagnotta, and Michele Ceccarelli. n.d. “GAIA: Genomic Analysis of Important Aberrations.”
Network, Cancer Genome Atlas Research, and others. 2012. “Comprehensive Genomic Characterization of Squamous Cell Lung Cancers.” Nature 489 (7417). Nature Publishing Group: 519–25.
———. 2013. “Comprehensive Molecular Characterization of Clear Cell Renal Cell Carcinoma.” Nature 499 (7456). Nature Publishing Group: 43–49.
———. 2014a. “Comprehensive Molecular Characterization of Gastric Adenocarcinoma.” Nature 513 (7517). Nature Publishing Group: 202–9.
———. 2014b. “Comprehensive Molecular Characterization of Gastric Adenocarcinoma.” Nature 513 (7517). Nature Publishing Group: 202–9.
———. 2014c. “Comprehensive Molecular Profiling of Lung Adenocarcinoma.” Nature 511 (7511). Nature Publishing Group: 543–50.
———. 2014d. “Integrated Genomic Characterization of Papillary Thyroid Carcinoma.” Cell 159 (3). Elsevier: 676–90.
———. 2015. “The Molecular Taxonomy of Primary Prostate Cancer.” Cell 163 (4). Elsevier: 1011–25.
———. 2016. “Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma.” N Engl J Med 2016 (374). Mass Medical Soc: 135–45.
Network, Cancer Genome Atlas, and others. 2012a. “Comprehensive Molecular Characterization of Human Colon and Rectal Cancer.” Nature 487 (7407). Nature Publishing Group: 330–37.
———. 2012b. “Comprehensive Molecular Portraits of Human Breast Tumours.” Nature 490 (7418). Nature Publishing Group: 61–70.
———. 2015a. “Comprehensive Genomic Characterization of Head and Neck Squamous Cell Carcinomas.” Nature 517 (7536). Nature Publishing Group: 576–82.
———. 2015b. “Genomic Classification of Cutaneous Melanoma.” Cell 161 (7). Elsevier: 1681–96.
Nishida, Hiromi, Takahiro Suzuki, Shinji Kondo, Hisashi Miura, Yu-ichi Fujimura, and Yoshihide Hayashizaki. 2006. “Histone H3 Acetylated at Lysine 9 in Promoter Is Associated with Low Nucleosome Density in the Vicinity of Transcription Start Site in Human Cell.” Chromosome Research 14 (2). Springer: 203–11.
Noushmehr, Houtan, Daniel J Weisenberger, Kristin Diefes, Heidi S Phillips, Kanan Pujara, Benjamin P Berman, Fei Pan, et al. 2010. “Identification of a Cpg Island Methylator Phenotype That Defines a Distinct Subgroup of Glioma.” Cancer Cell 17 (5). Elsevier: 510–22.
Ou, J, M Brodsky, S Wolfe, and LJ Zhu. 2013. “MotifStack: Plot Stacked Logos for Single or Multiple Dna, Rna and Amino Acid Sequence.”
Peters, Antoine HFM, Stefan Kubicek, Karl Mechtler, Roderick J O’Sullivan, Alwin AHA Derijck, Laura Perez-Burgos, Alexander Kohlmaier, et al. 2003. “Partitioning and Plasticity of Repressive Histone Methylation States in Mammalian Chromatin.” Molecular Cell 12 (6). Elsevier: 1577–89.
Phillips, Theresa. 2008. “The Role of Methylation in Gene Expression.” Nature Education 1 (1): 116.
Rada-Iglesias, Alvaro, Ruchi Bajpai, Tomek Swigut, Samantha A Brugmann, Ryan A Flynn, and Joanna Wysocka. 2011. “A Unique Chromatin Signature Uncovers Early Developmental Enhancers in Humans.” Nature 470 (7333). Nature Publishing Group: 279–83.
Rhodes, Daniel R, and Arul M Chinnaiyan. 2005. “Integrative Analysis of the Cancer Transcriptome.” Nature Genetics 37. Nature Publishing Group: S31–S37.
Risso, Davide, Katja Schwartz, Gavin Sherlock, and Sandrine Dudoit. 2011. “GC-Content Normalization for Rna-Seq Data.” BMC Bioinformatics 12 (1). BioMed Central Ltd: 480.
Robertson, Keith D. 2005. “DNA Methylation and Human Disease.” Nature Reviews Genetics 6 (8). Nature Publishing Group: 597–610.
Samur, Mehmet Kemal. 2014. “RTCGAToolbox: A New Tool for Exporting Tcga Firehose Data.”
Shi, Xingjie, Jin Liu, Jian Huang, Yong Zhou, BenChang Shia, and Shuangge Ma. 2014. “Integrative Analysis of High-Throughput Cancer Studies with Contrasted Penalization.” Genetic Epidemiology 38 (2). Wiley Online Library: 144–51.
Siegmund, Kimberly D. 2011. “Statistical Approaches for the Analysis of Dna Methylation Microarray Data.” Human Genetics 129 (6). Springer: 585–95.
Silva, TC, A Colaprico, C Olsen, F D’Angelo, G Bontempi, M Ceccarelli, and H Noushmehr. 2016. “TCGA Workflow: Analyze Cancer Genomics and Epigenomics Data Using Bioconductor Packages [Version 2; Referees: 1 Approved, 1 Approved with Reservations].” F1000Research 5 (1542). doi:10.12688/f1000research.8923.2.
Stark, Chris, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, and Mike Tyers. 2006. “BioGRID: A General Repository for Interaction Datasets.” Nucleic Acids Research 34 (suppl 1): D535–D539. doi:10.1093/nar/gkj109.
Sur, Inderpreet Kaur, Outi Hallikas, Anna Vähärautio, Jian Yan, Mikko Turunen, Martin Enge, Minna Taipale, Auli Karhu, Lauri A Aaltonen, and Jussi Taipale. 2012. “Mice Lacking a Myc Enhancer That Includes Human Snp Rs6983267 Are Resistant to Intestinal Tumors.” Science 338 (6112). American Association for the Advancement of Science: 1360–3.
Weinstein, John N, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, et al. 2013. “The Cancer Genome Atlas Pan-Cancer Analysis Project.” Nature Genetics 45 (10). Nature Publishing Group: 1113–20.
Wilks, Christopher, Melissa S Cline, Erich Weiler, Mark Diehkans, Brian Craft, Christy Martin, Daniel Murphy, et al. 2014. “The Cancer Genomics Hub (Cghub): Overcoming Cancer Through the Power of Torrential Data.” Database 2014. Oxford University Press: bau093.
Yao, L, H Shen, PW Laird, PJ Farnham, and BP Berman. 2015. “Inferring Regulatory Element Landscapes and Transcription Factor Networks from Cancer Methylomes.” Genome Biology 16 (1): 105–5.
Yao, Lijing, Benjamin P Berman, and Peggy J Farnham. 2015. “Demystifying the Secret Mission of Enhancers: Linking Distal Regulatory Elements to Target Genes.” Critical Reviews in Biochemistry and Molecular Biology 50 (6). Taylor & Francis: 550–73.
Yu, Guangchuang, Li-Gen Wang, and Qing-Yu He. 2015. “ChIPseeker: An R/Bioconductor Package for Chip Peak Annotation, Comparison and Visualization.” Bioinformatics. Oxford Univ Press, btv145.
Z., Gu. n.d. “ComplexHeatmap: Making Complex Heatmaps. R Package Version 1.7.1.” https://github.com/jokergoo/ComplexHeatmap.
Zheng, Siyuan, Andrew D Cherniack, Ninad Dewal, Richard A Moffitt, Ludmila Danilova, Bradley A Murray, Antonio M Lerario, et al. 2016. “Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma.” Cancer Cell 29 (5). Elsevier: 723–36.