Gene Expression Normalization Workflow
Gene Expression Normalization Workflow
- 1 Introduction
- 2 Package Content
- 3 Workflow steps
- 3.1 A. Reading Input Data
- 3.2 B. Creating an ExpressionSet Object
- 3.3 C. Principal Variance Component Analysis of the raw data
- 3.4 D. Surrogate Variable Analysis
- 3.5 E. Computing the correlation between the surrogate variables and the covariates
- 3.6 F. Principal Variance Component Analysis of the raw data with the surrogate variables included as covariates
- 3.7 G. Supervised normalization of Microarrays
- 3.8 H. Principal Variance Component Analysis on the normalized data
- 4 Discussion
- References
1 Introduction
Every Gene Expression study has an underlying question which the experimenter tries to address.Before proceeding towards any downstream analysis, the researcher has to decide how to normalize the gene expression data to minimize the impact of technical and other confounding factors on estimation of the environmental or biological factors of interest. One approach is to simply remove all of the major principal components of variation or specified technical effects, but this can also remove the biological signal of interest.Supervised normalization strategies encourage the user to define the major parameters influencing the expression variation, and then systematically remove them while shielding the biological factors of interest.
Here we outline a dynamic workflow that recognizes that there is not any ‘one algorithm fits all data sets’ analysis approach that works for every experiment.The workflow employs three successive procedures available in Bioconductor: Principal Variance Component Analysis (PVCA) to explore how technical and biological factors correlate with the major components of variance in the data set; Surrogate Variable Analysis (SVA) to identify major unwanted sources of variation; and Supervised Normalization of Microarrays (SNM) to efficiently remove these sources of variation while retaining the biological factor(s) of interest. The data is based on a study contrasting peripheral blood gene expression in acute myocardial infarction and coronary artery disease patients, described in Kim et al. (2014). Only a subset of the data is used, and the labels have been changed for privacy protection, but the full data set is accessible from the GEO data repository at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49925.
2 Package Content
The workflow package contains five functions:
1. expSetobj – function to create an Expression Set Object which encapsulates the expression values, covariate information, and the experimental metadata
2. pvcAnaly - Principal Variance Component Analysis functionality
3. surVarAnaly – function which identifies hidden or new surrogate variables that are potential covariates hidden in the data
4. conTocat – function to convert continuous variables to categorical variables (discretize)
5. snmAnaly – function which implements the Supervised Normalization of Microarrays (SNM) normalization technique
2.1 Sample data
The workflow runs on these sample two data files:
1. CAD_Expression.csv - file containing gene expression values (8000 probes for 100 samples)
2. CAD_Exptdsgn.csv - file containing the phenotype information (100 samples with 10 covariates)
The covariates are Study (a large batch effect since samples were run 2 years apart), Rin (RNA integrity number, a measure of RNA quality), BMI, Age, Gender, CAD (disease status as Acute MI or Fine at the time of sampling of patients in a CAD cohort), Rin3, BMI3, Age3 (discrete categorizations of the three continuous measures, used in the PVCA), and Array (a sample number) The concept of the analysis is to explore how adjusting for Study, Rin, and other covariates if desired, influences the interpretation of the impact of CAD on gene expression.That is, CAD is the biological variable of interest.
3 Workflow steps
3.1 A. Reading Input Data
- File containing the gene expression values (log transformed)
Format - Features x Samples (where the features could be probes in the case of microarrays, or gene or exon names in the case of RNASeq data), preferably a comma separated file (.csv) - File containing the covariates
Covariates are the different phenotypes that describe sample attributes, and can be biological, technical, or environmental.In the workflow we will also identify hidden covariates
Format - Samples x Covariates, preferably a comma separated file (.csv)
You should start by examining the sample data to get an idea of how the files should be structured.
## Warning: replacing previous import 'Biobase::anyMissing' by
## 'matrixStats::anyMissing' when loading 'ExpressionNormalizationWorkflow'
## Warning: replacing previous import 'Biobase::rowMedians' by
## 'matrixStats::rowMedians' when loading 'ExpressionNormalizationWorkflow'
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colMeans,
## colnames, colSums, do.call, duplicated, eval, evalq, Filter,
## Find, get, grep, grepl, intersect, is.unsorted, lapply,
## lengths, Map, mapply, match, mget, order, paste, pmax,
## pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
## rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Set your working directory (or use Control + Shift + H to choose it)
setwd("working directory")
## read in the file containing the gene expression values
expData_file_name <- system.file("folder containing the data", "CAD_Expression.csv", package="ExpressionNormalizationWorkflow")
exprs <- read.table(expData_file_name, header=TRUE, sep=",", row.names=1, as.is=TRUE)
## read in the file containing the covariates
expDesign_file_name <- system.file("folder containing the data", "CAD_ExptDsgn.csv", package="ExpressionNormalizationWorkflow")
covrts <- read.table(expDesign_file_name, header=TRUE, sep=",", row.names=1, as.is=TRUE)
## A001 A002 A003 A004 A005
## ILMN_2389211 14.091 13.892 13.747 14.056 14.243
## ILMN_1667796 13.872 13.903 14.273 14.276 14.125
## ILMN_1683271 13.419 13.665 13.180 13.487 14.252
## ILMN_2100437 14.237 13.638 14.160 14.004 13.945
## ILMN_2242491 13.692 13.830 13.667 13.756 14.015
## Study Rin3 Gender Age3 BMI3 CAD Array Rin Age BMI
## A001 A HIGH FEM ADULT OVER ACUTE 1 9.2 56 31.0
## A002 A HIGH FEM ADULT OVER ACUTE 2 9.1 54 26.9
## A003 A HIGH MAL ADULT NORM ACUTE 3 9.1 51 20.7
## A004 A HIGH MAL ADULT NORM ACUTE 4 9.3 52 24.4
## A005 A HIGH MAL MATURE OVER ACUTE 5 9.3 62 26.1
## Study Rin3 Gender Age3 BMI3 CAD Array Rin Age BMI
## A001 A HIGH FEM ADULT OVER ACUTE 1 9.2 56 31.0
## A002 A HIGH FEM ADULT OVER ACUTE 2 9.1 54 26.9
## A003 A HIGH MAL ADULT NORM ACUTE 3 9.1 51 20.7
## A004 A HIGH MAL ADULT NORM ACUTE 4 9.3 52 24.4
## A005 A HIGH MAL MATURE OVER ACUTE 5 9.3 62 26.1
## A006 A HIGH MAL MATURE OVER ACUTE 6 9.4 65 29.9
## A007 A HIGH MAL MATURE OVER ACUTE 7 8.8 65 28.4
## A008 A HIGH MAL MATURE OVER ACUTE 8 8.8 64 33.1
## A009 A HIGH FEM MATURE OVER FINE 9 9.0 69 29.8
## A010 A MOD MAL ADULT OVER ACUTE 10 8.7 50 33.4
## A011 A HIGH MAL ADULT OBESE ACUTE 11 9.3 56 48.4
## A012 A HIGH FEM ELDERLY OVER FINE 12 9.1 77 26.7
## A013 A HIGH MAL ADULT OBESE ACUTE 13 8.9 46 36.0
## A014 A HIGH FEM ADULT OVER ACUTE 14 8.8 58 27.5
## A015 A MOD FEM MATURE OVER FINE 15 8.6 71 26.8
## A016 A HIGH MAL MATURE OVER FINE 16 8.8 65 33.0
## A017 A MOD FEM MATURE OVER FINE 17 8.7 68 25.0
## A018 A MOD FEM ADULT OVER ACUTE 18 8.2 57 34.0
## A019 A MOD MAL ADULT OBESE ACUTE 19 8.6 56 35.0
## A020 A HIGH FEM MATURE OBESE FINE 20 9.2 68 44.5
## A021 A MOD MAL ELDERLY NORM FINE 21 8.5 77 23.8
## A022 A MOD FEM MATURE NORM FINE 22 8.6 69 21.3
## A023 A MOD MAL MATURE NORM FINE 23 8.4 70 21.8
## A024 A MOD FEM ELDERLY OVER FINE 24 8.7 80 26.5
## A025 A MOD MAL ELDERLY NORM FINE 25 8.7 83 23.9
## A026 A MOD FEM ELDERLY OVER FINE 26 8.3 81 29.7
## A027 A MOD MAL MATURE NORM FINE 27 8.5 74 22.1
## A028 A LOW MAL MATURE OVER ACUTE 28 7.3 64 33.9
## A029 A MOD MAL MATURE OVER FINE 29 8.6 68 31.8
## A030 A MOD FEM MATURE NORM FINE 30 8.7 74 17.6
## A031 A LOW MAL MATURE OVER FINE 31 5.0 74 33.0
## A032 A LOW FEM MATURE OVER FINE 32 6.8 71 33.4
## A033 A HIGH FEM MATURE OVER FINE 33 8.8 73 32.3
## A034 A LOW MAL ADULT OVER ACUTE 34 6.2 57 30.2
## A035 A MOD MAL ELDERLY OVER FINE 35 8.7 76 28.8
## A036 A HIGH MAL ADULT OVER FINE 36 8.8 42 31.8
## A037 A HIGH MAL MATURE OBESE ACUTE 37 9.0 65 36.3
## A038 A LOW MAL ELDERLY NORM ACUTE 38 7.8 75 22.8
## A039 A MOD MAL ADULT OVER ACUTE 39 8.7 56 34.6
## A040 A HIGH MAL ADULT OBESE ACUTE 40 9.4 54 38.2
## A041 A HIGH MAL ELDERLY OVER ACUTE 41 9.0 83 28.0
## A042 A MOD MAL ADULT OVER ACUTE 42 8.6 49 27.9
## A043 A HIGH FEM MATURE OBESE ACUTE 43 9.5 67 41.9
## A044 A HIGH MAL MATURE OBESE ACUTE 44 9.6 65 39.8
## A045 A MOD FEM ELDERLY NORM FINE 45 8.7 76 24.9
## A046 A HIGH MAL MATURE OVER FINE 46 9.0 66 25.1
## A047 A HIGH MAL MATURE OVER ACUTE 47 9.1 66 25.7
## A048 A MOD MAL MATURE OVER ACUTE 48 8.7 61 28.9
## A049 A HIGH FEM MATURE NORM FINE 49 9.0 68 21.6
## A050 A MOD FEM ELDERLY OVER FINE 50 8.4 75 32.0
## A051 A MOD MAL MATURE NORM FINE 51 8.6 69 20.1
## B052 B MOD FEM MATURE OBESE FINE 52 8.5 60 36.2
## B053 B HIGH MAL MATURE OVER FINE 53 9.7 63 27.0
## B054 B MOD FEM ELDERLY NORM ACUTE 54 8.2 75 24.7
## B055 B LOW FEM ELDERLY OVER ACUTE 55 7.5 75 26.3
## B056 B LOW MAL MATURE OVER ACUTE 56 6.1 67 26.0
## B057 B LOW MAL ELDERLY NORM ACUTE 57 7.9 79 22.4
## B058 B LOW FEM ELDERLY OVER ACUTE 58 7.8 75 25.2
## B059 B LOW MAL MATURE OVER ACUTE 59 7.8 72 32.1
## B060 B MOD MAL MATURE NORM ACUTE 60 8.0 66 21.2
## B061 B LOW MAL MATURE OVER ACUTE 61 7.4 68 32.6
## B062 B MOD MAL MATURE OVER FINE 62 8.0 60 31.5
## B063 B LOW MAL ADULT NORM FINE 63 7.9 50 23.8
## B064 B MOD MAL ADULT OVER FINE 64 8.2 53 25.9
## B065 B LOW MAL ADULT OBESE FINE 65 7.5 58 41.6
## B066 B LOW MAL MATURE NORM ACUTE 66 7.4 69 21.1
## B067 B LOW FEM ELDERLY OVER ACUTE 67 7.1 75 28.3
## B068 B LOW MAL MATURE OBESE ACUTE 68 6.7 70 38.6
## B069 B MOD MAL ELDERLY OVER ACUTE 69 8.0 76 32.1
## B070 B MOD FEM MATURE NORM ACUTE 70 8.3 74 15.1
## B071 B LOW MAL MATURE OBESE ACUTE 71 6.8 60 42.6
## B072 B LOW MAL ADULT OVER ACUTE 72 7.6 56 34.9
## B073 B LOW MAL MATURE OVER FINE 73 7.5 61 30.3
## B074 B LOW MAL ADULT OVER FINE 74 6.2 59 33.5
## B075 B LOW MAL ADULT OVER FINE 75 6.2 36 26.7
## B076 B MOD MAL ADULT OVER FINE 76 8.0 57 31.5
## B077 B MOD FEM ADULT OVER FINE 77 8.6 47 29.8
## B078 B HIGH MAL ADULT OBESE FINE 78 8.8 55 36.4
## B079 B HIGH FEM ADULT OVER FINE 79 9.5 59 33.0
## B080 B LOW FEM ADULT OBESE FINE 80 7.2 43 43.6
## B081 B HIGH FEM MATURE OBESE FINE 81 8.8 60 38.8
## B082 B LOW FEM ADULT NORM FINE 82 7.3 50 21.5
## B083 B MOD MAL ADULT OVER FINE 83 8.6 55 26.5
## B084 B MOD MAL ELDERLY NORM ACUTE 84 8.3 82 20.7
## B085 B LOW MAL ELDERLY NORM ACUTE 85 7.8 78 20.9
## B086 B LOW FEM ADULT NORM FINE 86 6.7 43 24.0
## B087 B MOD MAL ELDERLY OVER ACUTE 87 8.4 77 32.3
## B088 B MOD FEM MATURE OBESE FINE 88 8.6 60 35.2
## B089 B MOD MAL ELDERLY OVER ACUTE 89 8.3 82 26.3
## B090 B MOD MAL MATURE OVER FINE 90 8.7 60 30.0
## B091 B MOD MAL ELDERLY NORM ACUTE 91 8.2 81 21.1
## B092 B LOW FEM ELDERLY NORM ACUTE 92 7.9 83 22.9
## B093 B MOD MAL ELDERLY NORM ACUTE 93 8.5 80 18.4
## B094 B MOD MAL MATURE NORM FINE 94 8.5 63 23.6
## B095 B HIGH MAL ADULT OBESE ACUTE 95 9.2 26 39.4
## B096 B LOW FEM ADULT OVER FINE 96 7.8 44 25.1
## B097 B HIGH MAL ADULT NORM FINE 97 9.3 51 24.4
## B098 B HIGH MAL ADULT NORM FINE 98 9.2 51 21.7
## B099 B MOD MAL ADULT OVER FINE 99 8.5 56 27.5
## B100 B LOW MAL ADULT OBESE ACUTE 100 7.5 54 38.2
3.2 B. Creating an ExpressionSet Object
An ExpressionSet Class (Falcon, Morgan and Gentleman, 2006) is a Biobase data structure that is used to conveniently store experimental information and associated meta data, all in one place. This command creates an object of the the ExpressionSet Class, which stores the gene expression values and the covariate data.The function GEOquery (Davis, 2016) can be used to download ExpressionSet objects from directly from GEO (https://www.ncbi.nlm.nih.gov/gds)
inpData <- expSetobj(exprs, covrts)
3.3 C. Principal Variance Component Analysis of the raw data
PVCA (Bushel, 2013) estimates the proportion of the variance of the principal components of the gene expression data that can be attributed to each of the given covariates.The remaining fraction is assigned as “residual” variance.It efficiently combines principal component analysis (PCA) to reduce the feature space and variance component analysis (VCA) which fits a mixed linear model using factors of interest as random effects to estimate and partition the total variability. The variance explained is computed as a weighted average of the contributions of each factor to each PC, and you have the option of specifying how many principal components to include, by setting a threshold amount of the variance that needs to be explained by the identified PCs. Here PVCA is used to estimate the proportion of variance due to CAD as well as due to the covariates like Gender, BMI, Rin, and Study.Since BMI and Rin are continuous variables, we use a categorization of each into Obese, Overweight, Normal weight and High, Moderate, and Low quality RNA respectively, through the BMI3 and Rin3 columns.
## Set the covariates whose effect size on the data needs to be calculated
cvrts_eff_var <- c("CAD", "BMI3", "Rin3", "Gender", "Study")
## Set a PVCA Threshold Value between 0 & 1
## PVCA Threshold Value is the percentile value of the minimum amount of the variabilities that the selected principal components need to explain, here requiring 75% of the expression variance to be captured by the PCs
pct_thrsh <- 0.75
## Perform the PVCA analysis
pvcAnaly(inpData, pct_thrsh, cvrts_eff_var)
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.02673772 0.03953953 0.2686028 0.02059324 0.0367772 0.6077496
##
## $label
## [1] "Rin3" "BMI3" "Study" "Gender" "CAD" "resid"
The analysis (also generated as a histogram) shows that the Study batch effect is a major technical artifact that explains ~27% of the expression variance.In comparison, CAD and BMI each only explain around 4% of the variance, RNA quality less than 3%, and Gender is a minor component.60% of the variance is residual.
3.4 D. Surrogate Variable Analysis
Surrogate variables (Leek and Storey, 2007) are covariates constructed directly from high-dimensional data (gene expression or RNA-Seq data) that can be used in subsequent analyses to adjust for unknown/unmodeled covariates or latent sources of noise. The user provides one or more biological variable(s) that they are interested in estimating, and then estimates the surrogate variables (sv) given the presence of the biological signal. These are then appended as new covariates to the existing list of covariates, and we also add a step to convert them to categorical variables so as to estimate their contributions to the total variance.For a thorough introduction to SVA, refer to the paper by Leek and Storey (2007) .SVA is available from Bioconductor at https://bioconductor.org/packages/release/bioc/html/sva.html, and a version specifically for RNASeq data (svaseq) is also available from Github at https://github.com/jtleek/svaseq.
## Choose a biological variable that is to be used to calculate the surrogate variables
biol_var_sva <- "CAD"
## Perform SVA
sur_var_obj <- surVarAnaly(inpData, biol_var_sva)
## Number of significant surrogate variables is: 4
## Iteration (out of 5 ):1 2 3 4 5
## The newly generated surrogate variables sv1 through sv4 are appended to the ExpressionSet object
inpData_sv <- sur_var_obj$expSetobject
3.5 E. Computing the correlation between the surrogate variables and the covariates
This step helps the researcher explore whether each newly identified surrogate variable is entirely independent of all the existing covariates or if they are correlated with either a technical factor or one of the biological covariates. If they are associated with a biological factor (especially the focus of the analysis, in this case CAD), then it would not be advisable to remove them during the normalization step, though they might be adjusted for.Otherwise, it will usually be advisable to remove them, or to remove the technical factor they capture.To see the correlations, we run a series of generalized linear models where the identified surrogate variables are modeled as a function of the existing covariates (Study, CAD, Rin, BMI; you could add Gender and Age if you wish – but they are not significant).
## Fit a generalized linear model for sv1
glm.sv1 <- glm(pData(inpData_sv)[,"sv1"]~pData(inpData_sv)[,"BMI"]+pData(inpData_sv)[,"Rin"]+pData(inpData_sv)[,"CAD"]
+pData(inpData_sv)[,"Study"])
summary(glm.sv1)
##
## Call:
## glm(formula = pData(inpData_sv)[, "sv1"] ~ pData(inpData_sv)[,
## "BMI"] + pData(inpData_sv)[, "Rin"] + pData(inpData_sv)[,
## "CAD"] + pData(inpData_sv)[, "Study"])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.127062 -0.040490 -0.009979 0.030200 0.139604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1832540 0.0711053 2.577 0.0115 *
## pData(inpData_sv)[, "BMI"] 0.0009422 0.0009208 1.023 0.3088
## pData(inpData_sv)[, "Rin"] -0.0159186 0.0074152 -2.147 0.0344 *
## pData(inpData_sv)[, "CAD"]FINE 0.0095449 0.0119928 0.796 0.4281
## pData(inpData_sv)[, "Study"]B -0.1699332 0.0129778 -13.094 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.00357263)
##
## Null deviance: 1.0000 on 99 degrees of freedom
## Residual deviance: 0.3394 on 95 degrees of freedom
## AIC: -272.79
##
## Number of Fisher Scoring iterations: 2
The analysis shows that the Study batch effect is very strongly associated with sv1.You can also see this by plotting the two measures from the pData(inpData_sv) file. Subsequent analyses of sv2 through sv4 with similar code shows that BMI is weakly associated with sv2, but strongly with sv3 (as is Study), while Rin is strongly associate with sv4.CAD is not associated with any of the surrogate variables, but Age contributes slightly to sv2 and sv3.
## Fit a generalized linear model for sv2
glm.sv2 <- glm(pData(inpData_sv)[,"sv2"]~pData(inpData_sv)[,"BMI"]+pData(inpData_sv)[,"Rin"]+pData(inpData_sv)[,"CAD"]
+pData(inpData_sv)[,"Study"])
summary(glm.sv2)
##
## Call:
## glm(formula = pData(inpData_sv)[, "sv2"] ~ pData(inpData_sv)[,
## "BMI"] + pData(inpData_sv)[, "Rin"] + pData(inpData_sv)[,
## "CAD"] + pData(inpData_sv)[, "Study"])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.21940 -0.04787 -0.00900 0.04228 0.33138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.261993 0.115073 -2.277 0.0250 *
## pData(inpData_sv)[, "BMI"] 0.003209 0.001490 2.154 0.0338 *
## pData(inpData_sv)[, "Rin"] 0.019394 0.012000 1.616 0.1094
## pData(inpData_sv)[, "CAD"]FINE 0.030581 0.019409 1.576 0.1184
## pData(inpData_sv)[, "Study"]B -0.016918 0.021003 -0.806 0.4225
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.009356872)
##
## Null deviance: 1.0000 on 99 degrees of freedom
## Residual deviance: 0.8889 on 95 degrees of freedom
## AIC: -176.51
##
## Number of Fisher Scoring iterations: 2
## Output should be similar to that shown above for sv1
3.6 F. Principal Variance Component Analysis of the raw data with the surrogate variables included as covariates
The following PVCA step is performed to estimate the contributions of the newly identified surrogate variables to the overall expression variance, with or without the Study covariate in the model.
## First discretize the continuous surrogate variables
var_names <- c("sv1", "sv2", "sv3", "sv4")
pData(inpData_sv)<-conTocat(pData(inpData_sv), var_names)
## View them appended to the covariate matrix as additional covariate columns
# View(pData(inpData_sv))
pData(inpData_sv)
## Study Rin3 Gender Age3 BMI3 CAD Array Rin Age BMI
## A001 A HIGH FEM ADULT OVER ACUTE 1 9.2 56 31.0
## A002 A HIGH FEM ADULT OVER ACUTE 2 9.1 54 26.9
## A003 A HIGH MAL ADULT NORM ACUTE 3 9.1 51 20.7
## A004 A HIGH MAL ADULT NORM ACUTE 4 9.3 52 24.4
## A005 A HIGH MAL MATURE OVER ACUTE 5 9.3 62 26.1
## A006 A HIGH MAL MATURE OVER ACUTE 6 9.4 65 29.9
## A007 A HIGH MAL MATURE OVER ACUTE 7 8.8 65 28.4
## A008 A HIGH MAL MATURE OVER ACUTE 8 8.8 64 33.1
## A009 A HIGH FEM MATURE OVER FINE 9 9.0 69 29.8
## A010 A MOD MAL ADULT OVER ACUTE 10 8.7 50 33.4
## A011 A HIGH MAL ADULT OBESE ACUTE 11 9.3 56 48.4
## A012 A HIGH FEM ELDERLY OVER FINE 12 9.1 77 26.7
## A013 A HIGH MAL ADULT OBESE ACUTE 13 8.9 46 36.0
## A014 A HIGH FEM ADULT OVER ACUTE 14 8.8 58 27.5
## A015 A MOD FEM MATURE OVER FINE 15 8.6 71 26.8
## A016 A HIGH MAL MATURE OVER FINE 16 8.8 65 33.0
## A017 A MOD FEM MATURE OVER FINE 17 8.7 68 25.0
## A018 A MOD FEM ADULT OVER ACUTE 18 8.2 57 34.0
## A019 A MOD MAL ADULT OBESE ACUTE 19 8.6 56 35.0
## A020 A HIGH FEM MATURE OBESE FINE 20 9.2 68 44.5
## A021 A MOD MAL ELDERLY NORM FINE 21 8.5 77 23.8
## A022 A MOD FEM MATURE NORM FINE 22 8.6 69 21.3
## A023 A MOD MAL MATURE NORM FINE 23 8.4 70 21.8
## A024 A MOD FEM ELDERLY OVER FINE 24 8.7 80 26.5
## A025 A MOD MAL ELDERLY NORM FINE 25 8.7 83 23.9
## A026 A MOD FEM ELDERLY OVER FINE 26 8.3 81 29.7
## A027 A MOD MAL MATURE NORM FINE 27 8.5 74 22.1
## A028 A LOW MAL MATURE OVER ACUTE 28 7.3 64 33.9
## A029 A MOD MAL MATURE OVER FINE 29 8.6 68 31.8
## A030 A MOD FEM MATURE NORM FINE 30 8.7 74 17.6
## A031 A LOW MAL MATURE OVER FINE 31 5.0 74 33.0
## A032 A LOW FEM MATURE OVER FINE 32 6.8 71 33.4
## A033 A HIGH FEM MATURE OVER FINE 33 8.8 73 32.3
## A034 A LOW MAL ADULT OVER ACUTE 34 6.2 57 30.2
## A035 A MOD MAL ELDERLY OVER FINE 35 8.7 76 28.8
## A036 A HIGH MAL ADULT OVER FINE 36 8.8 42 31.8
## A037 A HIGH MAL MATURE OBESE ACUTE 37 9.0 65 36.3
## A038 A LOW MAL ELDERLY NORM ACUTE 38 7.8 75 22.8
## A039 A MOD MAL ADULT OVER ACUTE 39 8.7 56 34.6
## A040 A HIGH MAL ADULT OBESE ACUTE 40 9.4 54 38.2
## A041 A HIGH MAL ELDERLY OVER ACUTE 41 9.0 83 28.0
## A042 A MOD MAL ADULT OVER ACUTE 42 8.6 49 27.9
## A043 A HIGH FEM MATURE OBESE ACUTE 43 9.5 67 41.9
## A044 A HIGH MAL MATURE OBESE ACUTE 44 9.6 65 39.8
## A045 A MOD FEM ELDERLY NORM FINE 45 8.7 76 24.9
## A046 A HIGH MAL MATURE OVER FINE 46 9.0 66 25.1
## A047 A HIGH MAL MATURE OVER ACUTE 47 9.1 66 25.7
## A048 A MOD MAL MATURE OVER ACUTE 48 8.7 61 28.9
## A049 A HIGH FEM MATURE NORM FINE 49 9.0 68 21.6
## A050 A MOD FEM ELDERLY OVER FINE 50 8.4 75 32.0
## A051 A MOD MAL MATURE NORM FINE 51 8.6 69 20.1
## B052 B MOD FEM MATURE OBESE FINE 52 8.5 60 36.2
## B053 B HIGH MAL MATURE OVER FINE 53 9.7 63 27.0
## B054 B MOD FEM ELDERLY NORM ACUTE 54 8.2 75 24.7
## B055 B LOW FEM ELDERLY OVER ACUTE 55 7.5 75 26.3
## B056 B LOW MAL MATURE OVER ACUTE 56 6.1 67 26.0
## B057 B LOW MAL ELDERLY NORM ACUTE 57 7.9 79 22.4
## B058 B LOW FEM ELDERLY OVER ACUTE 58 7.8 75 25.2
## B059 B LOW MAL MATURE OVER ACUTE 59 7.8 72 32.1
## B060 B MOD MAL MATURE NORM ACUTE 60 8.0 66 21.2
## B061 B LOW MAL MATURE OVER ACUTE 61 7.4 68 32.6
## B062 B MOD MAL MATURE OVER FINE 62 8.0 60 31.5
## B063 B LOW MAL ADULT NORM FINE 63 7.9 50 23.8
## B064 B MOD MAL ADULT OVER FINE 64 8.2 53 25.9
## B065 B LOW MAL ADULT OBESE FINE 65 7.5 58 41.6
## B066 B LOW MAL MATURE NORM ACUTE 66 7.4 69 21.1
## B067 B LOW FEM ELDERLY OVER ACUTE 67 7.1 75 28.3
## B068 B LOW MAL MATURE OBESE ACUTE 68 6.7 70 38.6
## B069 B MOD MAL ELDERLY OVER ACUTE 69 8.0 76 32.1
## B070 B MOD FEM MATURE NORM ACUTE 70 8.3 74 15.1
## B071 B LOW MAL MATURE OBESE ACUTE 71 6.8 60 42.6
## B072 B LOW MAL ADULT OVER ACUTE 72 7.6 56 34.9
## B073 B LOW MAL MATURE OVER FINE 73 7.5 61 30.3
## B074 B LOW MAL ADULT OVER FINE 74 6.2 59 33.5
## B075 B LOW MAL ADULT OVER FINE 75 6.2 36 26.7
## B076 B MOD MAL ADULT OVER FINE 76 8.0 57 31.5
## B077 B MOD FEM ADULT OVER FINE 77 8.6 47 29.8
## B078 B HIGH MAL ADULT OBESE FINE 78 8.8 55 36.4
## B079 B HIGH FEM ADULT OVER FINE 79 9.5 59 33.0
## B080 B LOW FEM ADULT OBESE FINE 80 7.2 43 43.6
## B081 B HIGH FEM MATURE OBESE FINE 81 8.8 60 38.8
## B082 B LOW FEM ADULT NORM FINE 82 7.3 50 21.5
## B083 B MOD MAL ADULT OVER FINE 83 8.6 55 26.5
## B084 B MOD MAL ELDERLY NORM ACUTE 84 8.3 82 20.7
## B085 B LOW MAL ELDERLY NORM ACUTE 85 7.8 78 20.9
## B086 B LOW FEM ADULT NORM FINE 86 6.7 43 24.0
## B087 B MOD MAL ELDERLY OVER ACUTE 87 8.4 77 32.3
## B088 B MOD FEM MATURE OBESE FINE 88 8.6 60 35.2
## B089 B MOD MAL ELDERLY OVER ACUTE 89 8.3 82 26.3
## B090 B MOD MAL MATURE OVER FINE 90 8.7 60 30.0
## B091 B MOD MAL ELDERLY NORM ACUTE 91 8.2 81 21.1
## B092 B LOW FEM ELDERLY NORM ACUTE 92 7.9 83 22.9
## B093 B MOD MAL ELDERLY NORM ACUTE 93 8.5 80 18.4
## B094 B MOD MAL MATURE NORM FINE 94 8.5 63 23.6
## B095 B HIGH MAL ADULT OBESE ACUTE 95 9.2 26 39.4
## B096 B LOW FEM ADULT OVER FINE 96 7.8 44 25.1
## B097 B HIGH MAL ADULT NORM FINE 97 9.3 51 24.4
## B098 B HIGH MAL ADULT NORM FINE 98 9.2 51 21.7
## B099 B MOD MAL ADULT OVER FINE 99 8.5 56 27.5
## B100 B LOW MAL ADULT OBESE ACUTE 100 7.5 54 38.2
## sv1 sv2 sv3 sv4 sv1_cat
## A001 0.1685317044 0.0686634899 -0.1189287326 0.010810595 4
## A002 0.1107831911 0.1709872222 -0.0635056937 0.006366820 4
## A003 0.1822475094 0.0134671702 -0.0392570943 0.093194229 4
## A004 0.1517820659 0.1037053672 -0.0468203932 0.079492456 4
## A005 0.0022265756 -0.0491325284 -0.0972098591 -0.003685162 2
## A006 0.1694603306 0.0634709174 0.0564065326 0.086614242 4
## A007 -0.0090895000 -0.0071531244 -0.2390643566 -0.012642614 2
## A008 0.0596687401 -0.0894681741 -0.0226028525 0.082379813 3
## A009 0.0206895984 0.0975934586 -0.0924006972 -0.030491226 3
## A010 -0.0110139832 0.2629930013 0.0126922039 0.073836566 2
## A011 0.0079509365 0.0405198731 0.0026173986 0.003233636 2
## A012 0.0482693800 0.1176427902 -0.1329328074 -0.018593003 3
## A013 0.0525372352 -0.0110413171 -0.0230227269 -0.091383809 3
## A014 0.0427465491 0.1531396102 -0.0307274189 -0.012860098 3
## A015 0.0313393255 0.0328061322 -0.1060425222 -0.050008495 3
## A016 0.0052473559 0.0304748836 -0.0379584708 -0.016615664 2
## A017 0.0747990563 -0.0099397958 0.0148835632 0.001389135 4
## A018 0.0406614760 -0.0394981364 -0.1234647243 -0.055816688 3
## A019 0.0665466380 0.0266362476 0.0981372431 0.016162431 4
## A020 0.1102433297 -0.0293609059 0.1115707538 0.057239356 4
## A021 0.0593975754 -0.0019047815 -0.1465477569 -0.096046741 3
## A022 0.0730581697 -0.0325688704 0.0150148722 0.006801722 4
## A023 0.0257882107 0.0072734290 -0.0873242746 -0.106874550 3
## A024 0.0654390485 0.0095865515 -0.0380078406 -0.047140115 3
## A025 0.0388981956 -0.0371743567 -0.0786803385 0.033890923 3
## A026 0.0466914873 0.0696108870 -0.0445531399 -0.065690451 3
## A027 0.0150110201 -0.0889003915 -0.1396463319 -0.109840788 3
## A028 0.1631416137 -0.1173662985 0.1373478533 0.028461331 4
## A029 0.0390913625 0.0871780662 0.0330640056 0.034121765 3
## A030 0.0313286087 0.0192949645 -0.0600881833 -0.081180830 3
## A031 0.1428162974 -0.0711895752 0.0647840762 -0.142019686 4
## A032 0.0631617944 0.0109841422 0.0249006718 -0.174360971 3
## A033 0.0890704395 0.0105975723 -0.0426143496 -0.081738124 4
## A034 0.1086018601 0.0554191522 -0.0378009006 0.052828010 4
## A035 0.2210484496 0.0679176507 -0.0130733284 0.067116309 4
## A036 0.1124573916 -0.1366450309 -0.0069266799 0.130730253 4
## A037 0.0333628244 0.0471648803 -0.0088357999 -0.020142461 3
## A038 0.1364906331 -0.0713962174 -0.0680235586 -0.084161702 4
## A039 0.0729770948 0.0483550308 -0.2736299662 -0.055777637 4
## A040 0.0927461159 -0.0973518441 0.0485973161 0.062769134 4
## A041 0.1371060962 -0.0422871258 -0.0229608926 0.084317598 4
## A042 0.0391820895 -0.0077956496 -0.0144825536 -0.021051569 3
## A043 0.0761437042 -0.0268612518 -0.0311480280 -0.057557378 4
## A044 0.1633550829 0.0953900166 0.0993704712 0.062232243 4
## A045 0.1303685921 -0.0089756037 -0.0488626488 0.026554812 4
## A046 0.0528462627 0.0126712760 -0.0323382228 0.070333325 3
## A047 0.0274655450 -0.1155785326 -0.0262496221 0.085245232 3
## A048 0.0456336116 0.0005924566 -0.0095429633 0.048586605 3
## A049 0.0922956334 -0.0343000221 -0.0290082366 0.082434660 4
## A050 0.1742131031 0.0773396955 -0.0472897491 -0.050038978 4
## A051 0.0991182949 0.1196758116 -0.0404123041 0.034831733 4
## B052 -0.1066635236 0.0617389342 -0.0273116159 0.227814806 1
## B053 -0.1988062805 -0.0387309142 -0.0511715607 0.039807357 1
## B054 -0.1144537606 -0.0843365977 0.0190027344 0.069284456 1
## B055 -0.1201225888 0.0498483411 0.0287360144 0.058717927 1
## B056 -0.1615159903 0.0900106482 -0.0646252824 -0.396244704 1
## B057 -0.0597477184 0.0164659556 0.1084460865 0.108227698 2
## B058 -0.1033003815 -0.2661575643 -0.1365534100 0.114345752 1
## B059 -0.0738705332 -0.0875677713 0.0015632995 -0.023155240 2
## B060 -0.1458664148 0.0897671086 -0.0068950235 0.003523696 1
## B061 -0.0833729960 -0.0705836028 -0.0261580765 0.012959187 1
## B062 -0.0850810478 -0.0736933563 -0.0673569453 0.100055841 1
## B063 -0.0930081424 0.0804705228 0.0963438459 -0.030759499 1
## B064 -0.0564290240 -0.0385791888 0.1256857947 0.061753553 2
## B065 -0.0362422187 0.0611651701 0.1670571211 0.061317260 2
## B066 -0.1108573825 -0.1492519505 -0.1039667443 0.057211212 1
## B067 -0.1153042690 -0.0503911145 -0.1040310608 -0.071633745 1
## B068 -0.0413689365 -0.0297600683 0.1979381819 -0.178083871 2
## B069 -0.0566064393 -0.0927816967 0.2161590555 0.146165557 2
## B070 -0.1675465382 -0.0197732985 -0.1090963257 0.091832466 1
## B071 -0.0387305497 0.0316875039 0.2314181186 -0.057822399 2
## B072 -0.1151529942 -0.0886522559 0.0566396245 -0.075428732 1
## B073 -0.1950350333 0.2746486624 -0.0097003777 0.124548958 1
## B074 -0.0326925681 -0.0363703073 0.0803876998 -0.181363186 2
## B075 0.0126673560 -0.0815040861 0.1499787251 -0.324263295 2
## B076 -0.0652368914 -0.0117165855 -0.0324549584 -0.222864906 2
## B077 -0.0337246565 0.0071538119 0.1976553527 -0.060523836 2
## B078 0.0177883223 -0.1198815969 0.2102781199 -0.148044690 3
## B079 -0.1930702634 0.3100407626 -0.0231350762 0.106466779 1
## B080 -0.0685252505 0.0912994513 0.1442977556 -0.186604548 2
## B081 0.0416262405 0.0156670947 0.1666826367 0.025488234 3
## B082 -0.0031273397 -0.0110225039 0.0004160514 -0.107522903 2
## B083 -0.1004976754 0.0349267578 -0.0481767681 0.046520311 1
## B084 -0.1665990646 -0.1496266843 -0.0513124334 -0.155949874 1
## B085 0.0190485434 -0.2039825069 0.0672775138 0.101526980 3
## B086 0.0583052313 -0.0877680351 0.2287190853 0.102257374 3
## B087 -0.1622664220 -0.1112573533 0.0405520157 0.011213886 1
## B088 -0.0039320167 -0.0414974813 0.1685804721 0.011951606 2
## B089 -0.0763866896 -0.1906826854 -0.0625712486 0.130579535 1
## B090 -0.0970365459 0.0335226152 0.0129816913 0.050407851 1
## B091 -0.0358030445 0.0661873547 0.0156839694 0.027802622 2
## B092 -0.0003346916 -0.0914858683 -0.0323520492 0.061968627 2
## B093 -0.1237422390 -0.1503499446 -0.1857040694 0.127684330 1
## B094 -0.1810888341 0.0066802193 -0.0873042289 0.003408599 1
## B095 -0.1276455798 0.3573474604 0.1367982982 0.058287157 1
## B096 -0.0461850758 -0.0017539088 0.1494867614 0.104419932 2
## B097 -0.1775641983 0.0396395414 -0.0555220902 0.065063361 1
## B098 -0.0288646239 -0.1214670988 0.1226672616 0.112419614 2
## B099 -0.0673265890 0.0118979172 -0.0056179901 -0.039489061 2
## B100 -0.0726363934 -0.0248020197 -0.0478168935 -0.101534229 2
## sv2_cat sv3_cat sv4_cat
## A001 4 1 2
## A002 4 1 2
## A003 3 2 4
## A004 4 2 4
## A005 2 1 2
## A006 4 3 4
## A007 2 1 2
## A008 1 3 4
## A009 4 1 2
## A010 4 3 4
## A011 3 3 2
## A012 4 1 2
## A013 2 2 1
## A014 4 2 2
## A015 3 1 2
## A016 3 2 2
## A017 2 3 2
## A018 2 1 2
## A019 3 4 3
## A020 2 4 3
## A021 2 1 1
## A022 2 3 2
## A023 3 1 1
## A024 3 2 2
## A025 2 1 3
## A026 4 2 1
## A027 1 1 1
## A028 1 4 3
## A029 4 3 3
## A030 3 1 1
## A031 1 4 1
## A032 3 3 1
## A033 3 2 1
## A034 4 2 3
## A035 4 3 3
## A036 1 3 4
## A037 3 3 2
## A038 1 1 1
## A039 3 1 2
## A040 1 3 3
## A041 2 3 4
## A042 2 3 2
## A043 2 2 1
## A044 4 4 3
## A045 2 2 3
## A046 3 2 4
## A047 1 2 4
## A048 3 3 3
## A049 2 2 4
## A050 4 2 2
## A051 4 2 3
## B052 4 2 4
## B053 2 2 3
## B054 1 3 4
## B055 3 3 3
## B056 4 1 1
## B057 3 4 4
## B058 1 1 4
## B059 1 3 2
## B060 4 3 2
## B061 2 2 3
## B062 1 1 4
## B063 4 4 2
## B064 2 4 3
## B065 4 4 3
## B066 1 1 3
## B067 2 1 1
## B068 2 4 1
## B069 1 4 4
## B070 2 1 4
## B071 3 4 1
## B072 1 4 1
## B073 4 3 4
## B074 2 4 1
## B075 1 4 1
## B076 2 2 1
## B077 3 4 1
## B078 1 4 1
## B079 4 2 4
## B080 4 4 1
## B081 3 4 3
## B082 2 3 1
## B083 3 2 3
## B084 1 2 1
## B085 1 4 4
## B086 1 4 4
## B087 1 3 2
## B088 2 4 2
## B089 1 1 4
## B090 3 3 3
## B091 4 3 3
## B092 1 2 3
## B093 1 1 4
## B094 3 1 2
## B095 4 4 3
## B096 3 4 4
## B097 3 1 3
## B098 1 4 4
## B099 3 3 2
## B100 2 2 1
## Include the surrogate variables as covariates in addition to BMI3, Rin3, CAD and Study (be sure to use categorical measures of BMI and Rin rather than the continuous measure)
cvrts_eff_var <- c("BMI3", "Rin3", "CAD", "Study", "sv1_cat", "sv2_cat", "sv3_cat", "sv4_cat")
## Again set the PVCA Threshold to explain 75% of the expression variation
pct_thrsh <- 0.75
## Perform PVCA
pvcAnaly(inpData_sv, pct_thrsh, cvrts_eff_var)
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.08130148 0.05767765 0.1441592 0.1313694 0.008584126 0.006405938
## [,7] [,8] [,9]
## [1,] 0.195948 0.01317206 0.3613822
##
## $label
## [1] "sv4_cat" "sv3_cat" "sv2_cat" "sv1_cat" "Rin3" "BMI3" "Study"
## [8] "CAD" "resid"
The analysis shows that each of the surrogate variables capture a large amount of the variance, while the Study batch effect remains important.The residual variance has reduced by almost half to 36%.For comparison, remove Study to see how much of it the surrogate variables capture
cvrts_eff_var <- c("BMI3", "Rin3", "CAD","sv1_cat", "sv2_cat", "sv3_cat", "sv4_cat")
## Again set the PVCA Threshold to explain 75% of the expression variation
pct_thrsh <- 0.75
## Perform PVCA
pvcAnaly(inpData_sv, pct_thrsh, cvrts_eff_var)
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.08065094 0.07549449 0.1537173 0.2760134 0.007528328 0.007064944
## [,7] [,8]
## [1,] 0.01257326 0.3869573
##
## $label
## [1] "sv4_cat" "sv3_cat" "sv2_cat" "sv1_cat" "Rin3" "BMI3" "CAD"
## [8] "resid"
This implies that together the 4 SVs capture as much variance as Study alone.Both analyses have slightly reduced the CAD contribution relative to the analysis without surrogate variables.
3.7 G. Supervised normalization of Microarrays
SNM (Mecham, Nelson and Storey, 2010) is a study specific, customizable normalization approach that adjusts for specified biological and technical variables as far as possible without biasing the estimate of the biological source of interest. Much like SVA, the user chooses the biological variable(s), and then fits desired covariates, while also allowing for “intensity dependent” effects (here, just array/sample differences) to be adjusted for. The SNM software at http://www.bioconductor.org/packages/release/bioc/html/snm.html actually allows you to either remove the adjustment variables using the rm=TRUE option, or simply to fit them.You can also run both sequentially with different covariates (for example, removing a batch effect but adjusting for gender). SNM fits both categorical and continuous variables.In the workflow below we remove the Study or the surrogate variable effects, running 5 iterations.
## Choose the biological variableof interest
bv <- c("CAD")
## Choose your adjustment variable of interest, starting with just 'Study'
av <- c("Study")
## The intensity-dependent adjustment variables adjust for array effects
iv <- c("Array")
## Run SNM
sv_snmObj <- snmAnaly(exprs, pData(inpData_sv), bv, av, iv)
##
Iteration: 1
##
Iteration: 2
##
Iteration: 3
##
Iteration: 4
##
Iteration: 5
After 5 iterations, the π0 estimate suggests that almost 63% of the probes are true negatives for the CAD effect: in other words, that there is evidence that just over one third of the probes are differentially expressed with acute MI. This is seen as an enrichment of transcripts with small p-values. At each iteration, the π0 estimates should have been 0.92, 0.62, 0.63, 0.63 , 0.63 implying convergence. The output .csv file should have been saved to your working directory as “snm_normalized_data.csv”.
## Now choose the adjustment variables to be all four SVs plus Rin
av <- c("Rin", "sv1", "sv2", "sv3", "sv4")
## Run SNM
sv_snmObj <- snmAnaly(exprs, pData(inpData_sv), bv, av, iv)
This time the π0 estimates are 0.76, 0.75, 0.76, 0.76 and 0.75.However, the intensity-dependent effects are reduced considerably.A third possibility is to remove Study, Rin and sv2 (which is independent).This time the π0 estimates are 0.59, 0.54, 0.60, 0.63 and 0.64, and the intensity-dependent effects are more similar to the first model.The SNM ietrations are shown below-
av <- c("Rin", "sv2", "Study")
sv_snmObj <- snmAnaly(exprs, pData(inpData_sv), bv, av, iv)
##
Iteration: 1
##
Iteration: 2
##
Iteration: 3
##
Iteration: 4
##
Iteration: 5
## Create an expressionSet object of the normalized dataset(s)
sv_snmNorm_data <- sv_snmObj$norm.dat
colnames(sv_snmNorm_data) <- colnames(exprs)
sv_snm_data <- expSetobj(sv_snmNorm_data, pData(inpData_sv))
## Write this dateset to a table with rows as genes and columns as samples (with names the same as that from the input file)
write.table(sv_snmNorm_data, file="CAD_SNM.csv", sep=",")
3.8 H. Principal Variance Component Analysis on the normalized data
This post SNM PVCA asks how the process of SNM has influenced the estimated contributions of the various covariates. SNM brings down their effect to a minimal possible value while trying to preserve the variation due to the biological signals.
## Specify the covariates whose effect size is to be calculated
cvrts_eff_var <- c("BMI3", "Rin3", "CAD", "sv1_cat", "sv2_cat","sv3_cat", "sv4_cat")
## If needed, keep the same PC Threshold Value
pct_thrsh <- 0.75
## Perform PVCA
pvcAnaly(sv_snm_data, pct_thrsh, cvrts_eff_var)
## $dat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.07969802 0.1087383 0.0114825 0.0666872 0.009445852 0.01162648
## [,7] [,8]
## [1,] 0.02186982 0.6904519
##
## $label
## [1] "sv4_cat" "sv3_cat" "sv2_cat" "sv1_cat" "Rin3" "BMI3" "CAD"
## [8] "resid"
Fitting Study, sv2, and RIN has not completely removed those effects, but they are much reduced, and the residual unexplained variance is back up to 69%.Most of it is probably among individual differences.The CAD effect remains low, at 2% of the variance, but is slightly increased on the raw data, whereas BMI has dropped to 1%.The sv3 and sv4 effects have not been removed.
Further exploration of different normalization approaches should help you decide whether the data is now ap for downstream analysis.
4 Discussion
The workflow presents a general strategy for exploring how technical and biological factors influence the inference of gene expression differences between categories of interest.The PVCA routine identifies how much of the variance is explained by given covariates; the SVA routine finds latent variables not influenced by the biological factor; and the SNM normalizes the data set by removing known or unknown sources of variance – in this case including a large batch effect. Readers are encouraged to compare the distributions of overall transcript abundance before and after normalization to gain a sense of the impact of the procedures on the data set. Although the impact on detection of genes influences by acute MI was minimal in the sub-set of the Kim et al. (2014) data, application to the full data set has a larger influence.Similar strategies can be applied to RNASeq data, with the complication that low expression values have higher variance and are generally dealt with differently.
References
Bushel, P.R. (2013) Principal Variance Component Analysis (Pvca). Bioconductor.
Davis, S. (2016) GEOquery. Bioconductor.
Falcon, S., Morgan, M. and Gentleman, R. (2006) An Introduction to Bioconductor’s Expressionset Class. Bioconductor.
Kim, J., Ghasemzadeh, N., Eapen, D.J., Chung, N.C., Storey, J.D., Quyyumi, A.A., et al. (2014) Gene expression profiles associated with acute myocardial infarction and risk of cardiovascular death. Genome Medicine, 6.
Leek, J.T. and Storey, J.D. (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLOS Genetics, 3.
Mecham, B.H., Nelson, P.S. and Storey, J.D. (2010) Supervised normalization of microarrays. Bioinformatics, 26, 1308–1315.