This vignette provides a walk through tutorial on how to use BSCET
to characterize cell-type-specific allelic expression imbalance (AEI), and the association of cell-type-specific AEI with clinical factors by integrative analysis of bulk and single cell RNA sequencing (RNA-seq) data. BSCET
involves two steps. First, we infer cell type proportions in the bulk RNA-seq samples by incorporating cell-type-specific gene expression information provided by a scRNA-seq reference. Second, for each heterozugous transcribed SNP, BSCET
aggregates allele-specific read counts from the bulk RNA-seq data across individuals to model cell-type-specific expression difference between two alleles by linear regression, and tests for the evidence of cell-type-specific AEI.
Below, using the human pancreas datasets, we demonstrate BSCET
step by step.
In this step, we aim to infer cell type proportions in the bulk RNA-seq samples by incorporating cell-type-specific gene expression information provided by a scRNA-seq reference. This can be achieved by cell type deconvolution algorithms such as MuSiC (Wang et al. 2019). Following MuSiC, we deconvolve bulk RNA-seq data of 89 subjects from Fadista et al. (2014) using single cell reference obtained from Segerstolpe et al. (2016). The detailed tutorial of MuSiC (Wang et al. 2019) can be found on this page.
MuSiC (Wang et al. 2019) takes a bulk RNA-seq data that contain a mixture expression of various cell types, and a scRNA-seq data that include the single cell expression from multiple samples as input. Both datasets should be in the form of ExpressionSet
, a convenient data structure to hold expression data along with sample/feature annotation. The details of constructing ExpressionSet
can be found on this page.
library(MuSiC)
## Bulk RNA-seq data
fadista.eset = readRDS("./GSE50244bulkeset.rds")
fadista.eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 32581 features, 89 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: Sub1 Sub2 ... Sub89 (89 total)
## varLabels: sampleID SubjectName ... tissue (7 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
## scRNA-seq data
seger.eset = readRDS("./EMTABesethealthy.rds")
seger.eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 25453 features, 1097 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: AZ_A10 AZ_A11 ... HP1509101_P9 (1097 total)
## varLabels: sampleID SubjectName cellTypeID cellType
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
We constrained our analysis on 6 well-studied cell types: acinar, alpha, beta, delta, ductal and gamma.
music_result = music_prop(bulk.eset = fadista.eset, sc.eset = seger.eset, clusters = 'cellType', samples = 'sampleID', select.ct = c('acinar','alpha', 'beta', 'delta','ductal','gamma'), verbose = F)
est.prop = music_result$Est.prop.weighted
Below shows the estimated cell type proportions for the 89 bulk samples obtained from MuSiC (Wang et al. 2019).
Although we illustrated Step 1
using estimated cell type proportions obtained from MuSiC (Wang et al. 2019), however, BSCET
is not limited to MuSiC and one can infer cell type compositions using other methods, such as CIBERSORT (Newman et al. 2019), in this step.
Given the estimated cell type proportions obtained in Step 1
, for each heterozygous transcribed SNP, BSCET
aggregates allele-specific read counts of the bulk RNA-seq data across heterozygous individuals and detects cell-type-specific AEI of the SNP.
For individual \(j\), let \(X_j^T\) and \(X_j^t\) be the read counts for the reference and alternative alleles of the transcribed SNP, respectively. The difference of two allele-specific reads can be expressed as:
\[X_j^T-X_j^t = \alpha+m_j\sum_{k=1}^Kp_j^kS_j^k\theta^k+\epsilon_j \;\;\;\;\;\;\;\;\;\;\;\;\;(1)\]
where the intercept \(\alpha\) captures the allelic expression difference not explained by the selected \(K\) cell types; \(m_j\) is the total number of cells in the bulk tissue for individual \(j\); \(S_j^k\) is the subject-specific average cell size of cell type \(k\) and can be approximated by the sample mean, \(S^k\); \(p_j^k\) is the individual-specific proportion of cells from cell type \(k\); \(θ^k\) represents allelic expression difference between two alleles of the transcribed SNP for cell type \(k\); \(\epsilon_j\) is random error and assumed to follow \(N(0,\sigma^2)\).
To detect cell-type-specific AEI in the population, for cell type \(k\), we test the following hypothesis:
\[H_0: \theta^k=0 \;\; vs \;\; H_1: \theta^k \neq 0\]
using a t statistic. Details of BSCET
can be found in the Method section of our manuscript.
BSCET
analyzes one transcribed SNP at a time. In this section, we will walk you through on how to generate input data for BSCET
.
First, we obtained the allele-specific read count of the SNP, i.e, \(X_j^T\) and \(X_j^t\). The input allelic bulk RNA-seq data are in form of text file and should contain columns:
SNP
: character, the name or chromosome location of each heterozygous transcriobed SNP;id
: character, individual identifier;ref
: numeric, the SNP-level allele-specific read counts of the reference allele;alt
: numeric, the SNP-level allele-specific read counts of the alternative allele;The tables below show an example allelic RNA-seq data of SNP chr1:22336305.
Second, we calculate the subject-specific total number of cells in the bulk tissue, i.e., \(m_j\). As \(m_j\) is not observed for the bulk data, we esimate it using the individual-specific library size factor, which can be obtained by the function estimateSizeFactor
of DESeq2 (Love, Huber, and Anders 2014). A detailed tutorial of DESeq2 (Love, Huber, and Anders 2014) can be found on this page.
library(DESeq2)
expmat <- exprs(fadista.eset)
coldata = matrix(colnames(expmat),ncol=1)
colnames(coldata) = 'id'
rownames(coldata) = colnames(expmat)
dds <- DESeqDataSetFromMatrix(countData = expmat,
colData = coldata,
design = ~ id)
dds <- estimateSizeFactors(dds)
mj = sizeFactors(dds)
head(mj)
## Sub1 Sub2 Sub3 Sub4 Sub5 Sub6
## 0.9595915 0.9580428 1.0016846 0.8483268 1.0137396 0.8437796
Third, we obtained the subject-specific cell type compositions of the bulk tissue, i.e., \(p_j^k\), which can be estimated using cell type deconvolution showed in Step 1
. Given both \(m_j\) and \(p_j^k\), for each individual \(j\), we can calculate their \(m_jp_j^k\), showed in the table below.
Finally, we combine \(m_jp_j^k\) with the allele-specific reads, \(X_j^T\) and \(X_j^t\), and obtain the input data of BSCET
as the following:
To detect cell-type-specific AEI, BSCET regresses the allelic count difference, i.e., \(X_j^T-X_j^t\), over \(m_jp_j^k\) through linear regression across samples, and the p-values of coefficients are used to indicate the significance of cell-type-specific effect. As we can see below, the SNP was detected as having cell-type-specific AEI for acinar cells.
mod=lm(ref-alt ~ `acinar`+`alpha`+`beta`+`delta`+`ductal`+`gamma`,data=dat)
p_value = summary(mod)$coefficient[,4][-1]
print(round(p_value,5))
## acinar alpha beta delta ductal gamma
## 0.00032 0.44113 0.09326 0.22204 0.93712 0.75730
We can readily extend the model in Step 2
to assess covariate effect on cell-type-specific AEI. Let \(V_j\) be the covariate of interest for individual \(j\). We can modify the model by adding an interaction term between the covariate and the estimated cell type proportions:
\[X_j^T-X_j^t=\alpha+m_j\sum_{k=1}^Kp_j^k(\theta^k +V_j \theta_{\Delta}^k)+ \epsilon_j \;\;\;\;\;\;\;\;\;\;\;\;\;(2)\]
where \(\theta_{\Delta}^k\) is the covariate effect on the cell-type-specific AEI. In practice we will likely test for the covariate effect on cell-type-specific AEI only if a cell-type-specific AEI has been detected based on model (1). Therefore, for model (2), we are interested in testing whether the cell-type-specific AEI changes with the covariate, i.e., for each cell type \(k\), we test the following hypothesis:
\[H_0: \theta_{\Delta}^k=0 \;\; vs \;\; H_1: \theta_{\Delta}^k\neq 0\]
As the bulk RNA-seq data come from pancreatic islets, here, we test for the effect of HbA1c, a well-known biomarker for T2D diagnosis, on cell-type-specific AEI. By including HbA1c level for each individual, we can have the data as the following:
To detect covariate effect on cell-type-specific AEI, BSCET
models the allelic count difference over the cell type proportions and interactions between proportion estimates and HbA1c level through linear regression, and p-values for interaction terms are used to indicate the significance of covariate effect on cell-type-specific AEI. As we can see below, cell-type-specific AEI was not associated with HbA1c level for all cell types.
mod=lm(ref-alt ~ `acinar`+`alpha`+`beta`+`delta`+`ductal`+`gamma`+
HbA1c:(`acinar`+`alpha`+`beta`+`delta`+`ductal`+`gamma`),data=dat)
p_value = summary(mod)$coefficient[,4][8:13]
print(round(p_value,5))
## acinar:HbA1c alpha:HbA1c beta:HbA1c delta:HbA1c ductal:HbA1c gamma:HbA1c
## 0.49077 0.05900 0.16795 0.05154 0.39757 0.13501
Developed by Jiaxin Fan (jiaxinf@pennmedicine.upenn.edu).