basics.Rmd
# Libraries
library(rvat)
library(rvatData)
library(SummarizedExperiment)
In these tutorials, we’ll use simulated example data that we included in the rvatData package.
Included in-memory datasets:
data(package = "rvatData")$results[,"Item"]
## [1] "GT" "GTsmall" "rvbresults"
Included on-disk datasets, note that the paths to the on-disk datasets can be retrieved as below:
## [1] "rvatData_varsetfile.txt.gz" "rvatData.gdb"
## [3] "rvatData.pheno" "rvatData.varinfo"
## [5] "rvatData.vcf.gz"
gdbpath <- rvat_example("rvatData.gdb")
Some output will be written during the tutorials, use a temporary directory for that:
outdir <- tempdir()
The first step consists of setting up a so-called gdb()
,
an SQLite-based data structure which forms the heart of all RVAT
operation. Gdb files allow for much more rapid and memory efficient
loading of sample genotype data into numeric matrices within R. They
also allow for the upload and integration of additional variant and
sample annotation data to allow for complex data
Here, we show how to create a gdb()
object from a
VCF-file. Note that RVAT will assign a unique VAR_id
to
each variant, which allow rapid querying of variants, these IDs have no
inherent meaning (i.e. they do not refer to chromosomal position or the
like). Note that providing a genome build is optional, this will be
stored in the gdb metadata, and will be used in downstream analyses to
assign ploidy on sex chromosomes (diploid, XnonPAR, YnonPAR).
vcfpath <- rvat_example("rvatData.vcf.gz")
buildGdb(vcf = vcfpath,
output = paste0(outdir,"/rvat_tutorials.gdb"),
overWrite = TRUE,
genomeBuild = "GRCh38")
Connect to the gdb you just created using the gdb()
function:
add variant info to the gdb using the uploadAnno()
method, variants are mapped to the gdb VAR_id
based on the
CHROM,POS,REF,ALT fields.
# example file including info such as chromosomal location, annotated gene and variant impact.
varinfopath <- rvat_example("rvatData.varinfo")
varinfo <- read.table(varinfopath, header = TRUE)
# the `name` parameter specifies the name of the table in the gdb,
# the variant info that should be imported is specified using the `value` parameter
uploadAnno(object = gdb, name = "varInfo",value = varinfo)
The listAnno()
method lists all variant annotation
tables in the gdb:
listAnno(gdb)
## name value date
## 1 varInfo interactive_session Wed Feb 12 12:38:49 2025
The getAnno()
method retrieves a given variant
annotation table from the gdb:
## VAR_id CHROM POS ID REF ALT QUAL FILTER AC AN
## 1 1 chr1 11013912 <NA> A G 722.13 NA 1 45484
## 2 2 chr1 11013928 rs755357622 C T 2598.23 NA 1 49194
## 3 3 chr1 11013936 <NA> A C 4135.36 NA 1 50000
## 4 4 chr1 11013936 <NA> A G 4135.36 NA 1 42636
## 5 5 chr1 11013952 <NA> C G 517.13 NA 1 49574
## 6 6 chr1 11016874 rs80356715 C T 64910.10 NA 32 49766
## AF gene_name HighImpact ModerateImpact Synonymous CADDphred
## 1 0.0000219858 TARDBP 0 1 0 24.8
## 2 0.0000203277 TARDBP 0 0 1 .
## 3 0.0000200000 TARDBP 0 1 0 24.1
## 4 0.0000234544 TARDBP 0 1 0 22.6
## 5 0.0000201719 TARDBP 0 0 1 .
## 6 0.0006430093 TARDBP 0 1 0 22.2
## PolyPhen SIFT
## 1 P D
## 2 . .
## 3 B T
## 4 B T
## 5 . .
## 6 B T
a specific region can be selected as follows:
varinfo <- getAnno(gdb, table = "varinfo", ranges = data.frame(CHROM="chr1", start = 11013847, end = 11016874))
head(varinfo)
## VAR_id CHROM POS ID REF ALT QUAL FILTER AC AN
## 1 1 chr1 11013912 <NA> A G 722.13 NA 1 45484
## 2 2 chr1 11013928 rs755357622 C T 2598.23 NA 1 49194
## 3 3 chr1 11013936 <NA> A C 4135.36 NA 1 50000
## 4 4 chr1 11013936 <NA> A G 4135.36 NA 1 42636
## 5 5 chr1 11013952 <NA> C G 517.13 NA 1 49574
## 6 6 chr1 11016874 rs80356715 C T 64910.10 NA 32 49766
## AF gene_name HighImpact ModerateImpact Synonymous CADDphred
## 1 0.0000219858 TARDBP 0 1 0 24.8
## 2 0.0000203277 TARDBP 0 0 1 .
## 3 0.0000200000 TARDBP 0 1 0 24.1
## 4 0.0000234544 TARDBP 0 1 0 22.6
## 5 0.0000201719 TARDBP 0 0 1 .
## 6 0.0006430093 TARDBP 0 1 0 22.2
## PolyPhen SIFT
## 1 P D
## 2 . .
## 3 B T
## 4 B T
## 5 . .
## 6 B T
add phenotype info to the gdb using the uploadCohort()
method:
# First we'll read in the phenotype data
phenopath <- rvat_example("rvatData.pheno")
pheno <- read.table( phenopath, header = TRUE)
# Import into the gdb (the `IID` column is used)
# the `name` parameter specifies the name of the table in the gdb,
# the sample info that should be important is specified using the `value` parameter
uploadCohort(object = gdb, name = "pheno", value = pheno)
The listCohort()
method lists all uploaded sample info
tables in the gdb:
listCohort(gdb)
## name value date
## 1 pheno interactive_session Wed Feb 12 12:38:50 2025
The getCohort()
method retrieves a given sample info
table from the gdb:
## IID sex pheno pop superPop PC1 PC2 PC3 PC4
## 1 ALS1 2 1 PJL SAS 22.58749 4.738025 38.9157564 -10.1667342
## 2 ALS2 2 1 BEB SAS 26.37370 -1.445903 31.9712989 -6.0222555
## 3 ALS3 1 1 TSI EUR 30.75496 48.866843 -5.4092404 14.9474404
## 4 ALS4 2 1 GWD AFR -87.48035 -6.193628 -0.9504131 -1.2029932
## 5 ALS5 1 1 MSL AFR -87.79310 -5.323925 -1.0353300 -0.7403234
## 6 ALS6 1 1 YRI AFR -86.64071 -3.418417 -1.3964977 1.0561607
## age
## 1 66.01743
## 2 66.84306
## 3 54.35749
## 4 65.09777
## 5 58.94664
## 6 63.13635
Central to RVAT
is the genoMatrix()
object.
The genoMatrix()
object is an extension of the BioConductor
summarizedExperiment
class. A genoMatrix()
contains a matrix of genotypes,
variant info and sample info queried from a gdb()
object,
various methods are implemented for genoMatrix objects, which we’ll
explore in this section.
First, we’ll build a genoMatrix object using the getGT()
method. Variants to include can be specified in two main ways: 1) by
directly passing the VAR_ids to the getGT()
method, 2) by
providing a set of genomic ranges to retrieve and 23 by supplying a
varSet
(discussed later).
Here we’ll start with methods 1-2, varSets are discussed in
vignette("association_testing")
tutorial.
# get varinfo
varinfo <- getAnno(gdb, table = "varinfo", where = "gene_name = 'SOD1' and ModerateImpact = 1")
GT <- getGT(gdb,
VAR_id = varinfo$VAR_id,
cohort = "pheno")
## Retrieved genotypes for 38 variants
GT
## rvat genoMatrix Object
## class: genoMatrix
## dim: 38 25000
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(38): 1268 1270 ... 1315 1316
## rowData names(2): ploidy w
## colnames(25000): ALS1 ALS2 ... CTRL19999 CTRL20000
## colData names(10): IID sex ... PC4 age
Similarly we can retrieve variants within a given range, like below.
Annotations can be included by specifying the anno
parameter, and will be available in the rowData
of the
genoMatrix
:
GT_fromranges <- getGT(
gdb,
ranges = data.frame(CHROM = "chr21", start = 31659666, end = 31668931),
cohort = "pheno",
anno = "varInfo",
annoFields = c("VAR_id", "CHROM", "POS", "REF", "ALT", "HighImpact", "ModerateImpact", "Synonymous")
)
## Retrieved genotypes for 49 variants
GT_fromranges <- GT_fromranges[rowData(GT_fromranges)$ModerateImpact == 1,]
Also see this
tutorial on SummarizedExperiments objects, the genoMatrix()
class inherits from summarizedExperiment
, meaning that all
summarizedExperiment methods also apply to the genoMatrix
object. Here we first show some basic operations:
retrieve colData (i.e. sample info):
## DataFrame with 6 rows and 10 columns
## IID sex pheno pop superPop PC1
## <character> <numeric> <integer> <character> <character> <numeric>
## ALS1 ALS1 2 1 PJL SAS 22.5875
## ALS2 ALS2 2 1 BEB SAS 26.3737
## ALS3 ALS3 1 1 TSI EUR 30.7550
## ALS4 ALS4 2 1 GWD AFR -87.4804
## ALS5 ALS5 1 1 MSL AFR -87.7931
## ALS6 ALS6 1 1 YRI AFR -86.6407
## PC2 PC3 PC4 age
## <numeric> <numeric> <numeric> <numeric>
## ALS1 4.73803 38.915756 -10.166734 66.0174
## ALS2 -1.44590 31.971299 -6.022256 66.8431
## ALS3 48.86684 -5.409240 14.947440 54.3575
## ALS4 -6.19363 -0.950413 -1.202993 65.0978
## ALS5 -5.32393 -1.035330 -0.740323 58.9466
## ALS6 -3.41842 -1.396498 1.056161 63.1363
retrieve rowData (i.e. variant info):
## DataFrame with 6 rows and 2 columns
## ploidy w
## <character> <numeric>
## 1268 diploid 1
## 1270 diploid 1
## 1271 diploid 1
## 1273 diploid 1
## 1275 diploid 1
## 1276 diploid 1
show metadata:
metadata(GT)
## $gdb
## [1] "/private/var/folders/cl/wvc0rvjx4vd5rzt2_fhpmfth0000gp/T/RtmpCTy08L/rvat_tutorials.gdb"
##
## $gdbId
## [1] "u7fd78yvg5fivj57qv4af7ayh65s"
##
## $genomeBuild
## [1] "GRCh38"
##
## $ploidyLevels
## [1] "diploid"
##
## $m
## [1] 25000
##
## $nvar
## [1] 38
##
## $varSetName
## [1] "unnamed"
##
## $unit
## [1] "unnamed"
##
## $cohort
## [1] "pheno"
##
## $geneticModel
## [1] "allelic"
##
## $imputeMethod
## [1] "none"
A genoMatrix
object can be subsetted similarly as a
data.frame:
GT[1:5, 1:5]
## rvat genoMatrix Object
## class: genoMatrix
## dim: 5 5
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(5): 1268 1270 1271 1273 1275
## rowData names(2): ploidy w
## colnames(5): ALS1 ALS2 ALS3 ALS4 ALS5
## colData names(10): IID sex ... PC4 age
subset subjects with pheno == 1
GT[,GT$pheno == 1]
## rvat genoMatrix Object
## class: genoMatrix
## dim: 38 5000
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(38): 1268 1270 ... 1315 1316
## rowData names(2): ploidy w
## colnames(5000): ALS1 ALS2 ... ALS4999 ALS5000
## colData names(10): IID sex ... PC4 age
subset first 2 variants for subjects older than 70:
GT[1:2,GT$age > 70]
## rvat genoMatrix Object
## class: genoMatrix
## dim: 2 3910
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(2): 1268 1270
## rowData names(2): ploidy w
## colnames(3910): ALS17 ALS31 ... CTRL19990 CTRL19995
## colData names(10): IID sex ... PC4 age
Although you’ll usually not work directly on the genotypes, they can be retrieved as followingL
assays(GT)$GT[1:5,1:5]
## ALS1 ALS2 ALS3 ALS4 ALS5
## 1268 0 0 0 0 0
## 1270 0 0 0 0 0
## 1271 0 0 0 0 0
## 1273 0 0 0 0 0
## 1275 0 0 0 0 0
return the variant allele frequencies using the getAF()
method:
getAF(GT)
## 1268 1270 1271 1273 1275 1276
## 1.260187e-04 2.021591e-05 2.252049e-05 1.007293e-04 2.000000e-05 2.056428e-05
## 1277 1278 1279 1281 1282 1283
## 2.000000e-05 2.157404e-05 2.000000e-05 2.183025e-05 2.162256e-05 2.237237e-05
## 1284 1285 1287 1288 1289 1290
## 2.000000e-05 2.000000e-05 2.128656e-05 2.052798e-05 2.000000e-05 1.420000e-03
## 1291 1292 1293 1294 1296 1297
## 2.003928e-05 2.045157e-05 2.193463e-05 2.029386e-05 2.133834e-05 8.000000e-05
## 1298 1299 1300 1304 1305 1306
## 2.139129e-05 2.850859e-04 2.008355e-05 2.015560e-05 2.115059e-05 2.050945e-05
## 1307 1308 1309 1312 1313 1314
## 2.000000e-05 4.648568e-05 2.072024e-05 4.112688e-05 2.297689e-05 2.025686e-05
## 1315 1316
## 4.123031e-05 2.115507e-05
reurn the alternate allele counts:
getAC(GT)
## 1268 1270 1271 1273 1275 1276 1277 1278 1279 1281 1282 1283 1284 1285 1287 1288
## 6 1 1 5 1 1 1 1 1 1 1 1 1 1 1 1
## 1289 1290 1291 1292 1293 1294 1296 1297 1298 1299 1300 1304 1305 1306 1307 1308
## 1 71 1 1 1 1 1 4 1 14 1 1 1 1 1 2
## 1309 1312 1313 1314 1315 1316
## 1 2 1 1 2 1
Flip genotype dosages such that all GT values represent minor allele
counts (note that in this case that was already the case) using the
flipToMinor()
method:
GT <- flipToMinor(GT)
Return a variant summary using the summariseGeno()
method:
sumgeno <- summariseGeno(GT)
head(sumgeno)
## VAR_id AF callRate geno0 geno1 geno2 hweP
## 1268 1268 1.260187e-04 0.95224 23800 6 0 1
## 1270 1270 2.021591e-05 0.98932 24732 1 0 1
## 1271 1271 2.252049e-05 0.88808 22201 1 0 1
## 1273 1273 1.007293e-04 0.99276 24814 5 0 1
## 1275 1275 2.000000e-05 1.00000 24999 1 0 1
## 1276 1276 2.056428e-05 0.97256 24313 1 0 1
Meanimpute genotype dosages using the recode()
method:
recode(GT, imputeMethod = "meanImpute")
## rvat genoMatrix Object
## class: genoMatrix
## dim: 38 25000
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(38): 1268 1270 ... 1315 1316
## rowData names(2): ploidy w
## colnames(25000): ALS1 ALS2 ... CTRL19999 CTRL20000
## colData names(10): IID sex ... PC4 age
Apply dominant model using the recode()
method:
recode(GT, geneticModel = "dominant")
## rvat genoMatrix Object
## class: genoMatrix
## dim: 38 25000
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(38): 1268 1270 ... 1315 1316
## rowData names(2): ploidy w
## colnames(25000): ALS1 ALS2 ... CTRL19999 CTRL20000
## colData names(10): IID sex ... PC4 age
Aggregate genotype counts across variants, creating a single aggregate score per individual (i.e. a burden score):
GT_agg <- aggregate(recode(GT, imputeMethod = "meanImpute"), returnGT = TRUE)
summary(GT_agg$aggregate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0000410 0.0000460 0.0055313 0.0001275 2.0002520
Retrieve IDs of carriers per variants:
getCarriers(GT, colDataFields = c("pop","age"))[1:10,]
## VAR_id IID genotype pop age
## 1 1268 ALS1811 1 YRI 56.59491
## 2 1268 ALS2109 1 LWK 53.64514
## 3 1268 ALS2155 1 GIH 45.68899
## 4 1268 ALS3076 1 YRI 60.34504
## 5 1268 ALS4396 1 GWD 74.22662
## 6 1268 ALS4476 1 CHB 62.46120
## 7 1270 CTRL14060 1 KHV 67.61440
## 8 1271 CTRL18522 1 PUR 47.41249
## 9 1273 ALS1117 1 IBS 49.17095
## 10 1273 ALS3310 1 FIN 50.09404
RVAT provides an unified interface to various rare variant collapsing
methods, while allowing for MAF weighting, filtering etc. There is a
dedicated tutorial for association testing
vignette("association_testing")
, but we include a brief
primer here:
burden_results <- assocTest(GT,
pheno = "pheno",
covar = c("PC1", "PC2", "PC3", "PC4"),
test = c("firth", "skat", "acatv"),
name = "example") # optional name of the analysis
burden_results
## rvbResult with 3 rows and 24 columns
## unit cohort varSetName name pheno covar geneticModel
## <character> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle>
## 1 unnamed pheno unnamed example pheno PC1,PC2,PC3,PC4 allelic
## 2 unnamed pheno unnamed example pheno PC1,PC2,PC3,PC4 allelic
## 3 unnamed pheno unnamed example pheno PC1,PC2,PC3,PC4 allelic
## MAFweight test nvar caseCarriers ctrlCarriers meanCaseScore
## <Rle> <Rle> <numeric> <numeric> <numeric> <numeric>
## 1 1 firth 38 77 51 0.0170918
## 2 1 skat 38 77 51 0.0170918
## 3 1 acatv 38 77 51 0.0170918
## meanCtrlScore caseN ctrlN caseCallRate ctrlCallRate effect
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 0.00264118 5000 20000 0.962705 0.962584 1.75552
## 2 0.00264118 5000 20000 0.962705 0.962584 NA
## 3 0.00264118 5000 20000 0.962705 0.962584 NA
## effectSE effectCIlower effectCIupper OR P
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 0.176317 1.41495 2.10837 5.78647 0.0000e+00
## 2 NA NA NA NA 2.1049e-10
## 3 NA NA NA NA 0.0000e+00
A gdb can be subsetted, retaining queried variants. This can be useful to downsize a (big) gdb for targeted local analyses.
subsetGdb(gdb,
output = paste0(outdir, "/rvat_tutorials_subset.gdb"),
where = "gene_name = 'ABCA4'",
intersect = "varinfo",
overWrite = TRUE)
gdb_subset <- gdb(paste0(outdir, "/rvat_tutorials_subset.gdb"))
anno <- getAnno(gdb_subset,table = "varinfo")
table(anno$gene_name)
##
## ABCA4
## 589
Although several convenience methods exist to query and update a
gdb
(such as getGT()
, getAnno()
,
getCohort()
, uploadAnno()
and
uploadCohort()
), users can also query the gdb as any other
SQLite database. Either from the command-line (as demonstrated in
vignette("cmdline")
), or by connecting to the database
within R using RSQLite
for example.
An example:
# connect to the gdb
gdb <- gdb(paste0(outdir, "/rvat_tutorials.gdb"))
# select high impact SOD1 variants
SOD1_highimpact <- RSQLite::dbGetQuery(gdb,
"select distinct VAR_id from varinfo where gene_name = 'SOD1' and HighImpact = 1;")
More sophisticated examples will be included in our workflow examples.