Getting started • rvat

# Libraries
library(rvat)
library(rvatData)
library(SummarizedExperiment)

Example data

In these tutorials, we’ll use simulated example data that we included in the rvatData package.

Included in-memory datasets:

data(package = "rvatData")$results[,"Item"]

## [1] "GT"         "GTsmall"    "rvbresults"

Included on-disk datasets, note that the paths to the on-disk datasets can be retrieved as below:

rvat_example()

## [1] "rvatData_varsetfile.txt.gz" "rvatData.gdb"              
## [3] "rvatData.pheno"             "rvatData.varinfo"          
## [5] "rvatData.vcf.gz"

gdbpath <- rvat_example("rvatData.gdb")

Some output will be written during the tutorials, use a temporary directory for that:

outdir <- tempdir()

Setting up and populating a gdb

Creating a gdb based on a VCF-file

The first step consists of setting up a so-called gdb(), an SQLite-based data structure which forms the heart of all RVAT operation. Gdb files allow for much more rapid and memory efficient loading of sample genotype data into numeric matrices within R. They also allow for the upload and integration of additional variant and sample annotation data to allow for complex data

Here, we show how to create a gdb() object from a VCF-file. Note that RVAT will assign a unique VAR_id to each variant, which allow rapid querying of variants, these IDs have no inherent meaning (i.e. they do not refer to chromosomal position or the like). Note that providing a genome build is optional, this will be stored in the gdb metadata, and will be used in downstream analyses to assign ploidy on sex chromosomes (diploid, XnonPAR, YnonPAR).

vcfpath <- rvat_example("rvatData.vcf.gz")
buildGdb(vcf = vcfpath,
         output = paste0(outdir,"/rvat_tutorials.gdb"), 
         overWrite = TRUE,
         genomeBuild = "GRCh38")

Connect to the gdb you just created using the gdb() function:

gdb <- gdb(paste0(outdir,"/rvat_tutorials.gdb"))

Import variant info

add variant info to the gdb using the uploadAnno() method, variants are mapped to the gdb VAR_id based on the CHROM,POS,REF,ALT fields.

# example file including info such as chromosomal location, annotated gene and variant impact.
varinfopath <- rvat_example("rvatData.varinfo")
varinfo <- read.table(varinfopath, header = TRUE)

# the `name` parameter specifies the name of the table in the gdb, 
# the variant info that should be imported is specified using the `value` parameter
uploadAnno(object = gdb, name = "varInfo",value = varinfo)

The listAnno() method lists all variant annotation tables in the gdb:

listAnno(gdb)

##      name               value                     date
## 1 varInfo interactive_session Wed Feb 12 12:38:49 2025

The getAnno() method retrieves a given variant annotation table from the gdb:

varinfo <- getAnno(gdb, table = "varinfo")
head(varinfo)

##   VAR_id CHROM      POS          ID REF ALT     QUAL FILTER AC    AN
## 1      1  chr1 11013912        <NA>   A   G   722.13     NA  1 45484
## 2      2  chr1 11013928 rs755357622   C   T  2598.23     NA  1 49194
## 3      3  chr1 11013936        <NA>   A   C  4135.36     NA  1 50000
## 4      4  chr1 11013936        <NA>   A   G  4135.36     NA  1 42636
## 5      5  chr1 11013952        <NA>   C   G   517.13     NA  1 49574
## 6      6  chr1 11016874  rs80356715   C   T 64910.10     NA 32 49766
##             AF gene_name HighImpact ModerateImpact Synonymous CADDphred
## 1 0.0000219858    TARDBP          0              1          0      24.8
## 2 0.0000203277    TARDBP          0              0          1         .
## 3 0.0000200000    TARDBP          0              1          0      24.1
## 4 0.0000234544    TARDBP          0              1          0      22.6
## 5 0.0000201719    TARDBP          0              0          1         .
## 6 0.0006430093    TARDBP          0              1          0      22.2
##   PolyPhen SIFT
## 1        P    D
## 2        .    .
## 3        B    T
## 4        B    T
## 5        .    .
## 6        B    T

a specific region can be selected as follows:

varinfo <- getAnno(gdb, table = "varinfo", ranges = data.frame(CHROM="chr1", start = 11013847, end = 11016874))
head(varinfo)

##   VAR_id CHROM      POS          ID REF ALT     QUAL FILTER AC    AN
## 1      1  chr1 11013912        <NA>   A   G   722.13     NA  1 45484
## 2      2  chr1 11013928 rs755357622   C   T  2598.23     NA  1 49194
## 3      3  chr1 11013936        <NA>   A   C  4135.36     NA  1 50000
## 4      4  chr1 11013936        <NA>   A   G  4135.36     NA  1 42636
## 5      5  chr1 11013952        <NA>   C   G   517.13     NA  1 49574
## 6      6  chr1 11016874  rs80356715   C   T 64910.10     NA 32 49766
##             AF gene_name HighImpact ModerateImpact Synonymous CADDphred
## 1 0.0000219858    TARDBP          0              1          0      24.8
## 2 0.0000203277    TARDBP          0              0          1         .
## 3 0.0000200000    TARDBP          0              1          0      24.1
## 4 0.0000234544    TARDBP          0              1          0      22.6
## 5 0.0000201719    TARDBP          0              0          1         .
## 6 0.0006430093    TARDBP          0              1          0      22.2
##   PolyPhen SIFT
## 1        P    D
## 2        .    .
## 3        B    T
## 4        B    T
## 5        .    .
## 6        B    T

Import sample info

add phenotype info to the gdb using the uploadCohort() method:

# First we'll read in the phenotype data
phenopath <- rvat_example("rvatData.pheno")
pheno <- read.table( phenopath, header = TRUE)

# Import into the gdb (the `IID` column is used)
# the `name` parameter specifies the name of the table in the gdb, 
# the sample info that should be important is specified using the `value` parameter
uploadCohort(object = gdb, name = "pheno", value = pheno)

The listCohort() method lists all uploaded sample info tables in the gdb:

listCohort(gdb)

##    name               value                     date
## 1 pheno interactive_session Wed Feb 12 12:38:50 2025

The getCohort() method retrieves a given sample info table from the gdb:

pheno <- getCohort(gdb, cohort = "pheno")
head(pheno)

##    IID sex pheno pop superPop       PC1       PC2        PC3         PC4
## 1 ALS1   2     1 PJL      SAS  22.58749  4.738025 38.9157564 -10.1667342
## 2 ALS2   2     1 BEB      SAS  26.37370 -1.445903 31.9712989  -6.0222555
## 3 ALS3   1     1 TSI      EUR  30.75496 48.866843 -5.4092404  14.9474404
## 4 ALS4   2     1 GWD      AFR -87.48035 -6.193628 -0.9504131  -1.2029932
## 5 ALS5   1     1 MSL      AFR -87.79310 -5.323925 -1.0353300  -0.7403234
## 6 ALS6   1     1 YRI      AFR -86.64071 -3.418417 -1.3964977   1.0561607
##        age
## 1 66.01743
## 2 66.84306
## 3 54.35749
## 4 65.09777
## 5 58.94664
## 6 63.13635

GenoMatrix objects

Central to RVAT is the genoMatrix() object. The genoMatrix() object is an extension of the BioConductor summarizedExperiment class. A genoMatrix() contains a matrix of genotypes, variant info and sample info queried from a gdb() object, various methods are implemented for genoMatrix objects, which we’ll explore in this section.

Construct a genoMatrix

First, we’ll build a genoMatrix object using the getGT() method. Variants to include can be specified in two main ways: 1) by directly passing the VAR_ids to the getGT() method, 2) by providing a set of genomic ranges to retrieve and 23 by supplying a varSet (discussed later).

Here we’ll start with methods 1-2, varSets are discussed in vignette("association_testing") tutorial.

# get varinfo
varinfo <- getAnno(gdb, table = "varinfo", where = "gene_name = 'SOD1' and ModerateImpact = 1")

GT <- getGT(gdb, 
            VAR_id = varinfo$VAR_id,
            cohort = "pheno")

## Retrieved genotypes for 38 variants

GT

## rvat genoMatrix Object
## class: genoMatrix 
## dim: 38 25000 
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(38): 1268 1270 ... 1315 1316
## rowData names(2): ploidy w
## colnames(25000): ALS1 ALS2 ... CTRL19999 CTRL20000
## colData names(10): IID sex ... PC4 age

Similarly we can retrieve variants within a given range, like below. Annotations can be included by specifying the anno parameter, and will be available in the rowData of the genoMatrix:

GT_fromranges <- getGT(
  gdb,
  ranges = data.frame(CHROM = "chr21", start = 31659666, end = 31668931),
  cohort = "pheno",
  anno = "varInfo",
  annoFields = c("VAR_id", "CHROM", "POS", "REF", "ALT", "HighImpact", "ModerateImpact", "Synonymous")
)

## Retrieved genotypes for 49 variants

GT_fromranges <- GT_fromranges[rowData(GT_fromranges)$ModerateImpact == 1,]

Basic operations

Also see this tutorial on SummarizedExperiments objects, the genoMatrix() class inherits from summarizedExperiment, meaning that all summarizedExperiment methods also apply to the genoMatrix object. Here we first show some basic operations:

retrieve colData (i.e. sample info):

head(colData(GT))

## DataFrame with 6 rows and 10 columns
##              IID       sex     pheno         pop    superPop       PC1
##      <character> <numeric> <integer> <character> <character> <numeric>
## ALS1        ALS1         2         1         PJL         SAS   22.5875
## ALS2        ALS2         2         1         BEB         SAS   26.3737
## ALS3        ALS3         1         1         TSI         EUR   30.7550
## ALS4        ALS4         2         1         GWD         AFR  -87.4804
## ALS5        ALS5         1         1         MSL         AFR  -87.7931
## ALS6        ALS6         1         1         YRI         AFR  -86.6407
##            PC2       PC3        PC4       age
##      <numeric> <numeric>  <numeric> <numeric>
## ALS1   4.73803 38.915756 -10.166734   66.0174
## ALS2  -1.44590 31.971299  -6.022256   66.8431
## ALS3  48.86684 -5.409240  14.947440   54.3575
## ALS4  -6.19363 -0.950413  -1.202993   65.0978
## ALS5  -5.32393 -1.035330  -0.740323   58.9466
## ALS6  -3.41842 -1.396498   1.056161   63.1363

retrieve rowData (i.e. variant info):

head(rowData(GT))

## DataFrame with 6 rows and 2 columns
##           ploidy         w
##      <character> <numeric>
## 1268     diploid         1
## 1270     diploid         1
## 1271     diploid         1
## 1273     diploid         1
## 1275     diploid         1
## 1276     diploid         1

show metadata:

metadata(GT)

## $gdb
## [1] "/private/var/folders/cl/wvc0rvjx4vd5rzt2_fhpmfth0000gp/T/RtmpCTy08L/rvat_tutorials.gdb"
## 
## $gdbId
## [1] "u7fd78yvg5fivj57qv4af7ayh65s"
## 
## $genomeBuild
## [1] "GRCh38"
## 
## $ploidyLevels
## [1] "diploid"
## 
## $m
## [1] 25000
## 
## $nvar
## [1] 38
## 
## $varSetName
## [1] "unnamed"
## 
## $unit
## [1] "unnamed"
## 
## $cohort
## [1] "pheno"
## 
## $geneticModel
## [1] "allelic"
## 
## $imputeMethod
## [1] "none"

A genoMatrix object can be subsetted similarly as a data.frame:

GT[1:5, 1:5]

## rvat genoMatrix Object
## class: genoMatrix 
## dim: 5 5 
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(5): 1268 1270 1271 1273 1275
## rowData names(2): ploidy w
## colnames(5): ALS1 ALS2 ALS3 ALS4 ALS5
## colData names(10): IID sex ... PC4 age

subset subjects with pheno == 1

GT[,GT$pheno == 1]

## rvat genoMatrix Object
## class: genoMatrix 
## dim: 38 5000 
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(38): 1268 1270 ... 1315 1316
## rowData names(2): ploidy w
## colnames(5000): ALS1 ALS2 ... ALS4999 ALS5000
## colData names(10): IID sex ... PC4 age

subset first 2 variants for subjects older than 70:

GT[1:2,GT$age > 70]

## rvat genoMatrix Object
## class: genoMatrix 
## dim: 2 3910 
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(2): 1268 1270
## rowData names(2): ploidy w
## colnames(3910): ALS17 ALS31 ... CTRL19990 CTRL19995
## colData names(10): IID sex ... PC4 age

Although you’ll usually not work directly on the genotypes, they can be retrieved as followingL

assays(GT)$GT[1:5,1:5]

##      ALS1 ALS2 ALS3 ALS4 ALS5
## 1268    0    0    0    0    0
## 1270    0    0    0    0    0
## 1271    0    0    0    0    0
## 1273    0    0    0    0    0
## 1275    0    0    0    0    0

RVAT methods

return the variant allele frequencies using the getAF() method:

getAF(GT)

##         1268         1270         1271         1273         1275         1276 
## 1.260187e-04 2.021591e-05 2.252049e-05 1.007293e-04 2.000000e-05 2.056428e-05 
##         1277         1278         1279         1281         1282         1283 
## 2.000000e-05 2.157404e-05 2.000000e-05 2.183025e-05 2.162256e-05 2.237237e-05 
##         1284         1285         1287         1288         1289         1290 
## 2.000000e-05 2.000000e-05 2.128656e-05 2.052798e-05 2.000000e-05 1.420000e-03 
##         1291         1292         1293         1294         1296         1297 
## 2.003928e-05 2.045157e-05 2.193463e-05 2.029386e-05 2.133834e-05 8.000000e-05 
##         1298         1299         1300         1304         1305         1306 
## 2.139129e-05 2.850859e-04 2.008355e-05 2.015560e-05 2.115059e-05 2.050945e-05 
##         1307         1308         1309         1312         1313         1314 
## 2.000000e-05 4.648568e-05 2.072024e-05 4.112688e-05 2.297689e-05 2.025686e-05 
##         1315         1316 
## 4.123031e-05 2.115507e-05

reurn the alternate allele counts:

getAC(GT)

## 1268 1270 1271 1273 1275 1276 1277 1278 1279 1281 1282 1283 1284 1285 1287 1288 
##    6    1    1    5    1    1    1    1    1    1    1    1    1    1    1    1 
## 1289 1290 1291 1292 1293 1294 1296 1297 1298 1299 1300 1304 1305 1306 1307 1308 
##    1   71    1    1    1    1    1    4    1   14    1    1    1    1    1    2 
## 1309 1312 1313 1314 1315 1316 
##    1    2    1    1    2    1

Flip genotype dosages such that all GT values represent minor allele counts (note that in this case that was already the case) using the flipToMinor() method:

GT <- flipToMinor(GT)

Return a variant summary using the summariseGeno() method:

sumgeno <- summariseGeno(GT)
head(sumgeno)

##      VAR_id           AF callRate geno0 geno1 geno2 hweP
## 1268   1268 1.260187e-04  0.95224 23800     6     0    1
## 1270   1270 2.021591e-05  0.98932 24732     1     0    1
## 1271   1271 2.252049e-05  0.88808 22201     1     0    1
## 1273   1273 1.007293e-04  0.99276 24814     5     0    1
## 1275   1275 2.000000e-05  1.00000 24999     1     0    1
## 1276   1276 2.056428e-05  0.97256 24313     1     0    1

Meanimpute genotype dosages using the recode() method:

recode(GT, imputeMethod = "meanImpute")

## rvat genoMatrix Object
## class: genoMatrix 
## dim: 38 25000 
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(38): 1268 1270 ... 1315 1316
## rowData names(2): ploidy w
## colnames(25000): ALS1 ALS2 ... CTRL19999 CTRL20000
## colData names(10): IID sex ... PC4 age

Apply dominant model using the recode() method:

recode(GT, geneticModel = "dominant")

## rvat genoMatrix Object
## class: genoMatrix 
## dim: 38 25000 
## metadata(11): gdb gdbId ... geneticModel imputeMethod
## assays(1): GT
## rownames(38): 1268 1270 ... 1315 1316
## rowData names(2): ploidy w
## colnames(25000): ALS1 ALS2 ... CTRL19999 CTRL20000
## colData names(10): IID sex ... PC4 age

Aggregate genotype counts across variants, creating a single aggregate score per individual (i.e. a burden score):

GT_agg <- aggregate(recode(GT, imputeMethod = "meanImpute"), returnGT = TRUE)
summary(GT_agg$aggregate)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0000410 0.0000460 0.0055313 0.0001275 2.0002520

Retrieve IDs of carriers per variants:

getCarriers(GT, colDataFields = c("pop","age"))[1:10,]

##    VAR_id       IID genotype pop      age
## 1    1268   ALS1811        1 YRI 56.59491
## 2    1268   ALS2109        1 LWK 53.64514
## 3    1268   ALS2155        1 GIH 45.68899
## 4    1268   ALS3076        1 YRI 60.34504
## 5    1268   ALS4396        1 GWD 74.22662
## 6    1268   ALS4476        1 CHB 62.46120
## 7    1270 CTRL14060        1 KHV 67.61440
## 8    1271 CTRL18522        1 PUR 47.41249
## 9    1273   ALS1117        1 IBS 49.17095
## 10   1273   ALS3310        1 FIN 50.09404

Association testing

RVAT provides an unified interface to various rare variant collapsing methods, while allowing for MAF weighting, filtering etc. There is a dedicated tutorial for association testing vignette("association_testing") , but we include a brief primer here:

burden_results <- assocTest(GT,
                            pheno = "pheno",
                            covar = c("PC1", "PC2", "PC3", "PC4"),
                            test = c("firth", "skat", "acatv"),
                            name = "example") # optional name of the analysis
burden_results

## rvbResult with 3 rows and 24 columns
##          unit cohort varSetName    name pheno           covar geneticModel
##   <character>  <Rle>      <Rle>   <Rle> <Rle>           <Rle>        <Rle>
## 1     unnamed  pheno    unnamed example pheno PC1,PC2,PC3,PC4      allelic
## 2     unnamed  pheno    unnamed example pheno PC1,PC2,PC3,PC4      allelic
## 3     unnamed  pheno    unnamed example pheno PC1,PC2,PC3,PC4      allelic
##   MAFweight  test      nvar caseCarriers ctrlCarriers meanCaseScore
##       <Rle> <Rle> <numeric>    <numeric>    <numeric>     <numeric>
## 1         1 firth        38           77           51     0.0170918
## 2         1  skat        38           77           51     0.0170918
## 3         1 acatv        38           77           51     0.0170918
##   meanCtrlScore     caseN     ctrlN caseCallRate ctrlCallRate    effect
##       <numeric> <numeric> <numeric>    <numeric>    <numeric> <numeric>
## 1    0.00264118      5000     20000     0.962705     0.962584   1.75552
## 2    0.00264118      5000     20000     0.962705     0.962584        NA
## 3    0.00264118      5000     20000     0.962705     0.962584        NA
##    effectSE effectCIlower effectCIupper        OR          P
##   <numeric>     <numeric>     <numeric> <numeric>  <numeric>
## 1  0.176317       1.41495       2.10837   5.78647 0.0000e+00
## 2        NA            NA            NA        NA 2.1049e-10
## 3        NA            NA            NA        NA 0.0000e+00

More on the gdb

Subsetting a gdb

A gdb can be subsetted, retaining queried variants. This can be useful to downsize a (big) gdb for targeted local analyses.

subsetGdb(gdb, 
          output = paste0(outdir, "/rvat_tutorials_subset.gdb"),
          where = "gene_name = 'ABCA4'",
          intersect = "varinfo", 
          overWrite = TRUE)

gdb_subset <- gdb(paste0(outdir, "/rvat_tutorials_subset.gdb"))
anno <- getAnno(gdb_subset,table = "varinfo")
table(anno$gene_name)

## 
## ABCA4 
##   589

Using RSQLite

Although several convenience methods exist to query and update a gdb (such as getGT(), getAnno(), getCohort(), uploadAnno() and uploadCohort()), users can also query the gdb as any other SQLite database. Either from the command-line (as demonstrated in vignette("cmdline")), or by connecting to the database within R using RSQLite for example.

An example:

# connect to the gdb
gdb <- gdb(paste0(outdir, "/rvat_tutorials.gdb"))

# select high impact SOD1 variants
SOD1_highimpact <- RSQLite::dbGetQuery(gdb,
                              "select distinct VAR_id from varinfo where gene_name = 'SOD1' and HighImpact = 1;")

More sophisticated examples will be included in our workflow examples.