Return variant summaries — summariseGeno • rvat

Returns a per variant summary of genotype counts, frequencies, call-rates and hwe testing. Note, the gdb implementation is described here, summariseGeno can also be run directly on a genoMatrix object as described in the genoMatrix documentation.

# S4 method for gdb
summariseGeno(
  object,
  cohort = "SM",
  varSet = NULL,
  VAR_id = NULL,
  pheno = NULL,
  memlimit = 1000,
  geneticModel = "allelic",
  checkPloidy = NULL,
  keep = NULL,
  output = NULL,
  splitBy = NULL,
  minCallrateVar = 0,
  maxCallrateVar = Inf,
  minCallrateSM = 0,
  maxCallrateSM = Inf,
  minMAF = 0,
  maxMAF = 1,
  minMAC = 0,
  maxMAC = Inf,
  minCarriers = 0,
  maxCarriers = Inf,
  minCarrierFreq = 0,
  maxCarrierFreq = Inf,
  strict = TRUE,
  verbose = TRUE
)

Arguments

object: a gdb object
cohort: If a valid cohort name is provided, then the uploaded data for this cohort is used to filter and annotate the genotypes If not specified, all samples in the gdb will be loaded.
varSet: a varSetList or varSetFile object. Alternatively the VAR_id parameter can be specified.
VAR_id: A list of VAR_ids, alternatively the varSet parameter can be specified. The memlimit argument controls how many variants to analyze at a time.
pheno: colData field to test as response variable, although not used within this method, this can be useful to filter samples which have missing data for the response variable.
memlimit: Maximum number of variants to load at once (if VAR_id is specified).
geneticModel: Which genetic model to apply? ('allelic', 'recessive' or 'dominant'). Defaults to allelic.
checkPloidy: Version of the human genome to use when assigning variant ploidy (diploid, XnonPAR, YnonPAR). Accepted inputs are GRCh37, hg19, GRCh38, hg38. If not specified, the genome build in the gdb will be used, if available (included if the genomeBuild parameter was set in buildGdb). Otherwise, if the genome build is not included in the gdb metadata, and no value is provided, then all variants are assigned the default ploidy of "diploid"
keep: vector of sample IDs to keep, defaults to NULL, in which case all samples are kept.
output: Output file path for results. Defaults to NULL, in which case results are not written.
splitBy: Split variant summaries by labels indicated in the specified field.
minCallrateVar: Minimum genotype rate for variant retention.
maxCallrateVar: Maximum genotype rate for variant retention.
minCallrateSM: Minimum genotype rate for sample retention.
maxCallrateSM: Maximum genotype rate for sample retention.
minMAF: Minimum minor allele frequency for variant retention.
maxMAF: Maximum minor allele frequency for variant retention.
minMAC: Minimum minor allele count for variant retention.
maxMAC: Maximum minor allele count for variant retention.
minCarriers: Minimum carrier count for variant retention.
maxCarriers: Maximum carrier count for variant retention.
minCarrierFreq: Minimum carrier frequency for variant retention.
maxCarrierFreq: Maximum carrier frequency for variant retention.
strict: Should strict checks be performed? Defaults to TRUE. Strict checks currently includes checking whether supplied varSetFile/varSetList was generated from the same gdb as specified in object.
verbose: Should the function be verbose? (TRUE/FALSE), defaults to TRUE.

Value

Returns a data.frame with the following columns:

VAR_id: VAR_id of the respective variant.
AF: Allele frequency
callRate: callRate
geno0: Number of samples with genotype='0'. When geneticModel='allelic' or 'dominant' this is the number of individuals that are homozygous for the reference allele.
geno1: Number of samples with genotype='1'. When geneticModel='allelic' this is the number of individuals that are heterozygous for the reference allele. When geneticModel = 'dominant' this represents the number of individuals who carry at least one alternate allele. When geneticModel = 'recessive' this represents the number of individuals who are homozygous for the alternate allele.
geno2: When geneticModel = 'allelic', the number of individuals who are homozygous for the alternate allele.

Examples

library(rvatData)
gdb <- create_example_gdb()

# generate for variant summaries for list of variants
sumgeno <- tempfile()
summariseGeno(gdb,
              cohort = "pheno",
              VAR_id = 1:100,
              output = sumgeno)
#> Analysing chunk1
#> Retrieved genotypes for 100 variants
#> Analysing unit chunk1; varSet none

# generate for variant summaries for varSetFile
varsetfile <- varSetFile(rvat_example("rvatData_varsetfile.txt.gz"))
varsets <- getVarSet(varsetfile, unit = c("SOD1", "FUS"), varSetName = "High")
summariseGeno(gdb,
              cohort = "pheno",
              varSet = varsets,
              output = sumgeno)
#> Analysing FUS
#> Retrieved genotypes for 1 variants
#> Analysing unit FUS; varSet High
#> Analysing SOD1
#> Retrieved genotypes for 4 variants
#> Analysing unit SOD1; varSet High

# variant summaries can be generated for subgroups using the `splitBy` parameter.
# this will result in an additional column in the output for the subgroups
summariseGeno(gdb,
              cohort = "pheno",
              VAR_id = 1:100,
              splitBy = "pheno",
              output = sumgeno)
#> Analysing chunk1
#> Retrieved genotypes for 100 variants
#> Analysing unit chunk1; varSet none
data <- read.table(sumgeno, header = TRUE)
# contains 'pheno' column
head(data)
#>   VAR_id pheno           AF callRate geno0 geno1 geno2 hweP
#> 1      1     1 0.0000000000   0.9180  4590     0     0    1
#> 2      2     1 0.0000000000   0.9826  4913     0     0    1
#> 3      3     1 0.0000000000   1.0000  5000     0     0    1
#> 4      4     1 0.0000000000   0.8512  4256     0     0    1
#> 5      5     1 0.0001010305   0.9898  4948     1     0    1
#> 6      6     1 0.0004021717   0.9946  4969     4     0    1

# summariseGeno can be ran directly on a genoMatrix
data(GT)
sumgeno <- summariseGeno(GT)