Aggregate genotypes into a single (burden) score for each individual

Returns an aggregate of genotypes for each individual. Note, the gdb implementation is described here, aggregate can also be run directly on a genoMatrix object as described in the genoMatrix documentation. Specified genetic model, weights, MAF-weighting are taken into account when aggregating. Aggregates are written to disk in the aggregateFile format, which can be used as input for assocTest-aggregateFile to perform gene set burden analyses.

# S4 method for gdb
aggregate(
  x,
  cohort = "SM",
  varSet = NULL,
  VAR_id = NULL,
  pheno = NULL,
  memlimit = 1000,
  geneticModel = "allelic",
  imputeMethod = "meanImpute",
  MAFweights = "none",
  checkPloidy = NULL,
  keep = NULL,
  output = NULL,
  signif = 6,
  minCallrateVar = 0,
  maxCallrateVar = Inf,
  minCallrateSM = 0,
  maxCallrateSM = Inf,
  minMAF = 0,
  maxMAF = 1,
  minMAC = 0,
  maxMAC = Inf,
  minCarriers = 0,
  maxCarriers = Inf,
  minCarrierFreq = 0,
  maxCarrierFreq = Inf,
  verbose = TRUE,
  strict = TRUE
)

Arguments

x: a gdb object
cohort: If a valid cohort name is provided, then the uploaded data for this cohort is used to filter and annotate the genotypes If not specified, all samples in the gdb will be loaded.
varSet: a varSetList or varSetFile object.
VAR_id: A vector of VAR_ids, alternatively the varSet parameter can be specified. The memlimit argument controls how many variants to aggregate at a time.
pheno: colData field to test as response variable, although not used within this method, this can be useful to filter samples which have missing data for the response variable.
memlimit: Maximum number of variants to load at once (if VAR_id is specified).
geneticModel: Which genetic model to apply? ('allelic', 'recessive' or 'dominant'). Defaults to allelic.
imputeMethod: Which imputation method to apply? ('meanImpute' or 'missingToRef'). Defaults to meanImpute.
MAFweights: MAF weighting method. Currently Madsen-Browning ('mb') is implemented, by default no MAF weighting is applied.
checkPloidy: Version of the human genome to use when assigning variant ploidy (diploid, XnonPAR, YnonPAR). Accepted inputs are GRCh37, hg19, GRCh38, hg38. If not specified, the genome build in the gdb will be used, if available (included in the genomeBuild parameter was set in buildGdb). Otherwise, if the genome build is not included in the gdb metadata, and no value is provided, then all variants are assigned the default ploidy of "diploid"
keep: vector of sample IDs to keep, defaults to NULL, in which case all samples are kept.
output: Output file path for results. Defaults to NULL, in which case results are not written.
signif: Number of significant digits to store. Defaults to 6.
minCallrateVar: Minimum genotype rate for variant retention.
maxCallrateVar: Maximum genotype rate for variant retention.
minCallrateSM: Minimum genotype rate for sample retention.
maxCallrateSM: Maximum genotype rate for sample retention.
minMAF: Minimum minor allele frequency for variant retention.
maxMAF: Maximum minor allele frequency for variant retention.
minMAC: Minimum minor allele count for variant retention.
maxMAC: Maximum minor allele count for variant retention.
minCarriers: Minimum carrier count for variant retention.
maxCarriers: Maximum carrier count for variant retention.
minCarrierFreq: Minimum carrier frequency for variant retention.
maxCarrierFreq: Maximum carrier frequency for variant retention.
verbose: Should the function be verbose? (TRUE/FALSE), defaults to TRUE.
strict: Should strict checks be performed? Defaults to TRUE. Strict tests currently includes checking whether supplied varSetFile/varSetList was generated from the same gdb as specified in object.

Examples

library(rvatData)
aggregatefile <- tempfile()
gdb <- gdb(rvat_example("rvatData.gdb"))

# generate aggregates for varSets
varsetfile <- varSetFile(rvat_example("rvatData_varsetfile.txt.gz"))
varsets <- getVarSet(varsetfile, unit = c("SOD1", "FUS"), varSetName = "High")
aggregate(x = gdb,
          varSet = varsets,
          maxMAF = 0.001,
          output = aggregatefile,
          verbose = FALSE)

# generate for aggregates for list of variants
aggregate(x = gdb,
          VAR_id = 1:100,
          maxMAF = 0.001,
          output = aggregatefile,
          verbose = FALSE)

# use recessive model
aggregate(x = gdb,
          varSet = varsets,
          maxMAF = 0.001,
          geneticModel = "recessive",
          output = aggregatefile,
          verbose = FALSE)

# apply MAF weighting 
aggregate(x = gdb,
          varSet = varsets,
          maxMAF = 0.001,
          MAFweights = "mb",
          output = aggregatefile,
          verbose = FALSE)