R/bootstrap_enrichment_test.R
bootstrap_enrichment_test.Rd
bootstrap_enrichment_test
takes a genelist and a single cell type
transcriptome dataset and determines the probability of enrichment and fold
changes for each cell type.
bootstrap_enrichment_test(
sct_data = NULL,
hits = NULL,
bg = NULL,
genelistSpecies = NULL,
sctSpecies = NULL,
sctSpecies_origin = sctSpecies,
output_species = "human",
method = "homologene",
reps = 100,
no_cores = 1,
annotLevel = 1,
geneSizeControl = FALSE,
controlledCT = NULL,
mtc_method = "BH",
sort_results = TRUE,
standardise_sct_data = TRUE,
standardise_hits = FALSE,
verbose = TRUE,
localHub = FALSE,
store_gene_data = TRUE
)
List generated using generate_celltype_data.
List of gene symbols containing the target gene list.
Will automatically be converted to human gene symbols
if geneSizeControl=TRUE
.
List of gene symbols containing the background gene list
(including hit genes). If bg=NULL
,
an appropriate gene background will be created automatically.
Species that hits
genes came from
(no longer limited to just "mouse" and "human").
See list_species for all available species.
Species that sct_data
is currently formatted as
(no longer limited to just "mouse" and "human").
See list_species for all available species.
Species that the sct_data
originally came from, regardless of its current gene format
(e.g. it was previously converted from mouse to human gene orthologs).
This is used for computing an appropriate backgrund.
Species to convert sct_data
and hits
to
(Default: "human").
See list_species for all available species.
R package to use for gene mapping:
"gprofiler"
: Slower but more species and genes.
"homologene"
: Faster but fewer species and genes.
"babelgene"
: Faster but fewer species and genes.
Also gives consensus scores for each gene mapping based on a
several different data sources.
Number of random gene lists to generate (Default: 100, but should be >=10,000 for publication-quality results).
Number of cores to parallelise
bootstrapping reps
over.
An integer indicating which level of sct_data
to
analyse (Default: 1).
Whether you want to control for
GC content and transcript length. Recommended if the gene list originates
from genetic studies (Default: FALSE).
If set to TRUE
, then hits
must be from humans.
[Optional] If not NULL, and instead is the name of a cell type, then the bootstrapping controls for expression within that cell type.
Multiple-testing correction method (passed to p.adjust).
Sort enrichment results from smallest to largest p-values.
Should sct_data
be standardised?
if TRUE
:
When sctSpecies!=output_species
the sct_data
will be checked for object formatting and
the genes will be converted to the orthologs of the output_species
with standardise_ctd
(which calls map_genes internally).
When sctSpecies==output_species
,
the sct_data
will be checked for object formatting
with standardise_ctd, but the gene names
will remain untouched.
Should hits
be standardised?
If TRUE
:
When genelistSpecies!=output_species
,
the genes will be converted to the orthologs of the output_species
with convert_orthologs.
When genelistSpecies==output_species
,
the genes will be standardised with map_genes.
If FALSE
, hits
will be passed on to subsequent steps as-is.
Print messages.
If working offline, add argument localHub=TRUE to work with a local, non-updated hub; It will only have resources available that have previously been downloaded. If offline, Please also see BiocManager vignette section on offline use to ensure proper functionality.
Store sampled gene data for every bootstrap iteration.
When the number of bootstrap reps
is very high (>=100k) and/or
the number of genes in hits
is very high, you may want
to set store_gene_data=FALSE
to avoid using excessive amounts of
CPU memory.
A list containing three elements:
hit.cells
: vector containing the summed proportion of
expression in each cell type for the target list.
gene_data:
data.table showing the number of time each gene
appeared in the bootstrap sample.
bootstrap_data
: matrix in which each row represents the
summed proportion of expression in each cell type for one of the
random lists
controlledCT
: the controlled cell type (if applicable)
# Load the single cell data
sct_data <- ewceData::ctd()
#> see ?ewceData and browseVignettes('ewceData') for documentation
#> loading from cache
# Set the parameters for the analysis
# Use 3 bootstrap lists for speed, for publishable analysis use >=10,000
reps <- 3
# Load gene list from Alzheimer's disease GWAS
hits <- ewceData::example_genelist()
#> see ?ewceData and browseVignettes('ewceData') for documentation
#> loading from cache
# Bootstrap significance test, no control for transcript length or GC content
full_results <- EWCE::bootstrap_enrichment_test(
sct_data = sct_data,
hits = hits,
reps = reps,
annotLevel = 1,
sctSpecies = "mouse",
genelistSpecies = "human")
#> 1 core(s) assigned as workers (3 reserved).
#> Generating gene background for mouse x human ==> human
#> Gathering ortholog reports.
#> Retrieving all genes using: homologene.
#> Retrieving all organisms available in homologene.
#> Mapping species name: human
#> Common name mapping found for human
#> 1 organism identified from search: 9606
#> Gene table with 19,129 rows retrieved.
#> Returning all 19,129 genes from human.
#> Retrieving all genes using: homologene.
#> Retrieving all organisms available in homologene.
#> Mapping species name: mouse
#> Common name mapping found for mouse
#> 1 organism identified from search: 10090
#> Gene table with 21,207 rows retrieved.
#> Returning all 21,207 genes from mouse.
#> --
#> --
#> Preparing gene_df.
#> data.frame format detected.
#> Extracting genes from Gene.Symbol.
#> 21,207 genes extracted.
#> Converting mouse ==> human orthologs using: homologene
#> Retrieving all organisms available in homologene.
#> Mapping species name: mouse
#> Common name mapping found for mouse
#> 1 organism identified from search: 10090
#> Retrieving all organisms available in homologene.
#> Mapping species name: human
#> Common name mapping found for human
#> 1 organism identified from search: 9606
#> Checking for genes without orthologs in human.
#> Extracting genes from input_gene.
#> 17,355 genes extracted.
#> Extracting genes from ortholog_gene.
#> 17,355 genes extracted.
#> Checking for genes without 1:1 orthologs.
#> Dropping 131 genes that have multiple input_gene per ortholog_gene (many:1).
#> Dropping 498 genes that have multiple ortholog_gene per input_gene (1:many).
#> Filtering gene_df with gene_map
#> Adding input_gene col to gene_df.
#> Adding ortholog_gene col to gene_df.
#>
#> =========== REPORT SUMMARY ===========
#> Total genes dropped after convert_orthologs :
#> 4,725 / 21,207 (22%)
#> Total genes remaining after convert_orthologs :
#> 16,482 / 21,207 (78%)
#> --
#>
#> =========== REPORT SUMMARY ===========
#> 16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
#> 16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
#> Gathering ortholog reports.
#> Retrieving all genes using: homologene.
#> Retrieving all organisms available in homologene.
#> Mapping species name: human
#> Common name mapping found for human
#> 1 organism identified from search: 9606
#> Gene table with 19,129 rows retrieved.
#> Returning all 19,129 genes from human.
#> Retrieving all genes using: homologene.
#> Retrieving all organisms available in homologene.
#> Mapping species name: human
#> Common name mapping found for human
#> 1 organism identified from search: 9606
#> Gene table with 19,129 rows retrieved.
#> Returning all 19,129 genes from human.
#> --
#>
#> =========== REPORT SUMMARY ===========
#> 19,129 / 19,129 (100%) target_species genes remain after ortholog conversion.
#> 19,129 / 19,129 (100%) reference_species genes remain after ortholog conversion.
#> 16,482 intersect background genes used.
#> Standardising CellTypeDataset
#> Checking gene list inputs.
#> Running without gene size control.
#> 17 hit gene(s) remain after filtering.
#> Computing gene scores.
#> Using previously sampled genes.
#> Computing gene counts.
#> Testing for enrichment in 7 cell types...
#> Sorting results by p-value.
#> Computing BH-corrected q-values.
#> 2 significant cell type enrichment results @ q<0.05 :
#> CellType annotLevel p fold_change sd_from_mean q
#> 1 microglia 1 0 1.920429 8.596007 0
#> 2 astrocytes_ependymal 1 0 1.480398 2.178928 0