Convert a CellTypeDataset into standardized format

This function will take a CTD, drop all genes without 1:1 orthologs with the output_species ("human" by default), convert the remaining genes to gene symbols, assign names to each level, and convert all matrices to sparse matrices and/or DelayedArray.

standardise_ctd(
  ctd,
  dataset,
  input_species = NULL,
  output_species = "human",
  sctSpecies_origin = input_species,
  non121_strategy = "drop_both_species",
  method = "homologene",
  force_new_quantiles = TRUE,
  force_standardise = FALSE,
  remove_unlabeled_clusters = FALSE,
  numberOfBins = 40,
  keep_annot = TRUE,
  keep_plots = TRUE,
  as_sparse = TRUE,
  as_DelayedArray = FALSE,
  rename_columns = TRUE,
  make_columns_unique = FALSE,
  verbose = TRUE,
  ...
)

Arguments

ctd

Input CellTypeData.

dataset

CellTypeData. name.

input_species

Which species the gene names in exp come from. See list_species for all available species.

output_species

Which species' genes names to convert exp to. See list_species for all available species.

sctSpecies_origin

Species that the sct_data originally came from, regardless of its current gene format (e.g. it was previously converted from mouse to human gene orthologs). This is used for computing an appropriate backgrund.

non121_strategy

How to handle genes that don't have 1:1 mappings between input_species:output_species. Options include:

"drop_both_species" or "dbs" or 1 :
Drop genes that have duplicate mappings in either the input_species or output_species
(DEFAULT).
"drop_input_species" or "dis" or 2 :
Only drop genes that have duplicate mappings in the input_species.
"drop_output_species" or "dos" or 3 :
Only drop genes that have duplicate mappings in the output_species.
"keep_both_species" or "kbs" or 4 :
Keep all genes regardless of whether they have duplicate mappings in either species.
"keep_popular" or "kp" or 5 :
Return only the most "popular" interspecies ortholog mappings. This procedure tends to yield a greater number of returned genes but at the cost of many of them not being true biological 1:1 orthologs.
"sum","mean","median","min" or "max" :
When gene_df is a matrix and gene_output="rownames", these options will aggregate many-to-one gene mappings (input_species-to-output_species) after dropping any duplicate genes in the output_species.

method

R package to use for gene mapping:

"gprofiler" : Slower but more species and genes.
"homologene" : Faster but fewer species and genes.
"babelgene" : Faster but fewer species and genes. Also gives consensus scores for each gene mapping based on a several different data sources.

force_new_quantiles

By default, quantile computation is skipped if they have already been computed. Set =TRUE to override this and generate new quantiles.

force_standardise

If ctd has already been standardised, whether to rerun standardisation anyway (Default: FALSE).

remove_unlabeled_clusters

Remove any samples that have numeric column names.

numberOfBins

Number of non-zero quantile bins.

keep_annot

Keep the column annotation data if provided.

keep_plots

Keep the dendrograms if provided.

as_sparse

Convert to sparse matrix.

as_DelayedArray

Convert to DelayedArray.

rename_columns

Remove replace_chars from column names.

make_columns_unique

Rename each columns with the prefix dataset.species.celltype.

verbose

Print messages. Set verbose=2 if you want to print all messages from internal functions as well.

...

Arguments passed on to orthogene::convert_orthologs

gene_df

Data object containing the genes (see gene_input for options on how the genes can be stored within the object).
Can be one of the following formats:

matrix :
A sparse or dense matrix.
data.frame :
A data.frame, data.table. or tibble.
codelist :
A list or character vector.

Genes, transcripts, proteins, SNPs, or genomic ranges can be provided in any format (HGNC, Ensembl, RefSeq, UniProt, etc.) and will be automatically converted to gene symbols unless specified otherwise with the ... arguments.
Note: If you set method="homologene", you must either supply genes in gene symbol format (e.g. "Sox2") OR set standardise_genes=TRUE.

gene_input

Which aspect of gene_df to get gene names from:

"rownames" :
From row names of data.frame/matrix.
"colnames" :
From column names of data.frame/matrix.
<column name> :
From a column in gene_df, e.g. "gene_names".

gene_output

How to return genes. Options include:

"rownames" :
As row names of gene_df.
"colnames" :
As column names of gene_df.
"columns" :
As new columns "input_gene", "ortholog_gene" (and "input_gene_standard" if standardise_genes=TRUE) in gene_df.
"dict" :
As a dictionary (named list) where the names are input_gene and the values are ortholog_gene.
"dict_rev" :
As a reversed dictionary (named list) where the names are ortholog_gene and the values are input_gene.

standardise_genes

If TRUE AND gene_output="columns", a new column "input_gene_standard" will be added to gene_df containing standardised HGNC symbols identified by gorth.

drop_nonorths

Drop genes that don't have an ortholog in the output_species.

agg_fun

Aggregation function passed to aggregate_mapped_genes. Set to NULL to skip aggregation step (default).

mthreshold

Maximum number of ortholog names per gene to show. Passed to gorth. Only used when method="gprofiler" (DEFAULT : Inf).

sort_rows

Sort gene_df rows alphanumerically.

gene_map

A data.frame that maps the current gene names to new gene names. This function's behaviour will adapt to different situations as follows:

gene_map=<data.frame> :
When a data.frame containing the gene key:value columns (specified by input_col and output_col, respectively) is provided, this will be used to perform aggregation/expansion.
gene_map=NULL and input_species!=output_species :
A gene_map is automatically generated by map_orthologs to perform inter-species gene aggregation/expansion.
gene_map=NULL and input_species==output_species :
A gene_map is automatically generated by map_genes to perform within-species gene gene symbol standardization and aggregation/expansion.

input_col

Column name within gene_map with gene names matching the row names of X.

output_col

Column name within gene_map with gene names that you wish you map the row names of X onto.

Value

Standardised CellTypeDataset.

Examples

ctd <- ewceData::ctd()
#> see ?ewceData and browseVignettes('ewceData') for documentation
#> loading from cache
ctd_std <- EWCE::standardise_ctd(
    ctd = ctd,
    input_species = "mouse",
    dataset = "Zeisel2016"
)
#> Standardising CellTypeDataset
#> Found 5 matrix types across 2 CTD levels.
#> Processing level: 1
#> Processing level: 2