This function will take a CTD,
drop all genes without 1:1 orthologs with the
output_species
("human" by default),
convert the remaining genes to gene symbols,
assign names to each level,
and convert all matrices to sparse matrices and/or DelayedArray
.
standardise_ctd(
ctd,
dataset,
input_species = NULL,
output_species = "human",
sctSpecies_origin = input_species,
non121_strategy = "drop_both_species",
method = "homologene",
force_new_quantiles = TRUE,
force_standardise = FALSE,
remove_unlabeled_clusters = FALSE,
numberOfBins = 40,
keep_annot = TRUE,
keep_plots = TRUE,
as_sparse = TRUE,
as_DelayedArray = FALSE,
rename_columns = TRUE,
make_columns_unique = FALSE,
verbose = TRUE,
...
)
Input CellTypeData.
CellTypeData. name.
Which species the gene names in exp
come from.
See list_species for all available species.
Which species' genes names to convert exp
to.
See list_species for all available species.
Species that the sct_data
originally came from, regardless of its current gene format
(e.g. it was previously converted from mouse to human gene orthologs).
This is used for computing an appropriate backgrund.
How to handle genes that don't have
1:1 mappings between input_species
:output_species
.
Options include:
"drop_both_species" or "dbs" or 1
:
Drop genes that have duplicate
mappings in either the input_species
or output_species
(DEFAULT).
"drop_input_species" or "dis" or 2
:
Only drop genes that have duplicate
mappings in the input_species
.
"drop_output_species" or "dos" or 3
:
Only drop genes that have duplicate
mappings in the output_species
.
"keep_both_species" or "kbs" or 4
:
Keep all genes regardless of whether
they have duplicate mappings in either species.
"keep_popular" or "kp" or 5
:
Return only the most "popular" interspecies ortholog mappings.
This procedure tends to yield a greater number of returned genes
but at the cost of many of them not being true biological 1:1 orthologs.
"sum","mean","median","min" or "max"
:
When gene_df
is a matrix and gene_output="rownames"
,
these options will aggregate many-to-one gene mappings
(input_species
-to-output_species
)
after dropping any duplicate genes in the output_species
.
R package to use for gene mapping:
"gprofiler"
: Slower but more species and genes.
"homologene"
: Faster but fewer species and genes.
"babelgene"
: Faster but fewer species and genes.
Also gives consensus scores for each gene mapping based on a
several different data sources.
By default, quantile computation is
skipped if they have already been computed.
Set =TRUE
to override this and generate new quantiles.
If ctd
has already been standardised, whether
to rerun standardisation anyway (Default: FALSE
).
Remove any samples that have numeric column names.
Number of non-zero quantile bins.
Keep the column annotation data if provided.
Keep the dendrograms if provided.
Convert to sparse matrix.
Convert to DelayedArray
.
Remove replace_chars
from column names.
Rename each columns with the prefix
dataset.species.celltype
.
Print messages.
Set verbose=2
if you want to print all messages
from internal functions as well.
Arguments passed on to orthogene::convert_orthologs
gene_df
Data object containing the genes
(see gene_input
for options on how
the genes can be stored within the object).
Can be one of the following formats:
matrix
:
A sparse or dense matrix.
data.frame
:
A data.frame
,
data.table
. or tibble
.
codelist :
A list
or character vector
.
Genes, transcripts, proteins, SNPs, or genomic ranges
can be provided in any format
(HGNC, Ensembl, RefSeq, UniProt, etc.) and will be
automatically converted to gene symbols unless
specified otherwise with the ...
arguments.
Note: If you set method="homologene"
, you
must either supply genes in gene symbol format (e.g. "Sox2")
OR set standardise_genes=TRUE
.
gene_input
Which aspect of gene_df
to
get gene names from:
"rownames"
:
From row names of data.frame/matrix.
"colnames"
:
From column names of data.frame/matrix.
<column name>
:
From a column in gene_df
,
e.g. "gene_names"
.
gene_output
How to return genes.
Options include:
"rownames"
:
As row names of gene_df
.
"colnames"
:
As column names of gene_df
.
"columns"
:
As new columns "input_gene", "ortholog_gene"
(and "input_gene_standard" if standardise_genes=TRUE
)
in gene_df
.
"dict"
:
As a dictionary (named list) where the names
are input_gene and the values are ortholog_gene.
"dict_rev"
:
As a reversed dictionary (named list)
where the names are ortholog_gene and the values are input_gene.
standardise_genes
If TRUE
AND
gene_output="columns"
, a new column "input_gene_standard"
will be added to gene_df
containing standardised HGNC symbols
identified by gorth.
drop_nonorths
Drop genes that don't have an ortholog
in the output_species
.
agg_fun
Aggregation function passed to
aggregate_mapped_genes.
Set to NULL
to skip aggregation step (default).
mthreshold
Maximum number of ortholog names per gene to show.
Passed to gorth.
Only used when method="gprofiler"
(DEFAULT : Inf
).
sort_rows
Sort gene_df
rows alphanumerically.
gene_map
A data.frame that maps the current gene names to new gene names. This function's behaviour will adapt to different situations as follows:
gene_map=<data.frame>
:
When a data.frame containing the
gene key:value columns
(specified by input_col
and output_col
, respectively)
is provided, this will be used to perform aggregation/expansion.
gene_map=NULL
and input_species!=output_species
:
A gene_map
is automatically generated by
map_orthologs to perform inter-species
gene aggregation/expansion.
gene_map=NULL
and input_species==output_species
:
A gene_map
is automatically generated by
map_genes to perform within-species
gene gene symbol standardization and aggregation/expansion.
input_col
Column name within gene_map
with gene names matching
the row names of X
.
output_col
Column name within gene_map
with gene names
that you wish you map the row names of X
onto.
Standardised CellTypeDataset.
ctd <- ewceData::ctd()
#> see ?ewceData and browseVignettes('ewceData') for documentation
#> loading from cache
ctd_std <- EWCE::standardise_ctd(
ctd = ctd,
input_species = "mouse",
dataset = "Zeisel2016"
)
#> Standardising CellTypeDataset
#> Found 5 matrix types across 2 CTD levels.
#> Processing level: 1
#> Processing level: 2