Module: CytofDR.dr

class CytofDR.dr.LinearMethods[source]

Bases: object

Linear DR Methods

This class contains static methods of a group of Linear DR Methods. If available, the sklearn implementation is used. All keyword arguments are passed directly to the method itself to allow for flexibility.

static FA(data, out_dims=2, **kwargs)[source]

Scikit-Learn Factor Analysis (FA)

This method uses the SKlearn’s FA implementation.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static ICA(data, out_dims=2, **kwargs)[source]

Scikit-Learn Independent Component Analysis (ICA)

This method uses the SKlearn’s FastICA implementation of ICA.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static NMF(data, out_dims=2, **kwargs)[source]

Scikit-Learn Nonnegative Matrix Factorization (NMF)

This method uses the SKlearn’s NMF implementation.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static PCA(data, out_dims=2, **kwargs)[source]

Scikit-Learn Principal Component Analysis (PCA)

This method uses the Sklearn’s standard PCA.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static ZIFA(data, out_dims=2, **kwargs)[source]

Zero-Inflated Factor Analysis (ZIFA)

This method implements ZIFA as developed by Pierson & Yau (2015).

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.

Return type:

ndarray

Returns:

The low-dimensional embedding.

class CytofDR.dr.NonLinearMethods[source]

Bases: object

NonLinear DR Methods.

This class contains static methods of a group of NonLinear DR Methods, except for tSNE.

static LLE(data, out_dims=2, transform=None, n_jobs=-1, **kwargs)[source]

Scikit-Learn Locally Linear Embedding (LLE)

This method is a wrapper for sklearn’s implementation of LLE.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.
transform (Optional[ndarray]) – The array to transform with the trained model.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static MDS(data, out_dims=2, n_jobs=-1, **kwargs)[source]

Scikit-Learn Multi-Dimensional Scaling (MDS)

This method uses the SKlearn’s MDS implementation.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.
n_jobs (int) – The number of jobs to run concurrantly, defaults to -1.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static SAUCIE(data, steps=1000, batch_size=256, **kwargs)[source]

This method is a wrapper for SAUCIE package’s SAUCIE model. Specifically, dimension reduction is of interest. Here, all keyword arguments are passed into the SAUCIE.SAUCIE method. The training parameters steps and batch_size are directly exposed in this wrapper.

Parameters:

data (ndarray) – The input high-dimensional array.
steps (int) – The number of training steps to use, defaults to 1000.
batch_size (int) – The batch size for training, defaults to 256.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static UMAP(data, out_dims=2, n_jobs=-1, **kwargs)[source]

This method uses the UMAP package’s UMAP implementation.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.
n_jobs (int) – The number of jobs to run concurrantly, defaults to -1.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static grandprix(data, out_dims=2, **kwargs)[source]

GrandPrix

This method is a wrapper for GrandPrix.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.

Returns:

The low-dimensional embedding.

static isomap(data, out_dims=2, transform=None, n_jobs=-1, **kwargs)[source]

Scikit-Learn Isomap

This method is a wrapper for sklearn’s implementation of Isomap.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.
transform (Optional[ndarray]) – The array to transform with the trained model.
n_jobs (int) – The number of threads to use.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static kernelPCA(data, out_dims=2, kernel='poly', n_jobs=-1, **kwargs)[source]

Scikit-Learn Kernel PCA

This method is a wrapper for sklearn’s implementation of kernel PCA.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.
kernel (str) – The kernel to use: “poly,” “linear,” “rbf,” “sigmoid,” or “cosine.”
n_jobs (int) – The number of jobs to run concurrantly, defaults to -1.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static open_tsne(data, out_dims=2, perp=30, learning_rate='auto', early_exaggeration_iter=250, early_exaggeration=12, max_iter=500, metric='euclidean', dof=1, theta=0.5, init='pca', negative_gradient_method='fft', n_jobs=-1)[source]

openTSNE implementation of FIt-SNE

This is the Python implementation of FIt-SNE through the openTSNE package. Its implementation is based on research from Linderman et al. (2019). This is the default recommended implementation. To allow for flexibility and avoid confusion, common parameters are directly exposed without allowing additional keyword arguments.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.
perp (Union[List[int], int]) – Perplexity. The default is set to 30. Tradition is between 30 and 50. This also supports multiple perplexities with a list, defaults to 30.
learning_rate (Union[str, float]) – The learning rate used during gradient descent, defaults to “auto”.
early_exaggeration_iter (int) – Number of early exaggeration iterations, defaults to 250.
early_exaggeration (float) – Early exaggeration factor, defaults to 12.
max_iter (int) – Maximum number of iterations to optimize, defaults to 500
dof (int) – T-distribution degree of freedom, defaults to “euclidean”
theta (float) – The speed/accuracy trade-off, defaults to 0.5.
init (Union[ndarray, str]) – Method of initialiazation. ‘random’, ‘pca’, ‘spectral’, or array, defaults to “pca”
negative_gradient_method (str) – Whether to use “bh” or “fft” tSNE, defaults to “fft”.
n_jobs (int) – The number of jobs to run concurrantly, defaults to -1.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static phate(data, out_dims=2, n_jobs=-1, **kwargs)[source]

PHATE

This method is a wrapper for PHATE.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.
n_jobs (int) – The number of jobs to run concurrantly, defaults to -1.

Returns:

The low-dimensional embedding.

static sklearn_tsne(data, out_dims=2, n_jobs=-1, **kwargs)[source]

Scikit-Learn t-SNE

This method uses the Scikit-learn implementation of t-SNE. It supports both traditional and BH t-SNE with more control of variables.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.
n_jobs (int) – The number of jobs to run concurrantly, defaults to -1.

Return type:

ndarray

Returns:

The low-dimensional embedding.

static spectral(data, out_dims=2, n_jobs=-1, **kwargs)[source]

Scikit-Learn Spectral Embedding

This method is a wrapper for sklearn’s implementation of spectral embedding.

Parameters:

data (ndarray) – The input high-dimensional array.
out_dims (int) – The number of dimensions of the output, defaults to 2.
n_jobs (int) – The number of jobs to run concurrantly, defaults to -1.

Returns:

The low-dimensional embedding.

class CytofDR.dr.Reductions(reductions=None)[source]

Bases: object

A class for reductions and their evaluation.

This class is a convenient data class for storing and evaluaqting reductions.

Parameters:

reductions (Optional[Dict[str, ndarray]]) – A dictionary of reductions as indexed by their names.

Attributes:

reductions: A dictionary of reductions as indexed by names.
names: The names of the reductions.
original_data: The original space data before DR.
original_labels: Clusterings based on original space data.
original_cell_types: Cell types based on original space data.
embedding_data: The embedding space reduction.
embedding_labels: Clusterings based on embedding space reduction.
embedding_cell_types: Cell types based on embedding space reduction.
comparison_data: The comparison data (matched with original data in some way) for concordance analysis.
comparison_cell_types: Cell types based on comparison data.
comparison_classes: Common cell types between embedding and comparison data.
evaluations: The DR evaluation results, which are generated by the default evaluation method.
custom_evaluations: The custom DR evaluation results, which are added by the add_custom_evaluation_results() method.
custom_evaluation_weights: The weights for each metric in the custom_evaluations. They do not distinguish between same metrics on different reductions.
custom_evaluation_reverse_ranking: Whether each metric in the custom_evaluations should be reverse ranked (i.e. smaller is better).

add_custom_evaluation_result(metric_name, reduction_name, value, weight=None, reverse_ranking=False)[source]

Add custom evaluation result.

This method allows you work with custom evaluation metrics. Instead of relying on the builtin schemes, you can add any metrics you would like. This offers great flexibility by allowing you to use metrics not included in the CytofDR package. However, the downside is that you have to run these metrics manually in your workflow. This method simply stores the results in the object so that the DR methods can automatically ranked.

Note

This method does not distinguish between metrics’ weight and revese_ranking properties across different embeddings. For example, if you add the same metrics for umap and tsne, they should have the same weight and reverse_ranking properties, with the only difference being value. Otherwise, latter properties will overwrite previous ones.

Parameters:

metric_name (str) – The name of the custom metric.
reduction_name (str) – The name pf the reduction on which the metric is run.
value (float) – The value of the evaluation metric.
weight (Optional[float], optional) – The weight of the metric in the overall DR ranking scheme, defaults to None
reverse_ranking (bool, optional) – Whether it should be reverse ranked, defaults to False. If True, this means that smaller values are better.

add_evaluation_metadata(original_data=None, original_labels=None, original_cell_types=None, embedding_labels=None, embedding_cell_types=None, comparison_data=None, comparison_cell_types=None, comparison_classes=None)[source]

Add supporting metadata for DR evaluation.

This method allows you to add metadata in the process of DR evaluation. They do not override existing metadata unless actual inputs are specified.

Parameters:

original_data (Optional[ndarray]) – The original space data before DR.
original_labels (Optional[ndarray]) – Clusterings based on original space data.
original_cell_types (Optional[ndarray]) – Cell types based on original space data.
embedding_data – The embedding space reduction.
embedding_labels (Optional[Dict[str, ndarray]]) – Clusterings based on embedding space reduction.
embedding_cell_types (Optional[Dict[str, ndarray]]) – Cell types based on embedding space reduction.
comparison_data (Optional[ndarray]) – The comparison data (matched with original data in some way) for concordance analysis.
comparison_cell_types (Optional[ndarray]) – Cell types based on comparison data.
comparison_classes (Union[str, List[str], None]) – Common cell types between embedding and comparison data.

add_reduction(reduction, name, replace=False)[source]

Add a reduction embedding.

This method allows users to add additional embeddings.

Parameters:

reduction (ndarray) – The reduction array.
name (str) – The name of the reduction.
replace (bool) – If the original name exists, whether to replace the original, defaults to False

Raises:

ValueError – Reduction already exists but users choose not to replace the original.

cluster(n_clusters, cluster_data=True, cluster_embedding=True, **kwargs)[source]

Cluster original_data and reductions.

This provides a convenient method to cluster original_data and embeddings using the KMeans method. The number of clusters must be manually specified.

Parameters:

n_clusters (int) – The number of clusters
cluster_data (bool) – Whether to cluster original data, defaults to True
cluster_embedding (bool) – Whether to cluster embeddings, defaults to True
kwargs – Keyword only arguments passed into sklearn.cluster.KMeans.

evaluate(category, pwd_metric='PCD', k_neighbors=5, annoy_original_data_path=None, auto_cluster=True, n_clusters=20, verbose=True, pairwise_downsample_size=10000, normalize_pwd=None, NPE_method='L1')[source]

Evaluate DR Methods Using Default DR Evaluation Scheme.

This method ranks the DR methods based on any of the four default categories: global, local, downstream, or concordance.

Parameters:

category (Union[str, List[str]]) – The major evaluation category: global, local, downstream, or concordance.
pwd_metric (str) – The pairwise distance metric. Three options are “PCD”, “pairwise_downsample”, or “pairwise”. PCD refers to Point Cluster Distance as implemented in this package; pairwise is the traditional pairwise distance; pairwise_downsample is pairwise distance with downsampling for both data and embedding. For large datasets, PCD and pairwise_downsample are recommended and practically with equivalent performances. Defaults to “PCD”.
k_neighbors (int) – The number of neighbors to use for local metrics. Defaults to 5.
annoy_original_data_path (Optional[str]) – The file path to an ANNOY object for original data. Optional.
auto_cluster (bool) – Whether to automatically perform clustering for evaluation purposes. This option has no effect when the original_labels and embedding_labels are previously added with the add_evaluation_metadata method. Defaults to True.
n_clusters (int) – The number of clusters for the auto_cluster option. Defaults to 20.
verbose (bool) – Whether to print out progress. Defaults to True.
pairwise_downsample_size (int) – The downsample size if the pairwise_downsample is chosen for the pwd_metric. If this is larger than the sample size of the original dataset, this methods falls back to the pairwise option for pwd_metric. On a typical machine, it is not recommended to go beyond the default, and for datasets smaller than 10,000, no downsample is strictly necessary. Defaults to 10,000.
normalize_pwd (Optional[str]) – Whether to perform minmax normalize on the pwd metric for EMD. If needed, enter “minmax”. If None, then the raw distances are used. For more details, see EvaluationMetrics.EMD. This only matters when Global is chosen in the category parameter. Defaults to None.
NPE_method (str) – The distance measure used for the NPE metric as part of the Local categorty. “L1” for L1-norm or “tvd” for Total Variation Distance. Defaults to “L1”.

Raises:

ValueError – No reductions to evalate.
ValueError – Unsupported ‘pwd_metric’: ‘PCD’, ‘Pairwise’, or ‘pairwise_downsample’ only.
ValueError – Evaluation needs ‘original_data’, ‘original_labels’, and ‘embedding_labels’ attributes.

Note

This method requires add_evaluation_metadata to run first. original_cell_types and embedding_cell_types are optional for the downstream category. For concordance, if you wish to use clustering results for embedding and comparison files, set the appropriate clusterings to embedding_cell_types and comparison_cell_types.

New in version 0.2.0: The pairwise_downsample option for pwd_metric; the pairwise_downsample_size parameter for when pairwise_downbsample is chosen.

New in version 0.3.0: The normalize_pwd parameter. It was added to allow for nromalization in the EMD metric for Global. This should be used in cases that there are scale differences in different DR embeddings. For this specific update, only “minmax” is supported.

New in version 0.3.0: The NPE_method parameter. It was added to allow for alternative implementations of NPE’s distance measure. Originally, “L1” was implemented and it is stil the default. In this new version, now “tvd” is also an option. Defaults to “L1”.

get_reduction(name)[source]

Retrieve a reduction my name.

This method allows users to retrieve a reduction by name. This equivalent to running self.reductions[name].

Parameters:: name (str) – The name of the reduction.

plot_reduction(name, save_path, style='darkgrid', hue=None, **kwargs)[source]

Draw embedding using a scatter plot.

This method generates a scatter plot for reductions in the class.

Parameters:

name (str) – The name of the reduction.
save_path (str) – The path to save the plot.
stype – The plot style, defaults to “darkgrid”.
hue (Optional[ndarray]) – Labels used to color the points.
kwargs – Keyword arguments passed into the sns.scatterplot method.

Note

Live viewing is not supported by this method.

rank_dr_methods(tie_method='max')[source]

Rank DR Methods Using Default DR Evaluation.

Based on the results from the evaluate method, this method ranks the DR methods based on the categories chosen. All weighting schemes are consistent with the paper. Custom evaluation and weighting schemes are not supported in this case.

Parameters:: tie_method (str, optional) – The method to deal with ties when ranking, defaults to “max”.
Returns:: A dictionary of DR methods and their final weighted ranks.

rank_dr_methods_custom(tie_method='max')[source]

Rank DR methods according to custom evaluation metrics.

This is the custom version the rank_dr_methods method. It ranks all the DR methods using the metrics added through the add_custom_evaluation_result method and their corresonding weights.

Parameters:: tie_method (str, optional) – The method to deal with ties when ranking, defaults to “max”
Raises:: RuntimeError – Missing metrics for certain DR methods.
Returns:: A dictionary of methods with their weighted rankings.
Return type:: Dict[str, float]

property reductions: Dict[str, ndarray]

Getter for the reductions dictionary.

This method returns the dictionary storing all the reductions in this class. The dictionary keys are names of the embeddings, and its values are the arrays of reductions.

Returns:: A dictionary of reductions.

save_all_reductions(save_dir, overwrite=False, delimiter='\\t', **kwargs)[source]

Save all DR embeddings to a specified directory.

This method saves all reductions to a specified directory. All files are plain text files with their embedding names as their names. All additional arguments are passed to np.savetxt.

Parameters:

save_dir (str) – The directory path to save all the embeddings.
overwrite (bool, optional) – Whether to overwrite existing files, defaults to False
delimiter (str, optional) – The delimiter used, defaults to “ “
kwargs – Keyword-only arguments passed to the np.savetxt method.

save_evaluations(path, overwrite=False)[source]

Save DR evaluation results.

This method saves all the results of DR evaluation in a csv.

Parameters:

path (str) – The path to the file.
overwrite (bool, optional) – Whether to overwrite existing file, defaults to False

Raises:

AttributeError – There is no evaluations attribute. Need to run evaluate first.
FileExistsError – The file already exists and overwrite is set to false.

save_reduction(name, path, overwrite=False, delimiter='\\t', **kwargs)[source]

Save a DR embeddings to disk.

This method saves a specific reduction to a disk. The file is presumed to be a plain text file. All additional arguments are passed to np.savetxt.

Parameters:

name (str) – The name of the embedding to be saved as the key stored in the reductions dictionary.
path (str) – The path to save the
overwrite (bool, optional) – Whether to overwrite existing files, defaults to False
delimiter (str, optional) – The delimiter used, defaults to “ “

CytofDR.dr.run_dr_methods(data, methods='all', out_dims=2, transform=None, n_jobs=-1, verbose=True, suppress_error_msg=False)[source]

Run dimension reduction methods.

This is a one-size-fits-all dispatcher that runs all supported methods in the module. It supports running multiple methods at the same time at the sacrifice of some more granular control of parameters. If you would like more customization, run each method indicidually instead.

Parameters:

data (ndarray) – The input high-dimensional array.
methods (Union[str, List[str]]) – DR methods to run (not case sensitive).
out_dims (int) – Output dimension of DR.
transform (Optional[ndarray]) – An array to transform after training on the traning set.
n_jobs (int) – The number of jobs to run when applicable, defaults to -1.
verbose (bool) – Whether to print out progress, defaults to True.
suppress_error_msg (bool) – Whether to suppress error messages print outs, defaults to False.

Return type:

Reductions

Returns:

A Reductions object with all dimension reductions.