Module: CytofDR.evaluation

class CytofDR.evaluation.Annoy[source]

Bases: object

static build_annoy(data, metric='angular', n_trees=10)[source]

Build AnnoyIndex object from data.

Parameters:

data (ndarray) – The data array
metric (str) – The distance metric to use, defaults to “angular”
n_trees (int) – The number of trees, defaults to 10

Return type:

Annoy

Returns:

An AnnoyIndex object.

static load_annoy(path, ncol, metric='angular')[source]

Load AnnoyIndex object from disk.

This loads an AnnoyIndex object saved using this class or the ANnoy’s buildin IO function.

Parameters:

path (str) – The path to the object.
ncol (int) – The number of columns.
metric (str) – _description_, defaults to “angular”

Return type:

Annoy

Returns:

The loaded AnnoyIndex object.

static save_annoy(model, path)[source]

Save AnnoyIndex object to disk.

This saves an AnnoyIndex object to a specified path.

Parameters:

model (Annoy) – An AnnoyIndex object to be saved.
path (str) – The path to the object.

Returns:

The loaded AnnoyIndex object.

class CytofDR.evaluation.EvaluationMetrics[source]

Bases: object

Evaluation metrics for dimension reduction

This class contains methods to run evluation metrics.

static ARI(x_labels, y_labels)[source]

Adjusted Rand Index (ARI)

The ARI uses the labels from the original space and the embedding space to measure the similarity between them using pairs. It is used in Xiang et al. (2021).

Parameters:

x_labels (ndarray) – The first set of labels.
y_labels (ndarray) – The second set of labels on the same data.

Return type:

float

Returns:

ARI.

References:

This implementation adapts from sklearn’s implementation of ARI with a bug fix of overflow issue.

@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
        and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
        and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
        Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}

License:

BSD 3-Clause License

Copyright (c) 2007-2021 The scikit-learn developers.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

static EMD(x, y, normalization=None)[source]

Earth Mover’s Distance (EMD)

This metric computes the EMD between the pairwise distance of between points in the high and low dimensional space. This implementation uses the scipy.stats.wasserstein_distance. The usage of EMD is proposed in Heiser & Lou (2020).

Parameters:

x (ndarray) – The first distribution x as a 1D array.
y (ndarray) – The second distribution y as a 1D array.
normalization (Optional[str]) – Whether to perfrom minmax normalization on x and y. The acceptable value is minmax, which performs min-max normalization. If None, no normalization is performed. Defaults to None.

Raises:

ValueError – Unsupported normalization method.

Return type:

float

Returns:

Earth mover’s distance.

New in version 0.3.0: The normalization parameter. It was added to allow for nromalization of x and y with min-max normalization using minmax. This is useful when x and y are empirical observations that are on different scales, and the scale itself is not of interest.

static KNN(data_neighbors, embedding_neighbors)[source]

K-Nearest Neighbors Preservation (KNN)

The KNN metric computes the percentage of k-neighbors of each point is preserved in the embedding space, and it is average across the entire dataset.

Note

This method is not used to calculate KNN itself.

Parameters:

data_neighbors (ndarray) – A nearest-neighbor array of the original data.
embedding_neighbors (ndarray) – A nearest-neighbor array of the embedding.

Return type:

float

Returns:

K-nearest neighbors preservation.

static NMI(x_labels, y_labels)[source]

Normalized Mutual Information (NMI)

The NMI metric computes the mutual information between labels of the original space and the embeeding space and then normalizes it with the larger entroy of the two vectors. This metric is a measure of clustering performance before and after dimension reduction, and it is used in Xiang et al. (2021).

Parameters:

x_labels (ndarray) – The first set of labels.
y_labels (ndarray) – The second set of labels on the same data.

Return type:

float

Returns:

Silhouette score.

static NPE(data_neighbors, embedding_neighbors, labels, method='L1')[source]

Neighborhood Proportion Error (NPE)

The NPE metric is proposed by Konstorum et al. (2019). It measures the total variation distance between the proportion of nearest points belonging to the same class of each point in the HD and LD space. The lower the NPE, the more similar the embedding and the original data are.

To further elaborate on the difference between L1 norm and TVD, the NPE metric involves the following calculation \(\delta (P, Q)\), where \(\delta\) is a distance measureon \(P\) and \(Q\). The “L1” optioc computes

\[\sum_i |P_i-Q_i|\]

whereas the TVD computes

\[\sup_{a\in [0,1]} |P(a) - Q(a)| \, .\]

There is questionably some debate on the implementation used to calculate NPE, but TVD should align more with the original authors’ implementation.

Parameters:

data_neighbors (ndarray) – A nearest-neighbor array of the original data.
embedding_neighbors (ndarray) – A nearest-neighbor array of the embedding.
labels (ndarray) – The class labels of each observation.
method (str) – The distance measure used for computing the distance between the neighborhood-proportion vector. “L1” for L1-norm or “tvd” for Total Variation Distance. The latter is likely the intended implementation by Konstorum et al. Defaults to “L1”.

Raises:

ValueError – Unsupported method provided. We only support “L1” or “tvd”.

Return type:

float

Returns:

Neighborhood proportion error.

New in version 0.3.0: The method parameter. It was added to allow for alternative implementations. Originally, “L1” was implemented and it is stil the default. In this new version, now “tvd” is also an option.

static build_annoy(data, saved_annoy_path=None, k=5)[source]

Build ANNOY and returns nearest neighbors.

This is a utility function for building ANNOY models and returning the nearest-neighbor matrices for original space data and low-dimensional embedding.

: param data: The input high-dimensional array. :type saved_annoy_path: Optional[str] : param saved_annoy_path: The path to pre-built ANNOY model for original data. :type k: int : param k: The number of neighbors.

Return type:: ndarray
Returns:: Nearest-neighbor matrices of original space data.

static calinski_harabasz(embedding, labels)[source]

Calinski-Harabasz Index

This metric computes the Calinski-Harabasz index of clusters in the embedding space. Ideally, clusters should be coherent, and using labels obtained from the original space can evaluate the effectiveness of the embedding technique.

Parameters:

embedding (ndarray) – The low-dimensional embedding.
labels (ndarray) – The class labels of each observation.

Return type:

float

Returns:

Calinski-Harabasz Index.

static correlation(x, y, metric='Pearson')[source]

Calculate Correlation Coefficient

This method computes the pearson or spearman correlation between the inputs.

Parameters:

x (ndarray) – The first 1D array.
y (ndarray) – The second 1D array.
metric (str) – The metric to use. ‘Pearson’ or ‘Spearman’, defaults to “Pearson”.

Return type:

float

Returns:

Correlation coefficient.

static davies_bouldin(embedding, labels)[source]

Davies-Bouldin Index

This metric computes the Davies-Bouldin index of clusters in the embedding space. Ideally, clusters should be coherent, and using labels obtained from the original space can evaluate the effectiveness of the embedding technique.

Parameters:

embedding (ndarray) – The low-dimensional embedding.
labels (ndarray) – The class labels of each observation.

Return type:

float

Returns:

Davies-Bouldin Index.

static embedding_concordance(embedding, labels_embedding, comparison_file, comparison_labels, comparison_classes=None, method='emd')[source]

Concordance between two embeddings.

This is a wrapper function to implement two embedding concordance metrics based on named clusters: EMD and Cluster Distance. When two embeddings can be reasonably aligned based on clusters or manual labels, these two metrics calculate the relationships between clusters and their distances between two embeddings.

For EMD, the metric considers matched pairs of clusters in both embeddings: for each pair in each embedding, the distances between each centroid and all points in the other cluster are calculated. The EMD between these two vectors from two embeddings are calculated and then averaged across all pairs.

For Cluster Distance, pairwise rank distance between all cluster centroids are calculated in each embedding. Then, the Euclidean distance between these two vectors are taken.

Parameters:

embedding (ndarray) – The first (main) embedding.
labels_embedding (ndarray) – Labels for all obervations in the embedding.
comparison_file (Union[ndarray, List[ndarray]]) – The second embedding.
comparison_labels (Union[ndarray, List[ndarray]]) – The labels for all observations in the comparison embedding.
comparison_classes (Optional[List[str]]) – Which classes in labels to compare. At least two classes need to be provided for this to work; otherwise, NA will be returned. If None, all overlapping labels used, optional
method (str) – “emd” or “cluster_distance”, defaults to “emd”

Return type:

Union[float, str]

Returns:

The score or “NA”

Note

When there is no overlapping labels, “NA” is automatically returned as str.

Deprecated since version 0.2.0: Passing in str for the comparison_classes parameter is deprecated and will be removed in futrue versions.

static neighborhood_agreement(data_neighbors, embedding_neighbors)[source]

Neighborhood Agreement

The Neighborhood Agreement metric is proposed by Lee et al. (2015). It measures the intersection of k-nearest neighbors (KNN) of each point in HD and LD space. The result is subsequently rescaled to measure the improvement over a random embedding. This measure is conceptually similar to Metric.KNN such that they both measure the agreement of KNN, but Metric.KNN simply takes the average of the KNN graph agreement without any scaling.

Parameters:

data_neighbors (ndarray) – A nearest-neighbor array of the original data.
embedding_neighbors (ndarray) – A nearest-neighbor array of the embedding.

Return type:

float

Returns:

Neighborhood agreement.

static neighborhood_trustworthiness(data_neighbors, embedding_neighbors, dist_data)[source]

Neighborhood Trustworthiness

The Neighborhood Truestworthiness is proposed by Venna and Kaski (2001). It measures trustworthiness by measuring the ranked distane of new points entering the defined neighborhood size in the embedding. The higher the new points are ranked based on the original HD space distance matrix, the less trustworthy the new embedding is. The measure is scaled between 0 and 1 with a higher score reflecting a more trustworthy embedding.

Parameters:

data_neighbors (ndarray) – A nearest-neighbor matrix of the original data.
embedding_neighbors (ndarray) – A nearest-neighbor matrix of the embedding.
dist_data (ndarray) – A pairwise distance matrix for the original data.

Return type:

float

Returns:

Neighborhood trustworthiness.

static random_forest(embedding, labels)[source]

Random Forest Classification Accuracy

This method trains a random forest classifer using the embedding data and the labels generated or manually classified from the original space. It then tests the accuracy of the classifier using the 33% of the embedding data. This metric was first proposed in Becht et al. (2019).

Parameters:

embedding (ndarray) – The low-dimensional embedding.
labels (ndarray) – The class labels of each observation.

Return type:

float

Returns:

Random forest prediction accuracy.

static residual_variance(x=None, y=None, r=None)[source]

Residual Variance

The residual variance is computed with the following formuation with r as the pearson correlation: 1-r**2. If r is provided, x and y are optional for efficiency.

Parameters:

x (Optional[ndarray]) – The first 1D array, optional.
y (Optional[ndarray]) – The second 1D array, optional.
r (Optional[float]) – Pearson correlation between x and y, optional.

Return type:

float

Returns:: float: Redisual variance.

static silhouette(embedding, labels)[source]

Silhouette Score

This metric computes the silhouette score of clusters in the embedding space. Ideally, clusters should be coherent, and using labels obtained from the original space can evaluate the effectiveness of the embedding technique. This metric is used in Xiang et al. (2021).

Parameters:

embedding (ndarray) – The low-dimensional embedding.
labels (ndarray) – The class labels of each observation.

Return type:

float

Returns:

Silhouette score.

class CytofDR.evaluation.PointClusterDistance(X, labels, dist_metric='euclidean')[source]

Bases: object

Point Cluster Distance

This class is used to compute the Point Cluster Distance. Instead of full pairwise distance, this distance metric computes the distance between each cluster centroid and all other point. The memory complexity is N_cluster*N instead of (N^2)/2.

Parameters:

X (ndarray) – The input data array.
labels (ndarray) – Labels for the data array.
dist_metric (str) – The distance metric to use. This supports “euclidean”, “manhattan”, or “cosine”, defaults to “euclidean”

Attributes:

X: The input data array.
labels: Labels for the data array.
dist_metric: The distance metric to use. This supports “euclidean”, “manhattan”, or “cosine”, defaults to “euclidean”
dist: The calculated distance array. The first axis corresponds to each observation in the original array and the second axis is all the cluster centroids, optional.

fit(flatten=False)[source]

Fit the distance metric.

This method calculates the distance metric based on the class attributes.

Parameters:: flatten (bool) – Whether to flatten the return into a 1-d vector
Return type:: ndarray
Returns:: The calculate distance array.

static flatten(dist)[source]

Flatten an array

This method is a wrapper for the flatten method in numpy.

Parameters:: dist (ndarray) – The distance array.
Return type:: ndarray
Returns:: The flattened array.