Module: CytofDR.evaluation
- class CytofDR.evaluation.Annoy[source]
Bases:
object
- static build_annoy(data, metric='angular', n_trees=10)[source]
Build
AnnoyIndex
object from data.- Parameters:
data (
ndarray
) – The data arraymetric (
str
) – The distance metric to use, defaults to “angular”n_trees (
int
) – The number of trees, defaults to 10
- Return type:
Annoy
- Returns:
An
AnnoyIndex
object.
- static load_annoy(path, ncol, metric='angular')[source]
Load
AnnoyIndex
object from disk.This loads an AnnoyIndex object saved using this class or the ANnoy’s buildin IO function.
- Parameters:
path (
str
) – The path to the object.ncol (
int
) – The number of columns.metric (
str
) – _description_, defaults to “angular”
- Return type:
Annoy
- Returns:
The loaded
AnnoyIndex
object.
- class CytofDR.evaluation.EvaluationMetrics[source]
Bases:
object
Evaluation metrics for dimension reduction
This class contains methods to run evluation metrics.
- static ARI(x_labels, y_labels)[source]
Adjusted Rand Index (ARI)
The ARI uses the labels from the original space and the embedding space to measure the similarity between them using pairs. It is used in Xiang et al. (2021).
- Parameters:
x_labels (
ndarray
) – The first set of labels.y_labels (
ndarray
) – The second set of labels on the same data.
- Return type:
float
- Returns:
ARI.
- References:
This implementation adapts from sklearn’s implementation of ARI with a bug fix of overflow issue.
@article{scikit-learn, title={Scikit-learn: Machine Learning in {P}ython}, author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, journal={Journal of Machine Learning Research}, volume={12}, pages={2825--2830}, year={2011} }
- License:
BSD 3-Clause License Copyright (c) 2007-2021 The scikit-learn developers. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- static EMD(x, y, normalization=None)[source]
Earth Mover’s Distance (EMD)
This metric computes the EMD between the pairwise distance of between points in the high and low dimensional space. This implementation uses the
scipy.stats.wasserstein_distance
. The usage of EMD is proposed in Heiser & Lou (2020).- Parameters:
x (
ndarray
) – The first distribution x as a 1D array.y (
ndarray
) – The second distribution y as a 1D array.normalization (
Optional
[str
]) – Whether to perfrom minmax normalization on x and y. The acceptable value is minmax, which performs min-max normalization. If None, no normalization is performed. Defaults to None.
- Raises:
ValueError – Unsupported normalization method.
- Return type:
float
- Returns:
Earth mover’s distance.
New in version 0.3.0: The normalization parameter. It was added to allow for nromalization of x and y with min-max normalization using minmax. This is useful when x and y are empirical observations that are on different scales, and the scale itself is not of interest.
- static KNN(data_neighbors, embedding_neighbors)[source]
K-Nearest Neighbors Preservation (KNN)
The KNN metric computes the percentage of k-neighbors of each point is preserved in the embedding space, and it is average across the entire dataset.
Note
This method is not used to calculate KNN itself.
- Parameters:
data_neighbors (
ndarray
) – A nearest-neighbor array of the original data.embedding_neighbors (
ndarray
) – A nearest-neighbor array of the embedding.
- Return type:
float
- Returns:
K-nearest neighbors preservation.
- static NMI(x_labels, y_labels)[source]
Normalized Mutual Information (NMI)
The NMI metric computes the mutual information between labels of the original space and the embeeding space and then normalizes it with the larger entroy of the two vectors. This metric is a measure of clustering performance before and after dimension reduction, and it is used in Xiang et al. (2021).
- Parameters:
x_labels (
ndarray
) – The first set of labels.y_labels (
ndarray
) – The second set of labels on the same data.
- Return type:
float
- Returns:
Silhouette score.
- static NPE(data_neighbors, embedding_neighbors, labels, method='L1')[source]
Neighborhood Proportion Error (NPE)
The NPE metric is proposed by Konstorum et al. (2019). It measures the total variation distance between the proportion of nearest points belonging to the same class of each point in the HD and LD space. The lower the NPE, the more similar the embedding and the original data are.
To further elaborate on the difference between L1 norm and TVD, the NPE metric involves the following calculation \(\delta (P, Q)\), where \(\delta\) is a distance measureon \(P\) and \(Q\). The “L1” optioc computes
\[\sum_i |P_i-Q_i|\]whereas the TVD computes
\[\sup_{a\in [0,1]} |P(a) - Q(a)| \, .\]There is questionably some debate on the implementation used to calculate NPE, but TVD should align more with the original authors’ implementation.
- Parameters:
data_neighbors (
ndarray
) – A nearest-neighbor array of the original data.embedding_neighbors (
ndarray
) – A nearest-neighbor array of the embedding.labels (
ndarray
) – The class labels of each observation.method (
str
) – The distance measure used for computing the distance between the neighborhood-proportion vector. “L1” for L1-norm or “tvd” for Total Variation Distance. The latter is likely the intended implementation by Konstorum et al. Defaults to “L1”.
- Raises:
ValueError – Unsupported method provided. We only support “L1” or “tvd”.
- Return type:
float
- Returns:
Neighborhood proportion error.
New in version 0.3.0: The method parameter. It was added to allow for alternative implementations. Originally, “L1” was implemented and it is stil the default. In this new version, now “tvd” is also an option.
- static build_annoy(data, saved_annoy_path=None, k=5)[source]
Build ANNOY and returns nearest neighbors.
This is a utility function for building ANNOY models and returning the nearest-neighbor matrices for original space data and low-dimensional embedding.
: param data: The input high-dimensional array. :type saved_annoy_path:
Optional
[str
] : param saved_annoy_path: The path to pre-built ANNOY model for original data. :type k:int
: param k: The number of neighbors.- Return type:
ndarray
- Returns:
Nearest-neighbor matrices of original space data.
- static calinski_harabasz(embedding, labels)[source]
Calinski-Harabasz Index
This metric computes the Calinski-Harabasz index of clusters in the embedding space. Ideally, clusters should be coherent, and using labels obtained from the original space can evaluate the effectiveness of the embedding technique.
- Parameters:
embedding (
ndarray
) – The low-dimensional embedding.labels (
ndarray
) – The class labels of each observation.
- Return type:
float
- Returns:
Calinski-Harabasz Index.
- static correlation(x, y, metric='Pearson')[source]
Calculate Correlation Coefficient
This method computes the pearson or spearman correlation between the inputs.
- Parameters:
x (
ndarray
) – The first 1D array.y (
ndarray
) – The second 1D array.metric (
str
) – The metric to use. ‘Pearson’ or ‘Spearman’, defaults to “Pearson”.
- Return type:
float
- Returns:
Correlation coefficient.
- static davies_bouldin(embedding, labels)[source]
Davies-Bouldin Index
This metric computes the Davies-Bouldin index of clusters in the embedding space. Ideally, clusters should be coherent, and using labels obtained from the original space can evaluate the effectiveness of the embedding technique.
- Parameters:
embedding (
ndarray
) – The low-dimensional embedding.labels (
ndarray
) – The class labels of each observation.
- Return type:
float
- Returns:
Davies-Bouldin Index.
- static embedding_concordance(embedding, labels_embedding, comparison_file, comparison_labels, comparison_classes=None, method='emd')[source]
Concordance between two embeddings.
This is a wrapper function to implement two embedding concordance metrics based on named clusters: EMD and Cluster Distance. When two embeddings can be reasonably aligned based on clusters or manual labels, these two metrics calculate the relationships between clusters and their distances between two embeddings.
For EMD, the metric considers matched pairs of clusters in both embeddings: for each pair in each embedding, the distances between each centroid and all points in the other cluster are calculated. The EMD between these two vectors from two embeddings are calculated and then averaged across all pairs.
For Cluster Distance, pairwise rank distance between all cluster centroids are calculated in each embedding. Then, the Euclidean distance between these two vectors are taken.
- Parameters:
embedding (
ndarray
) – The first (main) embedding.labels_embedding (
ndarray
) – Labels for all obervations in the embedding.comparison_file (
Union
[ndarray
,List
[ndarray
]]) – The second embedding.comparison_labels (
Union
[ndarray
,List
[ndarray
]]) – The labels for all observations in the comparison embedding.comparison_classes (
Optional
[List
[str
]]) – Which classes in labels to compare. At least two classes need to be provided for this to work; otherwise, NA will be returned. IfNone
, all overlapping labels used, optionalmethod (
str
) – “emd” or “cluster_distance”, defaults to “emd”
- Return type:
Union
[float
,str
]- Returns:
The score or “NA”
Note
When there is no overlapping labels, “NA” is automatically returned as
str
.Deprecated since version 0.2.0: Passing in str for the comparison_classes parameter is deprecated and will be removed in futrue versions.
- static neighborhood_agreement(data_neighbors, embedding_neighbors)[source]
Neighborhood Agreement
The Neighborhood Agreement metric is proposed by Lee et al. (2015). It measures the intersection of k-nearest neighbors (KNN) of each point in HD and LD space. The result is subsequently rescaled to measure the improvement over a random embedding. This measure is conceptually similar to
Metric.KNN
such that they both measure the agreement of KNN, butMetric.KNN
simply takes the average of the KNN graph agreement without any scaling.- Parameters:
data_neighbors (
ndarray
) – A nearest-neighbor array of the original data.embedding_neighbors (
ndarray
) – A nearest-neighbor array of the embedding.
- Return type:
float
- Returns:
Neighborhood agreement.
- static neighborhood_trustworthiness(data_neighbors, embedding_neighbors, dist_data)[source]
Neighborhood Trustworthiness
The Neighborhood Truestworthiness is proposed by Venna and Kaski (2001). It measures trustworthiness by measuring the ranked distane of new points entering the defined neighborhood size in the embedding. The higher the new points are ranked based on the original HD space distance matrix, the less trustworthy the new embedding is. The measure is scaled between 0 and 1 with a higher score reflecting a more trustworthy embedding.
- Parameters:
data_neighbors (
ndarray
) – A nearest-neighbor matrix of the original data.embedding_neighbors (
ndarray
) – A nearest-neighbor matrix of the embedding.dist_data (
ndarray
) – A pairwise distance matrix for the original data.
- Return type:
float
- Returns:
Neighborhood trustworthiness.
- static random_forest(embedding, labels)[source]
Random Forest Classification Accuracy
This method trains a random forest classifer using the embedding data and the labels generated or manually classified from the original space. It then tests the accuracy of the classifier using the 33% of the embedding data. This metric was first proposed in Becht et al. (2019).
- Parameters:
embedding (
ndarray
) – The low-dimensional embedding.labels (
ndarray
) – The class labels of each observation.
- Return type:
float
- Returns:
Random forest prediction accuracy.
- static residual_variance(x=None, y=None, r=None)[source]
Residual Variance
The residual variance is computed with the following formuation with r as the pearson correlation: 1-r**2. If r is provided, x and y are optional for efficiency.
- Parameters:
x (
Optional
[ndarray
]) – The first 1D array, optional.y (
Optional
[ndarray
]) – The second 1D array, optional.r (
Optional
[float
]) – Pearson correlation between x and y, optional.
- Return type:
float
- Returns:
float: Redisual variance.
- static silhouette(embedding, labels)[source]
Silhouette Score
This metric computes the silhouette score of clusters in the embedding space. Ideally, clusters should be coherent, and using labels obtained from the original space can evaluate the effectiveness of the embedding technique. This metric is used in Xiang et al. (2021).
- Parameters:
embedding (
ndarray
) – The low-dimensional embedding.labels (
ndarray
) – The class labels of each observation.
- Return type:
float
- Returns:
Silhouette score.
- class CytofDR.evaluation.PointClusterDistance(X, labels, dist_metric='euclidean')[source]
Bases:
object
Point Cluster Distance
This class is used to compute the Point Cluster Distance. Instead of full pairwise distance, this distance metric computes the distance between each cluster centroid and all other point. The memory complexity is N_cluster*N instead of (N^2)/2.
- Parameters:
X (
ndarray
) – The input data array.labels (
ndarray
) – Labels for the data array.dist_metric (
str
) – The distance metric to use. This supports “euclidean”, “manhattan”, or “cosine”, defaults to “euclidean”
- Attributes:
X: The input data array.
labels: Labels for the data array.
dist_metric: The distance metric to use. This supports “euclidean”, “manhattan”, or “cosine”, defaults to “euclidean”
dist: The calculated distance array. The first axis corresponds to each observation in the original array and the second axis is all the cluster centroids, optional.