Optimizations: PCD and ANNOY
With CyTOF’s uniquely large sample sizes, we often struggle with memory and speed. These
issues are not trivial to address: people dedicate their careers to them. While we try
to make things fast, a lot of things are beyond us. What we can do is to choose algorithms
wisely and think hard when we have to implement our own. In this package, we use two
non-standard algorithms to speed things up: ANNOY
and PointClusterDistance
. The
former is a famous nearest-neighbor algorithm that is off-the-shelf, whereas we introduced
and implemented the latter ourselves.
This page details how you can use these two optimizations, which come as defaults, in our package.
ANNOY
Do you find slow KNN algorithms annoying? (Get it? Okay, I will stop.) But seriously,
who has time to wait for neighbors? Thus, we were exhilarated when we found out about
Approximated Nearest Neighbors Oh Yeah (ANNOY)!!
This indeed is a popular algorithm, which is also used by packages such as openTSNE
,
and from our experiences, it is very (I mean very) fast! Contructing a neighbor graph
often takes within a minute even with large sample sizes.
While the exact inner working of this algorithm is quite beyond us, we have included
an interface to python’s annoy
package in our evaluation
module so that
you can easily use it. In fact, you likely won’t need to use this API unless you
are planning on running evaluations yourself without our Reductions
class. But
in case you need, you can use our Annoy
class.
Building Annoy
We make it a breeze to do so! Here is an example with a random numpy array:
>>> from CytofDR.evaluation import Annoy
>>> import numpy as np
>>> expression = np.random.rand(100, 2)
>>> Annoy.build_annoy(data = expression)
<annoy.Annoy object at 0x7fa14159ceb0>
This is it! It’s pretty easy, and you have an AnnoyIndex
object to work with.
To find out how exactly it works, we recommend you look at their documentation!
However, if you want to have a neighbor array, we have another interface in
the EvaluationMetrics
class. Here, you lose have the ability to specify
the number of neighbors:
>>> from CytofDR.evaluation import EvaluationMetrics
>>> import numpy as np
>>> expression = np.random.rand(100, 2)
>>> EvaluationMetrics.build_annoy(data = expression, k = 10)
array([[ 9, 21, 5, 78, 15, 46, 63, 3, 81, 26],
[58, 87, 95, 97, 71, 53, 22, 68, 2, 8],
[ 8, 68, 86, 11, 32, 22, 53, 71, 66, 52],
[81, 26, 6, 46, 70, 15, 78, 30, 21, 0],
[30, 70, 6, 26, 19, 81, 3, 40, 46, 15],
... ,
[84, 80, 75, 13, 56, 23, 93, 76, 82, 59],
[71, 53, 22, 68, 2, 8, 87, 1, 86, 11],
[16, 67, 24, 44, 14, 51, 36, 74, 27, 62],
[91, 12, 18, 17, 7, 34, 20, 10, 94, 85]])
EvaluationMetrics.build_annoy()
is a convenient wrapper for the Annoy
class. And we recommend using that unless you need customizations.
Load and Save Annoy Objects
Sometimes, you want to save your Annoy object if you are using
the same dataset repeatedly and don’t want to reconstruct it
every time. The Annoy
library makes it easy to do so, and
in that spirit, we provide a convenient as well. To save it,
you can simply do this:
>>> from CytofDR.evaluation import Annoy
>>> annoy_object = Annoy.build_annoy(data = expression)
>>> Annoy.save_annoy(annoy_object, path = "YOUR_PATH")
and it can be loaded easily:
>>> annoy_object = Annoy.load_annoy(path = "YOUR_PATH")
If you want to use the EvaluationMetrics
interface, you can
pass in the path and it will automatically load it for you:
>>> EvaluationMetrics.build_annoy(data = expression, saved_annoy_path = "YOUR_PATH", k = 10)
array([[ 9, 21, 5, 78, 15, 46, 63, 3, 81, 26],
[58, 87, 95, 97, 71, 53, 22, 68, 2, 8],
[ 8, 68, 86, 11, 32, 22, 53, 71, 66, 52],
[81, 26, 6, 46, 70, 15, 78, 30, 21, 0],
[30, 70, 6, 26, 19, 81, 3, 40, 46, 15],
... ,
[84, 80, 75, 13, 56, 23, 93, 76, 82, 59],
[71, 53, 22, 68, 2, 8, 87, 1, 86, 11],
[16, 67, 24, 44, 14, 51, 36, 74, 27, 62],
[91, 12, 18, 17, 7, 34, 20, 10, 94, 85]])
It’s as easy as that!
Point Cluster Distance (PCD)
Some of the evaluation metrics need to use a pairwise distance matrix.
This sounds very simple: you load scipy.dist
and you’re on your
way. But wait! There is bad news: if your sample is large, I mean
larger than hundreds of thousands of cells, you will quickly run into
trouble because the memory complexity is O(N^2). This is scary because
it quickly outpaces what is reseaonable for any user.
There are a few solutions:
We find an alternative altogether without finding pairwise distance.
We compromise and find partial pairwise distance.
Here, PCD belongs to the second solution. Instead of finding the pairwise between every cell, we utilze clusters! The idea is that with clusters, we can easily compute cluster centroids, and by definition, we expect the number of the clusters k to be much much smaller than the number of cell. Thus, we compute the pairwise distance between cluster centroids and all cells, effectively reducing the memory complexity to O(k*N). This makes it feasible to work with dataset even in the realm of millions as long as the cluster numbers and the feature numbers of the original expression matrix is resaonble. While this sounds very easy and you can probably implement it in 10 minutes, the good news is that we have already implemented this (save the 10 minutes for a walk outside)! Now, let’s look at the interface!
Using the PointClusterDistance
class
The PointClusterDistance
’s interface is akin to that of the sklearn
’s.
Here, we need both the expression matrix and the labels. We will simulate them
using random arrays:
>>> import numpy as np
>>> expression = np.random.rand(100, 2)
array([[3.89241108e-01, 3.74894390e-01],
[2.53484808e-02, 8.38034055e-01],
[1.81716130e-01, 6.77570720e-01],
[2.81618520e-01, 2.85328034e-01],
[8.72380359e-01, 9.43282666e-01],
...,
[5.64692819e-01, 3.01563659e-01],
[1.52626514e-01, 9.19922564e-01],
[9.13887763e-01, 1.12411075e-01],
[8.86792310e-01, 2.78847601e-01]])
>>> labels = np.repeat(["a", "b"], 50)
array(['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b',
'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b',
'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b',
'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b',
'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b'], dtype='<U1')
Now, we can instantatiate our PointClusterDistance
:
>>> from CytofDR.evaluation import PointClusterDistance
>>> pcd = PointClusterDistance(X = expression, labels = labels)
>>> pcd.fit()
array([[0.13338769, 0.11425074],
[0.57807929, 0.57630871],
[0.35672805, 0.35369442],
[0.27244763, 0.25324668],
...
[0.19573966, 0.19065184],
[0.54919051, 0.55249102],
[0.56914471, 0.57123187],
[0.45559932, 0.4623146 ]])
This should look familiar to you! And the fit()
method returns
a distance array by default: the rows are cells and the columns are
clusters. To get the indices for cluster labels
>>> pcd.index
array(['a', 'b'], dtype='<U1')
which will allow you to construct what each distance represents.
For convenience, you can also flatten the array into a 1-d array, or simply the vector, so that you can run other metrics:
>>> pcd.fit(flatten = True)
array([0.13338769, 0.11425074, 0.57807929, 0.57630871, 0.35672805,
0.35369442, 0.27244763, 0.25324668, 0.6125959 , 0.63173534,
..., 0.55249102, 0.56914471, 0.57123187, 0.45559932, 0.4623146 ])
Using PCD for DR Evaluation
PCD is the default for evaluation when labels are available or when auto_cluster is set to True in the evaluate method. Suppose we have a Reductions object named results, the simplest case will be
>>> results.evaluate(category = ["global"], auto_cluster = True, n_clusters = 20)
Evaluating global...
Or, if you prefer your own clustering results or labels, you can add them as well:
>>> results.add_evaluation_metadata(original_labels = original_labels, embedding_labels = embedding_labels)
>>> results.evaluate(category = ["global"], n_clusters = 20)
Evaluating global...
where we assume you already have original_labels and embedding_labels as arrays. In this latter case, you no longer need auto_cluster.
PCD Alternative: Pairwise Downsample
If you don’t have clustering information to perform PCD or prefer downsampling-based methods, you can use the pairwise_downsample method to tackle the sample size issue. Namely, we randomly downsample both the original dataset and the embedding and then we find the pairwise distance. In practice, we found that PCD and pairwise downsample are very similar in terms of evaluation performance of DR methods.
This metric can only be used for evaluation. To use it, you can just specify it while calling the evaluate method on your Reductions object:
>>> results.evaluate(category = ["global"], pwd_metric = "pairwise_downsample", pairwise_downsample_size = 10000)
Evaluating global...
This will downsample to 10,000 cells as with default. Of course, you can choose a different value, but for reasonable memory size, we won’t recommend anything much larger.