Optimizations: PCD and ANNOY

With CyTOF’s uniquely large sample sizes, we often struggle with memory and speed. These issues are not trivial to address: people dedicate their careers to them. While we try to make things fast, a lot of things are beyond us. What we can do is to choose algorithms wisely and think hard when we have to implement our own. In this package, we use two non-standard algorithms to speed things up: ANNOY and PointClusterDistance. The former is a famous nearest-neighbor algorithm that is off-the-shelf, whereas we introduced and implemented the latter ourselves.

This page details how you can use these two optimizations, which come as defaults, in our package.

ANNOY

Do you find slow KNN algorithms annoying? (Get it? Okay, I will stop.) But seriously, who has time to wait for neighbors? Thus, we were exhilarated when we found out about Approximated Nearest Neighbors Oh Yeah (ANNOY)!! This indeed is a popular algorithm, which is also used by packages such as openTSNE, and from our experiences, it is very (I mean very) fast! Contructing a neighbor graph often takes within a minute even with large sample sizes.

While the exact inner working of this algorithm is quite beyond us, we have included an interface to python’s annoy package in our evaluation module so that you can easily use it. In fact, you likely won’t need to use this API unless you are planning on running evaluations yourself without our Reductions class. But in case you need, you can use our Annoy class.

Building Annoy

We make it a breeze to do so! Here is an example with a random numpy array:

>>> from CytofDR.evaluation import Annoy
>>> import numpy as np

>>> expression = np.random.rand(100, 2)
>>> Annoy.build_annoy(data = expression)
<annoy.Annoy object at 0x7fa14159ceb0>

This is it! It’s pretty easy, and you have an AnnoyIndex object to work with. To find out how exactly it works, we recommend you look at their documentation! However, if you want to have a neighbor array, we have another interface in the EvaluationMetrics class. Here, you lose have the ability to specify the number of neighbors:

>>> from CytofDR.evaluation import EvaluationMetrics
>>> import numpy as np

>>> expression = np.random.rand(100, 2)
>>> EvaluationMetrics.build_annoy(data = expression, k = 10)
array([[ 9, 21,  5, 78, 15, 46, 63,  3, 81, 26],
       [58, 87, 95, 97, 71, 53, 22, 68,  2,  8],
       [ 8, 68, 86, 11, 32, 22, 53, 71, 66, 52],
       [81, 26,  6, 46, 70, 15, 78, 30, 21,  0],
       [30, 70,  6, 26, 19, 81,  3, 40, 46, 15],
       ... ,
       [84, 80, 75, 13, 56, 23, 93, 76, 82, 59],
       [71, 53, 22, 68,  2,  8, 87,  1, 86, 11],
       [16, 67, 24, 44, 14, 51, 36, 74, 27, 62],
       [91, 12, 18, 17,  7, 34, 20, 10, 94, 85]])

EvaluationMetrics.build_annoy() is a convenient wrapper for the Annoy class. And we recommend using that unless you need customizations.

Load and Save Annoy Objects

Sometimes, you want to save your Annoy object if you are using the same dataset repeatedly and don’t want to reconstruct it every time. The Annoy library makes it easy to do so, and in that spirit, we provide a convenient as well. To save it, you can simply do this:

>>> from CytofDR.evaluation import Annoy

>>> annoy_object = Annoy.build_annoy(data = expression)
>>> Annoy.save_annoy(annoy_object, path = "YOUR_PATH")

and it can be loaded easily:

>>> annoy_object = Annoy.load_annoy(path = "YOUR_PATH")

If you want to use the EvaluationMetrics interface, you can pass in the path and it will automatically load it for you:

>>> EvaluationMetrics.build_annoy(data = expression, saved_annoy_path = "YOUR_PATH", k = 10)
array([[ 9, 21,  5, 78, 15, 46, 63,  3, 81, 26],
       [58, 87, 95, 97, 71, 53, 22, 68,  2,  8],
       [ 8, 68, 86, 11, 32, 22, 53, 71, 66, 52],
       [81, 26,  6, 46, 70, 15, 78, 30, 21,  0],
       [30, 70,  6, 26, 19, 81,  3, 40, 46, 15],
       ... ,
       [84, 80, 75, 13, 56, 23, 93, 76, 82, 59],
       [71, 53, 22, 68,  2,  8, 87,  1, 86, 11],
       [16, 67, 24, 44, 14, 51, 36, 74, 27, 62],
       [91, 12, 18, 17,  7, 34, 20, 10, 94, 85]])

It’s as easy as that!

Point Cluster Distance (PCD)

Some of the evaluation metrics need to use a pairwise distance matrix. This sounds very simple: you load scipy.dist and you’re on your way. But wait! There is bad news: if your sample is large, I mean larger than hundreds of thousands of cells, you will quickly run into trouble because the memory complexity is O(N^2). This is scary because it quickly outpaces what is reseaonable for any user.

There are a few solutions:

We find an alternative altogether without finding pairwise distance.
We compromise and find partial pairwise distance.

Here, PCD belongs to the second solution. Instead of finding the pairwise between every cell, we utilze clusters! The idea is that with clusters, we can easily compute cluster centroids, and by definition, we expect the number of the clusters k to be much much smaller than the number of cell. Thus, we compute the pairwise distance between cluster centroids and all cells, effectively reducing the memory complexity to O(k*N). This makes it feasible to work with dataset even in the realm of millions as long as the cluster numbers and the feature numbers of the original expression matrix is resaonble. While this sounds very easy and you can probably implement it in 10 minutes, the good news is that we have already implemented this (save the 10 minutes for a walk outside)! Now, let’s look at the interface!

Using the `PointClusterDistance` class

The PointClusterDistance’s interface is akin to that of the sklearn’s. Here, we need both the expression matrix and the labels. We will simulate them using random arrays:

>>> import numpy as np

>>> expression = np.random.rand(100, 2)
array([[3.89241108e-01, 3.74894390e-01],
       [2.53484808e-02, 8.38034055e-01],
       [1.81716130e-01, 6.77570720e-01],
       [2.81618520e-01, 2.85328034e-01],
       [8.72380359e-01, 9.43282666e-01],
       ...,
       [5.64692819e-01, 3.01563659e-01],
       [1.52626514e-01, 9.19922564e-01],
       [9.13887763e-01, 1.12411075e-01],
       [8.86792310e-01, 2.78847601e-01]])

>>> labels = np.repeat(["a", "b"], 50)
array(['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
       'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
       'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
       'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b',
       'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b',
       'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b',
       'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b',
       'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b'], dtype='<U1')

Now, we can instantatiate our PointClusterDistance:

>>> from CytofDR.evaluation import PointClusterDistance
>>> pcd = PointClusterDistance(X = expression, labels = labels)
>>> pcd.fit()
array([[0.13338769, 0.11425074],
       [0.57807929, 0.57630871],
       [0.35672805, 0.35369442],
       [0.27244763, 0.25324668],
       ...
       [0.19573966, 0.19065184],
       [0.54919051, 0.55249102],
       [0.56914471, 0.57123187],
       [0.45559932, 0.4623146 ]])

This should look familiar to you! And the fit() method returns a distance array by default: the rows are cells and the columns are clusters. To get the indices for cluster labels

>>> pcd.index
array(['a', 'b'], dtype='<U1')

which will allow you to construct what each distance represents.

For convenience, you can also flatten the array into a 1-d array, or simply the vector, so that you can run other metrics:

>>> pcd.fit(flatten = True)
array([0.13338769, 0.11425074, 0.57807929, 0.57630871, 0.35672805,
   0.35369442, 0.27244763, 0.25324668, 0.6125959 , 0.63173534,
   ..., 0.55249102, 0.56914471, 0.57123187, 0.45559932, 0.4623146 ])

Using PCD for DR Evaluation

PCD is the default for evaluation when labels are available or when auto_cluster is set to True in the evaluate method. Suppose we have a Reductions object named results, the simplest case will be

>>> results.evaluate(category = ["global"], auto_cluster = True, n_clusters = 20)
Evaluating global...

Or, if you prefer your own clustering results or labels, you can add them as well:

>>> results.add_evaluation_metadata(original_labels = original_labels, embedding_labels = embedding_labels)
>>> results.evaluate(category = ["global"], n_clusters = 20)
Evaluating global...

where we assume you already have original_labels and embedding_labels as arrays. In this latter case, you no longer need auto_cluster.

PCD Alternative: Pairwise Downsample

If you don’t have clustering information to perform PCD or prefer downsampling-based methods, you can use the pairwise_downsample method to tackle the sample size issue. Namely, we randomly downsample both the original dataset and the embedding and then we find the pairwise distance. In practice, we found that PCD and pairwise downsample are very similar in terms of evaluation performance of DR methods.

This metric can only be used for evaluation. To use it, you can just specify it while calling the evaluate method on your Reductions object:

>>> results.evaluate(category = ["global"], pwd_metric = "pairwise_downsample", pairwise_downsample_size = 10000)
Evaluating global...

This will downsample to 10,000 cells as with default. Of course, you can choose a different value, but for reasonable memory size, we won’t recommend anything much larger.