IO and Preprocessing with PyCytoData
If you want to preprocess your CyTOF dataset befoe running DR, fear not! We have
you covered. We are also the developer of PyCytoData, which is focused on
IO and preprocessing for CyTOF experiments. Further, it allows us to use a single
pipeline for everything. This tutorial showcases how we can utilize this pipeline
for a DR-focused project.
Please also feel free to read more in-depth documentation on PyCytoData’s
Official Documentation.
Loading Benchmark Datasets
Previously in the Quick Start Guide,
we’ve showcased how to load datasets with numpy, which is very easy. However, if you want
to work with a few famous benchmark datasets, such as levine13 and levine32,
PyCytoData offers an easy solution to help you achieve that goal:
>>> from PyCytoData import DataLoader
>>> exprs = DataLoader.load_dataset(dataset = "levine13")
Would you like to download levine13? [y/n]y
Download in progress...
This may take quite a while, go grab a coffee or cytomulate it!
And you have successfully download the levine13 dataset. The dataset is automatically
cached so that you don’t have to repeatedly download it every time you use it. You can
access the the expression matrix easily:
>>> exprs.expression_matrix
array([[ 5.75381927e+01, 1.21189880e+01, 2.75074673e+00, ...,
2.60543274e+02, 1.54974432e+01, 8.29685116e+00],
[ 8.16322708e+01, 2.34020500e+01, 1.57276118e+00, ...,
1.75833466e+02, 2.17522359e+00, 3.34277302e-01],
[ 2.10737019e+01, 4.41922474e+00, -5.81668496e-01, ...,
2.27592499e+02, 6.24691308e-01, -1.94343376e+01],
...,
[ 1.59633112e+01, 9.53633595e+00, 4.49561157e+01, ...,
3.46169220e+02, 2.27766180e+00, 4.33450623e+01],
[ 2.25081215e+01, 8.42314911e+00, 8.56426620e+01, ...,
6.43495300e+02, 5.97545290e+00, 8.84256649e+00],
[ 2.82463398e+01, 7.47339916e+00, 5.64270020e+01, ...,
6.65499023e+02, -7.26899445e-01, 7.11599884e+01]])
From here, you can use the expression matrix to do everything you need to
do in CytofDR.
We have the following datasets available:
Dataset Name |
Literal |
Levine-13dim |
levine13 |
Levine-32dim |
levine32 |
Samusik |
samusik |
Currently, they have mostly been preprocessed, except for Acrsinh transformation,
which we will detail below.
Loading Your Own Dataset
Of course, you don’t have to use a benchmark dataset! You can use your own dataset:
>>> from PyCytoData import FileIO
>>> exprs = FileIO.load_expression(dataset = "PATH_TO_EXPRS", col_names=True, delim="\t")
>>> type(exprs)
<class 'PyCytoData.data.PyCytoData'>
This is very reminiscent of numpy approach or the R approach if you’re familiar with it.
Here, we assume that the data is stored in plain text, deliminated file. Rows are cells and columns
are features. If col_names=True, then the first row is treated as channel names. And again,
this is a PyCytoData object, and you can access its expression_matrix for all your DR needs.
Preprocessing
Once you have a PyCytoData object such as the ones we’ve created above, preprocessing is
really just one line of code away. We offer the following preprocessing steps:
Arcsinh transformation
Gate to remove derbis
Gate for intact cells
Gate for live cells
Gate using Center, Offset, and Residual channels
Bead normalization
And you can pick and choose which of these steps to apply to your particular dataset. For benchmark datasets, all you need to do is this:
>>> exprs.preprocess(arcsinh=True)
Runinng Arcsinh transformation...
Now, you can accessed you preprocessed expression matrix:
>>> exprs.expression_matrix()
array([[ 4.05275087, 2.50151373, 1.12358426, ..., 5.5627837 ,
2.74481299, 2.13009628],
[ 4.40237469, 3.15464461, 0.72199792, ..., 5.16956967,
0.94198797, 0.16637009],
[ 3.05027008, 1.53363094, -0.28688286, ..., 5.42757605,
0.30747774, -2.96967868],
...,
[ 2.77419437, 2.26592833, 3.80618123, ..., 5.84693608,
0.97621692, 3.76972462],
[ 3.11584426, 2.14478932, 4.45031986, ..., 6.46691714,
1.81455806, 2.19212776],
[ 3.34221489, 2.02879191, 4.03326172, ..., 6.50053943,
-0.35588934, 4.26512812]])
For your own dataset, you can run the whole suite if you like:
>>> exprs.preprocess(arcsinh=True,
... gate_debris_removal=True,
... gate_intact_cells=True,
... gate_live_cells=True,
... gate_center_offset_residual=True,
... bead_normalization=True)
Runinng Arcsinh transformation...
Runinng debris remvoal...
Runinng gating intact cells...
Runinng gating live cells...
Runinng gating Center, Offset, and Residual...
Runinng bead normalization...
Using CytofDR in PyCytoData
In the tutorial above, we’ve showcased how to extract the expression matrix and
then work with CytofDR. This works perfectly, but you may wonder whether it’s
possible to stay within the PyCytoData object. The answer is of course yes!
We’ve provided the run_dr_methods interface to PyCytoData, but you can
also store a Reductions object within your PyCytoData object. This
section will show you how to do so.
Quick DR with run_dr_methods
Once you have a PyCytoData object, you can simply run the method (here, we
will keep using the object created in the tutorials above):
>>> exprs.run_dr_methods(methods = ["PCA", "UMAP", "ICA"])
Running PCA
Running ICA
Running UMAP
>>> type(exprs.reductions)
<class 'CytofDR.dr.Reductions'>
This will already be familiar to you if you are familiar to CytofDR. Now,
this function automatically adds the expression matrix and cell types to
the object (if the latter is not all None):
>>> exprs.expression_matrix
array([[ 5.75381927e+01, 1.21189880e+01, 2.75074673e+00, ...,
2.60543274e+02, 1.54974432e+01, 8.29685116e+00],
[ 8.16322708e+01, 2.34020500e+01, 1.57276118e+00, ...,
1.75833466e+02, 2.17522359e+00, 3.34277302e-01],
[ 2.10737019e+01, 4.41922474e+00, -5.81668496e-01, ...,
2.27592499e+02, 6.24691308e-01, -1.94343376e+01],
...,
[ 1.59633112e+01, 9.53633595e+00, 4.49561157e+01, ...,
3.46169220e+02, 2.27766180e+00, 4.33450623e+01],
[ 2.25081215e+01, 8.42314911e+00, 8.56426620e+01, ...,
6.43495300e+02, 5.97545290e+00, 8.84256649e+00],
[ 2.82463398e+01, 7.47339916e+00, 5.64270020e+01, ...,
6.65499023e+02, -7.26899445e-01, 7.11599884e+01]])
>>> exprs.reductions.cell_types
array(['CD11b- Monocyte', 'CD11b- Monocyte', 'CD11b- Monocyte', ...,
'Pre-B I', 'Pre-B I', 'Pre-B I'], dtype='<U17')
Now, you can proceed with what you will need to do with the Reductions object:
>>> exprs.reductions.evaluate(category=["Global"])
Evaluating global...
>>> exprs.rank_dr_methods()
{'PCA': 1.5, 'ICA': 2.0, 'UMAP': 2.5}
As you can see, this, really, is just a wrapper for the CytofDR version to allow you to
run DR directly. Further, the reductions attribute stores a Reductions
object, meaning that once you’ve run your DR, you can use any Reductions
object features and workflows as usual.
Note
There is one significant caveat to note here: the transform option is
not implemented here because of the ambiguity that it may cause. This may be
included in a future feature update.
Using Your Own Reductions Object
As you may wonder whether you can do DR separately in CytofDR with more
features while still using PyCytoData, the answer is you can. You can
store your own Reductions object in the PyCytoData object:
from CytofDR import dr
>>> results = dr.Reductions()
>>> results.add_reduction(reduction = embedding1, name = "your_dr")
>>> results.add_reduction(reduction = embedding2, name = "your_dr2")
>>> exprs.reductions = results
This effectively combines two objects into one! Now, you can proceed as you wish!