IO and Preprocessing with PyCytoData
If you want to preprocess your CyTOF dataset befoe running DR, fear not! We have
you covered. We are also the developer of PyCytoData
, which is focused on
IO and preprocessing for CyTOF experiments. Further, it allows us to use a single
pipeline for everything. This tutorial showcases how we can utilize this pipeline
for a DR-focused project.
Please also feel free to read more in-depth documentation on PyCytoData
’s
Official Documentation.
Loading Benchmark Datasets
Previously in the Quick Start Guide,
we’ve showcased how to load datasets with numpy
, which is very easy. However, if you want
to work with a few famous benchmark datasets, such as levine13
and levine32
,
PyCytoData
offers an easy solution to help you achieve that goal:
>>> from PyCytoData import DataLoader
>>> exprs = DataLoader.load_dataset(dataset = "levine13")
Would you like to download levine13? [y/n]y
Download in progress...
This may take quite a while, go grab a coffee or cytomulate it!
And you have successfully download the levine13
dataset. The dataset is automatically
cached so that you don’t have to repeatedly download it every time you use it. You can
access the the expression matrix easily:
>>> exprs.expression_matrix
array([[ 5.75381927e+01, 1.21189880e+01, 2.75074673e+00, ...,
2.60543274e+02, 1.54974432e+01, 8.29685116e+00],
[ 8.16322708e+01, 2.34020500e+01, 1.57276118e+00, ...,
1.75833466e+02, 2.17522359e+00, 3.34277302e-01],
[ 2.10737019e+01, 4.41922474e+00, -5.81668496e-01, ...,
2.27592499e+02, 6.24691308e-01, -1.94343376e+01],
...,
[ 1.59633112e+01, 9.53633595e+00, 4.49561157e+01, ...,
3.46169220e+02, 2.27766180e+00, 4.33450623e+01],
[ 2.25081215e+01, 8.42314911e+00, 8.56426620e+01, ...,
6.43495300e+02, 5.97545290e+00, 8.84256649e+00],
[ 2.82463398e+01, 7.47339916e+00, 5.64270020e+01, ...,
6.65499023e+02, -7.26899445e-01, 7.11599884e+01]])
From here, you can use the expression matrix to do everything you need to
do in CytofDR
.
We have the following datasets available:
Dataset Name |
Literal |
Levine-13dim |
levine13 |
Levine-32dim |
levine32 |
Samusik |
samusik |
Currently, they have mostly been preprocessed, except for Acrsinh
transformation,
which we will detail below.
Loading Your Own Dataset
Of course, you don’t have to use a benchmark dataset! You can use your own dataset:
>>> from PyCytoData import FileIO
>>> exprs = FileIO.load_expression(dataset = "PATH_TO_EXPRS", col_names=True, delim="\t")
>>> type(exprs)
<class 'PyCytoData.data.PyCytoData'>
This is very reminiscent of numpy
approach or the R
approach if you’re familiar with it.
Here, we assume that the data is stored in plain text, deliminated file. Rows are cells and columns
are features. If col_names=True
, then the first row is treated as channel names. And again,
this is a PyCytoData
object, and you can access its expression_matrix
for all your DR needs.
Preprocessing
Once you have a PyCytoData
object such as the ones we’ve created above, preprocessing is
really just one line of code away. We offer the following preprocessing steps:
Arcsinh transformation
Gate to remove derbis
Gate for intact cells
Gate for live cells
Gate using Center, Offset, and Residual channels
Bead normalization
And you can pick and choose which of these steps to apply to your particular dataset. For benchmark datasets, all you need to do is this:
>>> exprs.preprocess(arcsinh=True)
Runinng Arcsinh transformation...
Now, you can accessed you preprocessed expression matrix:
>>> exprs.expression_matrix()
array([[ 4.05275087, 2.50151373, 1.12358426, ..., 5.5627837 ,
2.74481299, 2.13009628],
[ 4.40237469, 3.15464461, 0.72199792, ..., 5.16956967,
0.94198797, 0.16637009],
[ 3.05027008, 1.53363094, -0.28688286, ..., 5.42757605,
0.30747774, -2.96967868],
...,
[ 2.77419437, 2.26592833, 3.80618123, ..., 5.84693608,
0.97621692, 3.76972462],
[ 3.11584426, 2.14478932, 4.45031986, ..., 6.46691714,
1.81455806, 2.19212776],
[ 3.34221489, 2.02879191, 4.03326172, ..., 6.50053943,
-0.35588934, 4.26512812]])
For your own dataset, you can run the whole suite if you like:
>>> exprs.preprocess(arcsinh=True,
... gate_debris_removal=True,
... gate_intact_cells=True,
... gate_live_cells=True,
... gate_center_offset_residual=True,
... bead_normalization=True)
Runinng Arcsinh transformation...
Runinng debris remvoal...
Runinng gating intact cells...
Runinng gating live cells...
Runinng gating Center, Offset, and Residual...
Runinng bead normalization...
Using CytofDR in PyCytoData
In the tutorial above, we’ve showcased how to extract the expression matrix and
then work with CytofDR
. This works perfectly, but you may wonder whether it’s
possible to stay within the PyCytoData
object. The answer is of course yes!
We’ve provided the run_dr_methods
interface to PyCytoData
, but you can
also store a Reductions
object within your PyCytoData
object. This
section will show you how to do so.
Quick DR with run_dr_methods
Once you have a PyCytoData
object, you can simply run the method (here, we
will keep using the object created in the tutorials above):
>>> exprs.run_dr_methods(methods = ["PCA", "UMAP", "ICA"])
Running PCA
Running ICA
Running UMAP
>>> type(exprs.reductions)
<class 'CytofDR.dr.Reductions'>
This will already be familiar to you if you are familiar to CytofDR
. Now,
this function automatically adds the expression matrix and cell types to
the object (if the latter is not all None
):
>>> exprs.expression_matrix
array([[ 5.75381927e+01, 1.21189880e+01, 2.75074673e+00, ...,
2.60543274e+02, 1.54974432e+01, 8.29685116e+00],
[ 8.16322708e+01, 2.34020500e+01, 1.57276118e+00, ...,
1.75833466e+02, 2.17522359e+00, 3.34277302e-01],
[ 2.10737019e+01, 4.41922474e+00, -5.81668496e-01, ...,
2.27592499e+02, 6.24691308e-01, -1.94343376e+01],
...,
[ 1.59633112e+01, 9.53633595e+00, 4.49561157e+01, ...,
3.46169220e+02, 2.27766180e+00, 4.33450623e+01],
[ 2.25081215e+01, 8.42314911e+00, 8.56426620e+01, ...,
6.43495300e+02, 5.97545290e+00, 8.84256649e+00],
[ 2.82463398e+01, 7.47339916e+00, 5.64270020e+01, ...,
6.65499023e+02, -7.26899445e-01, 7.11599884e+01]])
>>> exprs.reductions.cell_types
array(['CD11b- Monocyte', 'CD11b- Monocyte', 'CD11b- Monocyte', ...,
'Pre-B I', 'Pre-B I', 'Pre-B I'], dtype='<U17')
Now, you can proceed with what you will need to do with the Reductions
object:
>>> exprs.reductions.evaluate(category=["Global"])
Evaluating global...
>>> exprs.rank_dr_methods()
{'PCA': 1.5, 'ICA': 2.0, 'UMAP': 2.5}
As you can see, this, really, is just a wrapper for the CytofDR
version to allow you to
run DR directly. Further, the reductions
attribute stores a Reductions
object, meaning that once you’ve run your DR, you can use any Reductions
object features and workflows as usual.
Note
There is one significant caveat to note here: the transform
option is
not implemented here because of the ambiguity that it may cause. This may be
included in a future feature update.
Using Your Own Reductions
Object
As you may wonder whether you can do DR separately in CytofDR
with more
features while still using PyCytoData
, the answer is you can. You can
store your own Reductions
object in the PyCytoData
object:
from CytofDR import dr
>>> results = dr.Reductions()
>>> results.add_reduction(reduction = embedding1, name = "your_dr")
>>> results.add_reduction(reduction = embedding2, name = "your_dr2")
>>> exprs.reductions = results
This effectively combines two objects into one! Now, you can proceed as you wish!