Dear Yuan,

Thank you for your interest in our work. Please refer to the methods section of https://arxiv.org/abs/2003.06122, which has been expanded to the following during revision:

Raw expression values were normalized and log transformed. We retained the cell clustering based on the original studies when available. For each dataset where per-cell annotation is not available, we re-processed the data from raw or normalized (whichever was deposited alongside the original publication) quantification matrix. The standard scanpy (version 1.4.3) clustering procedure was followed. When batch information is available, harmony was used to correct batch effects in the PC space and the corrected PCs were used for computing nearest neighbour graphs. To re-annotate the cells, multiple clusterings of different resolutions were generated among which the one best matching the published clustering was picked and manual annotation was undertaken using marker genes described in the original publication. Full details can be found in analysis notebooks available at

github.com/Teichlab/covid19_MS1/.

If you are to deposit data with us in raw quantification matrix, we can provide you code for processing your dataset as an analysis notebook for your reference.

Best,

From: Cv19caingest <cv19caingest-bounces@sanger.ac.uk> on behalf of Yuan He <yuanhe777tt@hotmail.com>
Date: Friday, 10 April 2020 at 14:43
To: "cv19caingest@sanger.ac.uk" <cv19caingest@sanger.ac.uk>
Cc: "jpopp4@jhu.edu" <jpopp4@jhu.edu>
Subject: [Cv19caingest] Question about data processing. Thank you! [EXT]

Dear COVID-19 cell atlas,

This is Yuan He, a graduate student from Dr. Alexis Battle’s lab in Johns Hopkins University. Thank you for providing the great platform for researchers to study COVID-19!

We have a question regarding processing of the single cell datasets of *.h5ad files. It said that “Expression values can be in raw counts (which we will re-process) and/or normalised values (which we will serve directly).”. However it’s not clear how the processing was done for each dataset. It would be great if the information about whether the processing was done by your pipeline or by the authors can be included in the website. Also, it would be super helpful if your pipeline of processing could be available!

Thank you very much!

Best,

Yuan