The Real Value of the Virtual Cell Challenge
Reflecting on its limitations should guide the future of virtual cell research
Image courtesy of Nano Banana Pro
Recently, the Arc Institute’s first Virtual Cell Challenge (VCC) was completed. The VCC felt like a watershed moment for the nascent field of “virtual cell” research, with big money and big names involved. The ambition was clear: to create a CASP-style competition that will eventually yield an AlphaFold-level breakthrough in modeling molecular and cell biology in silico. Doing so would usher in a new era of much more efficient, quantitative biomedicine and bioengineering.
Unfortunately, it didn’t quite work out as most people hoped. I don’t blame the Arc Institute — it is extremely difficult to run a competition of this sort — and in fact, I appreciate and laud their ambition and efforts to build momentum in the virtual cell field. After all, more VCC competitions are planned for the future, and nobody thought we would actually build a full “virtual cell” in the first iteration of this competition.
But the metrics in the competition ultimately had substantial problems — as highlighted by many other posts — to the extent that some of the leaders in the competition openly named their models “Leaderboard Hacker” and “Hackathon Champion”.
The final validation set leaderboard, featuring “Leaderboard Hacker”.
It also turned out that very simple statistical models operating on pseudobulk data — in which each perturbation is summarized by averaging the gene expression levels of all cells with that perturbation — often outperformed deep learning models. For example, I developed a simple pseudobulk kernel ridge regression method that placed 38th out of over 1200 entries, was 10th in the “generalist” rankings (which involved additional metrics), and was 1st overall in Pearson correlation. A very similar method got 15th overall and 3rd in the generalist category, and based on the descriptions of the winning methods, it seems that the second and third place submissions also applied relatively simple models to pseudobulk data, while the winning method combined pseudobulk-derived features with the existing scFoundation model. Ironically, it seems like linear models or simple shallow neural networks applied to pseudobulk data were a dominant strategy in a competition primarily intended to foster deep learning innovation in single cell prediction. (I’ll note that I tried developing a lot of fancy deep learning methods, but none beat my simple kernel ridge.)
Overall, some of the most interesting takeaways from the competition might be that it highlighted what we don’t know. What metrics should we be using? What do we hope to get out of virtual cells anyways? What, if anything, are deep learning foundation models learning that can’t be captured by simpler and more interpretable classic regression approaches? I’ll explore each of these questions below.
The metrics, in brief
Several blog posts have highlighted issues with the competition metrics, and I encourage those interested to review those, but I’ll offer my own flavor here, which I think clearly explains the issues with the VCC scoring scheme.
The VCC employed three metrics:
Mean Absolute Error (MAE) between predicted and true perturbation effects, computed on the pseudobulk level (averaging over all cells with a given perturbation)
Perturbation Discrimination Score (PDS), also computed in pseudobulk. This is the least standard metric. It is a “retrieval” metric in that it doesn’t measure your raw prediction error, but rather measures whether your predicted effects of perturbing some gene G is closer to the true effects of perturbing G than it is to the true effects of perturbing some other gene G2. If your method is the closest to the true perturbation of G, your PDS is 1; a random approach will get you about 0.5.
Differential Expression overlap score. DE genes were computed using the Wilcoxon Rank-Sum test. The DE score was then equal to the number of genes overlapping between the true and predicted DE gene sets, divided by the number of true DE genes, with a modification if the method predicted too many DE genes.
The methods were scored relative to the “average” baseline, which predicts that each perturbation effect in the test set will be equal to the average of all perturbation effects in the training set. If your method did worse than AverageBaseline (as I’ll call it) you get a 0 for that category; if your method is perfect it gets a 1; otherwise it gets a score from 0-1 depending on how much it improved on the baseline. The final score is the average over the three metric scores. For more details see this page.
So what was the problem?
(This section is a bit technical, although I’ve tried to simplify it a lot, so feel free to skim or skip.)
MAE is perfectly reasonable and is the most direct measure of the “accuracy” of predictions. The problem is, as has been documented before, it tends to favor conservative methods that don’t predict large effects except on genes that are heavily affected by many perturbations. Thus, AverageBaseline does quite well — it produces large estimates for genes affected by most perturbations (such as general cell stress pathways) and produces small estimates for genes that aren’t usually affected much or have inconsistent effects. It turns out to be hard to beat AverageBaseline at MAE. So most methods didn’t even try. If they did worse than AverageBaseline, they got a score of 0, so there was no reason to care about the MAE, from the perspective of the competition scoring.
But a method that does well on most metrics should be good in MAE too, right? Unfortunately, no. To see why, we turn to PDS.
Recall that PDS doesn’t depend on your error sizes, only whether your predicted perturbation was closer to the true perturbation of the same gene than to perturbations of other genes. For example, suppose the true perturbation P of a gene increases the expression of another gene G by 1 unit, while a different perturbation P2 increases G by 0.1. If your method predicts an increase in G by 1 for perturbation P, you’ve nailed it. Your error is 0, and your distance is 0.9 closer to the true value for P (1) than for P2 (0.1). But now suppose you predict an increase of 10 for G. Your absolute error is 9 (10-1). But your distance from P2 is 9.9 (10-0.1), so you are still 0.9 closer to P than to P2 for gene G. From a PDS perspective, they are the same, even though a prediction of 1 is perfect and 10 is wildly incorrect. If, on the other hand, you under-estimate the effect on G, say 0.55 instead of 1, you are getting both closer to P2 and further from P — you are 0.45 away from each.
This is a simple example, and it gets more complicated when many genes and many perturbations are involved, but it highlights that one can make large predictions with big errors and still do well on PDS. Generally, PDS tends to punish methods more for underestimating large perturbation effects on genes that should help distinguish between perturbations than it punishes over-estimation of those effects. Methods are therefore incentivized to inflate medium-to-large effect size estimates, as that will generally improve PDS. So that is what most of the top methods did. The winning method, for example, had an MAE that was over 37 times higher than that of AverageBaseline — and its MAE was the lowest among the top 10 methods! It clearly did not represent biologically realistic expression values for the genes in the perturbed cells, despite doing quite well on the metrics.
(For a much more thorough and technical assessment of the PDS metric, see this recent preprint.)
Meanwhile, the DE score had several issues too. First, I and others noticed pretty quickly that the score could not go higher than about 40%. I did this by creating a technical replicate — splitting the cells with each perturbation into two halves, then testing one half-file against the other half, yielding a DE score of about 39%.
What caused this problem? It turns out that even the technical replicate had very low DE scores for perturbations that had few DE genes. Digging in further, I found several “DE genes” called with very high confidence that had extremely low average expression — e.g. 0 in control cells and 1e-5 in perturbed cells. Recall that Wilcoxon is a rank-based test, which depends on the “ranks” of genes by expression level. I think what happened is that, because so many genes have zero or extremely low expression, a small random fluctuation in estimated expression (which may just be technical noise) can cause a massive change in “rank”. One or two (possibly spurious!) reads mapped to a given lowly-expressed gene could cause its “expression rank” to change drastically by leapfrogging thousands of other very low expression genes. So, each perturbation was associated with a number of completely spurious DE genes, and for low-effect perturbations those were almost all of the DE genes. You couldn’t do well on those perturbations, because all of the DE genes were actually random noise. Thus the ~40% ceiling.
There were other DE score issues that were more competition-specific that I’ll omit for brevity. Suffice to say, both the DE score and PDS could be — and were — extensively gamed, as documented in the aforementioned posts.
Overall, I think the biggest mistake was clipping negative scores to 0 in metrics where methods underperformance AverageBaseline. Without this, there would have been an interesting tradeoff between accuracy and conservatism (MAE) versus making bigger and more distinct predictions (PDS). But because of the zero clipping, MAE got disregarded by contestants, and consequently we ended up with unrealistic predictions. It is worth noting that, without clipping the MAE score at 0, all of the top 50 methods would have had a negative overall score.
What metrics should we use in the future? (Hint: it’s not what you think)
I am not the first to identify issues with these metrics, and many “virtual cell” or “perturbation prediction” metrics are currently being proposed to deal with these issues. For instance, papers have proposed metrics that measure performance relative to the AverageBaseline without some of the pitfalls mentioned above (see here, here, and here). Others believe that correlation metrics like Pearson are better than the ones chosen for the competition (I’ve said this myself, and I promise it was before I got first place in Pearson).
But I think there is a deeper, more existential question here. What do we mean by “virtual cells”? What do we want to do with them?
By comparison with CASP, determining the 3D structure of proteins from their 1D amino acid sequence was already a very well-defined problem by the time CASP was started. It clearly saves tons of time and carries immense value in understanding protein functions and binding targets. 3D structure prediction was a well-defined “intermediate goal”.
What is the equivalent “intermediate goal” for virtual cells? It’s not clear what concrete downstream value is unlocked by “discriminating between perturbations” or printing a list of “differentially expressed genes”, though there is undoubtedly some value in those things. It could be argued that if we push the MAE to nearly zero, then we can do CRISPRi/a screens entirely in silico, which of course would be quite useful.
But even that doesn’t seem realistic, because cells are not cold, dead machines sitting in a static lump, where you push a lever and then they immediately tranform into a different static lump. They are dynamic agents navigating problem spaces, changing phenotypes, constantly cycling through stages of growth and replication, arresting those cycles in response to stress, and possibly killing themselves if damage is too catastrophic.
Ultimately we are missing a sense of time. If we want to really model and understand how cells work in silico, we’ll have to model their dynamics, both internally and in populations. We don’t currently have high throughput technology to perturb cells and measure dynamic responses, although there are many exciting developments in molecular recorders and microscopy/imaging, but we do have access to time-series datasets, we can compute RNA velocity, and we have datasets with known trajectories (e.g. differentiation datasets). It’s time to introduce dynamics into virtual cell models.
I would also like to see a move towards metrics that assess changes in cell phenotypes or states. It ultimately may not always be meaningful to discern why gene A responds to a perturbation in one experiment versus gene B in another. Cells can take multiple different paths to the same outcome, as shown for example in reprogramming experiments. Also, the change in expression of a specific gene might not be phenotypically relevant for the cell, as the change could be buffered out by post-transcriptional or post-translational regulation. Changes in cell phenotype or state are what we ultimately care about, whether it be cell type, morphology, or disease-like characteristics. I think this will become clearer as we integrate multi-omic data. For example, it’s even more unclear whether it is meaningful or even possible to predict all open chromatin regions in a given cell in response to some perturbation. Cell phenotype- or state-level predictions are thus probably both more tractable and more meaningful and interpretable.
This brings me to my final point: the intended application is likely important in determining which metrics are worthwhile. Contra my previous statement, if you are interested in short-term mechanisms governing gene regulation, maybe you do care about predicting at least some of the open chromatin regions. But if you are developing a virtual cell method for more efficient cell type reprogramming, you might just want it to give you a cocktail that gives you more iPSCs for less money and with fewer mutations, with less concern for low-level mechanistic details. Or you might care about designing a protocol to recreate disease-relevant cell states in vitro that are normally only seen in vivo, and then designing protocols to push those cells out of those disease-relevant states. How you measure success may vary by application, and the scale of readouts that you want to predict (molecular, cell-level, spatial) may vary too.
Coda
Perhaps I have seemed critical in this post. But overall, I’m optimistic about the long-term value of virtual cell research. On a prosaic level, they should enable a variety of optimizations to in vitro perturbation and bioengineering protocols that improve cost-efficiency and expand the explorable hypothesis space. More expansively, I view virtual cells as a key bridge to making biology and biomedicine more quantitative and amenable to direct optimization. I think the “druggable target” and “causal gene/protein” disease paradigms are inherently limited and must give way to a more quantitative, systematic, multi-scale understanding of biological systems, if we want to learn how to intervene on those systems to cure diseases and achieve other desired outcomes. Yet the design space of multifactorial interventions at different levels of a system is infeasible to fully explore experimentally, so we must combine experimentation with active or reinforcement learning virtual cell systems to rationally, quantitatively optimize for desired outcomes.
To get closer to that envisaged halcyon age, we first need to develop virtual cell methods that work in the simplest in vitro settings. To do that, we need to take biology seriously. The deep learning revolution has given us incredibly powerful optimization tools, and the single cell revolution has provided the big data needed to feed those tools. But we risk optimizing the wrong problem. What questions do we want to answer? What aspects of cell and molecular biology do we want to model? What sorts of data do we need to build those models? (Likely not just transcriptomic data!) How do we encode relevant information about the experimental protocols employed? What is the actual downstream application? How do we prove that these models are actually useful, as opposed to “better” on some metric that may or may not correspond to something meaningful? We have to carefully think through these questions as a field. If we don’t, we risk building exquisite paths to nowhere.
P.S.: The dominance of simple methods in the VCC
Recently there has been a spate of dueling papers arguing over whether deep learning methods do or do not outperform simple baselines such as AverageBaseline or linear models. It seems to depend in part upon the metric chosen. It seems fair to say it’s unclear thus far. But, as mentioned in the introduction, what was clear was that simple methods (e.g. linear models or simple neural networks) run on pseudobulked data were one of the dominant strategies in the VCC. Here, I’ll speculate as to why that is.
My personal finding was that my simple kernel ridge regression approach did better than any deep learning methods that I developed and all of the pre-existing ones that I tested, including the Arc Institute’s State model. However, many (but not all) of the perturbations that my method did well on were ones that State did well on, and likewise for the perturbations I did poorly on.
What links my method and State? Well, despite attempts to use many different data sources and types of information and machine learning approaches, the only features I ever found to substantially boost my method’s performance were the correlations between perturbation effects in external datasets. The unseen test set perturbations in the VCC data have been performed in publicly accessible genome-wide perturb-seq datasets like the one published by Replogle et al. I reasoned, essentially, that if the effects of two perturbations are highly correlated in external datasets like the Replogle one, they will also be highly correlated in the VCC. This is particularly useful if one such perturbation is in the VCC training set and the other is in the test set: the test perturbation should look similar to the training set one. So I built a similarity kernel based on the correlations of all perturbations in external datasets, and used kernel ridge regression with that kernel to predict the effects of the VCC test set perturbations from the known training perturbation effects. I call this idea “relatedness transfer”.
State and other deep learning models for perturbation prediction also rely on (pre-)training on external datasets. I suspect, but cannot prove, that they are ultimately often learning, with much more computation and complexity in-between, that perturbations that are similar in external data are also going to be similar in the VCC. (Undoubtedly they are also learning some things about distributions of transcriptomes of single cells, but this did not affect the VCC metrics much.) In any case, it is clear that perturbations that had similar correlation patterns compared to other perturbations in the external data as they did in the VCC were largely the “easy” ones, and perturbations that were idiosyncratic in the VCC data (not similar to external data) were the “hard” ones (with respect to perturbation discrimination score, at least). It’s unclear whether any method, whether complex or simple, has developed a good way to deal with those hard, idiosyncratic perturbations.
I think this case study demonstrates that it’s important to try to develop strong simple models, both to see whether more complicated deep generative models are actually improving upon them, and to use them as a lens to try to understand what types of information the complex models are picking up on and what is still yet to be resolved.
(Clarification: this postscript was meant to reflect specifically on performance in the VCC and its chosen metrics. It is not meant to suggest that State, which I think is a very interesting and well-designed method, and which was not designed specifically for the VCC, does nothing better than ridge regression. Nor is it meant to reflect general skepticism of deep learning. I myself spent most of the challenge developing deep learning methods based on GNNs and flow matching, fully expecting those to outperform simple methods. I tried a simple kernel ridge approach a few weeks before the end of the contest and it instantly outperformed my previous attempts and existing state of the art methods. This is my attempt to make sense of that.)


