Friday, May 01, 2020

The Importance of Data Clarity

“3.6 Roentgen.
    Not Great, Not Terrible”

     - Anatoly Dyatlov (Chernobyl, HBO, 2019)

So, I posted my last results for Los Angeles to my neighborhood on Nextdoor.  With traditional Nextdoor diplomacy, one of the neighbors said that the data was useless.  While I argued that it was worthwhile to examine the data regardless, as it was the best that we had, I thought this post should show the effect of limited testing on data quality, in particular with the state of California.

The quote I used at the start of this article, of course, is a reference to the amazing HBO miniseries, Chernobyl.  The authorities had been making decisions based on a meter that had, as its highest reading, a radiation dose of 3.6 Roentgen.  Later in the show we see that the true value was 15,000, not 3.6.  What does this mean for us?  I think we experienced a similar event.

As much as I would like to say otherwise, the numbers for Los Angeles since April 20 have been terrible.  On April 20 alone, the number of cases shot up by 1,475 (the previous highest ever single day increase before then was “only” 711).  I tried to tell myself that this was because of a surge of reported cases due to a long-standing backlog.  And, to some extent, this was true.  The new case numbers began to fall, finally getting back to 461 new cases on April 26.  However, they have begun what cannot be disputed as an inexorable climb, resulting in 1,033 new cases reported today, May 1.

How can there be good news in this?  How can we say we’ve even peaked?  Much less that we are declining?  Is there any hope? 

The answer, it appears, is that there is.

I always knew that our testing limitations were affecting the data, but I had no way of knowing how.  I knew that, behind the numbers, was the bias that the amount of testing each day was not constant, and therefore the new cases would be skewed by this number.  It’s very interesting to see how much that was the case.

Let’s zoom out a bit, from Los Angeles, to California as a whole.  Using data on testing, shown here, we observe a few things.  The first is that around April 22, there were over 165,000 new case results (the previous numbers were closer to 20,000 new tests/day).  This caused the massive spike, as indeed, there were over a week’s worth of data coming in in one day.

But, what’s really interesting is if we plot the number of new cases as a percent of the number of new tests.  Every time we see 100% we have our “3.6 Roentgen” moment.  We report that many new cases because that’s how many new tests we had.  The *actual* new cases were much, much more.

We’ve been reading from a pegged-out meter for most of March and April. 

This is why Dr. Fauci (and many others) have been saying that we need more tests.  Only by actually seeing past the infected can we learn about the trends.

So, where’s the hope?  It comes on April 14.  Let’s zoom in to that last part of the curve.

This is the graph to look at.  If we are ramping up our testing faster than our actual cases are falling, we will actually continually see an increase in our cases (until we finally exhaust all of our infected, and then the chart will plunge).  Right now, that seems to be happening.  However, if we look at this graph, we’re basically normalizing the curve to assume a constant number of tests per day.  And when we do *that*, the curve is finally starting to fall.

This data is not dropping particularly quickly, and it’s hard to make predictions from it.  However, at this rate, it implies that we’re on track to have another two weeks or so until we get to zero.  I’m really reluctant to make that kind of statement, given the misadventures with Easter, with Ventura and Orange County beaches, and with our population of tactical armored, Confederate flag-waving, “Patriot” cosplayers.  However, assuming that cooler heads prevail, this is a very important piece of the puzzle, and evidence that data quality is a thing.