forked from k-sys/covid-19
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Realtime-R0_jhu.py
829 lines (666 loc) · 31.9 KB
/
Realtime-R0_jhu.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %%
from IPython import get_ipython
# %% [markdown]
# # Estimating COVID-19's $R_t$ in Real-Time
# Kevin Systrom - April 17
#
# In any epidemic, $R_t$ is the measure known as the effective reproduction number. It's the number of people who become infected per infectious person at time $t$. The most well-known version of this number is the basic reproduction number: $R_0$ when $t=0$. However, $R_0$ is a single measure that does not adapt with changes in behavior and restrictions.
#
# As a pandemic evolves, increasing restrictions (or potential releasing of restrictions) changes $R_t$. Knowing the current $R_t$ is essential. When $R\gg1$, the pandemic will spread through a large part of the population. If $R_t<1$, the pandemic will slow quickly before it has a chance to infect many people. The lower the $R_t$: the more manageable the situation. In general, any $R_t<1$ means things are under control.
#
# The value of $R_t$ helps us in two ways. (1) It helps us understand how effective our measures have been controlling an outbreak and (2) it gives us vital information about whether we should increase or reduce restrictions based on our competing goals of economic prosperity and human safety. [Well-respected epidemiologists argue](https://www.nytimes.com/2020/04/06/opinion/coronavirus-end-social-distancing.html) that tracking $R_t$ is the only way to manage through this crisis.
#
# Yet, today, we don't yet use $R_t$ in this way. In fact, the only real-time measure I've seen has been for [Hong Kong](https://covid19.sph.hku.hk/dashboard). More importantly, it is not useful to understand $R_t$ at a national level. Instead, to manage this crisis effectively, we need a local (state, county and/or city) granularity of $R_t$.
#
# What follows is a solution to this problem at the US State level. It's a modified version of a solution created by [Bettencourt & Ribeiro 2008](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0002185) to estimate real-time $R_t$ using a Bayesian approach. While this paper estimates a static $R$ value, here we introduce a process model with Gaussian noise to estimate a time-varying $R_t$.
#
# If you have questions, comments, or improvments feel free to get in touch: [[email protected]](mailto:[email protected]). And if it's not entirely clear, I'm not an epidemiologist. At the same time, data is data, and statistics are statistics and this is based on work by well-known epidemiologists so you can calibrate your beliefs as you wish. In the meantime, I hope you can learn something new as I did by reading through this example. Feel free to take this work and apply it elsewhere – internationally or to counties in the United States.
#
# Additionally, a huge thanks to [Frank Dellaert](http://www.twitter.com/fdellaert/) who suggested the addition of the Gaussian process and to [Adam Lerer](http://www.twitter.com/adamlerer/) who implemented the changes. Not only did I learn something new, it made the model much more responsive.
# %%
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.dates import date2num, num2date
from matplotlib import dates as mdates
from matplotlib import ticker
from matplotlib.colors import ListedColormap
from matplotlib.patches import Patch
from scipy import stats as sps
from scipy.interpolate import interp1d
from IPython.display import clear_output
"""
FILTERED_REGIONS = [
'Virgin Islands',
'American Samoa',
'Northern Mariana Islands',
'Guam',
'Puerto Rico']
FILTERED_REGION_CODES = ['AS', 'GU', 'PR', 'VI', 'MP']
"""
FILTERED_COUNTRIES = [
'Andorra',
'Austria',
'Belgium',
'Bosnia and Herzegovina',
'Bulgaria',
'Croatia',
'Czechia',
'Denmark',
'Finland',
'France',
'Germany',
'Hungary',
'Iceland',
'Ireland',
'Italy',
'Luxembourg',
'Netherlands',
'Montenegro',
'Moldova',
'Norway',
'Portugal',
'Russia',
'Serbia',
'Slovakia',
'Slovenia',
'Spain',
'Sweden',
'Switzerland',
'United Kingdom'
]
url = 'jhu.csv'
countries = pd.read_csv(url,
usecols=[0,1,3],
squeeze=True)
countries = countries['state'].drop_duplicates()
FILTERED_COUNTRIES = list(countries)
# FILTERED_COUNTRIES.remove('Angola')
# FILTERED_COUNTRIES.remove('Antigua and Barbuda')
# FILTERED_COUNTRIES.remove('Bahamas')
# FILTERED_COUNTRIES.remove('Barbados')
# FILTERED_COUNTRIES.remove('Bhutan')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
# %% [markdown]
# ## Bettencourt & Ribeiro's Approach
#
# Every day, we learn how many more people have COVID-19. This new case count gives us a clue about the current value of $R_t$. We also, figure that the value of $R_t$ today is related to the value of $R_{t-1}$ (yesterday's value) and every previous value of $R_{t-m}$ for that matter.
#
# With these insights, the authors use [Bayes' rule](https://en.wikipedia.org/wiki/Bayes%27_theorem) to update their beliefs about the true value of $R_t$ based on how many new cases have been reported each day.
#
# This is Bayes' Theorem as we'll use it:
#
# $$ P(R_t|k)=\frac{P(k|R_t)\cdot P(R_t)}{P(k)} $$
#
# This says that, having seen $k$ new cases, we believe the distribution of $R_t$ is equal to:
#
# - The __likelihood__ of seeing $k$ new cases given $R_t$ times ...
# - The __prior__ beliefs of the value of $P(R_t)$ without the data ...
# - divided by the probability of seeing this many cases in general.
#
# This is for a single day. To make it iterative: every day that passes, we use yesterday's prior $P(R_{t-1})$ to estimate today's prior $P(R_t)$. We will assume the distribution of $R_t$ to be a Gaussian centered around $R_{t-1}$, so $P(R_t|R_{t-1})=\mathcal{N}(R_{t-1}, \sigma)$, where $\sigma$ is a hyperparameter (see below on how we estimate $\sigma$). So on day one:
#
# $$ P(R_1|k_1) \propto P(R_1)\cdot \mathcal{L}(R_1|k_1)$$
#
# On day two:
#
# $$ P(R_2|k_1,k_2) \propto P(R_2)\cdot \mathcal{L}(R_2|k_2) = \sum_{R_1} {P(R_1|k_1)\cdot P(R_2|R_1)\cdot\mathcal{L}(R_2|k_2) }$$
#
# etc.
#
# ### Choosing a Likelihood Function $P\left(k_t|R_t\right)$
#
# A likelihood function function says how likely we are to see $k$ new cases, given a value of $R_t$.
#
# Any time you need to model 'arrivals' over some time period of time, statisticians like to use the [Poisson Distribution](https://en.wikipedia.org/wiki/Poisson_distribution). Given an average arrival rate of $\lambda$ new cases per day, the probability of seeing $k$ new cases is distributed according to the Poisson distribution:
#
# $$P(k|\lambda) = \frac{\lambda^k e^{-\lambda}}{k!}$$
# %%
# Column vector of k
k = np.arange(0, 70)[:, None]
# Different values of Lambda
lambdas = [10, 20, 30, 40]
# Evaluated the Probability Mass Function (remember: poisson is discrete)
y = sps.poisson.pmf(k, lambdas)
# Show the resulting shape
print(y.shape)
# %% [markdown]
# > __Note__: this was a terse expression which makes it tricky. All I did was to make $k$ a column. By giving it a column for $k$ and a 'row' for lambda it will evaluate the pmf over both and produce an array that has $k$ rows and lambda columns. This is an efficient way of producing many distributions all at once, and __you will see it used again below__!
# %%
fig, ax = plt.subplots()
ax.set(title='Poisson Distribution of Cases\n $p(k|\lambda)$')
plt.plot(k, y,
marker='o',
markersize=3,
lw=0)
plt.legend(title="$\lambda$", labels=lambdas);
# %% [markdown]
# The Poisson distribution says that if you think you're going to have $\lambda$ cases per day, you'll probably get that many, plus or minus some variation based on chance.
#
# But in our case, we know there have been $k$ cases and we need to know what value of $\lambda$ is most likely. In order to do this, we fix $k$ in place while varying $\lambda$. __This is called the likelihood function.__
#
# For example, imagine we observe $k=20$ new cases, and we want to know how likely each $\lambda$ is:
# %%
k = 20
lam = np.linspace(1, 45, 90)
likelihood = pd.Series(data=sps.poisson.pmf(k, lam),
index=pd.Index(lam, name='$\lambda$'),
name='lambda')
likelihood.plot(title=r'Likelihood $P\left(k_t=20|\lambda\right)$');
# %% [markdown]
# This says that if we see 20 cases, the most likely value of $\lambda$ is (not surprisingly) 20. But we're not certain: it's possible lambda was 21 or 17 and saw 20 new cases by chance alone. It also says that it's unlikely $\lambda$ was 40 and we saw 20.
#
# Great. We have $P\left(\lambda_t|k_t\right)$ which is parameterized by $\lambda$ but we were looking for $P\left(k_t|R_t\right)$ which is parameterized by $R_t$. We need to know the relationship between $\lambda$ and $R_t$
# %% [markdown]
# ### Connecting $\lambda$ and $R_t$
#
# __The key insight to making this work is to realize there's a connection between $R_t$ and $\lambda$__. [The derivation](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0002185) is beyond the scope of this notebook, but here it is:
#
# $$ \lambda = k_{t-1}e^{\gamma(R_t-1)}$$
#
# where $\gamma$ is the reciprocal of the serial interval ([about 7 days for COVID19](https://wwwnc.cdc.gov/eid/article/26/7/20-0282_article)). Since we know every new case count on the previous day, we can now reformulate the likelihood function as a Poisson parameterized by fixing $k$ and varying $R_t$.
#
# $$ \lambda = k_{t-1}e^{\gamma(R_t-1)}$$
#
# $$P\left(k|R_t\right) = \frac{\lambda^k e^{-\lambda}}{k!}$$
#
# ### Evaluating the Likelihood Function
#
# To continue our example, let's imagine a sample of new case counts $k$. What is the likelihood of different values of $R_t$ on each of those days?
# %%
k = np.array([20, 40, 55, 90])
# We create an array for every possible value of Rt
R_T_MAX = 12
r_t_range = np.linspace(0, R_T_MAX, R_T_MAX*100+1)
# Gamma is 1/serial interval
# https://wwwnc.cdc.gov/eid/article/26/7/20-0282_article
# https://www.nejm.org/doi/full/10.1056/NEJMoa2001316
GAMMA = 1/7
# Map Rt into lambda so we can substitute it into the equation below
# Note that we have N-1 lambdas because on the first day of an outbreak
# you do not know what to expect.
lam = k[:-1] * np.exp(GAMMA * (r_t_range[:, None] - 1))
# Evaluate the likelihood on each day and normalize sum of each day to 1.0
likelihood_r_t = sps.poisson.pmf(k[1:], lam)
likelihood_r_t /= np.sum(likelihood_r_t, axis=0)
# Plot it
ax = pd.DataFrame(
data = likelihood_r_t,
index = r_t_range
).plot(
title='Likelihood of $R_t$ given $k$',
xlim=(0,10)
)
ax.legend(labels=k[1:], title='New Cases')
ax.set_xlabel('$R_t$');
# %% [markdown]
# You can see that each day we have a independent guesses for $R_t$. The goal is to combine the information we have about previous days with the current day. To do this, we use Bayes' theorem.
#
# ### Performing the Bayesian Update
#
# To perform the Bayesian update, we need to multiply the likelihood by the prior (which is just the previous day's likelihood without our Gaussian update) to get the posteriors. Let's do that using the cumulative product of each successive day:
# %%
posteriors = likelihood_r_t.cumprod(axis=1)
posteriors = posteriors / np.sum(posteriors, axis=0)
columns = pd.Index(range(1, posteriors.shape[1]+1), name='Day')
posteriors = pd.DataFrame(
data = posteriors,
index = r_t_range,
columns = columns)
ax = posteriors.plot(
title='Posterior $P(R_t|k)$',
xlim=(0,10)
)
ax.legend(title='Day')
ax.set_xlabel('$R_t$');
# %% [markdown]
# Notice how on Day 1, our posterior matches Day 1's likelihood from above? That's because we have no information other than that day. However, when we update the prior using Day 2's information, you can see the curve has moved left, but not nearly as left as the likelihood for Day 2 from above. This is because Bayesian updating uses information from both days and effectively averages the two. Since Day 3's likelihood is in between the other two, you see a small shift to the right, but more importantly: a narrower distribution. We're becoming __more__ confident in our believes of the true value of $R_t$.
#
# From these posteriors, we can answer important questions such as "What is the most likely value of $R_t$ each day?"
# %%
most_likely_values = posteriors.idxmax(axis=0)
most_likely_values
# %% [markdown]
# We can also obtain the [highest density intervals](https://www.sciencedirect.com/topics/mathematics/highest-density-interval) for $R_t$:
# %%
def highest_density_interval(pmf, p=.9):
# If we pass a DataFrame, just call this recursively on the columns
if(isinstance(pmf, pd.DataFrame)):
return pd.DataFrame([highest_density_interval(pmf[col], p=p) for col in pmf],
index=pmf.columns)
cumsum = np.cumsum(pmf.values)
best = None
for i, value in enumerate(cumsum):
for j, high_value in enumerate(cumsum[i+1:]):
if (high_value-value > p) and (not best or j < best[1]-best[0]):
best = (i, i+j+1)
break
if best == None:
return pd.Series()
low = pmf.index[best[0]]
high = pmf.index[best[1]]
return pd.Series([low, high], index=[f'Low_{p*100:.0f}', f'High_{p*100:.0f}'])
hdi = highest_density_interval(posteriors)
hdi.tail()
# %% [markdown]
# Finally, we can plot both the most likely values for $R_t$ and the HDIs over time. This is the most useful representation as it shows how our beliefs change with every day.
# %%
ax = most_likely_values.plot(marker='o',
label='Most Likely',
title=f'$R_t$ by day',
c='k',
markersize=4)
ax.fill_between(hdi.index,
hdi['Low_90'],
hdi['High_90'],
color='k',
alpha=.1,
lw=0,
label='HDI')
ax.legend();
# %% [markdown]
# We can see that the most likely value of $R_t$ changes with time and the highest-density interval narrows as we become more sure of the true value of $R_t$ over time. Note that since we only had four days of history, I did not apply the Gaussian process to this sample. Next, however, we'll turn to a real-world application where this process is necessary.
# %% [markdown]
# # Real-World Application to US Data
#
# ### Setup
#
# Load US state case data from CovidTracking.com
# %%
url = 'https://covidtracking.com/api/v1/states/daily.csv'
url = 'jhu.csv'
"""
states = pd.read_csv(url,
usecols=['date', 'state', 'positive'],
parse_dates=['date'],
index_col=['state', 'date'],
squeeze=True).sort_index()
"""
states = pd.read_csv(url,
usecols=[0,1,3],
index_col=['state', 'date'],
parse_dates=['date'],
squeeze=True).sort_index()
# %% [markdown]
# Taking a look at the state, we need to start the analysis when there are a consistent number of cases each day. Find the last zero new case day and start on the day after that.
#
# Also, case reporting is very erratic based on testing backlogs, etc. To get the best view of the 'true' data we can, I've applied a gaussian filter to the time series. This is obviously an arbitrary choice, but you'd imagine the real world process is not nearly as stochastic as the actual reporting.
# %%
# state_name = 'NY'
state_name = 'Germany'
def prepare_cases(cases):
new_cases = cases.diff()
smoothed = new_cases.rolling(9,
win_type='gaussian',
min_periods=1,
center=True).mean(std=3).round()
smoothed_tmp = smoothed
for i in range(10, 0, -1):
idx_start = np.searchsorted(smoothed, i)
smoothed_tmp = smoothed.iloc[idx_start:]
original = new_cases.loc[smoothed_tmp.index]
if len(smoothed_tmp) > 0:
break
return original, smoothed_tmp
cases = states.xs(state_name).rename(f"{state_name} cases")
original, smoothed = prepare_cases(cases)
original.plot(title=f"{state_name} New Cases per Day",
c='k',
linestyle=':',
alpha=.5,
label='Actual',
legend=True,
figsize=(500/72, 400/72))
ax = smoothed.plot(label='Smoothed',
legend=True)
ax.get_figure().set_facecolor('w')
# fig.savefig(state_name + "_new_cases.png")
# %% [markdown]
# ### Running the Algorithm
#
# #### Choosing the Gaussian $\sigma$ for $P(R_t|R_{t-1})$
#
# > Note: you can safely skip this section if you trust that we chose the right value of $\sigma$ for the gaussian process below. Otherwise, read on.
#
# The original approach simply selects yesterday's posterior as today's prior. While intuitive, doing so doesn't allow for our belief that the value of $R_t$ has likely changed from yesterday. To allow for that change, we apply Gaussian noise to the prior distribution with some standard deviation $\sigma$. The higher $\sigma$ the more noise and the more we will expect the value of $R_t$ to drift each day. Interestingly, applying noise on noise iteratively means that there will be a natural decay of distant posteriors. This approach has a similar effect of windowing, but is more robust and doesn't arbitrarily forget posteriors after a certain time like my previous approach. Specifically, windowing computed a fixed $R_t$ at each time $t$ that explained the surrounding $w$ days of cases, while the new approach computes a series of $R_t$ values that explains all the cases, assuming that $R_t$ fluctuates by about $\sigma$ each day.
#
# However, there's still an arbitrary choice: what should $\sigma$ be? Adam Lerer pointed out that we can use the process of maximum likelihood to inform our choice. Here's how it works:
#
# Maximum likelihood says that we'd like to choose a $\sigma$ that maximizes the likelihood of seeing our data $k$: $P(k|\sigma)$. Since $\sigma$ is a fixed value, let's leave it out of the notation, so we're trying to maximize $P(k)$ over all choices of $\sigma$.
#
# Since $P(k)=P(k_0,k_1,\ldots,k_t)=P(k_0)P(k_1)\ldots P(k_t)$ we need to define $P(k_t)$. It turns out this is the denominator of Bayes rule:
#
# $$P(R_t|k_t) = \frac{P(k_t|R_t)P(R_t)}{P(k_t)}$$
#
# To calculate it, we notice that the numerator is actually just the joint distribution of $k$ and $R$:
#
# $$ P(k_t,R_t) = P(k_t|R_t)P(R_t) $$
#
# We can marginalize the distribution over $R_t$ to get $P(k_t)$:
#
# $$ P(k_t) = \sum_{R_{t}}{P(k_t|R_t)P(R_t)} $$
#
# So, if we sum the distribution of the numerator over all values of $R_t$, we get $P(k_t)$. And since we're calculating that anyway as we're calculating the posterior, we'll just keep track of it separately.
#
# Since we're looking for the value of $\sigma$ that maximizes $P(k)$ overall, we actually want to maximize:
#
# $$\prod_{t,i}{p(k_{ti})}$$
#
# where $t$ are all times and $i$ is each state.
#
# Since we're multiplying lots of tiny probabilities together, it can be easier (and less error-prone) to take the $\log$ of the values and add them together. Remember that $\log{ab}=\log{a}+\log{b}$. And since logarithms are monotonically increasing, maximizing the sum of the $\log$ of the probabilities is the same as maximizing the product of the non-logarithmic probabilities for any choice of $\sigma$.
#
# ### Function for Calculating the Posteriors
#
# To calculate the posteriors we follow these steps:
# 1. Calculate $\lambda$ - the expected arrival rate for every day's poisson process
# 2. Calculate each day's likelihood distribution over all possible values of $R_t$
# 3. Calculate the Gaussian process matrix based on the value of $\sigma$ we discussed above
# 4. Calculate our initial prior because our first day does not have a previous day from which to take the posterior
# - Based on [info from the cdc](https://wwwnc.cdc.gov/eid/article/26/7/20-0282_article) we will choose a Gamma with mean 7.
# 5. Loop from day 1 to the end, doing the following:
# - Calculate the prior by applying the Gaussian to yesterday's prior.
# - Apply Bayes' rule by multiplying this prior and the likelihood we calculated in step 2.
# - Divide by the probability of the data (also Bayes' rule)
# %%
def get_posteriors(sr, sigma=0.15):
# (1) Calculate Lambda
lam = sr[:-1].values * np.exp(GAMMA * (r_t_range[:, None] - 1))
# (2) Calculate each day's likelihood
likelihoods = pd.DataFrame(
data = sps.poisson.pmf(sr[1:].values, lam),
index = r_t_range,
columns = sr.index[1:])
# (3) Create the Gaussian Matrix
process_matrix = sps.norm(loc=r_t_range,
scale=sigma
).pdf(r_t_range[:, None])
# (3a) Normalize all rows to sum to 1
process_matrix /= process_matrix.sum(axis=0)
# (4) Calculate the initial prior
prior0 = sps.gamma(a=4).pdf(r_t_range)
prior0 /= prior0.sum()
# Create a DataFrame that will hold our posteriors for each day
# Insert our prior as the first posterior.
posteriors = pd.DataFrame(
index=r_t_range,
columns=sr.index,
data={sr.index[0]: prior0}
)
# We said we'd keep track of the sum of the log of the probability
# of the data for maximum likelihood calculation.
log_likelihood = 0.0
# (5) Iteratively apply Bayes' rule
for previous_day, current_day in zip(sr.index[:-1], sr.index[1:]):
#(5a) Calculate the new prior
current_prior = process_matrix @ posteriors[previous_day]
#(5b) Calculate the numerator of Bayes' Rule: P(k|R_t)P(R_t)
numerator = likelihoods[current_day] * current_prior
#(5c) Calcluate the denominator of Bayes' Rule P(k)
denominator = np.sum(numerator)
# Execute full Bayes' Rule
posteriors[current_day] = numerator/denominator
# Add to the running sum of log likelihoods
log_likelihood += np.log(denominator)
return posteriors, log_likelihood
# Note that we're fixing sigma to a value just for the example
posteriors, log_likelihood = get_posteriors(smoothed, sigma=.25)
# %% [markdown]
# ### The Result
#
# Below you can see every day (row) of the posterior distribution plotted simultaneously. The posteriors start without much confidence (wide) and become progressively more confident (narrower) about the true value of $R_t$
# %%
ax = posteriors.plot(title=f'{state_name} - Daily Posterior for $R_t$',
legend=False,
lw=1,
c='k',
alpha=.3,
xlim=(0.4,6))
ax.set_xlabel('$R_t$');
# %% [markdown]
# ### Plotting in the Time Domain with Credible Intervals
#
# Since our results include uncertainty, we'd like to be able to view the most likely value of $R_t$ along with its highest-density interval.
# %%
# Note that this takes a while to execute - it's not the most efficient algorithm
hdis = highest_density_interval(posteriors, p=.9)
most_likely = posteriors.idxmax().rename('ML')
# Look into why you shift -1
result = pd.concat([most_likely, hdis], axis=1)
result.tail()
# %%
def plot_rt(result, ax, state_name):
ax.set_title(f"{state_name}")
# Colors
ABOVE = [1,0,0]
MIDDLE = [1,1,1]
BELOW = [0,0,0]
cmap = ListedColormap(np.r_[
np.linspace(BELOW,MIDDLE,25),
np.linspace(MIDDLE,ABOVE,25)
])
color_mapped = lambda y: np.clip(y, .5, 1.5)-.5
index = result['ML'].index.get_level_values('date')
values = result['ML'].values
# Plot dots and line
ax.plot(index, values, c='k', zorder=1, alpha=.25)
ax.scatter(index,
values,
s=40,
lw=.5,
c=cmap(color_mapped(values)),
edgecolors='k', zorder=2)
# Aesthetically, extrapolate credible interval by 1 day either side
lowfn = interp1d(date2num(index),
result['Low_90'].values,
bounds_error=False,
fill_value='extrapolate')
highfn = interp1d(date2num(index),
result['High_90'].values,
bounds_error=False,
fill_value='extrapolate')
extended = pd.date_range(start=pd.Timestamp('2020-03-01'),
end=index[-1]+pd.Timedelta(days=1))
ax.fill_between(extended,
lowfn(date2num(extended)),
highfn(date2num(extended)),
color='k',
alpha=.1,
lw=0,
zorder=3)
ax.axhline(1.0, c='k', lw=1, label='$R_t=1.0$', alpha=.25);
# Formatting
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
ax.xaxis.set_minor_locator(mdates.DayLocator())
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_formatter(ticker.StrMethodFormatter("{x:.1f}"))
ax.yaxis.tick_right()
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.margins(0)
ax.grid(which='major', axis='y', c='k', alpha=.1, zorder=-2)
ax.margins(0)
ax.set_ylim(0.0, 5.0)
ax.set_xlim(pd.Timestamp('2020-03-01'), result.index.get_level_values('date')[-1]+pd.Timedelta(days=1))
fig.set_facecolor('w')
fig, ax = plt.subplots(figsize=(600/72,400/72))
plot_rt(result, ax, state_name)
ax.set_title(f'Real-time $R_t$ for {state_name}')
ax.xaxis.set_major_locator(mdates.WeekdayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
fig.savefig(state_name + ".png")
# %% [markdown]
# ### Choosing the optimal $\sigma$
#
# In the previous section we described choosing an optimal $\sigma$, but we just assumed a value. But now that we can evaluate each state with any sigma, we have the tools for choosing the optimal $\sigma$.
#
# Above we said we'd choose the value of $\sigma$ that maximizes the likelihood of the data $P(k)$. Since we don't want to overfit on any one state, we choose the sigma that maximizes $P(k)$ over every state. To do this, we add up all the log likelihoods per state for each value of sigma then choose the maximum.
#
# > Note: this takes a while!
# %%
sigmas = np.linspace(1/20, 1, 20)
targets = states.index.get_level_values('state').isin(FILTERED_COUNTRIES)
states_to_process = states.loc[targets]
results = {}
removed_countries = []
for state_name, cases in states_to_process.groupby(level='state'):
print(state_name)
new, smoothed = prepare_cases(cases)
if len(smoothed) < 2:
FILTERED_COUNTRIES.remove(state_name)
removed_countries.append(state_name)
continue
result = {}
# Holds all posteriors with every given value of sigma
result['posteriors'] = []
# Holds the log likelihood across all k for each value of sigma
result['log_likelihoods'] = []
for sigma in sigmas:
posteriors, log_likelihood = get_posteriors(smoothed, sigma=sigma)
result['posteriors'].append(posteriors)
result['log_likelihoods'].append(log_likelihood)
# Store all results keyed off of state name
results[state_name] = result
clear_output(wait=True)
print('Done.')
print('Removed countries:', removed_countries)
# %% [markdown]
# Now that we have all the log likelihoods, we can sum for each value of sigma across states, graph it, then choose the maximum.
# %%
# Each index of this array holds the total of the log likelihoods for
# the corresponding index of the sigmas array.
total_log_likelihoods = np.zeros_like(sigmas)
# Loop through each state's results and add the log likelihoods to the running total.
for state_name, result in results.items():
total_log_likelihoods += result['log_likelihoods']
# Select the index with the largest log likelihood total
max_likelihood_index = total_log_likelihoods.argmax()
# Select the value that has the highest log likelihood
sigma = sigmas[max_likelihood_index]
# Plot it
fig, ax = plt.subplots()
ax.set_title(f"Maximum Likelihood value for $\sigma$ = {sigma:.2f}");
ax.plot(sigmas, total_log_likelihoods)
ax.axvline(sigma, color='k', linestyle=":")
# %% [markdown]
# ### Compile Final Results
#
# Given that we've selected the optimal $\sigma$, let's grab the precalculated posterior corresponding to that value of $\sigma$ for each state. Let's also calculate the 90% and 50% highest density intervals (this takes a little while) and also the most likely value.
# %%
final_results = None
for state_name, result in results.items():
print(state_name)
posteriors = result['posteriors'][max_likelihood_index]
hdis_90 = highest_density_interval(posteriors, p=.9)
hdis_50 = highest_density_interval(posteriors, p=.5)
most_likely = posteriors.idxmax().rename('ML')
result = pd.concat([most_likely, hdis_90, hdis_50], axis=1)
if final_results is None:
final_results = result
else:
final_results = pd.concat([final_results, result])
clear_output(wait=True)
print('Done.')
# %% [markdown]
# ### Plot All US States
# %%
ncols = 16
nrows = int(np.ceil(len(results) / ncols))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(60, nrows*3))
for i, (state_name, result) in enumerate(final_results.groupby('state')):
# print(state_name, result)
plot_rt(result, axes.flat[i], state_name)
clear_output(wait=True)
fig.tight_layout()
fig.set_facecolor('w')
fig.savefig("world.png")
# %% [markdown]
# ### Export Data to CSV
# %%
# Uncomment the following line if you'd like to export the data
final_results.to_csv('rt.csv')
# %% [markdown]
# ### Standings
# %%
# As of 4/12
"""
no_lockdown = [
'North Dakota', 'ND',
'South Dakota', 'SD',
'Nebraska', 'NB',
'Iowa', 'IA',
'Arkansas','AR'
]
partial_lockdown = [
'Utah', 'UT',
'Wyoming', 'WY',
'Oklahoma', 'OK',
'Massachusetts', 'MA'
]
"""
FULL_COLOR = [.7,.7,.7]
NONE_COLOR = [179/255,35/255,14/255]
PARTIAL_COLOR = [.5,.5,.5]
ERROR_BAR_COLOR = [.3,.3,.3]
# %%
final_results
# %%
filtered = final_results.index.get_level_values(0).isin(FILTERED_COUNTRIES)
mr = final_results.loc[filtered].groupby(level=0)[['ML', 'High_90', 'Low_90']].last()
def plot_standings(mr, figsize=None, title='Most Recent $R_t$ by State'):
if not figsize:
figsize = ((15.9/50)*len(mr)+.1,2.5)
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(title)
err = mr[['Low_90', 'High_90']].sub(mr['ML'], axis=0).abs()
bars = ax.bar(mr.index,
mr['ML'],
width=.825,
color=FULL_COLOR,
ecolor=ERROR_BAR_COLOR,
capsize=2,
error_kw={'alpha':.5, 'lw':1},
yerr=err.values.T)
for bar, state_name in zip(bars, mr.index):
# if state_name in no_lockdown:
# bar.set_color(NONE_COLOR)
#if state_name in partial_lockdown:
bar.set_color(PARTIAL_COLOR)
labels = mr.index.to_series().replace({'District of Columbia':'DC'})
ax.set_xticklabels(labels, rotation=90, fontsize=11)
ax.margins(0)
ax.set_ylim(0,2.)
ax.axhline(1.0, linestyle=':', color='k', lw=1)
"""
leg = ax.legend(handles=[
Patch(label='Full', color=FULL_COLOR),
Patch(label='Partial', color=PARTIAL_COLOR),
Patch(label='None', color=NONE_COLOR)
],
title='Lockdown',
ncol=3,
loc='upper left',
columnspacing=.75,
handletextpad=.5,
handlelength=1)
leg._legend_box.align = "left"
"""
fig.set_facecolor('w')
return fig, ax
mr.sort_values('ML', inplace=True)
plot_standings(mr);
# %%
mr.sort_values('High_90', inplace=True)
plot_standings(mr);
# %%
show = mr[mr.High_90.le(1)].sort_values('ML')
fig, ax = plot_standings(show, title='Likely Under Control');
# %%
show = mr[mr.Low_90.ge(1.0)].sort_values('Low_90')
fig, ax = plot_standings(show, title='Likely Not Under Control');
# ax.get_legend().remove()
# %%