Animations within the time of Coronavirus

The primary 4 months of 2020 have been dominated by the Coronavirus pandemic (aka COVID-19), which has remodeled international life in an unprecedented method. Societies and economies wrestle to adapt to the brand new situations and essential contraints. A reassuringly giant fraction of governments around the globe proceed to take evidence-based approaches to this disaster which might be grounded in giant scale knowledge assortment efforts. Most of this knowledge is being made publicly out there and could be studied in actual time. This put up will describe methods to extract and put together the mandatory knowledge to animate the unfold of the virus over time inside my native nation of Germany.

I’ve printed a pre-processed model of the related knowledge for this undertaking as a Kaggle dataset, along with the geospatial form information it’s essential plot the ensuing map. This put up outlines methods to construct that dataset from the unique supply knowledge utilizing a set of tidyverse instruments. Then we are going to use the gganimate and sf packages to create animated map visuals.

These are the packages we want:

libs <- c('dplyr', 'tibble', # wrangling 'stringr', 'readr', # strings, enter 'lubridate', 'tidyr', # time, wrangling 'knitr', 'kableExtra', # desk styling 'ggplot2', 'viridis', # visuals 'gganimate', 'sf', # animations, maps 'ggthemes') # visuals
invisible(lapply(libs, library, character.solely = TRUE))

The COVID-19 knowledge for Germany are being collected by the Robert Koch Institute and could be obtain by the Nationwide Platform for Geographic Information (which additionally hosts an interactive dashboard). The earliest recorded circumstances are from 2020-01-24. Right here we outline the corresponding hyperlink and skim the dataset:

infile <- "https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv"
covid_de <- read_csv(infile, col_types = cols())

This knowledge accommodates quite a lot of columns that are, unsurprisingly, named in German:

covid_de %>% head(5) %>% glimpse()
## Observations: 5
## Variables: 18
## $ FID 4281356, 4281357, 4281358, 4281359, 4281360
## $ IdBundesland 1, 1, 1, 1, 1
## $ Bundesland "Schleswig-Holstein", "Schleswig-Holstein",…
## $ Landkreis "SK Flensburg", "SK Flensburg", "SK Flensbu…
## $ Altersgruppe "A15-A34", "A15-A34", "A15-A34", "A15-A34",…
## $ Geschlecht "M", "M", "M", "M", "M"
## $ AnzahlFall 1, 1, 1, 1, 1
## $ AnzahlTodesfall 0, 0, 0, 0, 0
## $ Meldedatum "2020/03/14 00:00:00", "2020/03/19 00:00:00…
## $ IdLandkreis "01001", "01001", "01001", "01001", "01001"
## $ Datenstand "30.04.2020, 00:00 Uhr", "30.04.2020, 00:00…
## $ NeuerFall 0, 0, 0, 0, 0
## $ NeuerTodesfall -9, -9, -9, -9, -9
## $ Refdatum "2020/03/16 00:00:00", "2020/03/13 00:00:00…
## $ NeuGenesen 0, 0, 0, 0, 0
## $ AnzahlGenesen 1, 1, 1, 1, 1
## $ IstErkrankungsbeginn 1, 1, 1, 1, 1
## $ Altersgruppe2 "nicht übermittelt", "nicht übermittelt", "…

The next code block reshapes and interprets the information to make it higher accessible. This consists of changing our beloved German umlauts with simplified diphthongs, creating age teams, and aggregating COVID-19 numbers by county, age group, gender, and date:

covid_de <- covid_de %>% choose(state = Bundesland, county = Landkreis, age_group = Altersgruppe, gender = Geschlecht, circumstances = AnzahlFall, deaths = AnzahlTodesfall, recovered = AnzahlGenesen, date = Meldedatum) %>% mutate(date = date(date)) %>% mutate(age_group = str_remove_all(age_group, "A")) %>% mutate(age_group = case_when( age_group == "unbekannt" ~ NA_character_, age_group == "80+" ~ "80-99", TRUE ~ age_group )) %>% mutate(gender = case_when( gender == "W" ~ "F", gender == "unbekannt" ~ NA_character_, TRUE ~ gender )) %>% group_by(state, county, age_group, gender, date) %>% summarise(circumstances = sum(circumstances), deaths = sum(deaths), recovered = sum(recovered)) %>% ungroup() %>% filter(circumstances >= 0 & deaths >= 0) %>% filter(date < right this moment()) %>% mutate(state = str_replace_all(state, "ü", "ue")) %>% mutate(state = str_replace_all(state, "ä", "ae")) %>% mutate(state = str_replace_all(state, "ö", "oe")) %>% mutate(state = str_replace_all(state, "ß", "ss")) %>% mutate(county = str_replace_all(county, "ü", "ue")) %>% mutate(county = str_replace_all(county, "ä", "ae")) %>% mutate(county = str_replace_all(county, "ö", "oe")) %>% mutate(county = str_replace_all(county, "ß", "ss")) %>% mutate(county = str_remove(county, "(.+)")) %>% mutate(county = str_trim(county)) 

The result’s a dataset that lists each day (not cumulative!) circumstances, deaths, and recovered circumstances for six age teams, gender, and the German counties and their corresponding federal states. Much like the US, Germany has a federal system during which the 16 federal states have a big amout of legislative energy. The German equal of the US county is the “Kreis”, which may both be related to a metropolis (“Stadtkreis” = “SK”) or the nation aspect (“Landkreis” = “LK”). Right here solely a subset of columns are proven for causes of readability:

covid_de %>% filter(state == "Sachsen") %>% choose(-deaths, -recovered) %>% head(5) %>% kable() %>% column_spec(1:6, width = c("15%", "25%", "15%", "10%", "25%", "10%")) %>% kable_styling()
state county age_group gender date circumstances
Sachsen LK Bautzen 00-04 F 2020-03-20 1
Sachsen LK Bautzen 00-04 F 2020-04-07 2
Sachsen LK Bautzen 00-04 M 2020-03-21 1
Sachsen LK Bautzen 05-14 F 2020-03-20 1
Sachsen LK Bautzen 05-14 F 2020-03-21 1

That is the cleaned dataset which is accessible on Kaggle as covid_de.csv. With this knowledge, you possibly can already already slice and analyse Germany’s COVID-19 traits by numerous demographic and geographical options.

Nonetheless, for the maps that we’re fascinated by yet one more enter is lacking: shapefiles. A shapefile makes use of a normal vector format for specifying spatial geometries. It packages the map boundary knowledge of the required entities (like international locations, federal states) right into a small set of associated information. For this undertaking, I discovered publicly out there shapefiles on the state and county degree offered by Germany’s Federal Company for Cartography and Geodesy. Each ranges can be found within the Kaggle dataset. Right here I put the county degree information (de_county.*) into a neighborhood, static listing.

Shapefiles could be learn into R utilizing the sf package deal device st_read. With the intention to quickly be part of them to our COVID-19 knowledge, we have to do a little bit of string translating and wrangling once more. The tidyr device unite is getting used to mix the county kind (BEZ in c("LK", "SK")) and county identify into the format we’ve got in our COVID-19 knowledge:

shape_county <- st_read(str_c("../../static/information/", "de_county.shp"), quiet = TRUE) %>% rename(county = GEN) %>% choose(county, BEZ, geometry) %>% mutate(county = as.character(county)) %>% mutate(county = str_replace_all(county, "ü", "ue")) %>% mutate(county = str_replace_all(county, "ä", "ae")) %>% mutate(county = str_replace_all(county, "ö", "oe")) %>% mutate(county = str_replace_all(county, "ß", "ss")) %>% mutate(county = str_remove(county, "(.+)")) %>% mutate(county = str_trim(county)) %>% mutate(BEZ = case_when( BEZ == "Kreis" ~ "LK", BEZ == "Landkreis" ~ "LK", BEZ == "Stadtkreis" ~ "SK", BEZ == "Kreisfreie Stadt" ~ "SK" )) %>% unite(county, BEZ, county, sep = " ", take away = TRUE)

At this stage, there are nonetheless some county names that don’t match exactly. It will have been too straightforward, in any other case. These circumstances principally come all the way down to totally different types of abbreviations getting used for counties with longer names. A scalable strategy to take care of these wonders of the German language can be fuzzy matching by string distance similarities. Right here, the variety of mismatches is small and I made a decision to regulate them manually.

Then, I group all the things by county and date and sum over the remaining options. One main difficulty right here is that not all counties will report numbers for all days. These are small areas, in spite of everything. On this dataset, these circumstances are implicitely lacking; i.e. the corresponding rows are simply not current. You will need to convert these circumstances into explicitely lacking entries: rows which have a rely of zero. In any other case, our eventual map may have “holes” in it for particular days and particular counties. The elegant answer within the code is made doable by the tidyr perform full: merely identify all of the columns for which we need to have all of the mixtures and specify how they need to be crammed. This method applies to any state of affairs the place we’ve got a set of options and want a whole grid of all doable mixtures.

Lastly, we sum up the cumulative circumstances and deaths. Right here, I additionally utilized a filter to extract knowledge from March 1st – 31st solely, to stop the animation file from changing into too giant. Be at liberty to develop this to an extended time-frame:

foo <- covid_de %>% mutate(county = case_when( county == "Area Hannover" ~ "LK Area Hannover", county == "SK Muelheim a.d.Ruhr" ~ "SK Muelheim an der Ruhr", county == "StadtRegion Aachen" ~ "LK Staedteregion Aachen", county == "SK Offenbach" ~ "SK Offenbach am Foremost", county == "LK Bitburg-Pruem" ~ "LK Eifelkreis Bitburg-Pruem", county == "SK Landau i.d.Pfalz" ~ "SK Landau in der Pfalz", county == "SK Ludwigshafen" ~ "SK Ludwigshafen am Rhein", county == "SK Neustadt a.d.Weinstrasse" ~ "SK Neustadt an der Weinstrasse", county == "SK Freiburg i.Breisgau" ~ "SK Freiburg im Breisgau", county == "LK Landsberg a.Lech" ~ "LK Landsberg am Lech", county == "LK Muehldorf a.Inn" ~ "LK Muehldorf a. Inn", county == "LK Pfaffenhofen a.d.Ilm" ~ "LK Pfaffenhofen a.d. Ilm", county == "SK Weiden i.d.OPf." ~ "SK Weiden i.d. OPf.", county == "LK Neumarkt i.d.OPf." ~ "LK Neumarkt i.d. OPf.", county == "LK Neustadt a.d.Waldnaab" ~ "LK Neustadt a.d. Waldnaab", county == "LK Wunsiedel i.Fichtelgebirge" ~ "LK Wunsiedel i. Fichtelgebirge", county == "LK Neustadt a.d.Aisch-Unhealthy Windsheim" ~ "LK Neustadt a.d. Aisch-Unhealthy Windsheim", county == "LK Dillingen a.d.Donau" ~ "LK Dillingen a.d. Donau", county == "LK Stadtverband Saarbruecken" ~ "LK Regionalverband Saarbruecken", county == "LK Saar-Pfalz-Kreis" ~ "LK Saarpfalz-Kreis", county == "LK Sankt Wendel" ~ "LK St. Wendel", county == "SK Brandenburg a.d.Havel" ~ "SK Brandenburg an der Havel", str_detect(county, "Berlin") ~ "SK Berlin", TRUE ~ county )) %>% group_by(county, date) %>% summarise(circumstances = sum(circumstances), deaths = sum(deaths)) %>% ungroup() %>% full(county, date, fill = listing(circumstances = 0, deaths = 0)) %>% group_by(county) %>% mutate(cumul_cases = cumsum(circumstances), cumul_deaths = cumsum(deaths)) %>% ungroup() %>% filter(between(date, date("2020-03-01"), date("2020-03-31")))

Now we’ve got all of the elements for animating a county-level map of cumulative circumstances. Right here we first outline the animation object by specifying geom_sf() and theme_map() for the map type, then offering the animation steps column date to the transition_time() methodology. The benefit of transition_time is that the size of transitions between steps takes is proportional to the intrinsic time variations. Right here, we’ve got a really nicely behaved dataset and all our steps are of size 1 day. Thus, we might additionally use transition_states() instantly. Nonetheless, I contemplate it good follow to make use of transition_time at any time when precise time steps are concerned; to be ready for unequal time intervals.

The animation parameters are offered within the animate perform, such because the transition type from sooner or later to the following (cubic-in-out), the animation velocity (10 frames per s), or the scale of the plot. For cumulative animations like this, it’s all the time a good suggestion to incorporate an end_pause freeze-frame, in order that the reader can have a more in-depth take a look at the ultimate state earlier than the loop begins anew:

gg <- shape_county %>% right_join(foo, by = "county") %>% ggplot(aes(fill = cumul_cases)) + geom_sf() + scale_fill_viridis(trans = "log1p", breaks = c(0, 10, 100, 1000)) + theme_map() + theme(title = element_text(measurement = 15), legend.textual content = element_text(measurement = 12), legend.title = element_text(measurement = 15)) + labs(title = "Complete COVID-19 circumstances in Germany: {frame_time}", fill = "Instances") + transition_time(date) animate(gg + ease_aes('cubic-in-out'), fps = 10, end_pause = 25, top = 800, width = spherical(800/1.61803398875))

Our closing map reveals how the variety of COVID-19 circumstances in Germany first began to rise within the South and West, and the way they unfold to different elements of the nation. The geographical center of Germany seems to be lagging behind in case counts even at later instances. Word the logarithmic color scale.

Extra data:

  • One caveat: This view doesn’t bear in mind inhabitants density, which makes giant cities like Berlin (north-east) stand out extra in the direction of the tip. My Kaggle dataset at the moment consists of inhabitants counts for the state-level solely, however I’m planning so as to add county knowledge within the close to future.

  • Should you’re searching for additional inspiration on methods to analyse this dataset then try the assorted Notebooks (aka “Kernels”) that are related to it on Kaggle. Kaggle has the massive benefit you could run R or Python scripts and notebooks in a reasonably highly effective cloud setting; and current your work alongside datasets and competitions.

  • One other Kaggle dataset of mine with each day COVID-19 circumstances, deaths, and recoveries within the US could be discovered right here. This knowledge additionally has a county-level decision. It’s based mostly on Johns Hopkins College knowledge and I’m updating it every day.



Should you acquired this far, why not subscribe for updates from the positioning? Select your taste: e-mail, twitter, RSS, or fb

Leave a Reply

Your email address will not be published. Required fields are marked *