towardsdatascience.com Open in urlscan Pro
162.159.152.4 Public Scan

Back to summary

Submitted URL:
https://towardsdatascience.com/covid-19-map-using-elk-7b8611e9f2f4?gi=b64cb7dbe529
Effective URL:
https://towardsdatascience.com/covid-19-map-using-elk-7b8611e9f2f4?gi=696de4449425
Submission: On December 08 via api (December 8th 2021, 6:47:55 am UTC) from KR — Scanned from DE

Form analysis
0 forms found in the DOM

Text Content

Get started
Open in app

Get started
Follow
603K Followers
·
Editors' PicksFeaturesDeep DivesGrowContribute

About

Get started
Open in app

RESPONSES

What are your thoughts?

Cancel
Respond

Also publish to my profile

There are currently no responses for this story.

Be the first to respond.

You have 2 free member-only stories left this month.

DATA FOR CHANGE

BUILDING A COVID-19 MAP USING ELK

CREATE YOUR OWN CUSTOM COVID-19 MAP USING ELASTICSEARCH

Carlos Cilleruelo

Sep 3, 2020·11 min read

Covid-19 ELK Map, available at
https://covid19map.uah.es/app/dashboards#/view/478e9b90-71e1-11ea-8dd8-e1599462e413
| Image by the author

Probably most of you are familiar with Johns Hopkins University (JHU) map
representing the current situation of the COVID-19 pandemic.

Image of the Johns Hopkins University (JHU) map (Johns Hopkins University)

This map has been developed using ArcGIS technology, that has come the facto
standard for developing pandemic maps in a lot of cases like the WHO or the
Italian Government.

After saw this I thought about creating my own map with ELK; and in a few days,
everything was running with the help of a friend. Based on my experience I
decided to write how you could easily do that too. This series of posts are
going to be centred in how you can create your own custom map using the ELK
stack.

WHY ELASTICSEARCH?

The first question to answer is why ELK? and not use ArcGIS technology.
Elasticsearch is Open Source and everyone can easily deploy a running cluster.
Furthermore, Elasticsearch has beautiful representations using Kibana and has
also maps, so it has everything we need to build an incredible Covid-19 map. I
really love ELK stack so I decided to give it a try.

> I have run ELK stack in a single 10$/month Digital Ocean VPS, obviously
> without redundancy and a lot of space, but we will see that we do not need a
> lot of space for the data.

Base on this our cost will only be the infrastructure for running ELK, but a
little VPS can run a little cluster. I do not have a lot of spare money so I
always try to maintain costs to a minimum. I have run ELK stack in a single
10$/month Digital Ocean VPS, obviously without redundancy and a lot of space,
but we will see that we do not need a lot of space for the data. Another choice
is the use of Elastic cloud.

My current deploy en Elastic Cloud | Image by the author

Elastic Cloud is the easiest way to run an ELK cluster and offer a lot of useful
capabilities. For example, one of the “problems” of ELK is the number of
frequently updates with new functionalities. In Elastic Cloud updating your
cluster is as easy as pressing a button within the control panel. Also, the cost
of the smallest deployment is under 20$/month. In my case, I ended using Elastic
Cloud because of the usability and easier administration. I totally recommend
this option and do not forget that you have a 14-day trial.

COVID-19 DATA SOURCES

After agreeing in the awesomeness of ELK stack we can start to think about how
to insert Covid-19 data inside ELK. First of all, we identify a reliable and
update source of data for our map. We need to retrieve and insert new data each
day to update our map.

Johns Hopkins University (JHU) publish their data on GitHub. They are all in CSV
format and can easily work with. Something similar happens with Covid-19 Italy
data. This formats can be easily parsed and then inserted, but there can be
several problems. During my work parsing those data, I had several problems. The
first one is the data updating process, for example, JHU is not always the
fastest updating their data. And that’s normal, they need to wait for the
release of new data and the incorporated in their dataset. You will see that
Italy repository is more frequently updated.

> If you are planning to run an updated map the best option is not to use one of
> those Github repositories.
>
> The best option is to use a Covid-19 API. We have several options but I
> decided to use Covid-19 Narrativa API.

Another problem is the change in the file structure. There have been times that
the CSV structure has been changed and then you need to parse again those files,
columns can be renamed or order can be changed. Because of that if you are
planning to run an updated map the best option is not to use one of those Github
repositories.

In order to avoid those problems, the best option is to use a Covid-19 API. We
have several options but I decided to use Covid-19 Narrativa API.

To retrieving a Covid-19 data from the 3rd of September of 2020 we just need to
make a request to https://api.covid19tracking.narrativa.com/api/2020–09–03.
After performed that request we will obtain a JSON response, avoiding all the
problems mentioned earlier. Narrativa already checks and download Covid-19
information from several official data sources:

* Spanish: Ministerio de Sanidad
* Italy: Dipartimento della Protezione Civile de Italia
* Germany: Robert Koch Institute
* France: Santé publique France
* Johns Hopkins University
* Johns Hopkins University

JSON response of Covid-19 Narrativa API | Image by the author

Using this API, we can retrieve all the data we need. From world data to
countries and region data. Full documentation of the API can be here. Reaching
this step we can start to consume and insert data in Elasticsearch.

INSERTING COVID-19 INSIDE ELASTIC

In order to consume data from an API, we could use Logstash. Logstash is the
standard tool for collecting, parsing and transform information before inserting
in Elasticsearch. Furthermore, Logstash has a lot of preconfigure configs
already published for common logs.

But there are other possibilities, like Python. I really love Python syntax and
programming so when I am consuming from an API I usually ended building a script
for consuming the data and the inserted in Elasticsearch. Like I said you can do
this using Logstash but I fell more comfortable programming in Python.

So let’s start with the code!! First of all, we need all the data from the
beginning of the pandemic until today. Also, we will like to have this data
separated in dates in order to filter or create visualizations from a different
period of time. Using Python request and datetime we can easily iterate throw
all the dates and retrieve all the data. Personally prefer requests to urrlib, I
find it much simple and elegant.

> Using Python request and datetime we can easily iterate throw all the dates
> and retrieve all the data.

import requests
from datetime import datetime, date, timedeltastart_date = date(2020, 3, 1)
end_date = date(2020, 4, 9)
delta = timedelta(days=1)while start_date <= end_date:
day = start_date.strftime("%Y-%m-%d")
print ("Downloading " + day)
url = "https://api.covid19tracking.narrativa.com/api/" + day
r = requests.get(url)
data = r.json()
start_date += delta

After obtaining the data of each date we could just insert it in Elasticsearch
but before doing that it is necessary to perform some formatting over the data.
We are going to represent the data associate to countries and in an elastic map.
In order to associate data to each country, elasticsearch need to identify the
name of the country/region. Narrativa API offers us the name in English, Italian
and Spanish but some country names could be problematic. Because of that ISO
3166–1 alpha-2 (iso2) and ISO 3166–1 alpha-3 (iso3) were invented. Using this
nomenclature we can easily identify each country name without mistakes. Python
has a package countryinfo that can help us in this process of translation
country names to iso3 format.

> Python has a package countryinfo that can help us in this process of
> translation country names to iso3 format.
>
> You need to compare infection rates amount the population, or use statatitics
> metrics, but not absolute numbers. This topic has been heavily detailed in an
> ArcGis post that I totally recommend.

from countryinfo import CountryInfofor day in data['dates']:
for country in data['dates'][day]['countries']:
try:
country_info = CountryInfo(country)
country_iso_3 = country_info.iso(3)
population = country_info.population()
except Exception as e:
print("Error with " + country)
country_iso_3 = country
population = None
infection_rate=0
print(e)

Also if you checked the code probably most of you will have seen a population
value. Unfortunately, Elasticsearch does not include population values for each
country right now, they are working on that. In order to map the Covid-19
pandemic representatively, it is necessary to have the population value of each
country, a heat map with only the number of cases is not representative. You
need to compare infection rates amount the population, or use statistics
metrics, but not absolute numbers. This topic has been heavily detailed in an
ArcGis post that I totally recommend.

An easy way of obtaining a representative metric of infected people in each
country is the use of an infection rate. Infection rate represents the
probability or risk of an infection in a population.

Rate of infection formula | Image by the author

Again using Python we can easily calculate that number. Unfortunately, Python
countryinfo does not include the population of all the countries, due to that
reason I am controlling the exception. Most of the countries are supported but I
caught some errors, with Bahamas, Cabo Verde and some others.

def getInfectionRate(confirmed, population):
infectionRate = 100 * (confirmed / population)
return float(infectionRate)if population != None:
try:
infection_rate=getInfectionRate(data['dates'][day]['countries'][country]['today_confirmed'], population)
print(infection_rate)
except:
infection_rate=0

> I replaced the timestamp with a datetime format date. This way Elasticsearch
> will automatically detect the date format and you forget about creating the
> Kibana index.

After performing all of these modifications we just need to insert this data
into Elasticsearch. Before inserting the data I preferred to create a custom
dictionary with all the data. I added the previous data, population, country
iso3 name, infection rate and I replaced the timestamp with a datetime format
date. This way Elasticsearch will automatically detect the date format and you
forget about creating the Kibana index.

def save_elasticsearch_es(index, result_data):
es = Elasticsearch(hosts="") #Your auth info
es.indices.create(
index=index,
ignore=400 # ignore 400 already exists code
)
id_case = str(result_data['timestamp'].strftime("%d-%m-%Y")) + \
'-'+result_data['name']

es.update(index=index, id=id_case, body {'doc':result_data,'doc_as_upsert':True})
result_data = data['dates'][day]['countries'][country]
del result_data['regions']
result_data['timestamp'] = result_data.pop('date')result_data.update(
timestamp=datetime.strptime(day, "%Y-%m-%d"),
country_iso_3=country_iso_3,
population=population,
infection_rate=infection_rate,
)save_elasticsearch_es('covid-19-live-global',result_data)

The complete script can be found on GitHub, just remember to add your
elasticsearch host before running the script and install all the dependencies.

CREATION COVID-19 VISUALIZATIONS USING KIBANA

> Kibana will automatically recognise the timestamp field as a time filter you
> will just need to select it.

After running the script an index will be created inside elasticsearch and you
would be able to configure it from Kibana. Kibana will automatically recognise
the timestamp field as a time filter you will just need to select it.

Kibana Index pattern | Image by the author

DATA TABLE VISUALIZATION

One of the easiest visualizations we can do is a simple table showing the
countries with the most Covid-19 number of cases.

Kibana table showing the countries with the most Covid-19 number of cases |
Image by the author

To create this visualization we will need a Data Table visualization. The first
column can be the total number of confirmed cases, using a simple max
aggregation over today_confirmed we can obtain that number.

Total confirmed cases configuration in Kibana | Image by the author

Another interesting metric could be the last confirmed cases in the last 48
hours. One can think that 24 hours number could be more interested but a lot of
countries spent more time to report their cases, with a 48-hour window you will
be able to represent more results. To perform this representation in Kibana we
will need a Sum Bucket aggregation. Using this aggregation we can use a Date
Range of the last 48 hours, now-2d, and then again used a max aggregation for
the number of confirmed cases.

Last 48h confirmed cases configuration in Kibana | Image by the author

After having our aggregations the only thing that we need to do is to split the
rows using their country name. Selecting split by name.keyword will do this.
Also, I will recommend putting a size limit, we will see the most important
number. A 25 descending number seem to be enough in my dashboard but you can
adjust this number to your preferences.

Split rows by country name | Image by the author

COVID-19 PANDEMIC MAP VISUALIZATION

The most incredible visualization that we can create or at least the most
popular one is a map showing the evolution of the Covid-19 pandemic. Kibana
offers several options for creating maps, in this case, we will select the
choropleth option. Using this option we can select a word countries layer and
select ISO 3166–1 alpha-3 as format, remember how we include this with our
script. The Statistics sources will be our index name and the field containing
the ISO 3166–1 alpha-3 will be our Join field, in our case country_iso_3.

> Also, be aware of this!!, If after adding the layer the map is still black
> check your Kibana date filter and change the last 15-minute selector to Last 1
> year. I spent a lot of time thinking I did something wrong and the problem was
> the Kibana time selector.

Kibana map choropleth configuration to create a layer in the map | Image by the
author

After adding this layer we are not finished, all the countries present the same
data. We will need to select as metric the infection_rate variable in order to
draw colours base on their value and then select Fill color by value and select
infection_rate again. Furthermore, we can choose between several colour palettes
under layer style, I prefer the one with red tones. Reaching this point the map
should have some colours on your map. Also, be aware of this!!, If after adding
the layer the map is still black check your Kibana date filter and change the
last 15-minute selector to Last 1 year. I spent a lot of time thinking I did
something wrong and the problem was the Kibana time selector.

Kibana map choropleth layer configuration | Image by the author

After selecting these options you should be able to see something like this:

Kibana map creation | Image by the author

CONCLUSIONS

Hopefully, with the code and examples presented in this article, you will be
able to create your custom maps and visualizations. There a lot of possible
Kibana visualizations in the article I only addressed some ideas. My
recommendation is to try as much Kibana visualizations as you can and want.
Kibana is an incredible tool with a few clicks you will be able to create graphs
centred in your country, region or continent.

CARLOS CILLERUELO

Bachelor of Computer Science and MSc on Cyber Security. Currently working as a
cybersecurity researcher at the University of Alcalá.

Carlos Cilleruelo Follows

* IVAN NINICHUCK

* STEVE MICALLEF

* MATT B

* NICOTRIAL

* DR. WILLIAM K.O

See all (22)

BY TOWARDS DATA SCIENCE

Every Thursday, the Variable delivers the very best of Towards Data Science:
from hands-on tutorials and cutting-edge research to original features you don't
want to miss. Take a look.

Get this newsletter
* Elasticsearch
* Covid 19
* Maps
* Data Science
* Data For Change

16 claps

towardsdatascience.com Open in urlscan Pro 162.159.152.4 Public Scan

Form analysis 0 forms found in the DOM

Text Content

towardsdatascience.com Open in urlscan Pro
162.159.152.4 Public Scan

Form analysis
0 forms found in the DOM