1. Foreword
This post is going to be amazingly useful for any newcomer to data visualization with programming languages. It's literally step-by-step and completely fool-proof. I hope you'll enjoy reading this.
First, let's take a look at the plot below:
Figure: Confirmed COVID-19 Infections in 42 Countries/Regions as of April 19, 2020
This is an interactive plot: you can display hidden information by hovering over the data points. The y-axis represents the number of confirmed cases of Coronavirus infection on April 19, 2020 for 42 locations; x-axis shows the conventional social distance in centimeters published in 2017; the size of the data points tells the number of population; and the color reflects how quickly the social distancing was practiced since the first case appeared.
You can observe several interesting trends. First, the default social distance had been quite diverse across countries/regions before the pandemic took place, and the confirmed cases as of April 19 are relatively lower in locations with default social distance over 80 cm (30 cm = 1 ft). Second, although some the locations with default social distance lower than 70 cm show more confirmed cases than the long distance cohort, those with darker colors, i.e., quicker practice of social distancing, have much fewer cases. Put together, these observations suggest that the speed at which Coronavirus spreads depends on how close people habitually gets together and how quickly the distance is increased upon the recognition of the infection. This is consistent with medical experts' recommendation for social distancing. Cool.
Now, you may be asking: "Where did this plot come from?"
The answer: YOU.
In this post, we are going to learn how to make interactive plots like the one above by going over how exactly it was made. The content is pretty long, so let's get on it without further ado.
2. Friendly Suggestion
Before moving on, however, I would like to suggest changing themes in Chrome and Jupyter Lab (if you are using them). It's for your eyes. Using dark themes has relieved the stress on my eyes and I strongly recommend you to do it too.
First, open Chrome and go to Settings from the drop-down at the top right of the window. Once the Settings tab opens, click on Appearance on the list on the left of the page. The tab shows the Appearance section of the Settings on top of which you'll see Themes. Click on that and you'll be in Chrome Web Store. Find and select a dark theme and click on the Add to Chrome button. I'm using the one called "Just Black".
Ok, now launch Jupyter Lab. Hover on Settings, then on JupyterLab Theme, and click on JupyterLab Dark. Wow... yeah, that's much better, right?
Great. Now we are ready for a careful gaze at the screen.
3. Get the Data
First, we'll grab two data tables from Johns Hopkins University's repository. One is the panel data of global confirmed cases I mentioned in the previous post. Another one is a metadata table; we are going to use its Population column.
Second, we need data for social distance across countries/regions. After a quick research, I found a 2017 study by a large group of researchers entitled "PREFERRED INTERPERSONAL DISTANCES: A GLOBAL COMPARISON". You can download a PDF file freely from this page at ResearchGate. I downloaded a high resolution image of Figure 2 of the paper stored here and figured out the values of "Personal Distance (Acquaintance)" for each country listed. I chose "Acquaintance" because it's right in the middle of "Stranger" and "Close Person" so that it can be used as a proxy of the average. I created a table named Distance.csv using Excel with country names adjusted and reordered for compatibility with other tables, which is stored here in my GitHub repository.
We use another source of data of border control but it's a bit complicated so I'll discuss that later.
We'll download the data using R; I have realized that I prefer R over Python for data preparation, especially when involving SQL codes. As far as I know, R takes much fewer and simpler codes than Python for this kind of task. Let's take a look at them one by one:
# 1. Load necessary packages.
library(RSQLite)
library(readr)
# 2. Create a database file Visualization.db.
con <- dbConnect(RSQLite::SQLite(), "Visualization.db")
# 3. We will get the aforementioned JHU's metadata table first. Define the URL of the JHU's data table location as urlfile.
urlfile = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv"
# 4. Read and hold the table contents at urlfile as data.
data <- read_csv(url(urlfile))
# 5. Export data as a CSV file Metadata.csv to the desktop.
write.csv(data, "C:/Users/Norrie/Desktop/Metadata.csv")
# 6. Next, we will get JHU's panel data of the confirmed cases. Define the URL of the JHU's data table location as urlfile.
urlfile = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
# 7. Read and hold the table contents at urlfile as data.
data <- read_csv(url(urlfile))
# 8. Store data as a CSV file JHUdata in the database created in the second step.
dbWriteTable(con, "JHUdata", data)
# 9. Moving on. Redefine urlfile as the location of Distance.csv at my GitHub repository.
urlfile = "https://raw.githubusercontent.com/NorrieF/Corona-Visualization/master/Distance.csv"
#10. Read and hold the table contents at urlfile as data.
data <- read_csv(url(urlfile))
#11. Store data as a CSV file Distance in the database.
dbWriteTable(con, "Distance", data)
#12. From JHUdata, remove all rows of Country/Region that are not common with Distance using SQL; we don't use them in this visualization project.
dbExecute(con, "DELETE FROM JHUdata WHERE [Country/Region] NOT IN (SELECT Country from Distance)")
#13. We are going to export JHUdata to the desktop for the next step. Read and hold the table from the database as data.
data <- dbGetQuery(con, "SELECT * FROM JHUdata")
#14. Store data on the desktop.
write.csv(data, "C:/Users/Norrie/Desktop/JHUdata.csv")
#15. We will need Distance table as well. Export it to the desktop in the same manner:
data <- dbGetQuery(con, "SELECT * FROM Distance")
write.csv(data, "C:/Users/Norrie/Desktop/Distance.csv")
After running the codes above, we have three CSV files on our desktop: Metadata.csv, JHUdata.csv, and Distance.csv. Now, let's clean up and join the data.
Oh, by the way, if you have checked my work of data cleanup with SQL in the previous post, you may be wondering "Wait, Norrie. Why would we download the tables at this point? Why not keep using SQL for automation?"
That's a good question, and here is why. In the other work, we dealt with dozens of tables containing a large number of rows. In that situation, spending some time for automating tasks makes a lot of sense. SQL - SQLite to be precise - is useful to update rows on the same column in a table all together, and the repetition of such process can be automated for all tables using for loop functions. This time, however, we are handling only a few tables, and JHUdata has not many rows but columns. At least with SQLite, you'd have to write SQL codes for updating all columns in a table, which in this case isn't worth it. The auto and manual both have roles to play. We have to pick which one to use wisely, just like we'd have to choose between the fork and spoon depending on whether our dessert plate has a pudding or a cake.
So, the three tables. We are going to do the following with them.
1. Create a temporary CSV file "temp" with Excel and copy-and-paste the column names of JHUdata in it. We'll use it for the following data cleanup.
2. First, we need to aggregate the rows of United Kingdom, Canada and China in JHUdata for consistence with Distance. Let's take Canada first. Enter "Canada" in the column Country/Region of temp, and add the total of the Canadian rows of "1/22/2020" in JHUdata to the corresponding cell in temp using the SUM( ) function. Not all rows appear together in sequence, so make sure you don't miss any of them. Then, extend the cell input to the last column "4/19/2020" (if can be much later depending on when you do this). After that, do the same for United Kingdom.
3. Moving on to China. There is an issue here. Hong Kong is among the Chinese rows in JHUdata while the region is treated separately in Distance. We want a higher number of rows; separate them in JHUdata. I know, you may feel weird to do this if you are from China, but don't forget that the column name is not Country but Country/Region, so it's ok. If you are from Hong Kong, yeah, just do it. And please remember: no politics on The Metrician. Do the same thing you did for Canada and UK, except you don't include Hong Kong here. Again, some rows are located away from others, so make sure to include all.
4. In JHUdata, move "Hong Kong" from Province/State to Country/Region.
5. We are ready to bring the three aggregate rows of Canada, UK and China from temp back to JHUdata. However, if you delete the original rows in JHUdata now, the aggregated values will disappear since they will be updated with nothing (the cell values would become "#REF!"). Just copy the whole data of 3 rows in temp you have got, paste their values (see the figure on the left) at some other space in the table.
6. Delete the original rows of Canada, UK and China in JHUdata and copy-and-paste the aggregate values just saved in temp.
7. Insert a column in JHUdata and copy-and-paste the Distance_CM from the table Distance to it. Name the column as you like (Social_Distance in my case).
8. Insert another column in JHUdata and copy-and-paste Population from the table Metadata for each country (there are aggregate values for all countries including Canada, UK and China. Don't forget to subtract the number for Hong Kong from the number for China. Name the column as you like (Population in my case).
9. Insert two columns in JHUdata and copy-and-paste Lat and Long from the table Metadata. For Canada, China and UK, Google "country_name coordinates" and copy-and-paste the latitudes and longitudes displayed in the result. Make sure to convert the longitudes with "W" to negative values (for example, Canada's longitude will be displayed as "106.3468° W" and you should type in -106.3468). Name the column as you like (Just Lat and Long in my case). We are not making a map in this post, but these variables may be useful in the future.
10. Insert yet another column in JHUdata and enter the date where the first confirmed case appeared for each country. This is a step you can make mistakes easily, so be careful. Name the column as you like (First_Case in my case).
11. Insert one more column in JHUdata and enter the dates where each country tightened their border control. We treat them as the proxies for the timing where social distancing was pushed forward seriously by the authority in each country (studying the exact process of social distancing in the countries would be way out of our scope here; include that to this analysis and you will be writing a master's thesis). I have found the dates in the following web pages: most of the countries; Austria; Belgium; Croatia; Estonia; Portugal; Uganda; and the rest. For Iran, this is "NA" - I looked hard but couldn't find when they restricted their border control. Maybe they haven't yet... all I found was the information about other countries closing their borders with Iran. Probably Iran hasn't done that because their borders weren't widely open due to economic sanction to begin with. Name the column as you like (Border_Closed in my case).
12. Just one more step: insert a column in JHUdata and calculate the number of days between the date the first case was confirmed and the date the border control tightened. This is another proxy variable, which represents how quickly social distancing started in each country. Here is how to calculate (search "Difference in days" on the page). Name the column as you like (Days_Inbetween in my case).
Phewww, we are done.
4. Switch to Python and Visualize JHUdata
Alright, we've got the data. Let's visualize it with the Python module of Plotly.
Using Plotly on Jupyter Lab requires the installation of some extensions. You need to do two things:
1. Download the software Node (I went for the "Recommended for Most Users" option).
2. Visit this Plotly instruction and execute the commands below the line "Then run the following commands to install the required JupyterLab extensions". Using Windows 10, I ran the following commands one by one in Command Prompt:
set NODE_OPTIONS=--max-old-space-size=4096
jupyter labextension install @jupyter-widgets/jupyterlab-manager@1.1 --no-build
jupyter labextension install jupyterlab-plotly@4.6.0 --no-build
jupyter labextension install plotlywidget@4.6.0 --no-build
jupyter lab build
set NODE_OPTIONS=
Now you are ready to visualize the data. You can find an easy instruction on Plotly, from the installation to embedding the resultant plots in any website, in this article by Ms. Elizabeth Ter Sahakyan, a data scientist at Medium. I created the plot at the top of this page by modifying her codes: my acknowledgement. I believe you can follow her examples and apply the method to your own work, but the following is the exact codes I ran for my plot:
Let's go over the Python code. It's pretty short.
# 1. Load the module pandas (giving it a short name like pd reduces typing later and is very common) in order to JHUdata.csv from you GitHub repository (I'm using my own here).
import pandas as pd
# 2. Define the location of the table url.
url = "https://raw.githubusercontent.com/NorrieF/Corona-Visualization/master/JHUdata.csv"
# 3. Read and hold the contents of url as df.
df = pd.read_csv(url, error_bad_lines = False)
# 4. Import Plotly.
import plotly.express as px
# 5. We can program a plot of 4 variables in this single code:
fig = px.scatter(df, x = "Social_Distance", y = "4/19/2020", size = "Population", color = "Days_Inbetween", hover_name = "Country/Region", log_x = True, size_max = 60)
# 6. Nail it.
fig.show()
# 7. Did you get it? If you did, write the html file for the plot using plotly.io:
import plotly.io as pio
pio.write_html(fig, file = "index.html", auto_open = True)
The HTML file written can be found in the current folder, as shown in the image on the right:
If you just like in my case chose the path to host your plot in your GitHub repository rather than in Plotly's server, all you have to do is embedding the following HTML code after replacing the URL (bold and underscored) with the one for your plot.
<iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="https://norrief.github.io/Corona-Visualization/" height="525" width="100%"></iframe>
My plot at the top of this page appears there as I had embedded this code in this page. If you want to use the plot for your own purposes, feel free to do so.
Done!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Now we all know how to create interactive plots using Plotly through Python!
5. Afterward
Since this post, we have learnt how to obtain, cleanup, reconstruct and visualize data using Johns Hopkins University's GitHub repository for examples. We have used tools including Jupyter Lab, R, Python, SQL, Excel, GitHub, etc. Now, we are getting pretty comfortable with handling qualitative and quantitative data.
The next step is learning the theory of statistics and its application such as econometrics. Machine learning is another interesting topic to study. We will cover them later in this blog.
As the examples, I will cover things other than COVID-19. I started working on the disease to utilize my jobless time for something useful. And the plot we made through this post is pretty satisfactory for myself, as it corroborated the conventional wisdom in the medical profession. I feel now that I should leave the rest of analysis to the experts and cooperate with the leaders. In that sense, I'm done.
There is a sign that my jobless time may end soon, also. That will provide us a great alternative source of data science practice, I believe.
You will discover what it is in the next post...... Until then, keep up with the metrics!
Comentarios