top of page

6. SQL: Ultimate Learning by Doing


In the previous post, we learned how to snatch global COVID-19 daily data tables for the range between January 22 and March 31 from Johns Hopkins University's GitHub repository automatically. Now we are ready to analyze the data to get insights... or are we? It is pretty common that data scientists have to clean up the data they have collected. And our case is no exception. But what's the issue with our database? ...Well, it doesn't matter anymore: as I started to write the next sentence here, I found something that annihilates our concern:


JHU had just uploaded a cleaned up time series data tables hours ago!!


Turned out, I spent hours, actually days, just to reinvent the wheel lol!

But I learned a lot in the process, and the codes I used for the project can be reused for other purposes in the future, so it's not in vein. It's not.


Anyway, here is the gift I was going to offer to my audience. I uploaded, quite like JHU, the database file and R and Python codes I used for making it, along some other material, to my GitHub repository CoronaDB. You can download them for free. Let me describe each item:


1. CoronaDB.db: the database file containing data tables for the range between 01-22-2020 and 04-15-2020.


2. Loop.r: R command file made with Jupyter Notebook used for downloading the original data tables automatically.


3. Cleanup.py: Python command file made with Jupyter Notebook used for cleaning up the data tables downloaded with Loop.r in combination with Coorespondence.r.


4. Coorespondence.r: R command used to support the process in Cleanup.p.


5. The CSV files: They contain the names and acronyms of US and Canadian states listed by NCBI (which JHU uses in their tables) here.


Also, I have summarized the workflow in the cleanup process, in two big chunks, so that you can see what I did.


1. Reformatting the tables. Tables from 01-22-2020 to 02-29-2020 have 6 variables, namely Province/State, Country/Region, Last Update, Confirmed, Deaths and Recovered. Tables from 03-01-2020 to 03-21-2020 have coordinates, Latitude and Longitude, in addition. Table 03-22-2020 thenceforth have even more variables, FIPS (Federal Information Processing Standards), Admin2 (municipality), Active, and Combined_Key. Moreover, the names of some existing variables have been changed: Province_State (from Province/State), Country_Region (from Country/Region), Last_Update (form Last Update), Lat (from Latitude) and Long_ (from Longitude). I reformatted all tables with the latest format in the section called 1. Uniform the tables in the format of the table for March 31 in Cleanup.py. Basically, I renamed all tables before 03-22-2020, created empty tables with the original names in the target format, copied data from the renamed tables to the empty tables, delete renamed tables, and added columns only tables from 03-22-2020 had, like the coordinates, to older tables.


2. Aggregating US and Canadian states. It was only from March 9 for Canada and March 10 for US that data was bundled with state labels. Before that, cases were reported for "Los Angeles, CA" and "Toronto, ON" (from table 03-22-2020, data for US were further compartmentalized with Admin2, the names of municipalities). I aggregated the data before March 9 and 10 respectively with state labels in 2. Aggregate rows for states in US and Canada in Cleanup.py. I did so by renaming the Province_State values containing state acronyms with a preceding comma and space, e.g., ", CA" and ", ON", with the full state names ("California", "Ontario"), then summed and copied the values of Confirmed, Deaths, and Recovered into a temporary table I had prepared, replaced the original rows with the sums in the temporary table, added the coordinates from table 03-31-2020, and deleted the temporary table. Phew.


For the data cleanup, I used SQL codes for SQLite through the R package RSQLite and the Python module sqlite3. I noticed the following in carrying out the task for this post:


1. Many programmers on the web have commented that SQLite is very limited in comparison with other systems such as MySQL, Microsoft SQL Server, Oracle Database, etc. Having used MySQL with phpMyAdmin for my doctoral research, I agree. But the setting up and using SQLite is so simple, without web server solution stack packages like XAMPP. And you can find work-around for the limitations in projects like this one. Anyway, it's "lite" as the name proclaims.


2. Knowing both R and Python is surely an asset. I took the task of cleanup with Python, just because I had been using R all the time. On my way, I was met with a situation where a group of codes didn't do what it was supposed to without producing an error message, and conversations on the web suggested there is some complication in the function (I forgot to keep the log for that but it was about importing CSV files on your desktop to a database file). After struggling for hours, I decided to do it with R and it was done at ease. I believe there are opposite situations as well, so knowing how to use them both indeed is essential for data scientists.


3. I found SQL to be like bureaucrats: it has made me shout in my mind "hey, why can't you just do such a simple and straightforward stuff? Just do it!" so many times. SQL, of course, works only the way the developers made them to. So, just like sometimes you have to pay repetitive visits for not-so-meaningful paperwork to the city hall, SQL may require manual labor for certain tasks. But it's still much faster than doing things literally "manually", and natural language processing experts will revolutionize data management in the future. So let's be patient.


4. Speaking of which, I just discovered the best patience training material working on this; it's called programming. "Empty yourself" is a phrase with such a spiritual connotation that you would associate it with Zen masters and clergymen. And the degree to which this is warranted in dealing with the iconic modern technology is even ironic. A prayer may receive, either by His blessings or by coincidence, what he or she wishes without objectively deserving it in their life. It never occurs in the realm of coding. Be perfectly right or ponder on error messages. And when you get an error, no matter how whimsical it looks, it never is; there is always a reason. That's why I now see that practicing programming is beneficial for anyone, even for technophobes. You will start engaging in human relations with a new spirit after the experience. With enough practice, I can tell that you can stop, at least habitually, thinking and saying phrases like "why don't/can't you just... ?!", "oh my god, you are so... !!", "I wish you were... !!" and so on. After all, in programming, we are dealing with nothing but pure Logos.


In a future post, we will discuss how to visualize data using the tables newly uploaded by JHU. Until then, stay safe and keep up with the metrics!

Comments


bottom of page