In the last post, we learned how to download CSV files from a GitHub repository, with a demonstration of fetching COVID-19 daily data tables from January 22 to March 31 shared by Johns Hopkins University. In doing so, we changed the file name at the end of the URL for each data table repeatedly, spending 5 minutes or a little longer.
That was fine for that particular project. But what if we want to download hundreds or thousands of files at once? Repeating the process manually suddenly becomes a daunting task and we would feel too tired before the get-go. Fortunately, we can automate such a repetitive task using loop functions, as shown below with the exact same project in the last post: create a database in a computer, download and store the COVID-19 daily data tables in the database using R.
Before that, however, you may want to delete the database CoronaDB.db if you have made it following the last post, just to observe that the same thing will have been created automatically after executing commands in this post. You could do so using R, but why don't we take a brief look at DB Browser for SQLite we downloaded in the first post?
When you activate your DB Browser for SQLite, it will show a window of the right image:
You can play with it a little bit if you want to. When you are done, click on "Open Database" on the toolbar.
You will see the window on the right image pop up:
CoronaDB is found at the bottom. You would open it to load the database, but for now, just select and delete the file here. You can close DB Browser for SQLite as well.
Alright, let's get back to the main task here. We start by loading necessary packages as usual.
library(RSQLite)
library(readr)
library(foreach)
Note that we are using a new package called foreach. We will see what each package is needed for later.
Let's build a connection between R and the database CoronaDB.db through SQLite. Several functions in the package RSQLite are used here:
conn <- dbConnect(RSQLite::SQLite(), "CoronaDB.db")
Now, we are going to do something new to prepare for the automation. Run the following command:
dates <- seq(as.Date("01-22-2020","%m-%d-%Y"), as.Date("03-31-2020", "%m-%d-%Y"), by=1)
Here, we are making a sequence of the dates at the end of the URLs for the CSV files. We define a variable called dates as a sequence by coding dates <- seq( , , ).
In it, the first and last entries of the sequence, as well as how many days each step moves forward, need to be defined. Let's analyze the first entry term, as.Date("01-22-2020","%m-%d-%Y"), to see what it stands for.
The as.Date function tells R that the data type in the immediate parentheses is date. The value is "01-22-2020" (don't forget the quotation marks), and the order is "%m-%d-%Y", month-date-year. You can see the same pattern in the ending date.
Finally, we tell R that each turn of the repetition should move just one day forward, because we don't want to skip any day, by setting the last term as by=1.
Great. Let's check how the sequential variable dates looks like. For that, just type "dates" and run it.
dates
You'll get this:
You can see all the days between Jaunary 22 and March 31, 2020, including the leap date February 29, stored. But... did you notice the problem here? Even when you create a date in the order as month-day-year, R stores it in the year-month-day order. So, we are going to place the following draconian order for R, and check what we've got.
dates <- format(dates, format = "%m-%d-%Y")
dates
Awesome!
Now, let's generate the names of data tables we will create in our database. In the last post, we manually wrote names like "January22" and "February16", but this time, we are naming them automatically. Fortunately, we can use the sequence of dates just made above. But the dbWriteTable function we will use cannot recognize values of date type, so we need to convert them into characters and store as a separate variable called names. To do so, run the following code.
names <- as.character(dates)
Ok, now we are ready for the automated file fetch!
Let's deconstruct and examine this last chunk of codes:
foreach(d = dates, i = names) %do% {
file = paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/", d, ".csv")
data <- read_csv(url(file))
dbWriteTable(conn, i, data, overwrite = T)
}
If you are an absolute newbie to coding, you may feel overwhelmed as I did. But don't worry.
Here, we use foreach function in the foreach package we loaded in the beginning. It indicates for which sequences of values R should do what is in the brackets, { }, after the %do% term. We want to repeat the downloading process automatically using dates as part of the URLs and names as the names for data tables in the database. So we write:
foreach(d = dates, i = names) %do% {...}
Now, what exactly is the process we want to repeat? It is the following:
Step 1. Specify a URL to download a CSV file from.
Step 2. Read the CSV file found in the URL specified into R.
Step 3. Store what was read as a data table into our database.
The code for Step 1 is this:
file = paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/", d, ".csv")
Here, we create the variable file, which stores the URL formed with the terms in the paste0( ) function. The first building block is the major part of the URL identical across the dates, namely:
"https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/".
Don't forget the quotation marks.
As we saw in the last post, this identical part is followed by the date which is unique to each file and thus needs to be changed in each turn of the repetition. That is why we specified this as a variable d = dates in the previous code. So, we put d after the identical part. Then, it is followed by the file type indicator ".csv" which is also identical across the dates. Done.
Now, Step 2.
data <- read_csv(url(file))
We use the read_csv( ) and url( ) functions in the readr package to read the CSV file found in the URL stored in the variable file we just created. The contents are stored in the variable named data.
Finally, Step 3.
dbWriteTable(conn, i, data)
Like we did in the last post, we use the dbWriteTable function in the RSQLite package. The conn indicates the connection to our database CoronaDB.db. The table name has to differ for each data table, and it was for this that we set i = names earlier. The contents of each data table is the variable data created in Step 2.
That's it!! Let's put everything together:
library(RSQLite)
library(readr)
library(foreach)
conn <- dbConnect(RSQLite::SQLite(), "CoronaDB.db")
dates <- seq(as.Date("01-22-2020","%m-%d-%Y"), as.Date("03-31-2020", "%m-%d-%Y"), by=1)
dates <- format(dates, format = "%m-%d-%Y")
names <- as.character(dates)
foreach(d = dates, i = names) %do% {
file = paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/", d, ".csv")
data <- read_csv(url(file))
dbWriteTable(conn, i, data)
}
Run this simple set of codes and you will see 70 CSV files downloaded and stored in your database full-auto in seconds!
Reactivate DB Browser for SQLite and click on "Open Database". You will see CoronaDB.db again.
Select the file and open it. There are 70 data tables, as shown in the image below.
We did it!! It's exactly what we wanted.
In a future post, we'll get at clearing up the data we just got. Until then, keep up with the metrics!
コメント