7. Speed Up Academic Study in Data-Driven Approach

Norikazu Furukawa
Apr 18, 2020
6 min read

Updated: Apr 27, 2020

As I mentioned in my self-introduction, I finished my doctoral study in March 2020 and earned a degree. It took me 3 years (after another 2 years of the master's program), but that was the minimum length of time required to stay in the program to finish the doctoral program. Potentially, there are several reasons it could take much longer than that, including procrastination on my part, but the most likely cause is not getting a good result from data analysis. You decide a research topic, form a set of hypotheses, collect the data, and analyze them to verify the hypotheses to find... nothing, no correlation or any interesting patterns! You may have to re-frame your research questions or recollect data, or even your supervisor may tell you to change the research topic altogether. No wonder it can take many years.

In hind sight, I could manage to finish my research in 3 years thanks to a professor's savvy advice. My program held poster presentation of every student at the beginning of each semester, followed by a welcoming party for new students. He was a reviewer of this horrible poster done by a procrastinating doctoral student who had no idea about what he was doing at the beginning of the second half of his doctoral program: me. At the party, he walked up to me with the inquisitive looks on his face so iconic. Holding a glass of beer at shoulder height, he tilted his head, as he always does when he gives some deeply considered remarks.

"You know, you could take a data-driven approach, rather."

That quip eventually transformed the procrastinator into a producer of something, allowing me to start and finish my research in the rest one and a half years. In the following, I'm going to explain that magic.

First, let's take a look at this slide.

It shows the flow of regular, hypothesis-driven - or positivist, in the ivory tower vernacular - research. First, a researcher reads the existing literature to find research gaps, issues not addressed yet. Academic work must begin that way, since reinventing the wheel is not only unacceptable but even punishable: it's called plagiarism.

Once some gaps are identified, the researcher asks questions the answers to which will fill them. Then, she builds hypotheses on how the answers to her questions can be derived from observations of facts, or data. For example, if the question is whether an economic stimulus policy has been effective, comparing the data of some economic indicators before and after the implementation of the policy is a good way to answer the question. The hypothesis in this case would be "If the policy has been effective, the values of so-and-so economic indicators would show an increase in comparison with the values before the implementation, unless there has been an opposite effect starting simultaneously (e.g., a recession)". Typically, if she is using statistical methods, she would form a null hypothesis "The policy had no effect: the indicator shows no significant difference before and after the implementation", which she will reject by showing that the policy DID have effects.

Then, the researcher will collect relevant data and clean it up for analysis. Now, here is the tricky part in academic research: she MUST reject the null hypothesis. She is not allowed to conclude her research by saying "The analysis found no significant difference in the data before and after the implementation of the policy, and therefore the null hypothesis is not rejected: there is no evidence that the economic policy has been effective". Sure, that is a finding, but for it to be a valid academic output, the finding must be one that rejects the null hypothesis (unless the failure to do so is used as a step for another essential argument, which in this example might be a proposal of alternative policies, showing they are effective by rejecting the null hypothesis for them). So, if she cannot reject the null hypothesis, the analysis must be redone with different methods until she succeeds in rejection. If she can't, she needs to add more work, like a proposal of alternative policies, or change her research topic altogether. That means she has to do it all over again from the review; she may just give up if she doesn't find it worth it. This is one of the reasons why graduate schooling can take a lot of years (the difficulty in finding a good research topic is another).

But that's old school. As you can see in this Wikipedia article, the positivist scientific method was established roughly a century ago. Back then, data analysis was a daunting task, calculating everything either by hands or non-electric mechanical computers. Back then, data analysis would cost a lot of time and money, so researchers back then had to identify data analyses worth conducting by first forming hypotheses. Today, however, data analysis can be much less costly than building a good hypothesis, thanks to powerful and cheap electric computers and programming languages like R and Python. If you know how to code, you can automate tasks and calculation; data analysis can be done without any human doing even "1 + 1", instantly. That's why, today, quite contrarily, it can be even better to first conduct data analysis to identify hypotheses worth testing. Check this slide.

This alternative scheme also starts with literature review (this process can become semi-automatic as well with text mining), but is followed directly by data analysis. The researcher collects and studies, with automation using programming languages, the data related to her research topic. When she finds some interesting patters in the variables and in their relationships with each other, she can ask where they come from. In the example used above, let's say the researcher is interested in the stock market and, using text mining on the Google, found that no academic work on the effect of Federal Reserve's stimulus policy amid the COVID-19 pandemic has been published. Then she can collect historical market data from sources like Alpha Vantage (or just look it up at MarketWatch), visualize and analyze them. Let's say she finds that the Dow started to recover from the slump slightly before the public announcement of the stimulus, as it actually did. She can reason that "implementation of such gigantic stimulus must be known by the insiders of investment banks and other established financial institutions, and therefore the beginning of their reaction to the policy should precede the public announcement..." and support it with some information found in the literature of economics and political science. She would write up her null hypothesis "the stimulus had no effect", then reject it by showing the result of her interrupted time series analysis. That's it. In the whole process, there is no uncertainty, trial-and-error, as to the result of her analysis. At the time she wrote her null hypothesis, she already knew the result and conclusion.

"That's cheating", one may say. How can you claim that such ex post facto hypotheses can be validated? Now we are talking about epistemology. For the sake of argument, we need to review the two major classical schools of thought: rationalism and empiricism...

Rationalism holds, basically, we can assume that new knowledge can be derived from what is already known by reasoning, just like the one we saw above: "Implementation of such gigantic stimulus must be known by the insiders of investment banks and other established financial institutions, and therefore the beginning of their reaction to the policy should precede the public announcement", for there is such thing as insider trading. The first set of examples in the slide, a pair of analytical and synthetic statements reasoned in the form of syllogism, demonstrate such deduction.

Empiricists don't agree, because ultimately, they believe each observation represents nothing but itself, and we cannot deduce anything else from them. The existence of black swans shows that, at least technically, they are right. The last example is related to bounded rationality: in the midst of the pandemic, the governor of Tokyo could only request people strongly to stay home due to Japan's extremely pro-rights constitution premised on unequivocal optimism about human nature; a stark contrast with the options that were available for Ms. Jacinda Ardern. The authors of the constitution may have reasoned that everybody will behave to secure their life and health. They didn't.

Ok, now let's get back at positivism.

Roughly speaking, positivism can be seen as the combination of rationalism and empiricism. And it's development must have been inevitable; the rationalist can speculate and hypothesize on as many subjects as he wants to, but he can never verify his claims; the empiricist can be sure about what he has observed, but he is absolutely ignorant about what is yet to be observed, nor would he run any experiment - he has no reason to believe that will help. In positivism, rationalism takes part in hypothesis building (including the construction of methodology to test it), while testing the hypothesis with data falls in the domain of empiricism. Reason unveils how knowledge can be deduced from the data; but what is inside the data has nothing to do with reason but sense experience.

What we can deduce from this is that hypothesis building and hypothesis testing have completely distinct and independent sources of validity. What is important is that the hypothesis (or methodology in a broader sense) is build on good reasoning and that it is tested with good data; which comes first isn't.

Data science enables this data-driven approach and is producing knowledge at a pace faster than ever. If you are curious and want to start it yourself, here is how to set up your PC to get on.

7. Speed Up Academic Study in Data-Driven Approach

Recent Posts

Komentar