*my note: Dr. Peter St. Onge is an avid student of the business cycle and a member of that rarefied class whose knowledge translates into real results. He is a frequent contributor to mises.org (of LvMI), one of the finest economics and politial-economy institutions on the planet, and editor at Profits of Chaos. He “is an Austrian-school behavioral economist. His research focuses on asset valuation, business strategy and business cycles.”
In this piece, he references Dr. Nassim Taleb, whose emphasis on the distinction between “fat-” and “thin-tailed statistical phenomena”. I encourage everyone in finance or econ, not least the aspiring student to familiarize herself with this work (link below)*

[Excerpted from the November Austrian Investment Monthly. Download a full copy at St Onge Research]

“Big Data,” the latest and greatest data fad, makes the Austrian approach even more important to investors. How can savvy investors take advantage?

One key advantage of “Austrian” investing is using theory to guide your choice of data. This advantage is growing massively as Wall Street falls in love with the “Big Data” fad.

In “Big Data,” you toss the kitchen sink into a huge correlation, run your supercomputer, and out pop your recommendations.

What could possibly go wrong? Lots.

Best-selling financial author and professional gadfly Nicholas Nassim Taleb has been on the warpath against Big Data. In an article last year in Wired MagazineTaleb laid out his complaint: the Bigger the data, the more likely it will generate noise masquerading as causation.

To see why, let’s back up and ask how statistical studies are born. Like the proverbial sausage machine, the reality is pretty ugly.

The way it’s supposed to work is that a researcher asks a an important question, seeks out the best data to answer the question, then runs an equation called a regression. The regression spits out an association between A and B — how often do they occur together, along with telling you how big is the association, and how likely is the association to be accidental noise. The standard in academia is to reject anything that has a greater than 5% chance of being noise, meaning that 1 out of 20 studies will still be noise (remember that “19-times-out-of-20” you always hear in opinion polls? That’s referring to this noise cut-off).

So what are some of the problems here?

First, regressions only tell you associations, not causation. So if band aids are associated with scraped knees, the data doesn’t tell you which caused which. Perhaps band aids cause the scrapes. This may not be a problem when the causation is obvious, but interesting questions are rarely obvious. For example, rising oil prices are associated with economic booms, but do they cause booms or vice versa?

The standard “fix” to figure out causation is lagged data — one data set is earlier than the other. Even here, you can be in a bind. Again with oil, high oil prices are associated with a lagged recession. But again, did the oil cause the recession, or did they both come from some third factor, such as a boom.

Problem number two is the noise problem that Taleb complained about. And Big Data indeed represents a worsening of this problem. Simply because Big Data represents the cheapening of regressions.

Cheap regressions are a problem because it means theory drops out as a quality control. In the old days, when it took weeks of hard-core and tedious calculation to run a regression by hand, you wanted to be pretty sure you’d actually find an association. Meaning the very cost of regression forced researchers to use theory to “weed out” stupid ideas.

Today, however, it would take you literally 3 minutes on a home computer to go grab some data from the Census or UN, toss it into a correlation program like Stata, and spit it out.

Automate this process, which is what Big Data’s about and you could literally spit out 1,000 regressions all day, every day. Going back to that 5% cut-off on noise, if you do 1,000 regressions, 5% of them are statistically guaranteed to be noise. Meaning you’re generating 50 false associations every single day.

Now, those false associations will be on anything under the sun. So what does a PR-minded researcher, or a lying institution do? You just cherry-pick from those 50 anything that confirms what you want to say. Then you publish it. It’s statistically valid — you did use the 5% cut-off, you used the correct data. Of course it’s noise, but the noise is coming from the fact that you originally ran 1,000 regressions. And you don’t have to tell anybody how many you ran.

This hustle is actually a cousin to the classic Newsletter fraud. In which I’d launch 32 newsletters predicting rising stocks next month, and another 32 predicting falling stocks. Shut down the 32 letters that were wrong, repeat a few times and eventually I get a single newsletter that’s been right 6 times in a row. A perfect track record that I can promote the heck out of. Never-mind the other 63 letters I quietly shut down.

In the same way, anybody who runs enough correlations will get enough noise to spend a career selling false correlations that, nevertheless, were produced using sterling statistical techniques.

How should a savvy investor react to these data games? Either you can reject all research, knowing that perhaps 90% of it is noise or cherry-picking. Or, my recommendation, you do the theory filter that researchers should have done. Meaning you ask whether it’s a plausible causation. If it’s not, then you toss the correlation. If it is plausible, then you at least keep an open mind that it might, indeed, be true.

Either way, the key here is protecting yourself from the false confidence from the rise of Big Data. Things are going to get much noisier from here, with all those cheap correlations, and the gap in performance between theory-informed evaluations and blind acceptance of Big Data is going to continue to widen.