Data mining in finance has long been a concern for academic researchers. Campbell Harvey, one of the authors on this paper, is leading the effort to ensure the integrity of empirical finance research. For example, see here for a post on his address to the AFA.
The concerns associated with data mining aren’t going away. A monster increase in affordable computing power is facilitating the use of machine learning to create predictive algorithms in finance. Machine learning algorithms have built-in defenses against data mining, but they aren’t full proof. Moreover, the data required to do proper cross-validation does not exist in finance (at least in the investing realm…HFT may be a different story).(1)
This paper addresses a basic question related to the use of quantitative methods (to include machine learning) in the context of finance:
Can we develop a sensible research protocol to deal with data-mining concerns?
What are the Academic Insights?
The authors outline a great research protocol. Below we outline the 7 steps of the protocol with our simple key takeaway on each. Readers should dig into the paper for more detail on each component to fully appreciate what is being proposed.
Start any research project with an ex-ante hypothesis, driven by economic foundations
For a “winning strategy,” ask the following question: Who is on the other side of the trade?…and why?
Multiple testing and statistical methods
How many variables were tried?
How many combinations were used?
Do you have enough data to justify the value of additional complexity? Probably not.
Data and sample choice
Live with the data you’ve been dealt — don’t cherry pick, transform, “clean”, and windsorize at random…
…but also make sure the data is accurate (e.g., market cap doesn’t exceed 10 trillion for 20 percent of the data set)
There is no real “out of sample” at this point, save live trading data and fresh historical data.
Beware of structural change. Humans are tricky animals with evolving tastes.
Avoid “tweaking” a model based on live results.
Keeps things as simple as possible, but no simpler.
We don’t have enough data to truly assess the value of complexity.
Reward good processes, not good results. (h.t. to Annie Duke for expressing a similar idea in “Thinking in Bets.”).
Do you know where the bodies are buried? Probably not –Do your own research!
Why does it matter?
The authors make a simple, but important point:
When data are limited, economic foundations become more important.
Here is a “magical” backtest of a strategy that is long all stocks with tickers with an “s” as the third letter and short stocks with tickers that have “u” as the third letter.
In-sample and out-of-sample validation
No correlation with known factors
Machine learning offers a set of powerful tools that holds considerable promise for investment management. As with most quantitative applications in finance, the danger of misapplying these techniques can lead to disappointment. One crucial limitation involves data availability. Many of machine learning’s early successes originated in the physical and biological sciences, in which truly vast amounts of data are available. Machine learning applications often require far more data than are available in finance, which is of particular concern in longer-horizon investing. Hence, choosing the right applications before applying the tools is important. In addition, capital markets reflect the actions of people, which may be influenced by others’ actions and by the findings of past research. In many ways, the challenges that affect machine learning are merely a continuation of the long-standing issues researchers have always faced in quantitative finance. While investors need to be cautious—indeed, more cautious than in past applications of quantitative methods—these new tools offer many potential applications in finance. In this article, the authors develop a research protocol that pertains both to the application of machine learning techniques and to quantitative finance in general.
After serving as a Captain in the United States Marine Corps, Dr. Gray earned an MBA and a PhD in finance from the University of Chicago where he studied under Nobel Prize Winner Eugene Fama. Next, Wes took an academic job in his wife’s hometown of Philadelphia and worked as a finance professor at Drexel University. Dr. Gray’s interest in bridging the research gap between academia and industry led him to found Alpha Architect, an asset management firm dedicated to an impact mission of empowering investors through education. He is a contributor to multiple industry publications and regularly speaks to professional investor groups across the country. Wes has published multiple academic papers and four books, including Embedded (Naval Institute Press, 2009), Quantitative Value (Wiley, 2012), DIY Financial Advisor (Wiley, 2015), and Quantitative Momentum (Wiley, 2016).
Dr. Gray currently resides in Palmas Del Mar Puerto Rico with his wife and three children. He recently finished the Leadville 100 ultramarathon race and promises to make better life decisions in the future.
Performance figures contained herein are hypothetical, unaudited and prepared by Alpha Architect, LLC; hypothetical results are intended for illustrative purposes only. Past performance is not indicative of future results, which may vary. There is a risk of substantial loss associated with trading stocks, commodities, futures, options and other financial instruments. Full disclosures here.