Zero-Revelation RegTech: Detecting Risk through Linguistic Analysis of Corporate Emails and News

  • S.R. Das, S. Kim, B. Kothari
  • Journal of Financial Data Science, Spring 2019
  • A version of this paper can be found here
  • Want to read our summaries of academic finance papers? Check out our Academic Research Insight category

What are the Research Questions

Last week I took you on a tour of utilizing the data hidden in the language of the news. In this post, we’re taking the analysis of language to corporate emails. Clearly, unlike the data in news, corporate emails are non-public information. Therefore the data is being utilized to develop regulatory technology (RegTech), not hunting for alpha generation. This paper applies natural language programming (NLP), a popular data science technique used in finance, to develop an early-warning system for detecting corporate fraud and/or failure. Specifically, the authors attempt at answering the following research questions:

  • Does the sentiment conveyed by employee communications (i.e. emails) contain value-relevant information?
  • Is this information conveyed in a timely manner (i.e., does email sentiment lead subsequent stock returns)?
  • Do other structural characteristics of internal employee emails (e.g., email length, email volume, or email-network characteristics) also contain value-relevant information?
  • Which tends to contain more value-relevant information, the actual verbal content, or structural characteristics of employee emails?

What are the Academic Insights?

By analyzing a unique dataset made up of 113,000 emails from 144 Enron employees and 1,300 that appeared on PR Newswire from January 2000 to December 2001, the authors find:

  1. YES, the authors observe trending patterns in the sentiment contained in emails. Specifically, they observe the positive sentiment from both emails and news articles decline into the year of 2001 as Enron problems started to manifest.
  2. YES, the net sentiment of email content is a meaningful predictor of subsequent stock returns. Specifically, a one standard deviation decrease in the net sentiment gleaned from emails is associated with a 4.5% decline in stock returns (coefficient estimate = 2.347, t-statistic = 3.27).
  3. YES, the authors find that when the length of emails is added as an independent variable to the regression, it takes over in explaining the relation with future stock returns. In fact, for every 20-character decline in email length, there is a 1.17% in future stock returns. Additionally, the authors find that structural characteristics such as the length of emails contain the most value-relevant information.

Why does it matter?

The importance of RegTech has grown rapidly since the financial crisis; more than $160 billion has been paid in fines by various financial institutions. Also, about 10%-15% of the staff in financial institutions is dedicated to compliance ( Arnold, 2016) and a RegTech solution could create a reduction of costs. This paper develops a RegTech expert system solution to parse corporate email content to detect shifts in critical characteristics in a timely, efficient, and noninvasive manner. Clearly it’s hard to make large sweeping conclusions from one data set on a company that the researchers knew had failed. That however shouldn’t stop us from taking a deeper look into the utilization of textual RegTech analysis of corporate management emails as a means to detect risk in a timelier fashion. It may also be used by regulators in their audit process because they can requisition such analyses from firms without intrusively reading emails. In the words of the authors:

Early detection and prevention is better than a cure”

The Most Important Chart from the Paper:

The results are hypothetical results and are NOT an indicator of future results and do NOT represent returns that any investor actually attained. Indexes are unmanaged, do not reflect management or trading fees, and one cannot invest directly in an index.


In this paper, we demonstrate how an applied linguistics platform may be used to parse corporate email content and news to assess factors predicting escalating risk or the gradual shifting of other critical characteristics within the firm before they are eventually manifested in observable data and financial outcomes. We find that email content and news articles meaningfully predict increased risk and potential malaise. We also find that other structural characteristics, such as the average email length, are strong predictors of risk and subsequent performance. We present implementations of three spatial analyses of internal corporate communication, i.e., email networks, vocabulary trends, and topic analysis. Overall, we propose a RegTech solution by which to systematically and effectively detect escalating risk or potential malaise without the need to manually read individual employee emails.

Print Friendly, PDF & Email

About the Author: Wesley Gray, PhD

Wesley Gray, PhD
After serving as a Captain in the United States Marine Corps, Dr. Gray earned an MBA and a PhD in finance from the University of Chicago where he studied under Nobel Prize Winner Eugene Fama. Next, Wes took an academic job in his wife’s hometown of Philadelphia and worked as a finance professor at Drexel University. Dr. Gray’s interest in bridging the research gap between academia and industry led him to found Alpha Architect, an asset management firm dedicated to an impact mission of empowering investors through education. He is a contributor to multiple industry publications and regularly speaks to professional investor groups across the country. Wes has published multiple academic papers and four books, including Embedded (Naval Institute Press, 2009), Quantitative Value (Wiley, 2012), DIY Financial Advisor (Wiley, 2015), and Quantitative Momentum (Wiley, 2016). Dr. Gray currently resides in Palmas Del Mar Puerto Rico with his wife and three children. He recently finished the Leadville 100 ultramarathon race and promises to make better life decisions in the future.

Important Disclosures

For informational and educational purposes only and should not be construed as specific investment, accounting, legal, or tax advice. Certain information is deemed to be reliable, but its accuracy and completeness cannot be guaranteed. Third party information may become outdated or otherwise superseded without notice.  Neither the Securities and Exchange Commission (SEC) nor any other federal or state agency has approved, determined the accuracy, or confirmed the adequacy of this article.

The views and opinions expressed herein are those of the author and do not necessarily reflect the views of Alpha Architect, its affiliates or its employees. Our full disclosures are available here. Definitions of common statistics used in our analysis are available here (towards the bottom).

Join thousands of other readers and subscribe to our blog.

Print Friendly, PDF & Email