Statistical Arbitrage Trading using Machine Learning

Introduction

Statistical arbitrage is a class of trading strategies that profit from exploiting what are believed to be market inefficiencies. These inefficiencies are determined through statistical and econometric techniques. Note that the arbitrage part should by no means suggest a risk-less strategy, rather a strategy in which risk is statistically assessed.

One of the earliest and simplest types of statistical arbitrage is pairs trading. It was first introduced in the mid 1980s by a group of top research analysts working at Morgan Stanley. In short, pairs trading is a strategy in which two securities that have been co-integrated over a certain time period are identified and any short-term deviations from their long-term equilibrium create opportunities for generating profits and are hence capitalised on.

Problem Statement

To find the co-integrated pairs of stocks from a defined stock universe is very compute intensive. For instance, for a stock universe defined by the S&P 500 index with 500 different stocks, the possible number of pairs to evaluate for co-integration would amount to 124750 pairs, massive number! A better approach would be to group similar stocks together based on certain identifiable and quantifiable characteristics, and then find co-integrated pairs among these individual groups. We could use the GICS sector classification definitions to group stocks based on their GICS sectors. However, a major drawback of this method is that these definitions are susceptible to abrupt changes (eg, recent changes, March 2023, caused both VISA and MASTERCARD to be moved to the Payments sub-industry under the Financial sector, which were initially part of the GICS IT sector). And more often than not, many major conglomerates are engaged in businesses across multiple sectors and industries, which makes the sector based stock clustering inefficient.

In this work, we explore grouping stocks based on a combination of their historical financial ratios and company descriptions (textual data derived from Wiki pages and company filings), and evaluate if it leads to the formation of better stock clusters, and thereby better co-integrated pairs. We do this by attempting to find co-integrated pairs within the obtained stock clusters and then backtesting a mean-reversion trading strategy on a rule-based defined portfolio of stock pairs to analyse the performance.

There are two parts to this project implementation - stock clustering (part A) and, pairs formation and backtesting trading strategy (part B)

Data Sources

S&P 500 constituents were considered for the study - in total 503 stocks.

Daily OHLCV data for all stocks were extracted from OpenBB and Alphavantage APIs. Time period used - Jan 2010 to Mar 2023.

Annual measures for 56 different financial ratios were fetched from Financial Modelling Prep APIs for all the stocks for the stated time period.

Company profile data and business descriptions were fetched from EDGAR and Alphavantage.

1. Stock Clustering

1. 1. Methodology:

Extract Historical financial ratios for all stocks from fiscal year 2010 to fiscal year 2019 and store in a database

clean & preprocess data (data imputation)
feature extraction from historical ratios data (dimensionality reduction) and data transformation

Fetch business descriptions (textual data) for all stocks from AlphaVantage -

clean & preprocess data (remove stop words, stemming/lemmatisation)
text vectorisation (TFIDF -> LSA)
feature selection and data transformation

Merge all features extracted from the pre-processing methods of both the data sources and create a single uncorrelated feature space for model fitting

Explore various clustering models and fit these models on preprocessed and transformed dataset

model fitting
hyper-parameter tuning

Model evaluation to select the best fit - intrinsic and extrinsic evaluation methods

Pick the best model to fit the data and retrieve cluster labels for all observations (stocks)

1. 1. 1. Feature extraction from financial ratios data:

There are 5 main categories of financial ratios, namely, liquidity, leverage, efficiency, profitability & market value ratios. We considered the following ratios (2 from each category) which are regarded as the most influential factors based on literature survey of academic research.

Liquidity - Quick ratio and Cash ratio

Leverage - Interest coverage and Debt-to-Equity ratios

Efficiency - Asset Turnover and Receivables Turnover

Profitability - ROA and Operating Margin

Market Value - EV multiple and Dividend Payout Ratio

There are multiple ways of imputing missing values, including dropping the corresponding features or observations, filling with mean/median, and/or using ML techniques. KNN Imputation is one such technique based on K-Nearest Neighbour algorithm, which predicts the missing values. In this project, missing values are imputed using the mean value from K-nearest neighbours found in the training set. KNNImputer can work with continuous, discrete and categorical data types but not with text data. However, defining the number of neighbours — ‘K’ can be tricky since it introduces a trade-off between robustness and speed. If a small ‘K’ is selected, then the computation is faster but results are less robust. In contrast, a large ‘K’ leads to slower computation. An optimal ‘K’ value was selected based on testing clustering performance.

Scree plot for determining the optimal number of components to be extracted from dimensionality reduction of feature variables

The feature space was reduced to 12 principal components using PCA for dimensionality reduction.

1. 1. 2. Feature extraction from textual data:

TF-IDF method was used to vectorise the textual company data. TF-IDF is the product of the ‘term frequency’ in a particular document and the ‘inverse document frequency’ with the concerned term. The higher the value, the more relevant the term is in that particular document. To learn more about TF-IDF algorithm, click here.

Sample feature matrix of 10 randomly chosen feature variables depicting the sparseness inherent in the feature space.

Word Cloud plot for stock clusters obtained from OPTICS model fit on the data.

The above heatmap depicts the sparseness of the feature matrix with a random sample of size 10 (features non-zero values in black). It's pretty evident from the above figure that the matrix is extremely sparsely populated and thus holds very little informational significance and yet is highly compute intensive. Hence, there are high chances that these features will be given very low importance when combined with other dense features like financial ratios or price returns data for stock clustering.

In order to reduce the matrix dimensionality & computational complexity, Feature Selection was performed, the process of selecting a subset of relevant variables. This can be done by using a method called Latent Semantic Analysis (LSA), which builds on TF-IDF method to discover a latent dimension that relates the document & word dimensions of a document-term matrix (DTM). This latent dimension is referred to as a set of themes (or topics) that focus on uncovering the semantic relation between words. LSA tries to extract these latent dimensions using a Machine Learning algorithm called Singular Value Decomposition (SVD), which is essentially a matrix factorisation technique that decomposes the large sparse matrix into three different matrices: orthogonal column matrix (document-aspect), orthogonal row matrix (word-aspect), & a singular matrix. Thus, from a sparse DTM, we get a dense document-aspect matrix that can be used for document clustering.

LSA was applied to reduce dimensionality of TF-IDF sparse matrix from 10K+ features to top 80 topics, while retaining ~25% of variance explainability.

1. 1. 3. Combined Feature space:

All the features extracted from both pre-processing methods were combined to form a single uncorrelated feature space to carry out model fitting.

1. 1. 4. Model Fitting:

Three clustering algorithms were examined including Partitional clustering, Density-based clustering and Hierarchical clustering. To learn in-depth about these clustering algorithms, follow article1 and article2.

1. 1. 5. Model Evaluation:

A number of model evaluation techniques were used as detailed in this article. Based on research and model evaluation results, OPTICS clustering model was chosen to fit the combined feature space and derive the cluster labels.

1. 2. References:

Understanding OPTICS and Implementation with Python - TDS

Imputing missing data with simple and advanced techniques - TDS

Machine Learning for Stock Trading: Natural Language Processing - AV

2. Pairs Trading

Pairs trading is a mean-reverting strategy that banks on the fact that the relative prices of pairs will follow historical trends in the mid to long-term with deviations only in the short term. Some of the reasons for this short-term divergence in relative prices could be a large number of shares bought by a single investor or differences in attention-levels of investors regarding one pair specifically or multiple pairs in comparison to one-another. These strategies are mostly market neutral when excluding transaction costs, as, irrespective of the overall direction of the market, either the long-leg or the short-leg of the trade will always be profitable. However, the major limitation to this market neutrality is a scenario in which pairs don’t converge, i.e. either the two securities diverge even further or both consistently move in the same direction. In a way, the trader is writing an insurance contract against the long-term divergence of the pair of securities and collects the corresponding premium by betting on short-term variations.

There are various forms and techniques of developing a pairs trading strategy. Follow this wonderful article by the Hudson and Thames to learn more about the various techniques. In this project, co-integration test was employed to find strong pairs and a deterministic trading strategy was backtested. Let’s dive in to see the details, shall we?

2. 1. Methodology:

Using the stock cluster labels gathered from the Stock Clustering section, find the best co-integrated stock pair from each cluster (over the 10 year period from 2010 to 2019, known as the formation period) by conducting Engle-Granger two-step co-integration test with a significance level of 5%.

Define the backtesting period (CY 2020) and set the pair eligibility criteria to be included in a portfolio.

Define trading strategy, and Bollinger Bands to generate the required trades, including stop-loss bounds.

Compute returns by factoring nominal transaction costs and evaluate the portfolio performance in comparison to the S&P 500 index.

2. 1. 1. Correlation vs Co-integration:

A stationary process is characterised by constant mean, variance and autocorrelation. Non-stationary time series do not have the tendency to revert to their mean and vice versa. Correlation is a measure of the linear relationship between stationary variables. As a consequence, when dealing with non-stationary variables, which is the case with most financial variables, correlation is often spurious. Enter Co-integration!

Co-integration allows us to construct a stationary time series from two non-stationary series, X and Y, which are also integrated of a non-zero order d, if only we can find the co-integration coefficient ‘beta’, such that the linear combination Y - (beta)*X is integrated of order less than d (ideally zero for it to be stationary). According to Alexander et al. (Alexander, 2002), price, rate, and yield data can be assumed as I(1) series, while returns (obtained by differencing the price) can be assumed as I(0) series. The most important property of the I(0) series that is relevant to statistical arbitrage is that I(0) series is weak-sense stationary.

Correlation has no well-defined relationship with co-integration. Co-integrated series might have low correlation, and highly correlated series might not be co-integrated at all. Correlation describes a short-term relationship between the returns, whereas co-integration describes long-term relationship between the prices.

In this case, the ‘beta’ was derived from the two-step Engle-Granger in python, which performs a linear regression between the two asset prices and check if the residual is stationary using the Augmented Dick-Fuller (ADF) test. If the residual is stationary, then the two asset prices are said to be co-integrated. The co-integration coefficient is obtained as the coefficient of the regressor. To summarise, co-integration implies that the variables share similar stochastic trends, and therefore their residuals follow a stationary process.

2. 1. 2. Portfolio Construction and Trading Strategy:

To construct a high quality portfolio of co-integrated pairs, a robust pair selection criteria was established. All the top long-term co-integrated stock pairs selected from each cluster were again tested for co-integration on out-of-sample backtest period, i.e CY 2020 and only successful pairs were retained to be included in the portfolio for backtesting. An additional layer of filtering was added wherein, only pairs which were able to meet the defined minimum number of mean crosses through the backtest period were retained. With all this refinement, the portfolio finally boiled down to 2 stock pairs.

Defining a trading strategy consists primarily in setting the trading period and the thresholds at which to enter/exit a position. Care was taken to establish a stop-loss to limit the excessive losses resulting from any failures in the strategy or other market factors. The thresholds were set as a number of standard deviation levels in reference to the normalised spread signal between the two securities in the pair.

2. 1. 3. Backtesting and Performance Analysis:

The backtesting environment was developed as per the design detailed above and the trading was simulated to achieve the following results.

Although pairs trading is an absolute returns strategy, SPY, tracking the S&P 500, was used as a benchmark in this case for study purposes. The portfolio of co-integrated stock pairs outperformed the benchmark by generating higher returns per unit of risk taken.

The portfolio also witnessed lower volatility and lower maximum drawdown relative to SPY, indicating its low risk characteristics and resilience.

The trades execution strategy for both the pairs was commendable with % of profitable trades averaging more than 85%

The portfolio backtest generated a cumulative return of ~35% over the calendar year 2020, outperforming the benchmark on risk-adjusted return with a sharpe ratio of 0.1456.

SPY tracking the S&P 500 withered massive downfall of ~30% due to the Covid-19 market crash in early 2020, bouncing back strongly to return ~20% by year end.

The stock pair with Autodesk Inc. and Idexx Laboratories had an average holding period of ~15 days and 75% of the trades were profitable.

The pair with Honeywell International and Waters Corporation had an average holding period of 21 days and 100% of the trades were profitable.

2. 2. Limitations:

The risk profile of pairs trading strategy may differ significantly from that of the benchmark, as it involves trading specific pairs of assets with both long and short-selling. Therefore, investors should consider their risk tolerance and investment objectives before adopting this strategy.

A longer time horizon and analysis of the specific pairs traded would be necessary to better assess the sustainability and potential effectiveness of the pairs trading strategy (ensuring that co-integration of the pair holds good).

References:

Are simple Pairs Trading Strategies still profitable? - BSIC

Pairs Selection Framework in Pairs Trading using Unsupervised ML - Market Neutral

Co-integration Test - A key to find high probability trading strategy - here

An Introduction to Co-integration for Pairs Trading - Hudson and Thames Reading Group

Rad, H., Low, R. K. Y., & Faff, R. (2016). The profitability of pairs trading strategies: distance, cointegration and copula methods.

Jacobs, H., & Weber, M. (2015). On the determinants of pairs trading profitability.

The python code for implementation is open source and can be easily accessed using this link.