🚀 Your Algo Edge Just Leveled Up — Premium Plans Are Here!🚀

A year in, our Starter, Pro, and Elite Quant Plans are crushing it—members are live-trading bots and booking 1-on-1 wins. Now with annual + lifetime deals for max savings.

Every premium member gets: Full code from every article Private GitHub repos + templates 3–5 deep-dive paid articles/mo Early access + live strategy teardowns

Pick your edge:

  • Starter (€20/mo) → 1 paid article + public repos

  • Builder (€30/mo) → Full code + private repos (most popular)

  • Master (€50/mo) → Two 1-on-1 calls + custom bot built for you

Best deals: 📅 Annual: 2 months FREE 🔒 Lifetime: Own it forever + exclusive perks

First 50 annual/lifetime signups get a free 15-min audit. Don’t wait—the market won’t.

— AlgoEdge Insights Team

The Pairs Trading is a trading strategy that takes advantage of the fact that certain assets tend to move in sync, which means that the spread between them tend to revert to some mean over time.

Imagine two companies that operate in the same sector. Even though their share value are different and unique to the company, they are both impacted by dips and highs on the market that they both belong to. So if demand goes up or down in that sector, both companies are affected by it. A trader could recognize this pattern and create a Pairs Trading strategy when the spread between the two stocks increase.

So for example, if the price of stock A increases, while the price of stock B remains the same, the trader would go long (buy) stock B and go short (sell) on stock A. The assumption is that the spread will close over time, reverting to a certain mean and resulting in profit.

A condition for this strategy to work is that the spread between the two assets is stationary. This pattern can be found in cointegrated variables, which demonstrate long-term stable relationships.

This strategy is considered to be market-neutral, because it aims to profit from the relationship between two assets and not from the direction of the overall market. By holding both a long and a short position, your exposure to broad market risk is significantly reduced.

Code

The full end-to-end workflow is available in a Google Colab notebook, exclusively for paid subscribers of my newsletter. Paid subscribers also gain access to the complete article, including the full code snippet in the Google Colab notebook, which is accessible below the paywall at the end of the article. Subscribe now to unlock these benefits

Importing Libraries

We are going to use pandas and the datetime library for DataFrame manipulation, matplotlib and seaborn for plotting graphs and the yfinance library to retrieve stock prices.

import pandas as pd
import yfinance as yf
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

Function to Retrieve Historical Prices

To get the historical closing prices, we first create a function to retrieve the data, using the yfinance library. This function takes as inputs the tickers of each stock in an array and optionally, the start and end dates for the timeseries.

def get_historical_data(tickers, start=datetime(2020, 1, 1), end=datetime(2025, 1, 1)):
    """
    Iterates through the tickers list, getting all the closing prices
    for each ticker from a start date until a end date
    """

    data = pd.DataFrame()
    
    # Iterates through all the tickers
    for ticker in tickers:
        try:
            # Request the closing prices
            df = yf.Ticker(ticker).history(start=start, end=end)['Close']
            df.index = df.index.date
        except:
            pass
        
        data = pd.concat([data, df], axis=1)

    data.columns = tickers
    
    return data

Getting a List of Companies from the Bovespa Index

As an example, I’ll be searching for pairs within the Brazilian stock market’s Bovespa Index. To get the tickers for each company, I’ll scrape the table on the Ibovespa’s Wikipedia page. We’ll also add the Bovespa index itself.

# Reads the Wikipedia page to obtain the tickers
table = pd.read_html('https://pt.wikipedia.org/wiki/Lista_de_companhias_citadas_no_Ibovespa')[0]
ibov_tickers = table['Código'].tolist()
ibov_tickers = [tick + '.SA' for tick in ibov_tickers] + ['BOVA11.SA']

Retrieving Historical Closing Prices from the Ibovespa Companies

df = get_historical_data(ibov_tickers)
df.dropna(axis=1, inplace=True)

Selecting a Pair of Stocks

Cointegration

We’re looking for cointegrated pairs, which means that in the long run, the two stocks tend to move together. If the prices of the two stocks move together in the long-term, the spread between them has a tendency to revert to some mean from time to time.

To check the cointegration between two stocks, we’ll use the ts.coint() method from the statsmodels library. If the p-value for the cointegration test is less than 0.05, we’ll reject the null hypothesis, and consider the two variables to be cointegrated.

import statsmodels.tsa.stattools as ts

pairs = []
cointegrated_pairs = []

# Iterate through all the pairs
for i in df.columns:
    for j in df.columns:

        # Skip the pairs that were already seen
        if i == j or set([i, j]) in pairs:
            continue

        # Get the p-value for the cointegration test
        coint = ts.coint(df[i], df[j])[1]

        # If the p-value is smaller than 0.05, we can consider it to be cointegrated
        if coint < 0.05:
            cointegrated_pairs.append((i, j))

        pairs.append(set([i, j]))

One thing to be careful with is the Multiple Comparison Problem. Since the statistical test has a p-value threshold of 0.05, it means that 5% of time the pairs will be considered cointegrated, even though they are not. One way to mitigate this issue is to second check the two stocks, and look for other reasons to why the two assets should be cointegrated.

For example, if we find out that the stock prices of two companies are cointegrated, one that operates in the education sector and another that operates in the transportation sector, it should raise the suspicion that it’s just random noise, requiring a second test.

To understand more about cointegration, I recommend watching Ben Lambert's video presented at the end of this article.

Stationarity

Stationary variables are stochastic processes whose mean and variance do not change over time. In other words, they tend to revert back to that mean, while it’s variance remains the same. Cointegrated pairs should result in stationary spreads, but we will also filter out from the cointegrated pairs, assets that don’t show a stationary spread.

If two variables are cointegrated, the difference between the two over time does not necessarily stay the same. That is because cointegration does not count for the scale that each variable moves. One stock might double while the other triples, and they could still be cointegrated if the ups and downs stay consistent over time.

For that reason, instead of calculating a simple spread between the two stock prices, we’ll fit a linear regression, looking for a beta coefficient, which represents the ratio in which one stock moves more or less than the other.

from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm

cointegrated_stationary_spread = []

for pair in cointegrated_pairs:
    stock1 = pair[0]
    stock2 = pair[1]

    # Fit a linear regression to get beta
    S1 = sm.add_constant(df[stock1])
    results = sm.OLS(df[stock2], S1).fit()
    beta = results.params[stock1]

    # Calculate the spread
    spread = df[stock2] - beta * df[stock1]

    # Check if the spread and the ratio are stationary
    if adfuller(spread)[1] < 0.05 and adfuller(df[stock1] / df[stock2])[1] < 0.05:
        cointegrated_stationary_spread.append(pair)

We can choose the stocks TAEE11 and ELET3, which belong to companies within the electric utilities business. Knowing the relationship the two companies share also help to reduce the effect of the multiple comparison bias.

stock1 = 'TAEE11.SA'
stock2 = 'ELET3.SA'
eletric_companies = ['TAEE11.SA', 'ELET3.SA', 'CPFE3.SA', 'CMIG4.SA', 'EGIE3.SA', 'ELET6.SA', 'ENGI11.SA', 'EQTL3.SA', 'BOVA11.SA']

def coint_test(stock, stock_list, df):
    """
    Returns the p-values for the cointegration tests of stock with each item from stock_list.
    """

    coint_p_values = []

    # Compare the stock chosen with all the other from stock_list
    for i in stock_list:

        # If they are the same stock return 0, else get the p-value for the cointegration test
        if stock != i:
            coint_p_values.append(round(ts.coint(df[stock], df[i])[1], 2))
        else:
            coint_p_values.append(0)

    return coint_p_values

p_values = [coint_test(stock1, eletric_companies, df), coint_test(stock2, eletric_companies, df)]

plt.figure(figsize=(10, 5))
sns.heatmap(p_values, xticklabels=eletric_companies, yticklabels=[stock1, stock2], cmap="RdYlGn_r", annot=True)

As we can see, the pair of stocks chosen are indeed cointegrated. Not only that, it seems that none of the stocks chosen are cointegrated either with the Bovespa Index (market) nor other companies in the same sector, which means that there could be no confounding noise between the two companies.

Plots

Historical Closing Prices

It is hard to identify with certainty cointegrated pairs just by looking at their time series plot. However, we can see that in the long run both stocks tend to rise and drop concurrently.

plt.figure(figsize=(16, 5))
plt.plot(df[stock1], label=stock1)
plt.plot(df[stock2], label=stock2)

plt.title("Historical Closing Prices")
plt.xlabel("Date")
plt.ylabel("Price (R$)")

plt.legend()
plt.show()

Z-Score

One way to see more clearly the simultaneous movements of each stock, is by calculating the z-score of the prices.

The z-score or standard score is how far the raw value is from its mean, in number of standard deviations. So if one observation has the z-score of +2, it means that the value of that observation is two standard deviations above the mean of that sample.

When we calculate the z-score of each variable, the difference in magnitude between them is irrelevant, since we are analyzing how much the values vary around its mean.

def zscore(data):
    """
    Calculates the z-scores of each series.
    """

    mean = data.mean()
    std = data.std()

    zscores = (data - mean) / std

    return zscores
plt.figure(figsize=(16, 5))
plt.plot(zscore(df[stock1]), label=f'{stock1} Z-Score')
plt.plot(zscore(df[stock2]), label=f'{stock2} Z-Score')

plt.title("Historical Closing Prices")
plt.xlabel("Date")
plt.ylabel("Price (Z-Score)")

plt.legend()
plt.show()

Now, it is more apparent that they could be cointegrated.

Spread

To calculate the spread, we’ll use the same process from before: computing it as the

logo

Subscribe to our premium content to read the rest.

Become a paying subscriber to get access to this post and other subscriber-only content.

Upgrade

Keep Reading

No posts found