Use this to make better stock predictions

In this article, you would learn how to make better stock predictions

Daksh Bhatnagar
7 min readNov 23, 2022

Backtesting is the general method for seeing how well a strategy or model would have done ex-post. Backtesting assesses the viability of a trading strategy by discovering how it would play out using historical data. If backtesting works, traders and analysts may have the confidence to employ it going forward

Understanding Backtesting

Backtesting allows a trader to simulate a trading strategy using historical data to generate results and analyze risk and profitability before risking any actual capital.

A well-conducted backtest that yields positive results assures traders that the strategy is fundamentally sound and is likely to yield profits when implemented in reality. In contrast, a well-conducted backtest that yields suboptimal results will prompt traders to alter or reject the strategy.

In this tutorial, we would be trying to predict if the stock price of Google will increase or decrease based on the historical idea. We would be using something known as backtesting and hyperparameter tuning to increase the precision score of our XGBoost Classifier.

The stock the prices of which we will try and predict if they went up or down will be Google. Google was founded on September 4, 1998, by Larry Page and Sergey Brin while they were PhD students at Stanford University in California. Together they own about 14% of its publicly listed shares and control 56% of the stockholder voting power through super-voting stock.

Google Logo

We will use the Yahoo Finance python library to load the data. Yahoo Finance uses pandas a library and automatically puts the entire data into a nice-looking data frame. We are using the max period to fetch all of the data of Google Stock prices.

!pip install yfinance --quiet
import yfinance as yf
data = yf.Ticker("GOOGL")
data_hist = data.history(period="max")
data_hist

Now that the data has been loaded let’s look at how the figures have changed over the years.

for i in range(len(data_hist.columns[:-2])):
plt.plot(data_hist[data_hist.columns[i]], color = color[i])
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.axhline(data_hist[data_hist.columns[i]].mean(), linestyle='--', lw=2, zorder=1, color='black')
plt.title("Google "+data_hist.columns[i] + " figures", fontsize=18)
plt.xlabel('Years')
plt.ylabel(data_hist.columns[i])
plt.legend([data_hist.columns[i]])
plt.show()

Since our data doesn’t have a Target Column yet, We are adding the target column where the column will say 0 or 1. 0 meaning the price went down and 1 would mean the price went up.

Thepandas library has a rolling method that will look at the number of rows in the data (in our case 2) and then what we are saying is to return the 1st row if the 2nd row is higher else return 0 and this gives us our target.

data_hist["Target"] = data_hist.rolling(2).apply(lambda x: x.iloc[1] > x.iloc[0])["Close"]

Now that we have our target, we would shift our data by 1 row.

What that means is data from 19–08–2004 will shift to the next available data which is 20–08–2004. This means we would, in the real scenario, be using yesterday’s data to predict tomorrow. However, if we were not to do that what we would be doing is using today’s data to predict today’s target which is something that is likely to render terrible results in real-life situations.

df = data_hist.copy()
df = df.shift(1)
df
DataFrame after adding target column

Let’s check the value counts in our Target column.

Our classes have a bit of a class imbalance which we will deal with next using the RandomOverSampler from the imblearn library and that's going to ensure the occurrence of both classes are equal which will help in a way that the model will not be biased towards a certain class.

ros = RandomOverSampler(random_state=0)
X = df[["Open", "High", "Low", "Close", "Volume"]].values
y = df['Target'].values
X_resampled, y_resampled = ros.fit_resample(X,y)
Countplot of Target class after resampling

We are starting off with a base model with a max_depth of 3 and a learning rate of 0.1 and the rest of the parameters are default.

In the fit method, we are using Early Stopping which is an approach to training complex machine learning models to avoid overfitting. It works by monitoring the performance of the model that is being trained on a separate test dataset and stopping the training procedure once the performance on the test dataset has not improved after a fixed number of training iterations. XGBoost supports early stopping after a fixed number of iterations.

Since we want to track the performance of the model on the both training and test set, we have given the model training and test set in the form of tuples and we are tracking the classification error and the log loss passed into the eval_metric argument.

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.25, random_state=42)
model = XGBClassifier(max_depth=3, learning_rate=0.1)
history = model.fit(X_train, y_train, early_stopping_rounds =2, eval_set =[(X_train, y_train), (X_test, y_test)],
eval_metric=["error", "logloss"], verbose=0)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

Let’s see the performance of the model with the help of graphs

We can see that there is some overfitting going on which we will take care of later on. We tuned our model by changing the max_depth to 15 and the reg_lambda parameter (regularization) to 0.6 and it took us to 71% precision.

Let’s do some back-testing and find out the model score

def backtest(data, model, predictors, start=1000, step=50):
predictions = []
# Loop over the dataset in increments
for i in range(start, data.shape[0], step):
# Split into train and test sets
train = data.iloc[0:i].copy()
test = data.iloc[i:(i+step)].copy()

# Fit the model
model.fit(train[predictors], train["Target"])

# Make predictions
preds = model.predict_proba(test[predictors])[:,1]
preds = pd.Series(preds, index=test.index)
preds[preds > .6] = 1
preds[preds<=.6] = 0

# Combine predictions and test values
combined = pd.concat({"Target": test["Target"],"Predictions": preds}, axis=1)

predictions.append(combined)

return pd.concat(predictions)

In backtesting, what we are doing is training the model on the first 1000 rows and then testing the model on the next 50 rows and this is being done for the entire dataset. This will ensure our model learns from each iteration and is able to make better predictions.

While making predictions we are using something known as predict_proba which is basically giving us the probabilities. The model usually has 0.5 as the threshold for classifying the data points but we are taking it a step further to 0.6 and then returning those predictions.

By using the function above, we got a 67% precision score. One of the ways you can increase model performance is feature engineering and that’s exactly what we are going to do. In the below code cell, we are adding the weekly average, quarterly average, and yearly average.

We are also finding the weekly trend of our target meaning what was the trend of the price during the week. We are also adding a bunch of percentages i.e open-close ratio, annual quarterly mean, etc. that would give the model more information about the data and the trend.

Read more here on feature selection

weekly_mean = df.rolling(7).mean()
quarterly_mean = df.rolling(90).mean()
annual_mean = df.rolling(365).mean()
weekly_trend = df.shift(1).rolling(7).mean()["Target"]
df["weekly_mean"] = weekly_mean["Close"] / df["Close"]
df["quarterly_mean"] = quarterly_mean["Close"] / df["Close"]
df["annual_mean"] = annual_mean["Close"] / df["Close"]

df["annual_weekly_mean"] = df["annual_mean"] / df["weekly_mean"]
df["annual_quarterly_mean"] = df["annual_mean"] / df["quarterly_mean"]
df["weekly_trend"] = weekly_trend

df["open_close_ratio"] = df["Open"] / df["Close"]
df["high_close_ratio"] = df["High"] / df["Close"]
df["low_close_ratio"] = df["Low"] / df["Close"]

Okay so now more columns have been added let’s use them in our function as predictors or input columns to determine if the price will go up or down. Remember our model is the one we found out by GridSearchCV.

predictors  = ['Open', 'High','Low','Close','Volume','weekly_mean','quarterly_mean','annual_mean',
'annual_weekly_mean','annual_quarterly_mean','weekly_trend','open_close_ratio','high_close_ratio','low_close_ratio']

backtestpredictions = backtest(df, model, predictors)
print(classification_report(backtestpredictions['Target'], backtestpredictions['Predictions']))

And we can conclusively say, the model has reached 85% precision which means a whopping 14% jump just by adding predictors that make sense and gives more information to the model.

For full code, click here https://lnkd.in/dB6hhbCJ

CONCLUSION:-

  1. Backtesting assesses the viability of a trading strategy by discovering how it would play out using historical data.
  2. Tuning hyperparameters first can be a great idea to increase the model performance.
  3. Adding more predictors in the dataset can give the model the information it needs to make nice predictions.

If you liked the walkthrough and it proved to be helpful to you, I would appreciate it if you can give the article a clap and follow me for more.

--

--

Daksh Bhatnagar
Daksh Bhatnagar

Written by Daksh Bhatnagar

Data Analyst who talks about #datascience, #dataanalytics and #machinelearning

Responses (1)