Use this to make better stock predictions
In this article, you would learn how to make better stock predictions
Backtesting is the general method for seeing how well a strategy or model would have done ex-post. Backtesting assesses the viability of a trading strategy by discovering how it would play out using historical data. If backtesting works, traders and analysts may have the confidence to employ it going forward
Understanding Backtesting
Backtesting allows a trader to simulate a trading strategy using historical data to generate results and analyze risk and profitability before risking any actual capital.
A well-conducted backtest that yields positive results assures traders that the strategy is fundamentally sound and is likely to yield profits when implemented in reality. In contrast, a well-conducted backtest that yields suboptimal results will prompt traders to alter or reject the strategy.
In this tutorial, we would be trying to predict if the stock price of Google will increase or decrease based on the historical idea. We would be using something known as backtesting and hyperparameter tuning to increase the precision score of our XGBoost Classifier.
The stock the prices of which we will try and predict if they went up or down will be Google. Google was founded on September 4, 1998, by Larry Page and Sergey Brin while they were PhD students at Stanford University in California. Together they own about 14% of its publicly listed shares and control 56% of the stockholder voting power through super-voting stock.
We will use the Yahoo Finance python library to load the data. Yahoo Finance uses pandas
a library and automatically puts the entire data into a nice-looking data frame. We are using the max
period to fetch all of the data of Google Stock prices.
!pip install yfinance --quiet
import yfinance as yf
data = yf.Ticker("GOOGL")
data_hist = data.history(period="max")
data_hist
Now that the data has been loaded let’s look at how the figures have changed over the years.
for i in range(len(data_hist.columns[:-2])):
plt.plot(data_hist[data_hist.columns[i]], color = color[i])
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.axhline(data_hist[data_hist.columns[i]].mean(), linestyle='--', lw=2, zorder=1, color='black')
plt.title("Google "+data_hist.columns[i] + " figures", fontsize=18)
plt.xlabel('Years')
plt.ylabel(data_hist.columns[i])
plt.legend([data_hist.columns[i]])
plt.show()
Since our data doesn’t have a Target Column yet, We are adding the target column where the column will say 0 or 1. 0 meaning the price went down and 1 would mean the price went up.
Thepandas
library has a rolling method that will look at the number of rows in the data (in our case 2) and then what we are saying is to return the 1st row if the 2nd row is higher else return 0 and this gives us our target.
data_hist["Target"] = data_hist.rolling(2).apply(lambda x: x.iloc[1] > x.iloc[0])["Close"]
Now that we have our target, we would shift our data by 1 row.
What that means is data from 19–08–2004 will shift to the next available data which is 20–08–2004. This means we would, in the real scenario, be using yesterday’s data to predict tomorrow. However, if we were not to do that what we would be doing is using today’s data to predict today’s target which is something that is likely to render terrible results in real-life situations.
df = data_hist.copy()
df = df.shift(1)
df
Let’s check the value counts in our Target column.
Our classes have a bit of a class imbalance which we will deal with next using the RandomOverSampler
from the imblearn
library and that's going to ensure the occurrence of both classes are equal which will help in a way that the model will not be biased towards a certain class.
ros = RandomOverSampler(random_state=0)
X = df[["Open", "High", "Low", "Close", "Volume"]].values
y = df['Target'].values
X_resampled, y_resampled = ros.fit_resample(X,y)
We are starting off with a base model with a max_depth of 3 and a learning rate of 0.1 and the rest of the parameters are default.
In the fit method, we are using Early Stopping
which is an approach to training complex machine learning models to avoid overfitting. It works by monitoring the performance of the model that is being trained on a separate test dataset and stopping the training procedure once the performance on the test dataset has not improved after a fixed number of training iterations. XGBoost supports early stopping after a fixed number of iterations.
Since we want to track the performance of the model on the both training and test set, we have given the model training and test set in the form of tuples and we are tracking the classification error and the log loss passed into the eval_metric
argument.
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.25, random_state=42)
model = XGBClassifier(max_depth=3, learning_rate=0.1)
history = model.fit(X_train, y_train, early_stopping_rounds =2, eval_set =[(X_train, y_train), (X_test, y_test)],
eval_metric=["error", "logloss"], verbose=0)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
Let’s see the performance of the model with the help of graphs
We can see that there is some overfitting going on which we will take care of later on. We tuned our model by changing the max_depth to 15 and the reg_lambda parameter (regularization) to 0.6 and it took us to 71% precision.
Let’s do some back-testing and find out the model score
def backtest(data, model, predictors, start=1000, step=50):
predictions = []
# Loop over the dataset in increments
for i in range(start, data.shape[0], step):
# Split into train and test sets
train = data.iloc[0:i].copy()
test = data.iloc[i:(i+step)].copy()
# Fit the model
model.fit(train[predictors], train["Target"])
# Make predictions
preds = model.predict_proba(test[predictors])[:,1]
preds = pd.Series(preds, index=test.index)
preds[preds > .6] = 1
preds[preds<=.6] = 0
# Combine predictions and test values
combined = pd.concat({"Target": test["Target"],"Predictions": preds}, axis=1)
predictions.append(combined)
return pd.concat(predictions)
In backtesting, what we are doing is training the model on the first 1000 rows and then testing the model on the next 50 rows and this is being done for the entire dataset. This will ensure our model learns from each iteration and is able to make better predictions.
While making predictions we are using something known as predict_proba
which is basically giving us the probabilities. The model usually has 0.5 as the threshold for classifying the data points but we are taking it a step further to 0.6 and then returning those predictions.
By using the function above, we got a 67% precision score. One of the ways you can increase model performance is feature engineering and that’s exactly what we are going to do. In the below code cell, we are adding the weekly average, quarterly average, and yearly average.
We are also finding the weekly trend of our target meaning what was the trend of the price during the week. We are also adding a bunch of percentages i.e open-close ratio, annual quarterly mean, etc. that would give the model more information about the data and the trend.
Read more here on feature selection
weekly_mean = df.rolling(7).mean()
quarterly_mean = df.rolling(90).mean()
annual_mean = df.rolling(365).mean()
weekly_trend = df.shift(1).rolling(7).mean()["Target"]
df["weekly_mean"] = weekly_mean["Close"] / df["Close"]
df["quarterly_mean"] = quarterly_mean["Close"] / df["Close"]
df["annual_mean"] = annual_mean["Close"] / df["Close"]
df["annual_weekly_mean"] = df["annual_mean"] / df["weekly_mean"]
df["annual_quarterly_mean"] = df["annual_mean"] / df["quarterly_mean"]
df["weekly_trend"] = weekly_trend
df["open_close_ratio"] = df["Open"] / df["Close"]
df["high_close_ratio"] = df["High"] / df["Close"]
df["low_close_ratio"] = df["Low"] / df["Close"]
Okay so now more columns have been added let’s use them in our function as predictors or input columns to determine if the price will go up or down. Remember our model is the one we found out by GridSearchCV.
predictors = ['Open', 'High','Low','Close','Volume','weekly_mean','quarterly_mean','annual_mean',
'annual_weekly_mean','annual_quarterly_mean','weekly_trend','open_close_ratio','high_close_ratio','low_close_ratio']
backtestpredictions = backtest(df, model, predictors)
print(classification_report(backtestpredictions['Target'], backtestpredictions['Predictions']))
And we can conclusively say, the model has reached 85% precision which means a whopping 14% jump just by adding predictors that make sense and gives more information to the model.
For full code, click here https://lnkd.in/dB6hhbCJ
CONCLUSION:-
- Backtesting assesses the viability of a trading strategy by discovering how it would play out using historical data.
- Tuning hyperparameters first can be a great idea to increase the model performance.
- Adding more predictors in the dataset can give the model the information it needs to make nice predictions.
If you liked the walkthrough and it proved to be helpful to you, I would appreciate it if you can give the article a clap and follow me for more.