Forecasting SP500 stocks with XGBoost and Python Part 1: Sourcing the data
A while back I dabbled with the question “how little effort can I put into building an XGBoost model for a financial dataset?”.
This question didn’t pop out of thin air. I was looking to get some experience with a Machine Learning model for timeseries forecasting and I was discussing the SP500 dataset with some friends at the time. XGBoost was a topic of conversation as well, so I thought “why not”.
What I have for you is a two-part series on how I used XGBoost for creating a generic time series ML model that forecasts SP500 financial stocks. This is not a formal process that you should follow in future projects, nor a complex ML model. This is purely my first attempt at creating a time series ML model with XGBoost and not putting much thought into the feature selection and engineering. My goal was to get a working model, independent of quality.
The code for the model itself is shown in part two of the series. This first part focuses on sourcing the data.
Sourcing the data
Sourcing the data is as much if not more important than building the model itself. The old adage “garbage in garbage out” rings true in any data-related work, even more so in ML.
I used the pandas-datareader library to download the historical stocks data. It already includes an interface to download historical stocks data from Yahoo, so it saved me the trouble of finding a Web API. The lines
from pandas_datareader import data as web
msft_stocks = web.DataReader('MSFT', 'yahoo', '2017-01-01', '2022-03-31')
are all you need to get Microsoft stocks between January 1st 2017 and March 31st 2022.
The script I wrote to download the history for all tickers is slightly more involved. The keyword here is slightly, because the script doesn’t do much more than run that second line of code inside a loop of tickers.
The extra bits are to handle issues downloading the data, storing the data locally, and combining all the stocks data into a single dataframe and CSV export. Oh, and I also included a Daily Return % column that calculates the percentage of growth for the Adj Close value on a daily basis. You can argue this last part is already feature engineering as it will be a useful feature to build the timeseries model: a feature that pertains to data changes over time.
Data-wise, each stock’s data includes the following:
- Date: of the stock values, there is one line per day
- High: highest value for the day
- Low: lowest value for the day
- Open: opening value for the day
- Close: closing value for the day
- Volume: number of transactions on that date
- Adj Close: or adjusted close, “is the closing price after adjustments for all applicable splits and dividend distributions”
Now please take a look at the code gists below.
The first gist shows how to download the history for each stock. This is the loop I mentioned before. Note how I add an extra column with the ticker symbol to differentiate the data once everything is appended into a single CSV.
The second gist shows how to consolidate the individual CSVs. They are read sequentially, and the Daily Return % is calculated for each file. The resulting dataframes are stored in a list and a final dataframe is created from this list — simple because the columns’ order and data types are common across all the dataframes.
And that’s all for sourcing the data. It was quick, but I wanted to separate this from the actual model building to start the second part with a pd.read_csv call right away :)