Forecasting SP500 stocks with XGBoost and Python Part 1: Sourcing the data

A while back I dabbled with the question “how little effort can I put into building an XGBoost model for a financial dataset?”.

Cover image (source)

This question didn’t pop out of thin air. I was looking to get some experience with a Machine Learning model for timeseries forecasting and I was discussing the SP500 dataset with some friends at the time. XGBoost was a topic of conversation as well, so I thought “why not”.

What I have for you is a two-part series on how I used XGBoost for creating a generic time series ML model that forecasts SP500 financial stocks. This is not a formal process that you should follow in future projects, nor a complex ML model. This is purely my first attempt at creating a time series ML model with XGBoost and not putting much thought into the feature selection and engineering. My goal was to get a working model, independent of quality.

The code for the model itself is shown in part two of the series. This first part focuses on sourcing the data.

Sourcing the data

Sourcing the data is as much if not more important than building the model itself. The old adage “garbage in garbage out” rings true in any data-related work, even more so in ML.

I used the pandas-datareader library to download the historical stocks data. It already includes an interface to download historical stocks data from Yahoo, so it saved me the trouble of finding a Web API. The lines

from pandas_datareader import data as web
msft_stocks = web.DataReader('MSFT', 'yahoo', '2017-01-01', '2022-03-31')

are all you need to get Microsoft stocks between January 1st 2017 and March 31st 2022.

The code

The script I wrote to download the history for all tickers is slightly more involved. The keyword here is slightly, because the script doesn’t do much more than run that second line of code inside a loop of tickers.

The extra bits are to handle issues downloading the data, storing the data locally, and combining all the stocks data into a single dataframe and CSV export. Oh, and I also included a Daily Return % column that calculates the percentage of growth for the Adj Close value on a daily basis. You can argue this last part is already feature engineering as it will be a useful feature to build the timeseries model: a feature that pertains to data changes over time.

Data-wise, each stock’s data includes the following:

Now please take a look at the code gists below.

Functions to download and combine stocks data into a s single export

The first gist shows how to download the history for each stock. This is the loop I mentioned before. Note how I add an extra column with the ticker symbol to differentiate the data once everything is appended into a single CSV.

The second gist shows how to consolidate the individual CSVs. They are read sequentially, and the Daily Return % is calculated for each file. The resulting dataframes are stored in a list and a final dataframe is created from this list — simple because the columns’ order and data types are common across all the dataframes.

And that’s all for sourcing the data. It was quick, but I wanted to separate this from the actual model building to start the second part with a pd.read_csv call right away :)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
José Fernando Costa

I write about data science to help other people who might come across the same problems