Introduction

The S&P 500 Index is an American stock market index which is an indicator of the health of the United States economy based on 500 of the largest United States stocks. Some of the largest publicy traded companies are included in the S&P 500, such as Microsoft, Apple, Amazon, Facebook, and more. In this project, we are going to do some exploratory data analysis on S&P 500 historical data obtained from Yahoo.com, and use historical pricing of crude oil, gold, and silver to predict the S&P 500.

Data

Packages

We will begin by loading up the neccesary libraries for this project.

  • Tidyverse: A collection of great R packages which provides tools to make graphs, use pipelines, tidy our data, and more.
  • rvest: A collection of functions for scraping web pages
library(tidyverse)
library(rvest)
tidyverse::tidyverse_packages() # List all packages included in tidyverse
##  [1] "broom"       "cli"         "crayon"      "dplyr"       "dbplyr"     
##  [6] "forcats"     "ggplot2"     "haven"       "hms"         "httr"       
## [11] "jsonlite"    "lubridate"   "magrittr"    "modelr"      "purrr"      
## [16] "readr"       "readxl\n(>=" "reprex"      "rlang"       "rstudioapi" 
## [21] "rvest"       "stringr"     "tibble"      "tidyr"       "xml2"       
## [26] "tidyverse"

Loading S&P 500 Data

Now that all libraries have been loaded, we can load the dataset acquired from Yahoo Finance into a dataframe. The dataset is stored in a csv file (Which can be found here), so we will use the read_csv function from the readr package (part of the tidyverse), to do so.

snp <- read_csv("snp500.csv") %>%
  as.tibble() # Load data into dataframe called snp
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
head(snp, 10) # Load first 10 rows of snp.
## # A tibble: 10 x 7
##    Date        Open  High   Low Close `Adj Close`  Volume
##    <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>   <dbl>
##  1 1950-01-03  16.7  16.7  16.7  16.7        16.7 1260000
##  2 1950-01-04  16.8  16.8  16.8  16.8        16.8 1890000
##  3 1950-01-05  16.9  16.9  16.9  16.9        16.9 2550000
##  4 1950-01-06  17.0  17.0  17.0  17.0        17.0 2010000
##  5 1950-01-09  17.1  17.1  17.1  17.1        17.1 2520000
##  6 1950-01-10  17.0  17.0  17.0  17.0        17.0 2160000
##  7 1950-01-11  17.1  17.1  17.1  17.1        17.1 2630000
##  8 1950-01-12  16.8  16.8  16.8  16.8        16.8 2970000
##  9 1950-01-13  16.7  16.7  16.7  16.7        16.7 3330000
## 10 1950-01-16  16.7  16.7  16.7  16.7        16.7 1460000

The dataset has seven attributes:

  • Date - Date
  • Open - The value of the S&P at market opening
  • High - Highest value of the S&P on that date
  • Low - Lowest value of the S&P on that date
  • Close - The value of the S&P at market closing
  • Adj Close - The value of the S&P at market closing adjusted for dividends
  • Volume - Number of shares traded on that date

We can check the number of observations by using the function nrow. Lets find out how many years of data is in this dataset.

nrow(snp)/365
## [1] 47.82466

So we have about 48 years worth of S&P 500 data.

Lets try plotting The closing value of the S&P 500 by Year. To do so, we will use the ggplot function of the ggplot2 package.

snp %>%
  ggplot(aes(x=Date, y=Close)) +
  geom_line() + #Create a line plot
  labs(x="Year", y="Closing Value", title = "S&P 500 Closing Value by Year")

Looking at this plot, the S&P 500 seems to have grown exponentially over time.

Loading Crude Oil Data

Crude Oil is “a naturally occurring, unrefined petroleum product composed of hydrocarbon deposits and other organic materials. A type of fossil fuel, crude oil can be refined to produce usable products such as gasoline, diesel and various forms of petrochemicals.” Crude Oil is nonrenewable, and society’s dependence on the fuel source is slowly diminishing the world’s supply, and slowly raising the price each year. We are interested in looking at historical prices of crude oil.

We will be using methods from the rvest package, another component of the tidyverse, to scrape a table with historical crude oil prices provided by (Investing.com)[https://www.investing.com/commodities/crude-oil-historical-data]. (Note: The site lets you download a csv, but we would like to practice scraping. Additionally, we are going to scrape the html file I downloaded, rather than directly from the website, but the process is exactly the same.)

oil <- read_html("Crude Oil WTI Futures Historical Prices - Investing.com.html") %>%
  html_node("#curr_table") %>% # Find HTML node that contains the table
  html_table %>% # Convert HTML table to R dataframe
  magrittr::set_colnames(c("Date", "Price", "Open", "High", "Low", "Volume", "Change")) # Rename cols to better names

The dataset has seven attributes:

  • Date - Date
  • Price - The cost of the crude oil at market closing
  • Open - The cost of the crude oil at market opening
  • High - Highest cost of the crude oil on that date
  • Low - Lowest cost of the crude oil on that date
  • Volume - Number of shares of crude oil traded on that date
  • Change - percent change in closing cost from the day before to the current day.

We have succesfully scraped the website, however the types of the attributes may not be correct. Lets see what types we have.

str(oil) # Shows a "compact" summary of an R object.
## 'data.frame':    4838 obs. of  7 variables:
##  $ Date  : chr  "May 20, 2019" "May 19, 2019" "May 17, 2019" "May 16, 2019" ...
##  $ Price : num  63.4 63.6 62.8 62.9 62 ...
##  $ Open  : num  63.6 63.2 63.1 62.1 61.4 ...
##  $ High  : num  64 63.7 63.6 63.5 62.3 ...
##  $ Low   : num  62.6 63.2 62.5 62.1 60.9 ...
##  $ Volume: chr  "-" "-" "215.80K" "545.68K" ...
##  $ Change: chr  "-0.24%" "1.27%" "-0.17%" "1.37%" ...

Based on the results of calling str on our oil dataframe, we see that Date, Volume, and Change % are stored as characters. We need to fix the types of those attributes, and we can do so using type_convert from the readr package.

oil$Volume <- gsub('K', 'e3', oil$Volume) # replace all numbers ending in K with e3
oil$Volume <- gsub('M', 'e6', oil$Volume) # replace all numbers ending in M with e6
oil[oil == "-"] <- NA # Set missing values to NA
oil$Change <- as.numeric(sub("%", "",oil$Change,fixed=TRUE))/100 # Convert percentages to decimal

oil <- oil %>%
  type_convert(col_types = cols(Date = col_date(format = "%b %d, %Y"))) # Convert to correct types.

str(oil)
## 'data.frame':    4838 obs. of  7 variables:
##  $ Date  : Date, format: "2019-05-20" "2019-05-19" ...
##  $ Price : num  63.4 63.6 62.8 62.9 62 ...
##  $ Open  : num  63.6 63.2 63.1 62.1 61.4 ...
##  $ High  : num  64 63.7 63.6 63.5 62.3 ...
##  $ Low   : num  62.6 63.2 62.5 62.1 60.9 ...
##  $ Volume: num  NA NA 215800 545680 723000 ...
##  $ Change: num  -0.0024 0.0127 -0.0017 0.0137 0.0039 0.0121 -0.0101 -0.0006 -0.0068 0.0117 ...

Lets try plotting Crude Oil closing cost vs Year

oil %>%
  ggplot(aes(x=Date, y=Price)) +
  geom_line() + 
  labs(x = "Year", y="Closing Cost",
       title = "Closing Cost of Crude Oil vs Year")

Based on this plot, the cost of crude oil generally increases in a linear fashion, but there have been some events which drastically changed the price of crude oil which explains the spikes.

Loading Gold Data

Gold is a metallic yellow chemical element. Due to its color, brightness, and durability, gold has been percieved as valuable sinces the times of the earliest civilizations. As a result, gold still has meaning today and many people purchase gold today. We will be looking at the historical exchange rate between the US Dollar and an ounce of gold, provided by Investing.com

gold <- read_csv("XAU_USD Historical Data.csv") %>% 
  magrittr::set_colnames(c("Date", "Price", "Open", "High", "Low", "Change"))
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Price = col_number(),
##   Open = col_number(),
##   High = col_number(),
##   Low = col_number(),
##   `Change %` = col_character()
## )
gold$Change <- as.numeric(sub("%", "",gold$Change,fixed=TRUE))/100 # Convert percentages to decimal

gold <- gold %>%
  type_convert(col_types = cols(Date = col_date(format = "%b %d, %Y"))) # Convert to correct types.

str(gold)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of  6 variables:
##  $ Date  : Date, format: "2019-05-20" "2019-05-19" ...
##  $ Price : num  1278 1278 1278 1286 1297 ...
##  $ Open  : num  1278 1279 1286 1297 1297 ...
##  $ High  : num  1279 1279 1289 1299 1301 ...
##  $ Low   : num  1274 1277 1275 1284 1293 ...
##  $ Change: num  -0.0003 0 -0.0064 -0.0082 -0.0005 -0.0036 0.0113 0.0011 0.0017 0.0018 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date = col_character(),
##   ..   Price = col_number(),
##   ..   Open = col_number(),
##   ..   High = col_number(),
##   ..   Low = col_number(),
##   ..   `Change %` = col_character()
##   .. )

The dataset has seven attributes:

  • Date - Date
  • Price - The exchange rate of gold at market closing
  • Open - The exchange rate of gold at market opening
  • High - Highest exchange rate of gold on that date
  • Low - Lowest exchange rate of gold on that date
  • Change - percent change in closing exchange rate of gold from the day before to the current day.

Merging the Data

Now that we have data on the price of crude oil and the S&P 500 properly formatted, we can merge the dataframes together. In order to reduce confusion, we will rename attributes of the oil dataset to match their equivalents in the s&p 500 dataset.

oil <- oil[c("Date", "Open", "High", "Low", "Price", "Volume", "Change")] %>%
  magrittr::set_colnames(c("Date", "Open", "High", "Low", "Close", "Volume", "Change"))

oil_snp <- inner_join((select(oil, "Date", "Close")), (select(snp, "Date", "Close")), by = "Date", suffix = c(".oil", ".s&p"))

We wil do the same for the Gold dataset.

gold <- gold[c("Date", "Open", "High", "Low", "Price", "Change")] %>%
  magrittr::set_colnames(c("Date", "Open", "High", "Low", "Close", "Change"))

gold_snp <- inner_join((select(gold, "Date", "Close")), (select(snp, "Date", "Close")), by = "Date", suffix = c(".gold", ".s&p"))

Lets take a look at how oil and the S&P 500 change over the years.

oil_snp %>%
  ggplot(aes(x=Date)) +
  geom_line(aes(y=`Close.s&p`)) +
  geom_line(aes(y=`Close.oil`)) +
  labs(x = "Year", y = "Value", title = "Crude Oil and S&P 500 vs Year")

Something’s not quite right. The value of the S&P 500 is so much higher than the price of crude oil, that this plot doesn’t really show us how their changes are correlated. If we want to compare their changes over time, we need to standardize them.

oil_scaled <- oil %>%
  mutate(CloseScaled = (Close - mean(Close))/sd(Close))
snp_scaled <- snp %>%
  mutate(CloseScaled = (Close - mean(Close))/sd(Close))

oil_snp_scaled <- inner_join((select(oil_scaled, "Date", "CloseScaled")), select(snp_scaled, "Date", "CloseScaled"), by = "Date", suffix = c(".oil", ".s&p"))

oil_snp_scaled %>%
  ggplot(aes(x=Date)) +
  geom_line(aes(y=`CloseScaled.s&p`), colour = "Blue") +
  geom_line(aes(y=`CloseScaled.oil`), colour = "Red") +
  labs(x = "Year", y = "Value", title = "Crude Oil (Red) and S&P 500 (Blue) vs Year")

Now we are getting somewhere. It appears that the price of crude oil rises at a similar pace to that of the value of the S&P 500, however, certain years, the price of crude oil changes drastically. Since our end goal is to predict the value of the S&P 500, it makes sense to focus on the most recent years, as the price of crude oil has not had a major change in the last 4 years.

oil_scaled <- oil %>%
  # We use the year function of the lubridate package (part of the tidyverse) to check for dates where year is at least 2016
  filter(lubridate::year(Date) > 2016) %>% 
  mutate(CloseScaled = (Close - mean(Close))/sd(Close))
snp_scaled <- snp %>%
  filter(lubridate::year(Date) > 2016) %>%
  mutate(CloseScaled = (Close - mean(Close))/sd(Close))

oil_snp_scaled <- inner_join(select(oil_scaled, "Date", "CloseScaled"), select(snp_scaled, "Date", "CloseScaled"), by = "Date", suffix = c(".oil", ".s&p"))

oil_snp_scaled %>%
  ggplot(aes(x=Date)) +
  geom_line(aes(y=`CloseScaled.s&p`), colour = "Blue") +
  geom_line(aes(y=`CloseScaled.oil`), colour = "Red") +
  labs(x = "Year", y = "Value", title = "Crude Oil (Red) and S&P 500 (Blue) vs Year")

This plot shows that there exists correlation between the price of Crude Oil and the Value of the S&P 500.

Now lets take a look at Gold and S&P 500.

gold_snp %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = `Close.s&p`), colour = "Blue") +
  geom_line(aes(y = `Close.gold`), colour = "Green") +
  labs(x = "Year", y = "Value", title = "Gold (Green) and S&P 500 (Blue) vs Year")

Now lets scale Gold and look at all three together.

gold_scaled <- gold %>%
  # We use the year function of the lubridate package (part of the tidyverse) to check for dates where year is at least 2016
  filter(lubridate::year(Date) > 2016) %>% 
  mutate(CloseScaled = (Close - mean(Close))/sd(Close))


gold_snp_oil_scaled <- inner_join(select(oil_scaled, "Date", "CloseScaled"), select(snp_scaled, "Date", "CloseScaled"), by = "Date", suffix = c(".oil", ".s&p")) %>%
  inner_join(., select(gold_scaled, "Date", "CloseScaled"), by = "Date") %>%
  magrittr::set_colnames(c("Date", "CloseScaled.oil", "CloseScaled.s&p", "CloseScaled.gold"))
  

gold_snp_oil_scaled %>%
  ggplot(aes(x=Date)) +
  geom_line(aes(y=`CloseScaled.s&p`), colour = "Blue") +
  geom_line(aes(y=`CloseScaled.gold`), colour = "Green") +
  geom_line(aes(y=`CloseScaled.oil`), colour = "Red") +
  
  labs(x = "Year", y = "Value", title = "Crude Oil (Red) Gold (Green) and S&P 500 (Blue) vs Year")

From this plot, we observe that in the last 4 years, The exchange rate of gold fluctuates far more rapidly than that of S&P 500 and crude oil. So we are just going to use crude oil as it is more consitent with changes in S&P 500.

Predicting S&P 500

Based on previous analysis, assuming no major changes occur in the economy, we can predict the S&P 500 using multiple linear regression. Lets take a look at a scatter plot between crude oil and S&P 500 from 2017 to present.

oil_snp <- oil_snp %>%
  filter(lubridate::year(Date) > 2016)

oil_snp %>%
  ggplot(aes(x = `Close.oil`, y = `Close.s&p`)) +
  geom_point()

The plot looks like it has moderate linear correlation. Lets perform an ANOVA test to check for interactions between crude oil closing price and Date. Our Null hypothesis is that there is no interaction between crude oil closing price and Date. Our alternative hypothesis is that there is an interaction. To show relevant information about the anova, we can use the tidy function of the Broom package.

anova <- aov(`Close.s&p` ~ `Close.oil` + Date + `Close.oil`*Date, data = oil_snp)
broom::tidy(anova)
## # A tibble: 4 x 6
##   term              df     sumsq    meansq statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
## 1 Close.oil          1 12461178. 12461178.    2696.   8.97e-223
## 2 Date               1  5331471.  5331471.    1154.   3.23e-141
## 3 Close.oil:Date     1    65479.    65479.      14.2  1.84e-  4
## 4 Residuals        593  2740724.     4622.      NA   NA

Based on the results of the anova test, we see that the p-value for the interaction Close.oil:Date is less than .001, so we reject the null hypothesis for all acceptable significance levels. We know know to include the interaction term in our model.

Lets build the linear regression model with the lm function. To show relevant information about the linear model, we again use broom::tidy

model <- lm(`Close.s&p` ~ `Close.oil`*Date, data = oil_snp)
broom::tidy(model)
## # A tibble: 4 x 5
##   term              estimate  std.error statistic  p.value
##   <chr>                <dbl>      <dbl>     <dbl>    <dbl>
## 1 (Intercept)     1897.      2042.         0.929  0.353   
## 2 Close.oil       -131.        37.6       -3.49   0.000524
## 3 Date               0.00820    0.116      0.0708 0.944   
## 4 Close.oil:Date     0.00801    0.00213    3.76   0.000184

Now that we have our linear model, we can predict the S&P 500 in the near future using the price of crude oil.

snp %>%
  filter(lubridate::year(Date) > 2016) %>%
  ggplot(aes(x=Date, y=Close)) +
  geom_line() +
  geom_smooth(method = 'lm', colour = "red") + 
  labs(x="Year", y="Closing Value", title = "S&P 500 Closing Value by Year")