The S&P 500 Index is an American stock market index which is an indicator of the health of the United States economy based on 500 of the largest United States stocks. Some of the largest publicy traded companies are included in the S&P 500, such as Microsoft, Apple, Amazon, Facebook, and more. In this project, we are going to do some exploratory data analysis on S&P 500 historical data obtained from Yahoo.com, and use historical pricing of crude oil, gold, and silver to predict the S&P 500.
We will begin by loading up the neccesary libraries for this project.
library(tidyverse)
library(rvest)
tidyverse::tidyverse_packages() # List all packages included in tidyverse
## [1] "broom" "cli" "crayon" "dplyr" "dbplyr"
## [6] "forcats" "ggplot2" "haven" "hms" "httr"
## [11] "jsonlite" "lubridate" "magrittr" "modelr" "purrr"
## [16] "readr" "readxl\n(>=" "reprex" "rlang" "rstudioapi"
## [21] "rvest" "stringr" "tibble" "tidyr" "xml2"
## [26] "tidyverse"
Now that all libraries have been loaded, we can load the dataset acquired from Yahoo Finance into a dataframe. The dataset is stored in a csv file (Which can be found here), so we will use the read_csv function from the readr package (part of the tidyverse), to do so.
snp <- read_csv("snp500.csv") %>%
as.tibble() # Load data into dataframe called snp
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
head(snp, 10) # Load first 10 rows of snp.
## # A tibble: 10 x 7
## Date Open High Low Close `Adj Close` Volume
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1950-01-03 16.7 16.7 16.7 16.7 16.7 1260000
## 2 1950-01-04 16.8 16.8 16.8 16.8 16.8 1890000
## 3 1950-01-05 16.9 16.9 16.9 16.9 16.9 2550000
## 4 1950-01-06 17.0 17.0 17.0 17.0 17.0 2010000
## 5 1950-01-09 17.1 17.1 17.1 17.1 17.1 2520000
## 6 1950-01-10 17.0 17.0 17.0 17.0 17.0 2160000
## 7 1950-01-11 17.1 17.1 17.1 17.1 17.1 2630000
## 8 1950-01-12 16.8 16.8 16.8 16.8 16.8 2970000
## 9 1950-01-13 16.7 16.7 16.7 16.7 16.7 3330000
## 10 1950-01-16 16.7 16.7 16.7 16.7 16.7 1460000
The dataset has seven attributes:
We can check the number of observations by using the function nrow. Lets find out how many years of data is in this dataset.
nrow(snp)/365
## [1] 47.82466
So we have about 48 years worth of S&P 500 data.
Lets try plotting The closing value of the S&P 500 by Year. To do so, we will use the ggplot function of the ggplot2 package.
snp %>%
ggplot(aes(x=Date, y=Close)) +
geom_line() + #Create a line plot
labs(x="Year", y="Closing Value", title = "S&P 500 Closing Value by Year")
Looking at this plot, the S&P 500 seems to have grown exponentially over time.
Crude Oil is “a naturally occurring, unrefined petroleum product composed of hydrocarbon deposits and other organic materials. A type of fossil fuel, crude oil can be refined to produce usable products such as gasoline, diesel and various forms of petrochemicals.” Crude Oil is nonrenewable, and society’s dependence on the fuel source is slowly diminishing the world’s supply, and slowly raising the price each year. We are interested in looking at historical prices of crude oil.
We will be using methods from the rvest package, another component of the tidyverse, to scrape a table with historical crude oil prices provided by (Investing.com)[https://www.investing.com/commodities/crude-oil-historical-data]. (Note: The site lets you download a csv, but we would like to practice scraping. Additionally, we are going to scrape the html file I downloaded, rather than directly from the website, but the process is exactly the same.)
oil <- read_html("Crude Oil WTI Futures Historical Prices - Investing.com.html") %>%
html_node("#curr_table") %>% # Find HTML node that contains the table
html_table %>% # Convert HTML table to R dataframe
magrittr::set_colnames(c("Date", "Price", "Open", "High", "Low", "Volume", "Change")) # Rename cols to better names
The dataset has seven attributes:
We have succesfully scraped the website, however the types of the attributes may not be correct. Lets see what types we have.
str(oil) # Shows a "compact" summary of an R object.
## 'data.frame': 4838 obs. of 7 variables:
## $ Date : chr "May 20, 2019" "May 19, 2019" "May 17, 2019" "May 16, 2019" ...
## $ Price : num 63.4 63.6 62.8 62.9 62 ...
## $ Open : num 63.6 63.2 63.1 62.1 61.4 ...
## $ High : num 64 63.7 63.6 63.5 62.3 ...
## $ Low : num 62.6 63.2 62.5 62.1 60.9 ...
## $ Volume: chr "-" "-" "215.80K" "545.68K" ...
## $ Change: chr "-0.24%" "1.27%" "-0.17%" "1.37%" ...
Based on the results of calling str on our oil dataframe, we see that Date, Volume, and Change % are stored as characters. We need to fix the types of those attributes, and we can do so using type_convert from the readr package.
oil$Volume <- gsub('K', 'e3', oil$Volume) # replace all numbers ending in K with e3
oil$Volume <- gsub('M', 'e6', oil$Volume) # replace all numbers ending in M with e6
oil[oil == "-"] <- NA # Set missing values to NA
oil$Change <- as.numeric(sub("%", "",oil$Change,fixed=TRUE))/100 # Convert percentages to decimal
oil <- oil %>%
type_convert(col_types = cols(Date = col_date(format = "%b %d, %Y"))) # Convert to correct types.
str(oil)
## 'data.frame': 4838 obs. of 7 variables:
## $ Date : Date, format: "2019-05-20" "2019-05-19" ...
## $ Price : num 63.4 63.6 62.8 62.9 62 ...
## $ Open : num 63.6 63.2 63.1 62.1 61.4 ...
## $ High : num 64 63.7 63.6 63.5 62.3 ...
## $ Low : num 62.6 63.2 62.5 62.1 60.9 ...
## $ Volume: num NA NA 215800 545680 723000 ...
## $ Change: num -0.0024 0.0127 -0.0017 0.0137 0.0039 0.0121 -0.0101 -0.0006 -0.0068 0.0117 ...
Lets try plotting Crude Oil closing cost vs Year
oil %>%
ggplot(aes(x=Date, y=Price)) +
geom_line() +
labs(x = "Year", y="Closing Cost",
title = "Closing Cost of Crude Oil vs Year")
Based on this plot, the cost of crude oil generally increases in a linear fashion, but there have been some events which drastically changed the price of crude oil which explains the spikes.
Gold is a metallic yellow chemical element. Due to its color, brightness, and durability, gold has been percieved as valuable sinces the times of the earliest civilizations. As a result, gold still has meaning today and many people purchase gold today. We will be looking at the historical exchange rate between the US Dollar and an ounce of gold, provided by Investing.com
gold <- read_csv("XAU_USD Historical Data.csv") %>%
magrittr::set_colnames(c("Date", "Price", "Open", "High", "Low", "Change"))
## Parsed with column specification:
## cols(
## Date = col_character(),
## Price = col_number(),
## Open = col_number(),
## High = col_number(),
## Low = col_number(),
## `Change %` = col_character()
## )
gold$Change <- as.numeric(sub("%", "",gold$Change,fixed=TRUE))/100 # Convert percentages to decimal
gold <- gold %>%
type_convert(col_types = cols(Date = col_date(format = "%b %d, %Y"))) # Convert to correct types.
str(gold)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of 6 variables:
## $ Date : Date, format: "2019-05-20" "2019-05-19" ...
## $ Price : num 1278 1278 1278 1286 1297 ...
## $ Open : num 1278 1279 1286 1297 1297 ...
## $ High : num 1279 1279 1289 1299 1301 ...
## $ Low : num 1274 1277 1275 1284 1293 ...
## $ Change: num -0.0003 0 -0.0064 -0.0082 -0.0005 -0.0036 0.0113 0.0011 0.0017 0.0018 ...
## - attr(*, "spec")=
## .. cols(
## .. Date = col_character(),
## .. Price = col_number(),
## .. Open = col_number(),
## .. High = col_number(),
## .. Low = col_number(),
## .. `Change %` = col_character()
## .. )
The dataset has seven attributes:
Now that we have data on the price of crude oil and the S&P 500 properly formatted, we can merge the dataframes together. In order to reduce confusion, we will rename attributes of the oil dataset to match their equivalents in the s&p 500 dataset.
oil <- oil[c("Date", "Open", "High", "Low", "Price", "Volume", "Change")] %>%
magrittr::set_colnames(c("Date", "Open", "High", "Low", "Close", "Volume", "Change"))
oil_snp <- inner_join((select(oil, "Date", "Close")), (select(snp, "Date", "Close")), by = "Date", suffix = c(".oil", ".s&p"))
We wil do the same for the Gold dataset.
gold <- gold[c("Date", "Open", "High", "Low", "Price", "Change")] %>%
magrittr::set_colnames(c("Date", "Open", "High", "Low", "Close", "Change"))
gold_snp <- inner_join((select(gold, "Date", "Close")), (select(snp, "Date", "Close")), by = "Date", suffix = c(".gold", ".s&p"))
Lets take a look at how oil and the S&P 500 change over the years.
oil_snp %>%
ggplot(aes(x=Date)) +
geom_line(aes(y=`Close.s&p`)) +
geom_line(aes(y=`Close.oil`)) +
labs(x = "Year", y = "Value", title = "Crude Oil and S&P 500 vs Year")
Something’s not quite right. The value of the S&P 500 is so much higher than the price of crude oil, that this plot doesn’t really show us how their changes are correlated. If we want to compare their changes over time, we need to standardize them.
oil_scaled <- oil %>%
mutate(CloseScaled = (Close - mean(Close))/sd(Close))
snp_scaled <- snp %>%
mutate(CloseScaled = (Close - mean(Close))/sd(Close))
oil_snp_scaled <- inner_join((select(oil_scaled, "Date", "CloseScaled")), select(snp_scaled, "Date", "CloseScaled"), by = "Date", suffix = c(".oil", ".s&p"))
oil_snp_scaled %>%
ggplot(aes(x=Date)) +
geom_line(aes(y=`CloseScaled.s&p`), colour = "Blue") +
geom_line(aes(y=`CloseScaled.oil`), colour = "Red") +
labs(x = "Year", y = "Value", title = "Crude Oil (Red) and S&P 500 (Blue) vs Year")
Now we are getting somewhere. It appears that the price of crude oil rises at a similar pace to that of the value of the S&P 500, however, certain years, the price of crude oil changes drastically. Since our end goal is to predict the value of the S&P 500, it makes sense to focus on the most recent years, as the price of crude oil has not had a major change in the last 4 years.
oil_scaled <- oil %>%
# We use the year function of the lubridate package (part of the tidyverse) to check for dates where year is at least 2016
filter(lubridate::year(Date) > 2016) %>%
mutate(CloseScaled = (Close - mean(Close))/sd(Close))
snp_scaled <- snp %>%
filter(lubridate::year(Date) > 2016) %>%
mutate(CloseScaled = (Close - mean(Close))/sd(Close))
oil_snp_scaled <- inner_join(select(oil_scaled, "Date", "CloseScaled"), select(snp_scaled, "Date", "CloseScaled"), by = "Date", suffix = c(".oil", ".s&p"))
oil_snp_scaled %>%
ggplot(aes(x=Date)) +
geom_line(aes(y=`CloseScaled.s&p`), colour = "Blue") +
geom_line(aes(y=`CloseScaled.oil`), colour = "Red") +
labs(x = "Year", y = "Value", title = "Crude Oil (Red) and S&P 500 (Blue) vs Year")
This plot shows that there exists correlation between the price of Crude Oil and the Value of the S&P 500.
Now lets take a look at Gold and S&P 500.
gold_snp %>%
ggplot(aes(x = Date)) +
geom_line(aes(y = `Close.s&p`), colour = "Blue") +
geom_line(aes(y = `Close.gold`), colour = "Green") +
labs(x = "Year", y = "Value", title = "Gold (Green) and S&P 500 (Blue) vs Year")
Now lets scale Gold and look at all three together.
gold_scaled <- gold %>%
# We use the year function of the lubridate package (part of the tidyverse) to check for dates where year is at least 2016
filter(lubridate::year(Date) > 2016) %>%
mutate(CloseScaled = (Close - mean(Close))/sd(Close))
gold_snp_oil_scaled <- inner_join(select(oil_scaled, "Date", "CloseScaled"), select(snp_scaled, "Date", "CloseScaled"), by = "Date", suffix = c(".oil", ".s&p")) %>%
inner_join(., select(gold_scaled, "Date", "CloseScaled"), by = "Date") %>%
magrittr::set_colnames(c("Date", "CloseScaled.oil", "CloseScaled.s&p", "CloseScaled.gold"))
gold_snp_oil_scaled %>%
ggplot(aes(x=Date)) +
geom_line(aes(y=`CloseScaled.s&p`), colour = "Blue") +
geom_line(aes(y=`CloseScaled.gold`), colour = "Green") +
geom_line(aes(y=`CloseScaled.oil`), colour = "Red") +
labs(x = "Year", y = "Value", title = "Crude Oil (Red) Gold (Green) and S&P 500 (Blue) vs Year")
From this plot, we observe that in the last 4 years, The exchange rate of gold fluctuates far more rapidly than that of S&P 500 and crude oil. So we are just going to use crude oil as it is more consitent with changes in S&P 500.
Based on previous analysis, assuming no major changes occur in the economy, we can predict the S&P 500 using multiple linear regression. Lets take a look at a scatter plot between crude oil and S&P 500 from 2017 to present.
oil_snp <- oil_snp %>%
filter(lubridate::year(Date) > 2016)
oil_snp %>%
ggplot(aes(x = `Close.oil`, y = `Close.s&p`)) +
geom_point()
The plot looks like it has moderate linear correlation. Lets perform an ANOVA test to check for interactions between crude oil closing price and Date. Our Null hypothesis is that there is no interaction between crude oil closing price and Date. Our alternative hypothesis is that there is an interaction. To show relevant information about the anova, we can use the tidy function of the Broom package.
anova <- aov(`Close.s&p` ~ `Close.oil` + Date + `Close.oil`*Date, data = oil_snp)
broom::tidy(anova)
## # A tibble: 4 x 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Close.oil 1 12461178. 12461178. 2696. 8.97e-223
## 2 Date 1 5331471. 5331471. 1154. 3.23e-141
## 3 Close.oil:Date 1 65479. 65479. 14.2 1.84e- 4
## 4 Residuals 593 2740724. 4622. NA NA
Based on the results of the anova test, we see that the p-value for the interaction Close.oil:Date is less than .001, so we reject the null hypothesis for all acceptable significance levels. We know know to include the interaction term in our model.
Lets build the linear regression model with the lm function. To show relevant information about the linear model, we again use broom::tidy
model <- lm(`Close.s&p` ~ `Close.oil`*Date, data = oil_snp)
broom::tidy(model)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1897. 2042. 0.929 0.353
## 2 Close.oil -131. 37.6 -3.49 0.000524
## 3 Date 0.00820 0.116 0.0708 0.944
## 4 Close.oil:Date 0.00801 0.00213 3.76 0.000184
Now that we have our linear model, we can predict the S&P 500 in the near future using the price of crude oil.
snp %>%
filter(lubridate::year(Date) > 2016) %>%
ggplot(aes(x=Date, y=Close)) +
geom_line() +
geom_smooth(method = 'lm', colour = "red") +
labs(x="Year", y="Closing Value", title = "S&P 500 Closing Value by Year")