Introduction

Zillow’s housing market predictions are an integral part of its business model, helping the company achieve a greater understanding of how the market will value properties. Our team is confident that through our geospatial machine learning-based model that considers not only attributes of homes, but local factors as well, we can improve Zillow’s house price predictions for the Boulder County study area and provide a template which can be adapted to other localities.

This project outlines and analyzes the process by which we created our model in three broad stages: data gathering, regression modeling, and finally, prediction and validation. Our process involves splitting our total dataset into a training set consisting of 75% of the points, and a testing set consisting of the other 25%. These data are used to create our model, which predicts the 100 observations of the challenge set.

We found that our model’s predictive capabilities were bolstered by the addition of geographic features. While many variables from the given dataset were also central to the model’s predictions, variables such as distance to the nearest park and the mean distance to the five closest trail heads were very important to the model.

# load 17 libraries for use later
library(tidyverse)
library(tidycensus)
library(sf)
library(kableExtra)
library(dplyr)
library(ggcorrplot)
library(caret)
library(spdep)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(jtools)     
library(ggstance)
library(rpart)
library(ggplot2)
library(stargazer)
library(viridis)

# set up knitr
knitr::opts_chunk$set(
    message = FALSE,
    warning = FALSE,
    include = TRUE,
    cache = TRUE,
    echo=TRUE
)

# Set root directory
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

# Locate the source of functions from raw.githubusercontent.com
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

# Save q5 and qbr functions for later use
q5 <- function(variable) {as.factor(ntile(variable, 5))}

# create qbr function
qbr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

# Set a color palette for later use
palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")

Data Gathering

2019 ACS Data

The data gathering process begins by accessing tract-level geography, population, and demographic data from the 2019 American Community Survey (ACS). Variables of interest are isolated and wrangled to create features required for the analysis.

# Create a list of all 2019 acs 5-year variables
varlist_2019 <- load_variables(2019, "acs5", cache = TRUE)

# Census API key
census_api_key("94efffd19b56ad527e379faea1653ee74dc3de4a",overwrite = TRUE)


# Grab nine specific tract-level variables using the tidycensus "get_acs" command for each tract in Boulder County in 2019
tracts19 <- get_acs(geography = "tract",            
                    variables = c("B01001_001","B23025_004","B06011_001",
                                  "B06012_002","B02001_002","B25002_003",
                                  "B25013_006","B08013_001","B15012_009"), 
                    year=2019, 
                    state=08, 
                    county=013,
                    output = "wide",
                    geometry=TRUE) %>% 
            st_transform('ESRI:102254') %>%
            select( c("GEOID","B01001_001E","B23025_004E","B06011_001E",
                      "B06012_002E","B02001_002E","B25002_003E",
                    "B25013_006E","B08013_001E","B15012_009E","geometry") ) %>%
            rename(tot_pop = "B01001_001E",
                   empl_pop = "B23025_004E",
                   med_inc = "B06011_001E",
                   pvty_pop = "B06012_002E",
                   white_pop = "B02001_002E",
                   vac_occ = "B25002_003E",
                   own_occ_bach = "B25013_006E",
                   tt_work = "B08013_001E",
                   sci_bach = "B15012_009E") %>%
            mutate(area = as.numeric(st_area(geometry)/1000000))%>%
            mutate(pop_den = tot_pop/area)

Primary Training Dataset: Housing Data

The housing data, provided as part of the project requirements, provides housing prices along with a variety of features associated with houses. After other variables are added to the dataset, we have the features we need to create our model.

We then joined ACS demographic data that we theorized could predict housing prices. Each house is ascribed the demographic variables associated with the tract in which it is located.

# This loads all the data into "studentData".
studentData <- st_read("studentData.geojson", crs = 'ESRI:102254')%>%
  mutate(Age = 2021 - builtYear) %>%
  mutate(TotalBath = nbrThreeQtrBaths + nbrFullBaths + nbrHalfBaths)

# Create data table with variables
studentData %>%
st_make_valid(geometry)

# Attach ACS data
studentData <- st_join(studentData, tracts19, join = st_within)

#Boulder County Boundary
BoulderCounty_Bundary <-
st_read("https://opendata.arcgis.com/datasets/964b8f3b3dbe401bb28d49ac93d29dc4_0.geojson")%>%
  select(geometry) %>%
  st_transform('ESRI:102254')

#Boulder Municipal Boundary
BoulderMuni_Boundary <-
  st_read("https://opendata.arcgis.com/datasets/9597d3916aba47e887ca563d5ac15938_0.geojson")%>%
  st_transform('ESRI:102254')%>%
  rename(Municipality = ZONEDESC)

Municipalities

Boulder County and the boundaries of the municipalities within are drawn from open-source geometry files. Fig. 1 below displays the sale price of homes as raster data, superimposed onto Boulder County and its 10 municipalities.

#Map of Boulder County with Housing Sales Price and municipal boundary
ggplot()+
  geom_sf(data = BoulderCounty_Bundary, fill = "grey70") +
  geom_sf(data = BoulderMuni_Boundary, aes(fill = Municipality, alpha=0.5),colour = "white") +
  geom_sf(data = studentData, aes(colour = q5(price)),size=.85)+
  labs(title = "Housing Sales Under Municipalities", subtitle = "Boulder County, CO", caption = "Fig. 1")+
  scale_colour_manual(values = palette5,
                      labels=qbr(studentData,"price"),
                      name = "Sale Price") +
  mapTheme()

# Median Housing Price in Each Municipality
House_in_muni_boundary <- st_intersection(studentData, BoulderMuni_Boundary) %>%
  mutate(pct_pvty = pvty_pop/tot_pop * 100,
         pct_white = white_pop/tot_pop * 100,
         )

Table 1, below summarizes key indicators by municipality. Each variable represents the mean values ascribed to homes in the municipality. For example, poverty rate represents not the poverty rate of the municipality, but the poverty rate ascribed to each house in that municipality. While the rates are unweighted averages, we conclude that house price is strongly related to municipality. Boulder City consistently has the highest housing prices, a trend that we felt should be reflected in our model.

# Produce a table with data price and demographic variable averages across municipalities
Municipality.Summary <-
  st_drop_geometry(House_in_muni_boundary) %>%
  group_by(Municipality) %>%
  summarize(Medium.Price = median(price, na.rm = T),
            Mean.Price = mean(price, na.rm = T),
            Population = mean(tot_pop, na.rm = T),
            "Poplation Density per km^2"= mean(pop_den,na.rm=T),
            "Median Income" = mean(med_inc, na.rm = T),
            "Poverty Rate" = mean(pct_pvty, na.rm = T),
             "White Poplation in %" = mean(pct_white, na.rm = T),
            "Building Age" = mean(Age,na.rm = T)) %>%
  arrange(desc(Medium.Price))

kable(Municipality.Summary, digits = 2) %>%
  kable_styling() %>%
  footnote(general_title = "\n",
           general = "Table 1. Housing prices grouped by municipalities")
Municipality Medium.Price Mean.Price Population Poplation Density per km^2 Median Income Poverty Rate White Poplation in % Building Age
Boulder 950000 1228007.0 4785.49 1774.41 43834.62 12.27 88.95 51.38
Louisville 680000 736988.4 4959.54 1329.45 47542.11 5.29 90.15 36.35
Superior 656900 705382.1 4706.15 982.34 51891.10 4.72 77.15 17.30
Lyons 652500 625371.9 4016.00 28.63 51709.00 4.56 94.05 43.54
Erie 600000 624215.4 12923.00 316.98 55578.00 5.83 92.79 8.48
Jamestown 570000 566608.3 5563.08 21.77 47650.33 9.47 91.31 42.00
Lafayette 560500 618278.9 5515.73 863.10 45651.45 6.28 89.51 28.93
Nederland 536250 586593.5 5735.00 16.74 44740.00 9.59 94.68 41.31
Longmont 455000 493376.4 6057.30 1272.47 38180.19 8.51 89.09 29.96
Ward 295000 291250.0 5990.00 23.59 47166.00 9.92 90.88 81.71

Table 1. Housing prices grouped by municipalities
# Add municipal boundaries to data table 
studentData <- st_join(studentData, BoulderMuni_Boundary, join = st_within)

# Remove unimportant variables from the data table
studentData <- 
studentData %>%
  select(-c(OBJECTID, ZONECLASS, LASTUPDATE,LASTEDITOR, REG_PDF_URL, Shape_STArea__, 
            Shape_STLength__, ShapeSTArea,ShapeSTLength
            ))

# Creating dummy variables for municipalities
studentData <- mutate(studentData, Loui_dummy = case_when(Municipality=="Louisville"~ 1, Municipality!="Louisville"~ 0))
studentData <- mutate(studentData, Ward_dummy = case_when(Municipality=="Ward"~ 1, Municipality!="Ward"~ 0))
studentData <- mutate(studentData, Jame_dummy = case_when(Municipality=="Jamestown"~ 1, Municipality!="Jamestown"~ 0))
studentData <- mutate(studentData, Nede_dummy = case_when(Municipality=="Nederland"~ 1, Municipality!="Nederland"~ 0))
studentData <- mutate(studentData, Boul_dummy = case_when(Municipality=="Boulder"~ 1, Municipality!="Boulder"~ 0))
studentData <- mutate(studentData, Erie_dummy = case_when(Municipality=="Erie"~ 1, Municipality!="Erie"~ 0))
studentData <- mutate(studentData, Lafa_dummy = case_when(Municipality=="Lafayette"~ 1, Municipality!="Lafayette"~ 0))
studentData <- mutate(studentData, Long_dummy = case_when(Municipality=="Longmont"~ 1, Municipality!="Longmont"~ 0))
studentData <- mutate(studentData, Lyon_dummy = case_when(Municipality=="Lyons"~ 1, Municipality!="Lyons"~ 0))
studentData <- mutate(studentData, Supe_dummy = case_when(Municipality=="Superior"~ 1, Municipality!="Superior"~ 0))

studentData$Loui_dummy[is.na(studentData$Loui_dummy)] <- 0
studentData$Ward_dummy[is.na(studentData$Ward_dummy)] <- 0
studentData$Jame_dummy[is.na(studentData$Jame_dummy)] <- 0
studentData$Nede_dummy[is.na(studentData$Nede_dummy)] <- 0
studentData$Boul_dummy[is.na(studentData$Boul_dummy)] <- 0
studentData$Erie_dummy[is.na(studentData$Erie_dummy)] <- 0
studentData$Lafa_dummy[is.na(studentData$Lafa_dummy)] <- 0
studentData$Long_dummy[is.na(studentData$Long_dummy)] <- 0
studentData$Lyon_dummy[is.na(studentData$Lyon_dummy)] <- 0
studentData$Supe_dummy[is.na(studentData$Supe_dummy)] <- 0

Parks, Landmarks, and Trail Heads

The Boulder area is rich in natural beauty and outdoor activities. We thought it likely that home buyers might place a premium on proximity to locations like parks, natural landmarks, and Trail Heads, and engineered features accordingly.

For parks, we downloaded a polygon of park area in Boulder County, and created three variables based on each home’s “nearest neighbors”. One variable represented the shortest distance between a home and park space, another represented the mean of the distances to the two nearest areas of park space, and the third took the mean distances to the three nearest park spaces. Our fourth variable was created by the number of park centroids within a 500m buffer of each home. A map of trail heads produced a similar set of variables. A 1000m buffer was used for the last variable rather than a 500m buffer because trail heads are disproportionately accessed by cars. For landmarks, a single variable was added, expressing the distance between the home and the nearest landmark.

# Load Park data
GreenSpacePolygon <- st_read("County_Open_Space.geojson") %>%
                     st_transform('ESRI:102254')

# Remove NA park groups
Park <- GreenSpacePolygon[!is.na(GreenSpacePolygon$PARK_GROUP),] %>%
        st_centroid()
  
st_c <- st_coordinates

# Add three nearest neighbor variables for parks
studentData <-
  studentData %>% 
  mutate(park_nn1 = nn_function(st_c(studentData), st_c(Park), 1),
         park_nn2 = nn_function(st_c(studentData), st_c(Park), 2),
         park_nn3 = nn_function(st_c(studentData), st_c(Park), 3),
         park_dist = st_distance(studentData,Park))

# Create a variable for the parks within a 500m buffer of any house
Park_in_buffer <-
  st_join(Park,st_buffer(studentData,500),join = st_within)

PKCount <-
  Park_in_buffer %>%
  count(MUSA_ID)%>%
  st_drop_geometry()

# Add the buffer park count variable
studentData <-
  left_join(studentData, PKCount, by = "MUSA_ID" )

studentData <-
  studentData %>%
  rename(PKCount500m = n)

# Remove NA values
studentData$PKCount500m[is.na(studentData$PKCount500m)] <- 0

# Load Landmarks data
landmarksPolygon <- st_union(st_read("Natural_Landmarks.geojson")) %>%
  st_transform('ESRI:102254')

# attach distance to land marks data
studentData <- mutate(studentData, landmark_dist = st_distance(studentData, landmarksPolygon))

# Loading Trail Heads Locations
TrailHead <- 
  st_read("https://opendata.arcgis.com/datasets/5ade4ef915c54430a32026bcb03fe1d7_0.geojson") %>%
  st_transform('ESRI:102254')%>%
  select(geometry)

#Apply buffer counts and nearest neighbor function on trail heads
Trailhead_in_buffer <-
  st_join(TrailHead,st_buffer(studentData,1000),join = st_within)

TrailHeadCount <-
  Trailhead_in_buffer %>%
  count(MUSA_ID)%>%
  st_drop_geometry()

studentData <-
  left_join(studentData, TrailHeadCount, by = "MUSA_ID" )

studentData <-
  studentData %>%
  rename(TrailHead1000m = n)

studentData$TrailHead1000m[is.na(studentData$TrailHead1000m)] <- 0

# This time calculate up to 5 nearest neighbors
studentData <-
  studentData %>% 
  mutate(
    head_nn1 = nn_function(st_c(studentData), st_c(TrailHead), 1),
    head_nn2 = nn_function(st_c(studentData), st_c(TrailHead), 2), 
    head_nn3 = nn_function(st_c(studentData), st_c(TrailHead), 3),
    head_nn4 = nn_function(st_c(studentData), st_c(TrailHead), 4), 
    head_nn5 = nn_function(st_c(studentData), st_c(TrailHead), 5))

studentData <-
  studentData %>%
  mutate(trail_dist = st_distance(studentData, TrailHead))

Playgrounds, Schools, Flood Plains

Additional variables to be considered by the model come from proximity to playgrounds, schools, and floodplains. Playgrounds are important amenities for families with young children which we theorize is a large portion of the home buying contingency. Our playground indicator reflects the number of playgrounds within a 500m radius of the house.

Similarly, we know that availability of quality schools contributes to higher house prices. Though no school quality data is readily available for Boulder county we use a 500m radius indicator as a proxy. We also engineered features related to private schools using nearest neighbor function.

Finally, distance to the floodplain is added because of the potential connection between risky proximity to floodable areas and disadvantaged areas. We use a distance between a given home and the floodplain as a variable in our model. However, the model can benefit from a more granular, risk-management feature related to flooding. If similar analysis is replicated in a different locality we recommend checking insurance databases for availability of this data.

#Loading Playground Locations
Playground <- 
  st_read("Playground_Sites_Points.GEOJSON") %>%
  st_transform('ESRI:102254')%>%
  drop_na(PROPID)%>%
  select(geometry)

# Creating a variable to represent the number of playgrounds within a 500m buffer of a home
Playground.sf <-
  st_join(Playground,st_buffer(studentData,500),join = st_within)

# Call this PGCount
PGCount <-
  Playground.sf %>%
  count(MUSA_ID)%>%
  st_drop_geometry()

# Join PGCount to the rest of the variables
studentData <-
  left_join(studentData, PGCount, by = "MUSA_ID" )

studentData <-
  studentData %>%
  rename(pgcount500m = n)

studentData$pgcount500m[is.na(studentData$pgcount500m)] <- 0
  

#Loading School Locations
Schools <- 
  st_read("CDPHE_CDOE_School_Locations_and_District_Office_Locations.GEOJSON")%>%
  st_transform('ESRI:102254')%>%
  filter(COUNTY == "BOULDER")
  
# Differentiate private schools for further analysis
Private_School <-
  filter(Schools, startsWith(Type_, "Non-"))

# Create a variable for distance to the nearest school
studentData <-
  studentData %>% 
  mutate(
  school_nn1 = nn_function(st_c(studentData), st_c(Schools), 1))

# For private schools, create variables for mean distance to the nearest 1 through 5 neighbors
studentData <-
  studentData %>%
  mutate(
    privatesch_nn1 = nn_function(st_c(studentData),st_c(Private_School),1),
    privatesch_nn2 = nn_function(st_c(studentData),st_c(Private_School),2),
    privatesch_nn3 = nn_function(st_c(studentData),st_c(Private_School),3),
    privat_dist = st_distance(studentData,Private_School, by_element = TRUE))

#Loading Flood Plain
Floodplain <-
 st_read("https://opendata.arcgis.com/datasets/30674682e55e4e1b8b0e407b0fe23b9a_0.geojson") %>%
 st_transform('ESRI:102254')%>%
  st_union()

# Add a variable to represent distance to the floodplain
studentData <-
  studentData %>%
  mutate(flood_dist = st_distance(studentData, Floodplain, by_element=TRUE))


# This selects all numeric variables as preparation for the correlation analysis that follows.
# It also deletes an outlier.
cleanData <- 
  select_if(st_drop_geometry(studentData), is.numeric) %>%
  select(!c(year, ExtWallSec, IntWall, Roof_Cover, Stories, UnitCount))%>%
  slice(-2638)

studentData.GEOID <-
  select(studentData, GEOID, MUSA_ID)

cleanData <-
  left_join(cleanData, studentData.GEOID)

# Remove NA values
cleanData$Ac[(cleanData$Ac)== NA] <- 0
cleanData$Heating[(cleanData$Heating) == NA] <- 0

Fig. 1.2, 1.3, 1.4 map some variables in Boulder County. Most of the parks and trail heads are located outside of the residential areas. Most of the private schools are located in Boulder City and Longment. From a glance, housing sales are very related to private schools.

#Parks
ggplot() + 
  geom_sf(data = BoulderCounty_Bundary, fill = "grey60") +
  stat_density2d(stat = "sf_coordinates",
                 data = data.frame(st_coordinates(Park)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis(name = "Density Level") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
   geom_sf(data = studentData, aes(colour = q5(price)),size=.85)+
  scale_colour_manual(values = palette5,
                      labels=qbr(studentData,"price"),
                      name = "Sale Price") +
  labs(title = "Density of Parks with Housing Sales", caption = "Fig 1.2") +
  mapTheme(title_size = 14)

#Trail Head Location
ggplot() + 
  geom_sf(data = BoulderCounty_Bundary, fill = "grey60") +
  geom_sf(data = TrailHead, colour = "purple") +
   geom_sf(data = studentData, aes(colour = q5(price)),size=.85)+
  scale_colour_manual(values = palette5,
                      labels=qbr(studentData,"price"),
                      name = "Sale Price") +
  labs(title = "Trail Head Location with Housing Sales", caption = "Fig 1.3") +
  mapTheme(title_size = 14)

#Private School
ggplot() + 
  geom_sf(data = BoulderCounty_Bundary, fill = "grey60") +
  stat_density2d(stat = "sf_coordinates",
                 data = data.frame(st_coordinates(Private_School)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis(name = "Density Level") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
   geom_sf(data = studentData, aes(colour = q5(price)),size=.85)+
  scale_colour_manual(values = palette5,
                      labels=qbr(studentData,"price"),
                      name = "Sale Price") +
  labs(title = "Density of Private School with Housing Sales", caption = "Fig 1.4") + mapTheme(title_size = 14)

# Predictions and Validation

Preparation for machine learning and regression model

The model creates coefficients for several of the previously mentioned variables along with an additional constant value. To make a price prediction for a home, the values of the home’s variables (square footage, population of the census tract, distance to landmarks, etc.) are multiplied by the coefficients created by the model, and the sum of the results produces a house price. Our OLS model must first look at the relationships in the training set data so that it can optimize the coefficients and produce the best possible model.

To test our model, we use a 75/25 split of the data. This means that 75% of our model will be trained on, and 25% of the model will be tested on. Since we already know the price of the points we are testing, we get a sense of the accuracy of our model before we make our predictions for the challenge set.

# Remove Outliers

test.Data <- filter(cleanData, toPredict == 1)

#Split the ‘toPredict’ == 0 into a separate training and test set 
#using a 75/25 split

train.Data <- filter(cleanData, toPredict == 0) %>%
  na.omit()
  

inTrain <- createDataPartition(
  y = paste(train.Data$Heating, train.Data$Ac,train.Data$GEOID), 
  p = .75, list = FALSE)
train.Data.1 <- train.Data[inTrain,] 
train.Data.2 <- train.Data[-inTrain,]

#Dataframe with variables of interest
train.Data.clean <-
  train.Data %>%
  select(price, 
         section_num,
         qualityCode,
         TotalFinishedSF,
         Age,
         Ac,
         Heating,
         med_inc,
         nbrRoomsNobath,
         vac_occ,
         tot_pop,
         pop_den,
         pvty_pop,
         head_nn5,
         park_nn1,
         Boul_dummy,
         Loui_dummy,
         pgcount500m,
         TotalBath,
         flood_dist,
         landmark_dist,
         privat_dist)

Fig. 2 shows a graph of the relationship between Housing Price and the distance to the single nearest trail head neighbor.

Correlation & Test Signifcance

Prior to the the regression, the relationships across all the variables are evaluated to test the levels of significance of their relationship with each other. This process determines if variables are colinear, meaning that two or more variables give such similar information that to include all of them would be redundant. In Fig. 3, Pearson Correlation is imputed: a strong orange shows strong positive correlation and a strong green shows strong negative correlation between selected variables.

Note that price is strongly correlated with section number, quality code, total finished square footage, number of bathrooms, location within or not within Boulder City, and count of playgrounds within a 500 meter radius.

Below is a in-depth look of Fig. 4, presenting correlations between price and four selected variables. There is a noticeably strong positive correlation between price and total square footage. Negative correlations are also clear between price and the distances to landmarks and private schools.

## 
## Call:
## lm(formula = price ~ ., data = train.Data %>% dplyr::select(price, 
##     qualityCode, TotalFinishedSF, mainfloorSF, Age, pop_den, 
##     med_inc))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1590341  -188808   -50635   116708  5488603 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.566e+06  3.802e+04 -41.184  < 2e-16 ***
## qualityCode      3.717e+04  9.059e+02  41.035  < 2e-16 ***
## TotalFinishedSF  1.319e+02  8.876e+00  14.860  < 2e-16 ***
## mainfloorSF      3.624e+01  1.048e+01   3.458 0.000547 ***
## Age              7.853e+03  2.504e+02  31.366  < 2e-16 ***
## pop_den          7.324e+01  6.805e+00  10.762  < 2e-16 ***
## med_inc          6.184e+00  5.942e-01  10.407  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 406400 on 7416 degrees of freedom
## Multiple R-squared:  0.4735, Adjusted R-squared:  0.473 
## F-statistic:  1111 on 6 and 7416 DF,  p-value: < 2.2e-16

100-fold validation

The next step of the project involves cross-validation using k-folds. We chose 100-fold cross-validation which involves randomly partitioning the training set into 100 groupings. We trained the model using 99 groupings and checked the errors on the remaining group. During the research process, we tested many combinations of variables, but this combination seemed reasonable and was particularly successful in making predictions. The result shows approximately 70% of the variance is explained by the model (depending on the iteration). Using this result we proceed to applying our predictions to the challenge set.

Table 3. displays the p-value, a “score” that describes how well each variable performs in the prediction of house price across all of the folds of the 100-fold cross validation. A p-value less than 0.05 for a variable indicates that variable is significant. The R-squared value displayed below gives a percentage of how much of the total error our model accounts for.

# Multivariate regression 
reg1 <- lm(price ~ ., data = train.Data.1 %>% 
             dplyr::select(price, 
                 section_num,
                 qualityCode,
                 TotalFinishedSF,
                 Ac,
                 Heating,
                 Age,
                 med_inc,
                 landmark_dist,
                 nbrRoomsNobath,
                 pvty_pop,
                 privat_dist,
                 head_nn5,
                 park_nn1,
                 Loui_dummy,
                 Nede_dummy,
                 Boul_dummy,
                 Erie_dummy,
                 Long_dummy,
                 Lyon_dummy,
                 pgcount500m,
                 TotalBath,
                 flood_dist,
                 privat_dist))


# Create a stargazer table to represent variation of selected variables
stargazer(reg1, type = "html", title = "Table 3. Variation of selected variable and error control of the regression model",out="table1.html",dep.var.labels = "Housing Price")
Table 3. Variation of selected variable and error control of the regression model
Dependent variable:
Housing Price
section_num 412,961.100***
(93,038.840)
qualityCode 24,981.620***
(996.088)
TotalFinishedSF 162.067***
(9.864)
Ac -12,298.460***
(3,741.127)
Heating 3,609.143***
(559.931)
Age 3,086.934***
(311.135)
med_inc -1.650**
(0.800)
landmark_dist -3.610
(4.350)
nbrRoomsNobath 513.097
(3,578.184)
pvty_pop -34.964*
(18.342)
privat_dist -0.101
(0.672)
head_nn5 -28.451***
(6.138)
park_nn1 32.724***
(5.147)
Loui_dummy 112,943.900***
(22,651.080)
Nede_dummy 299,635.200*
(153,023.700)
Boul_dummy 435,012.200***
(25,305.600)
Erie_dummy -201,529.400***
(22,732.790)
Long_dummy -193,822.000***
(18,916.530)
Lyon_dummy -249,719.900***
(75,196.160)
pgcount500m -26,725.240*
(14,440.280)
TotalBath 48,507.360***
(7,439.177)
flood_dist -28.024***
(3.933)
Constant -1,275,489.000
(853,281.700)
Observations 5,654
R2 0.600
Adjusted R2 0.599
Residual Std. Error 365,214.800 (df = 5631)
F Statistic 384.366*** (df = 22; 5631)
Note: p<0.1; p<0.05; p<0.01
# Display a table with information as to the model's prediction of housing prices based on the 75-point training set
reg1.tidy <-
  tidy(reg1,
       "Section Number" = section_num)

kable(reg1.tidy, digits = 2, caption = "Table 4. Regression model predicting variation on housing prices", 
      col.names = c("Predictor","Estimate","Standard Error", "T-Value","P-Value")
      ) %>%
  kable_styling()
Table 4. Regression model predicting variation on housing prices
Predictor Estimate Standard Error T-Value P-Value
(Intercept) -1275488.53 853281.71 -1.49 0.14
section_num 412961.06 93038.84 4.44 0.00
qualityCode 24981.62 996.09 25.08 0.00
TotalFinishedSF 162.07 9.86 16.43 0.00
Ac -12298.46 3741.13 -3.29 0.00
Heating 3609.14 559.93 6.45 0.00
Age 3086.93 311.14 9.92 0.00
med_inc -1.65 0.80 -2.06 0.04
landmark_dist -3.61 4.35 -0.83 0.41
nbrRoomsNobath 513.10 3578.18 0.14 0.89
pvty_pop -34.96 18.34 -1.91 0.06
privat_dist -0.10 0.67 -0.15 0.88
head_nn5 -28.45 6.14 -4.64 0.00
park_nn1 32.72 5.15 6.36 0.00
Loui_dummy 112943.93 22651.08 4.99 0.00
Nede_dummy 299635.23 153023.72 1.96 0.05
Boul_dummy 435012.24 25305.60 17.19 0.00
Erie_dummy -201529.35 22732.80 -8.87 0.00
Long_dummy -193821.97 18916.53 -10.25 0.00
Lyon_dummy -249719.90 75196.16 -3.32 0.00
pgcount500m -26725.24 14440.28 -1.85 0.06
TotalBath 48507.36 7439.18 6.52 0.00
flood_dist -28.02 3.93 -7.13 0.00
# This sets up k-fold cross-validation.
k = 100
fitControl <- trainControl(method = "cv", number = k)
set.seed(825)
# variables in the "select(...)" function are considered in the analysis here.
regression.100foldcv <- 
  train(price ~ ., data = train.Data.1%>% 
          select(price, 
                 section_num,
                 qualityCode,
                 TotalFinishedSF,
                 Ac,
                 Heating,
                 Age,
                 med_inc,
                 landmark_dist,
                 nbrRoomsNobath,
                 pvty_pop,
                 privat_dist,
                 head_nn5,
                 park_nn1,
                 Loui_dummy,
                 Nede_dummy,
                 Boul_dummy,
                 Erie_dummy,
                 Long_dummy,
                 Lyon_dummy,
                 pgcount500m,
                 TotalBath,
                 flood_dist,
                 privat_dist
          ),
  method = "lm", trControl = fitControl, na.action = na.pass)

# The resulting Mean Absolute Error (MAE) of running this line tells us how
# successful our model is at predicting unknown data. 

data = regression.100foldcv

MAE Histogram

While we are proud of our model’s ability to make predictions, it of course cannot be perfect. The following histogram displays the distribution of the Mean Average Error (MAE) for each of the folds of the 100-fold cross-validation exercise. MAE measures the magnitude of the average difference between the predicted price and the actual price of the home. In most of the folds, the predictions were around $200,000 off-base on average.

#MAE Histogram
#This allows you to see the distribution of MAE of the folds

ggplot(regression.100foldcv$resample, aes(x=MAE))+
  geom_histogram(colour = "white",fill="orange")+
  labs(title="Distribution of MAE",x="Mean Absolute Error", y = "Count",
       subtitle = "k-fold cross-validation; k = 100",
       caption = "Fig. 5")+
  theme_apa()

Test Set Prediction

After a set of regression coefficients have been produced and tested by 100-fold validation, the model is used to predict price for testing set. The results created by running this model are displayed by Table 6.

The resulting Mean Absolute Percentage Error indicates that the model created a decent output. Predictions were, on average, within 30% of the value of the price. This bodes well for the potential of the model to predict mostly reasonable prices for the challenge set.

Table 6. Mean Absolute Error and Absolute Percentage Error on the Prediction Trial
Regression Mean Absolute Error ($) Mean Absolute Percentage Error (%)
Trial 1 193241.2 0.35

Plot Predictions

Fig. 6 compares predicted price to actual price. The logarithmic relationship shows that for lower prices, our model slightly under-predicts observed prices. Mid-level prices tend to be over-predicted, but the highest-priced homes are where our model struggles the most. Prices of homes that are especially higher than the mean price are significantly underestimated.

Fig. 7, showing absolute errors between observed and predicted housing price across Boulder County, demonstrates that our model generally has consistently low absolute error.

# Plot of predicted prices as a function of observed price
# Also included: theoretical line at which predictions = observations
ggplot(data = train.Data.2, aes(price,SalePrice.Predict)) +
  geom_point(size = .85,colour = "orange") + 
  geom_abline(slope=1,colour = "red",size = 1.2)+
  labs(x = "Obeserved Prices",y = "Predicted Prices", title = "Predicted Prices as a Function of Observed Prices",
       subtitle = "Boulder County, CO (all in $)", 
       caption = "Fig. 6")+
  theme(plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 15),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        axis.line = element_line(colour = "grey50", size = 1),
        panel.grid.major = element_line(linetype = "dotted",size = 1))

#Map of residuals
#This allows you to see the degree of error across over Boulder County

train.Data.2 <-
  left_join(train.Data.2, tracts19)%>%
  st_as_sf()

ggplot()+
  geom_sf(data = BoulderCounty_Bundary, fill = "grey70") +
  geom_sf(data = tracts19,colour = "Black") +
  geom_sf(data = train.Data.2, aes(colour = SalePrice.AbsError),size=.90)+
  scale_colour_gradient(low = "#00cfbb", high = "#6f5ad3",
    name = "Absolute Error ($)" ) +
  labs(title = "Absolute Errors between Observed and Predicted Housing Price", caption = "Fig. 7")+
  mapTheme()

Lag Price Errors

There is a slight trend relating Sale Price Error to Spatial Lag Errors, suggesting some spatial lag effect. The points with the greatest error are generally a little bit more likely to have nearby points with high error.

# Lag Price Errors

coords.test <-  st_coordinates(train.Data.2) 

# Get list of neighbors
neighborList.test <- knn2nb(knearneigh(coords.test, 5))

# Supplement neighbors list with spatial weights
spatialWeights.test <- nb2listw(neighborList.test, style="W")

# Get lag price error
train.Data.2.lag <-
train.Data.2 %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error))

# Plot lag price errors against predicted price error
ggplot(data = train.Data.2.lag, aes(lagPriceError, SalePrice.Error)) +
  geom_point(size = .85,colour = "orange") + 
  geom_smooth(method = "lm",colour = "red",size = 1.2)+
  labs(x = "Spatial Lag of Errors (Mean error of five nearest neighbors)",y = "Sale Price Error", title = "Lag Sale Price Errors against Predicted Sale Price Error",
       subtitle = "Boulder County, CO (all in $)",caption = "Fig. 8")+
  theme(plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 15),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        axis.line = element_line(colour = "grey50", size = 1),
        panel.grid.major = element_line(linetype = "dotted",size = 1))

Moran’s I

Moran’s I is a measure that assess spatial autocorrelation: specifically, whether values are spatially clustered, random, or dispersed. The Moran’s I of our error is calculated below, and compared with the frequency of 999 randomly permutated I.

A I approaching -1 indicates greater dispersal, while an I approaching 1 indicates clustering. Random spatial distribution is indicated by a I of 0. Because the spatial distribution of the randomly permutated I is random, its Moran’s I is very close to zero. The observed Moran’s I is higher than all 999 of the permutated Is, confirming the existence of a spatial autocorrelation.

Predicted Results by Municipality

The map below shows predicted values for the entire dataset overlaid onto a map of municipalities. According to the model, most of the municipalities are predicted to have houses at a range of prices. The City of Boulder is more segregated by home price, with higher priced houses in the north and comparatively lower priced houses in the south.

We used tracts to summarize our absolute percentage error to examine spatial patterns across Boulder County. As shown in Fig. 11, some tracts lack datapoints, and are gray. There is not a clear, visually apparent pattern of error across tracts, other than that our model struggled more in the Longmont area. In our graphical analysis of census tracts in Fig, 11, there is no evident correlation between mean absolute percentage error and mean housing prices either. This shows that census tracts do not create generalized error.

If we consider census tracts as neighborhoods, we can account for neighborhood effects in our regression model. Using this logic, we add census tract GEOID as one of our indicators along with the previously included variables.

# Create a linear model for neighborhood analysis
reg.nhood <- lm(price ~ ., data = as.data.frame(train.Data.1) %>% 
                                 select(price, GEOID,
                 section_num,
                 qualityCode,
                 TotalFinishedSF,
                 Ac,
                 Heating,
                 Age,
                 med_inc,
                 landmark_dist,
                 nbrRoomsNobath,
                 pvty_pop,
                 privat_dist,
                 head_nn5,
                 park_nn1,
                 Loui_dummy,
                 Nede_dummy,
                 Boul_dummy,
                 Erie_dummy,
                 Long_dummy,
                 Lyon_dummy,
                 pgcount500m,
                 TotalBath,
                 flood_dist,
                 privat_dist))

# test fro neighborhood effects
boulder.test.nhood <-
  train.Data.2 %>%
  mutate(Regression = "Neighborhood Effects",
         SalePrice.Predict = predict(reg.nhood, train.Data.2),
         SalePrice.Error = SalePrice.Predict- price,
         SalePrice.AbsError = abs(SalePrice.Predict- price),
         SalePrice.APE = (abs(SalePrice.Predict- price)) / price)

Table 7 compares the absolute error between the baseline regression and a version of that regression that considers neighborhood effects. Based on measures of absolute error and APE, these regressions are very similar.

Fig. 13 maps the relationship between predicted and actual prices for the house prices in the test set. The orange line represents a function in which predicted values perfectly represent the observed values. The green lines represent the regression of the two models. The similarities between the baseline version of the plot and the plot considering neighborhood effects show that the neighborhood effects do not prediction accuracy much.

# Combine regular regression dataframe with the version of the regression that considers neighborhood effects
bothRegressions <- 
  rbind(select(train.Data.2, starts_with("SalePrice"), Regression, GEOID,price) %>%
 mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)),
select(boulder.test.nhood, starts_with("SalePrice"), Regression, GEOID,price) %>%
  mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)))

# Table output of absolute error and sale price
st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -GEOID) %>%
  filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
    spread(Variable, meanValue) %>%
    kable(caption = "Table 7")%>% kable_styling()
Table 7
Regression SalePrice.AbsError SalePrice.APE
Baseline Regression 193241.2 0.3463888
Neighborhood Effects 165988.6 0.2669913
# Plot baseline regression and sale price
bothRegressions %>%
  select(SalePrice.Predict, price, Regression) %>%
    ggplot(aes(price, SalePrice.Predict)) +
  geom_point() +
  stat_smooth(aes(price, price), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(SalePrice.Predict, price), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction", caption = "Fig. 13") +
  plotTheme()

The maps of Fig. 14 present geographic representations of demographics in Boulder County. These maps provide a framework for understanding how demographic variables affect sales price, or potential clustering. Every tract in Boulder County is majority white based on 2019 ACS data. For our map of median income across Boulder County census tracts, tracts are designated as “high income” (greater than the county median of $40450), or “low income” (lower than the county median).

# Calculate percent white and median income for census tracts
tracts19.2 <- 
  tracts19%>%
  mutate(percentWhite = white_pop / tot_pop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(med_inc > 40450, "High Income", "Low Income"))

# Create two tract-level, boulder county maps, one for race and one for income 
grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(tracts19.2), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
    labs(title = "Race Context", caption = "Fig. 14") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(tracts19.2), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

We examine income as a potential neighborhood effect. Table 8 shows a lower MAPE in the low income area than in the high income area. This indicates that at least in low income areas, neighborhood effects do account for the clustering of housing prices.

# Calculate and compare MAPE for low income and high income census tracts
st_join(bothRegressions, tracts19.2) %>% 
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Table 8. Test set MAPE by neighborhood racial context") %>%
  kable_styling()
Table 8. Test set MAPE by neighborhood racial context
Regression High Income Low Income
Baseline Regression 33% 41%
Neighborhood Effects 26% 28%

Additional Discussion

The model we created was certainly improved by the addition of geographic features. While many variables from the given dataset were also central to the model’s predictions, variables such as distance to the nearest park and the mean distance to the five closest trail heads performed well too. For more information on individual variable performance, see the tables below.

Statistic Summary: Internal Characteristic Variables
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
Section.Num 11,364 1 0 1 1 1 9
Quality.Code 11,364 37 8 10 30 41 80
TotalFinishedSF 11,364 1,955 893 0 1,276 2,443.2 10,188
Age 11,364 34 27 0 15 49 161
Ac 7,488 210 1 200 210 210 220
Heating 11,260 814 15 800 810 810 1,000
Total.Bath 11,364 3 1 0 2 4 12
Rooms.No.bath 11,364 7 2 0 6 9 28
Section.Num: Building section number by uses
Quality.Code: Quality as determined by our appraisal staff
TotalFinishedSF: Total number of finished square feet
Age: Age of the structure
Ac: Air Conditioner Type (Lower N value due to absence of AC units)
Heating: Heating System Type (Lower N value due to absence of Heating units)
Total.Bath: Number of Bathrooms
Rooms.No.Bath: Number of Rooms that are not bathrooms
Statistic Summary: Amenities
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
LandMark.Dist 11,364 4,632 2,654 0 2,389 6,407 11,945
Private.Dist 11,364 14,553 8,206 89 7,767 20,700 48,318
Trailhead.Dist 11,364 6,101 2,166 1,157 4,933 7,190 22,566
Park.Dist 11,364 3,243 2,031 94 1,731 4,620 17,427
PGCount.500m 11,364 0 0 0 0 0 3
Flood.Dist 11,364 2,062 1,816 0 534 3,168 9,501
Unit: Meters Source: Boulder County Open Data Portal
LandMark.Dist: Distance to the nearest landmark
Private.Dist: Distance to the nearest private school
Trailhead.Dist: Distance to the nearest trail head
Park.Dist: Distance to the nearest park
PGCount.500: Playground count in 500m radius of a sold house
Flood.Dist: Distance to the nearest Floodplain
Statistic Summary: Spacial Structure
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
Med.Income 11,364 44,615 10,070 4,513 38,949 52,241 63,205
White.Pop 11,364 5,511 2,713 832 3,777 6,471 11,991
Pop.Den 11,364 1,049 975 2 317 1,636 8,239
Pvty.Pop 11,364 483 393 40 227 724 3,202
Source: One-year American Community Survey
Data collected by census tracts
Med.Income: Medium Family Income
White.Pop: Population identified as ‘white’
Pop.Den: Population density by census tract
Pvty.Pop: Population living in poverty
Some variables are catagorical. In these variables below
‘1’ indicates the sold house exists within the municipality,
0 means non-exsiting:
Loui_dummy, Ward_dummy, Jame_dummy, Nede_dummy,Boul_dummy
Erie_dummy,Lafa_dummy,Long_dummy,
representing Louisvelle, Ward, Jamestown, Nederland, Boulder, Erie,
Lafayette, and Longmont, respectively

Conclusion

Throughout the cyclical iterations of the project’s development our team faced difficult decisions related to acceptance and rejection of given variables. Ultimately, a mean-absolute-percent-error (MAPE) of around 30% allows for a modest level of confidence in our model. Geographically, Boulder City had the highest concentration of errors. This error may have been caused by higher density development or the unique real estate environment created by the presence of a large University in the area. The rest of the municipalities had generally more accurate predictions. Overall, the model we created was certainly improved by the addition of geographic features. Given more research and an enhanced qualitative context we are inclined to think that the model has room for improvement, but our team would still recommend our own model to Zillow as a solid starting point.