Training on Historical Data and Testing on 2022 Data
3.1 Introduction
To better understand the impacts of orange line shutdown and the free bluebike program on bikeshare ridership, we first develop a model trained by hiostorical data of 2021 ridership only and test this model on 2022 to see what regions would have highest percentage errors. This modeling method can test our hypothesis that the rise in ridership in the summer of 2022 is related to orange line shutdown from a geospatial persoective and to free bluebike program from an equity point of view. Therefore, we’re specifically looking for the spatial distribution of predictive errors in areas near orange line stations and in low-income neighborhoods.
To develop the model using histroical data, we first split the 2021 ridership data into the train and test datasets and we use the log values for modeling. We also use transformers including SimpleImputer to fill missing values with the median and to standardize values. We then use random forest with a 5-fold cross validation to find the best parameter with 200 estimators and the depth of 0.
3.2 Accuracy
We then change the test data to the 2022 ridership and use the previous model to predict ridership in 2022. Compared to the percentage errors for 2021 ridership, the generalizability of the model is not so good when we directly apply the model to 2022 data. The test score decreases from 0.88 to 0.8.
3.3 Errors in 2021
We first test the model on the 0.3 split of 2021 data, and receive a test score of 0.88. This score is good enough for us to proceed with this model. We then calculate the percentage error of each station and aggregate errors to census tracts. The plot below shows the choropleth map based on a quintitle category of percentage errors.
3.4 Erros in 2022
The map below shows the spatial distribution of percentage errors for 2022 ridership data. Compared to the previous map, the percentage errors for 2022 ridership increase along the orange line especially in the south-west region which also has a higher concentration of low-income communities. This model supports our hypothesis that the increase of bikeshare ridership in 2022 is related to the orange line shutdown and free bluebike program.