Pronto! is Seattle’s bike sharing program, which launched in fall 2014. You may have seen the green bike docks around campus. It was also in the news frequently and eventually shut down.
You will be using data from the 2015 Pronto Cycle Share Data Challenge. These are available for download as a 75 MB ZIP file from https://s3.amazonaws.com/pronto-data/open_data_year_one.zip. Once unzipped, the folder containing all the files is around 900 MB. The
open_data_year_one
folder contains aREADME.txt
file that you should reference for documentation. Place theopen_data_year_one
folder in the same directory as this template.
Questions for you to answer are as quoted blocks of text. Put your code used to address these questions and any comments you have below each block. Remember the guiding DRY principle: Don’t Repeat Yourself!
The next section asks you to load the data with a loop. As an advanced alternative, you may optionally skip this and load the data using a vectorized method like
lapply()
.
Make sure the
open_data_year_one
folder is in the same directory as this template. Use thelist.files()
on theopen_data_year_one
folder to return a character vector giving all the files in that folder, and store it to an object calledfiles_in_year_one
. Then use subsetting onfiles_in_year_one
to remove the entries forREADME.txt
(which isn’t data) and for2015_status_data.csv
(which is massive and doesn’t have interesting information, so we’re going to exclude it). Thus,files_in_year_one
should be a character vector with three entries.
library(stringr)
files_in_year_one <- list.files("open_data_year_one")[-c(2,5)]
files_in_year_one <- str_subset(list.files("open_data_year_one"),
"station|trip|weather")
We want to read the remaining .csv files into data frames stored in a list called
data_list
. Preallocate this usingdata_list <- vector("list", length(files_in_year_one))
.
data_list <- vector("list", length(files_in_year_one))
We would like the names of the list entries to be simpler than the file names. For example, we want to read the
2015_station_data.csv
file intodata_list[["station_data"]]
, and2015_trip_data.csv
intodata_list[["trip_data"]]
. So, you should make a new vector calleddata_list_names
giving the names of the objects to read in these CSV files to usingfiles_in_year_one
. Use thestr_sub()
function inlibrary(stringr)
to keep the portion of thefiles_in_year_one
entries starting from the sixth character (which will drop the2015_
part) and stopping at number of characters of each filename string, minus 4 (which will drop the.csv
part). Remember to loadstringr
withlibrary
and use?str_sub
in the console to get help on the function.
data_list_names <- str_sub(files_in_year_one, 6, -5)
data_list_names <- str_remove_all(files_in_year_one, "2015_|\\.csv")
Set the names for
data_list
using thenames()
function and thedata_list_names
vector.
names(data_list) <- data_list_names
Then, write a
for()
loop that usesread_csv()
from thereadr
package to read in all the CSV files contained in the ZIP file,seq_along
ing thefiles_in_year_one
vector. Store each of these files to its corresponding entry indata_list
. The data download demo might be a helpful reference.
You will want to use the
cache=TRUE
chunk option for this chunk—otherwise you’ll have to wait for the data to get read in every single time you knit. You will also want to make sure you are usingread_csv()
in thereadr
package and not base R’sread.csv()
asreadr
’s version is faster, gives you a progress bar, and won’t convert all character variables to factors automatically.
library(readr)
for(i in seq_along(files_in_year_one)){
data_list[[i]] <- read_csv(paste0("open_data_year_one/", files_in_year_one[i]))
}
##
## -- Column specification --------------------------------------------------------
## cols(
## id = col_double(),
## name = col_character(),
## terminal = col_character(),
## lat = col_double(),
## long = col_double(),
## dockcount = col_double(),
## online = col_character()
## )
##
## -- Column specification --------------------------------------------------------
## cols(
## trip_id = col_double(),
## starttime = col_character(),
## stoptime = col_character(),
## bikeid = col_character(),
## tripduration = col_double(),
## from_station_name = col_character(),
## to_station_name = col_character(),
## from_station_id = col_character(),
## to_station_id = col_character(),
## usertype = col_character(),
## gender = col_character(),
## birthyear = col_double()
## )
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## Date = col_character(),
## Max_Gust_Speed_MPH = col_character(),
## Events = col_character()
## )
## i Use `spec()` for the full column specifications.
str(data_list)
## List of 3
## $ station_data: spec_tbl_df[,7] [54 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## ..$ id : num [1:54] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ name : chr [1:54] "3rd Ave & Broad St" "2nd Ave & Vine St" "6th Ave & Blanchard St" "2nd Ave & Blanchard St" ...
## ..$ terminal : chr [1:54] "BT-01" "BT-03" "BT-04" "BT-05" ...
## ..$ lat : num [1:54] 47.6 47.6 47.6 47.6 47.6 ...
## ..$ long : num [1:54] -122 -122 -122 -122 -122 ...
## ..$ dockcount: num [1:54] 18 16 16 14 18 20 18 20 20 16 ...
## ..$ online : chr [1:54] "10/13/2014" "10/13/2014" "10/13/2014" "10/13/2014" ...
## ..- attr(*, "spec")=
## .. .. cols(
## .. .. id = col_double(),
## .. .. name = col_character(),
## .. .. terminal = col_character(),
## .. .. lat = col_double(),
## .. .. long = col_double(),
## .. .. dockcount = col_double(),
## .. .. online = col_character()
## .. .. )
## $ trip_data : spec_tbl_df[,12] [142,846 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## ..$ trip_id : num [1:142846] 431 432 433 434 435 436 437 438 439 440 ...
## ..$ starttime : chr [1:142846] "10/13/2014 10:31" "10/13/2014 10:32" "10/13/2014 10:33" "10/13/2014 10:34" ...
## ..$ stoptime : chr [1:142846] "10/13/2014 10:48" "10/13/2014 10:48" "10/13/2014 10:48" "10/13/2014 10:48" ...
## ..$ bikeid : chr [1:142846] "SEA00298" "SEA00195" "SEA00486" "SEA00333" ...
## ..$ tripduration : num [1:142846] 986 926 884 866 924 ...
## ..$ from_station_name: chr [1:142846] "2nd Ave & Spring St" "2nd Ave & Spring St" "2nd Ave & Spring St" "2nd Ave & Spring St" ...
## ..$ to_station_name : chr [1:142846] "Occidental Park / Occidental Ave S & S Washington St" "Occidental Park / Occidental Ave S & S Washington St" "Occidental Park / Occidental Ave S & S Washington St" "Occidental Park / Occidental Ave S & S Washington St" ...
## ..$ from_station_id : chr [1:142846] "CBD-06" "CBD-06" "CBD-06" "CBD-06" ...
## ..$ to_station_id : chr [1:142846] "PS-04" "PS-04" "PS-04" "PS-04" ...
## ..$ usertype : chr [1:142846] "Annual Member" "Annual Member" "Annual Member" "Annual Member" ...
## ..$ gender : chr [1:142846] "Male" "Male" "Female" "Female" ...
## ..$ birthyear : num [1:142846] 1960 1970 1988 1977 1971 ...
## ..- attr(*, "spec")=
## .. .. cols(
## .. .. trip_id = col_double(),
## .. .. starttime = col_character(),
## .. .. stoptime = col_character(),
## .. .. bikeid = col_character(),
## .. .. tripduration = col_double(),
## .. .. from_station_name = col_character(),
## .. .. to_station_name = col_character(),
## .. .. from_station_id = col_character(),
## .. .. to_station_id = col_character(),
## .. .. usertype = col_character(),
## .. .. gender = col_character(),
## .. .. birthyear = col_double()
## .. .. )
## $ weather_data: spec_tbl_df[,21] [366 x 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## ..$ Date : chr [1:366] "10/13/2014" "10/14/2014" "10/15/2014" "10/16/2014" ...
## ..$ Max_Temperature_F : num [1:366] 71 63 62 71 64 68 73 66 64 60 ...
## ..$ Mean_Temperature_F : num [1:366] 62 59 58 61 60 64 64 60 58 58 ...
## ..$ Min_TemperatureF : num [1:366] 54 55 54 52 57 59 55 55 55 57 ...
## ..$ Max_Dew_Point_F : num [1:366] 55 52 53 49 55 59 57 57 52 55 ...
## ..$ MeanDew_Point_F : num [1:366] 51 51 50 46 51 57 55 54 49 53 ...
## ..$ Min_Dewpoint_F : num [1:366] 46 50 46 42 41 55 53 50 46 48 ...
## ..$ Max_Humidity : num [1:366] 87 88 87 83 87 90 94 90 87 88 ...
## ..$ Mean_Humidity : num [1:366] 68 78 77 61 72 83 74 78 70 81 ...
## ..$ Min_Humidity : num [1:366] 46 63 67 36 46 68 52 67 58 67 ...
## ..$ Max_Sea_Level_Pressure_In : num [1:366] 30 29.8 30 30 29.8 ...
## ..$ Mean_Sea_Level_Pressure_In: num [1:366] 29.8 29.8 29.7 29.9 29.8 ...
## ..$ Min_Sea_Level_Pressure_In : num [1:366] 29.6 29.5 29.5 29.8 29.7 ...
## ..$ Max_Visibility_Miles : num [1:366] 10 10 10 10 10 10 10 10 10 10 ...
## ..$ Mean_Visibility_Miles : num [1:366] 10 9 9 10 10 8 10 10 10 6 ...
## ..$ Min_Visibility_Miles : num [1:366] 4 3 3 10 6 2 6 5 6 2 ...
## ..$ Max_Wind_Speed_MPH : num [1:366] 13 10 18 9 8 10 10 12 15 14 ...
## ..$ Mean_Wind_Speed_MPH : num [1:366] 4 5 7 4 3 4 3 5 8 8 ...
## ..$ Max_Gust_Speed_MPH : chr [1:366] "21" "17" "25" "-" ...
## ..$ Precipitation_In : num [1:366] 0 0.11 0.45 0 0.14 0.31 0 0.44 0.1 1.43 ...
## ..$ Events : chr [1:366] "Rain" "Rain" "Rain" "Rain" ...
## ..- attr(*, "spec")=
## .. .. cols(
## .. .. Date = col_character(),
## .. .. Max_Temperature_F = col_double(),
## .. .. Mean_Temperature_F = col_double(),
## .. .. Min_TemperatureF = col_double(),
## .. .. Max_Dew_Point_F = col_double(),
## .. .. MeanDew_Point_F = col_double(),
## .. .. Min_Dewpoint_F = col_double(),
## .. .. Max_Humidity = col_double(),
## .. .. Mean_Humidity = col_double(),
## .. .. Min_Humidity = col_double(),
## .. .. Max_Sea_Level_Pressure_In = col_double(),
## .. .. Mean_Sea_Level_Pressure_In = col_double(),
## .. .. Min_Sea_Level_Pressure_In = col_double(),
## .. .. Max_Visibility_Miles = col_double(),
## .. .. Mean_Visibility_Miles = col_double(),
## .. .. Min_Visibility_Miles = col_double(),
## .. .. Max_Wind_Speed_MPH = col_double(),
## .. .. Mean_Wind_Speed_MPH = col_double(),
## .. .. Max_Gust_Speed_MPH = col_character(),
## .. .. Precipitation_In = col_double(),
## .. .. Events = col_character()
## .. .. )
file_paths <- paste0("open_data_year_one/", files_in_year_one)
data_list <- lapply(file_paths, read_csv)
##
## -- Column specification --------------------------------------------------------
## cols(
## id = col_double(),
## name = col_character(),
## terminal = col_character(),
## lat = col_double(),
## long = col_double(),
## dockcount = col_double(),
## online = col_character()
## )
##
## -- Column specification --------------------------------------------------------
## cols(
## trip_id = col_double(),
## starttime = col_character(),
## stoptime = col_character(),
## bikeid = col_character(),
## tripduration = col_double(),
## from_station_name = col_character(),
## to_station_name = col_character(),
## from_station_id = col_character(),
## to_station_id = col_character(),
## usertype = col_character(),
## gender = col_character(),
## birthyear = col_double()
## )
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## Date = col_character(),
## Max_Gust_Speed_MPH = col_character(),
## Events = col_character()
## )
## i Use `spec()` for the full column specifications.
names(data_list) <- data_list_names
Run
str()
ondata_list
and look at how the variables came in from usingread_csv()
. Most should be okay, but some of the dates and times may be stored as character rather than dates orPOSIXct
date-time values. We also have lots of missing values forgender
in the trip data because users who are not annual members do not report gender.
First, patch up the missing values for
gender
indata_list[["trip_data"]]
: if a user is aShort-Term Pass Holder
, then put"Unknown"
as theirgender
. Don’t make new objects, but rather modify the entries indata_list
directly (e.g.data_list[["trip_data"]] <- data_list[["trip_data"]] %>% mutate(...)
.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data_list[["trip_data"]] %>%
count(gender, usertype)
## # A tibble: 4 x 3
## gender usertype n
## <chr> <chr> <int>
## 1 Female Annual Member 18245
## 2 Male Annual Member 67608
## 3 Other Annual Member 1507
## 4 <NA> Short-Term Pass Holder 55486
Now, use
dplyr
’smutate_at()
and/ormutate()
, functions from thelubridate
package, and thefactor
function to (1) fix any date/times, as well as to (2) convert theusertype
andgender
variables to factor variables from the trip data. Don’t make new objects, but rather modify the entries indata_list
directly.
data_list[["trip_data"]] <- data_list[["trip_data"]] %>%
mutate(gender = ifelse(usertype == "Short-Term Pass Holder", "Unknown", gender))
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
data_list[["trip_data"]] <- data_list[["trip_data"]] %>%
mutate(across(c(starttime, stoptime), mdy_hm))
data_list[["trip_data"]] %>% glimpse()
## Rows: 142,846
## Columns: 12
## $ trip_id <dbl> 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 44~
## $ starttime <dttm> 2014-10-13 10:31:00, 2014-10-13 10:32:00, 2014-10-1~
## $ stoptime <dttm> 2014-10-13 10:48:00, 2014-10-13 10:48:00, 2014-10-1~
## $ bikeid <chr> "SEA00298", "SEA00195", "SEA00486", "SEA00333", "SEA~
## $ tripduration <dbl> 985.935, 926.375, 883.831, 865.937, 923.923, 808.805~
## $ from_station_name <chr> "2nd Ave & Spring St", "2nd Ave & Spring St", "2nd A~
## $ to_station_name <chr> "Occidental Park / Occidental Ave S & S Washington S~
## $ from_station_id <chr> "CBD-06", "CBD-06", "CBD-06", "CBD-06", "CBD-06", "C~
## $ to_station_id <chr> "PS-04", "PS-04", "PS-04", "PS-04", "PS-04", "PS-04"~
## $ usertype <chr> "Annual Member", "Annual Member", "Annual Member", "~
## $ gender <chr> "Male", "Male", "Female", "Female", "Male", "Male", ~
## $ birthyear <dbl> 1960, 1970, 1988, 1977, 1971, 1974, 1978, 1983, 1974~
data_list[["weather_data"]] <- data_list[["weather_data"]] %>%
mutate(Date = mdy(Date))
data_list[["weather_data"]] %>% glimpse()
## Rows: 366
## Columns: 21
## $ Date <date> 2014-10-13, 2014-10-14, 2014-10-15, 2014-1~
## $ Max_Temperature_F <dbl> 71, 63, 62, 71, 64, 68, 73, 66, 64, 60, 62,~
## $ Mean_Temperature_F <dbl> 62, 59, 58, 61, 60, 64, 64, 60, 58, 58, 55,~
## $ Min_TemperatureF <dbl> 54, 55, 54, 52, 57, 59, 55, 55, 55, 57, 50,~
## $ Max_Dew_Point_F <dbl> 55, 52, 53, 49, 55, 59, 57, 57, 52, 55, 49,~
## $ MeanDew_Point_F <dbl> 51, 51, 50, 46, 51, 57, 55, 54, 49, 53, 47,~
## $ Min_Dewpoint_F <dbl> 46, 50, 46, 42, 41, 55, 53, 50, 46, 48, 44,~
## $ Max_Humidity <dbl> 87, 88, 87, 83, 87, 90, 94, 90, 87, 88, 86,~
## $ Mean_Humidity <dbl> 68, 78, 77, 61, 72, 83, 74, 78, 70, 81, 76,~
## $ Min_Humidity <dbl> 46, 63, 67, 36, 46, 68, 52, 67, 58, 67, 62,~
## $ Max_Sea_Level_Pressure_In <dbl> 30.03, 29.84, 29.98, 30.03, 29.83, 29.96, 2~
## $ Mean_Sea_Level_Pressure_In <dbl> 29.79, 29.75, 29.71, 29.95, 29.78, 29.90, 2~
## $ Min_Sea_Level_Pressure_In <dbl> 29.65, 29.54, 29.51, 29.81, 29.73, 29.80, 2~
## $ Max_Visibility_Miles <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,~
## $ Mean_Visibility_Miles <dbl> 10, 9, 9, 10, 10, 8, 10, 10, 10, 6, 10, 10,~
## $ Min_Visibility_Miles <dbl> 4, 3, 3, 10, 6, 2, 6, 5, 6, 2, 10, 8, 6, 10~
## $ Max_Wind_Speed_MPH <dbl> 13, 10, 18, 9, 8, 10, 10, 12, 15, 14, 15, 8~
## $ Mean_Wind_Speed_MPH <dbl> 4, 5, 7, 4, 3, 4, 3, 5, 8, 8, 9, 4, 6, 12, ~
## $ Max_Gust_Speed_MPH <chr> "21", "17", "25", "-", "-", "-", "18", "-",~
## $ Precipitation_In <dbl> 0.00, 0.11, 0.45, 0.00, 0.14, 0.31, 0.00, 0~
## $ Events <chr> "Rain", "Rain", "Rain", "Rain", "Rain", "Ra~
The
terminal
,to_station_id
, andfrom_station_id
columns indata_list[["station_data"]]
anddata_list[["trip_data"]]
have a two or three character code followed by a hyphen and a numeric code. These character codes convey the broad geographic region of the stations (e.g.CBD
is Central Business District,PS
is Pioneer Square,ID
is International District). Write a function calledregion_extract()
that can extract these region codes by taking a character vector as input and returning another character vector as output that just has these initial character codes. For example, if I runregion_extract(x = c("CBD-11", "ID-01"))
, it should give me as output a character vector with first entry"CBD"
and second entry"ID"
(e.g."CBD" "ID"
).
Note: if you cannot get this working and need to move on with your life, try writing your function to just take the first two characters using
str_sub()
and use that.
# YOUR WORK HERE
Then on
data_list[["station_data"]]
anddata_list[["trip_data"]]
, make new columns calledterminal_region
,to_station_region
, andfrom_station_region
using yourregion_extract()
function.
# YOUR WORK
The
Events
column indata_list[["weather_data"]]
mentions if there was rain, thunderstorms, fog, etc. On some days you can see multiple weather events. Add a column to this data frame calledRain
that takes the value"Rain"
if there was rain, and"No rain"
otherwise. You will need to use some string parsing since"Rain"
is not always at the beginning of the string. The functionstr_detect()
may be useful here, but again, if you are running short on time, just look for"Rain"
at the beginning usingstr_sub()
as a working but imperfect approach. Then convert theRain
variable to a factor.
# YOUR WORK HERE
You have bike station region information now, and rainy weather information. Make a new data frame called
trips_weather
that joinsdata_list[["trip_data"]]
withdata_list[["weather_data"]]
by trip start date so that theRain
column is added to the trip-level data (just theRain
column please, none of the rest of the weather info). You may need to do some date manipulation and extraction as seen in Week 5 slides to get a date variable from thestarttime
column that you can use in merging.
# YOUR WORK HERE
Now for the grand finale. Write a function
daily_rain_rides()
that takes as input:
region_code
: a region code (e.g."CBD"
,"UW"
)direction
: indicates whether we are thinking of trips"from"
or"to"
a region
and inside the function does the following:
- Filters the data to trips that came from stations with that region code or went to stations with that region code (depending on the values of
direction
andregion_code
). For example, if I sayregion_code = "BT"
(for Belltown) anddirection = "from"
, then I want to keep rows for trips whosefrom_station_region
is equal to"BT"
.- Makes a data frame called
temp_df
with one row per day counting how many trips were inregion_code
goingdirection
. This should have columns for trip starting date, how many trips there were that day, and whether there was rain or not that day. You’ll need to usegroup_by()
andsummarize()
.- Uses
temp_df
to make aggplot
scatterplot (geom_point
) with trip starting date on the horizontal axis, number of trips on the vertical axis, and points colored"black"
for days with no rain and"deepskyblue"
for days with rain. Make sure the legend is clear and that the x axis is easy to understand without being overly labeled (control this withscale_x_date
). The title of the plot should be customized to say which region code is shown and which direction is analyzed (e.g. “Daily rides going to SLU”) usingpaste0()
. Feel free to use whatever themeing you like on the plot or other tweaks to make it look great.- Returns the
ggplot
object with all its layers.
I have created a skeleton for this function. Fill in your code as needed and remember to set
eval=TRUE
in the chunk after you’ve written the function.
daily_rain_rides <- function(region_code, direction){
# WRITE YOUR WORK HERE
}
Then, test out your function: make three plots using
daily_rain_rides()
, trying out different values of the region code and direction to show it works.
# YOUR WORK HERE