Instructions

Fill in this lab worksheet at your own pace. Knit it periodically to check that things are working the same way they are when you are working in RStudio interactively. Ask questions, consult with others, use Google, etc. At the end of the class session, email what you have to yourself so you don’t lose progress. You do not need to turn in your work, but feedback will probided if you as Homework 7 on Canvas (both .Rmd and .html files). These will be evaluated by Chuck rather than peer reviewed.

Background

Last week we saw data from health inspections of restaurants in Seattle since 2012 and used them to practice working with character/string data and regular expressions. Load in the data directly from the URL (this will work because a CSV is just a text file) and use cache=TRUE so that we don’t have to repeat this each time we re-knit:

restaurants <- read_csv("https://clanfear.github.io/CSSS508/Lectures/Week8/restaurants.csv",  col_types = "ccccccccnnccicccciccciD")

As a reminder of what these data look like:

str(restaurants)
## Classes 'tbl_df', 'tbl' and 'data.frame':    258630 obs. of  23 variables:
##  $ Name                      : chr  "@ THE SHACK, LLC" "10 MERCER RESTAURANT" "10 MERCER RESTAURANT" "10 MERCER RESTAURANT" ...
##  $ Program_Identifier        : chr  "SHACK COFFEE" "10 MERCER RESTAURANT" "10 MERCER RESTAURANT" "10 MERCER RESTAURANT" ...
##  $ Inspection_Date           : chr  NA "01/24/2017" "01/24/2017" "01/24/2017" ...
##  $ Description               : chr  "Seating 0-12 - Risk Category I" "Seating 13-50 - Risk Category III" "Seating 13-50 - Risk Category III" "Seating 13-50 - Risk Category III" ...
##  $ Address                   : chr  "2920 SW AVALON WAY" "10 MERCER ST" "10 MERCER ST" "10 MERCER ST" ...
##  $ City                      : chr  "Seattle" "Seattle" "Seattle" "Seattle" ...
##  $ Zip_Code                  : chr  "98126" "98109" "98109" "98109" ...
##  $ Phone                     : chr  "(206) 938-5665" NA NA NA ...
##  $ Longitude                 : num  -122 -122 -122 -122 -122 ...
##  $ Latitude                  : num  47.6 47.6 47.6 47.6 47.6 ...
##  $ Inspection_Business_Name  : chr  NA "10 MERCER RESTAURANT" "10 MERCER RESTAURANT" "10 MERCER RESTAURANT" ...
##  $ Inspection_Type           : chr  NA "Routine Inspection/Field Review" "Routine Inspection/Field Review" "Routine Inspection/Field Review" ...
##  $ Inspection_Score          : int  NA 10 10 10 15 15 15 0 15 15 ...
##  $ Inspection_Result         : chr  NA "Unsatisfactory" "Unsatisfactory" "Unsatisfactory" ...
##  $ Inspection_Closed_Business: chr  NA "false" "false" "false" ...
##  $ Violation_Type            : chr  NA "blue" "blue" "red" ...
##  $ Violation_Description     : chr  NA "4300 - Non-food contact surfaces maintained and clean" "4800 - Physical facilities properly installed,..." "1200 - Proper shellstock ID; wild mushroom ID;  parasite destruction procedures for fish" ...
##  $ Violation_Points          : int  0 3 2 5 5 5 5 0 5 10 ...
##  $ Business_ID               : chr  "PR0048053" "PR0049572" "PR0049572" "PR0049572" ...
##  $ Inspection_Serial_Num     : chr  NA "DAHSIBSJT" "DAHSIBSJT" "DAHSIBSJT" ...
##  $ Violation_Record_ID       : chr  NA "IV43WZVLN" "IVCQ1ZIV0" "IVREK90PM" ...
##  $ Grade                     : int  NA 2 2 2 2 2 2 2 2 2 ...
##  $ Date                      : Date, format: NA "2017-01-24" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 23
##   .. ..$ Name                      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Program_Identifier        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Inspection_Date           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Description               : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Address                   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ City                      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Zip_Code                  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Phone                     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Longitude                 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ Latitude                  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ Inspection_Business_Name  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Inspection_Type           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Inspection_Score          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Inspection_Result         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Inspection_Closed_Business: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Violation_Type            : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Violation_Description     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Violation_Points          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Business_ID               : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Inspection_Serial_Num     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Violation_Record_ID       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Grade                     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Date                      :List of 1
##   .. .. ..$ format: chr ""
##   .. .. ..- attr(*, "class")= chr  "collector_date" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

There are often multiple rows per Business_ID per Date, such as when an establishment is given violation points for multiple problems. The Result and Score columns will have the same values on those rows for the same restaurant and date, but details of the violation type (“red” or “blue”), violation description, and violation points will differ from row to row, with different violations on different rows. Keep this duplication in mind as you work. You will need to drop extra rows to get one row per business per date, or even one row per business. You can do this using the dplyr concepts we’ve studied like the distinct() function, or group_by() and summarize() or filter() to collapse over multiple rows.

Preparing the data

Restaurants only

There are grocery stores without seating, school cafeterias, and other less relevant businesses in the data. We only want to look at restaurants. Identify and only keep businesses whose Description starts with Seating, e.g. "Seating 51-150 - Risk Category III". Call this new data frame with the rows filtered restaurants_only.

[YOUR CODE HERE]

Scores over time

Now make a data frame using restaurants_only called scores_over_time with exactly one row per Business_ID per inspection Date, with the business Name, its Address and ZIP, its Longitude and Latitude, and the value of Score on each inspection date. With data structured this way, you will be able analyze trends over time for each establishment. There should no longer be duplicate rows for when an establishment has multiple violations on a single date.

[YOUR CODE HERE]

Preparing to label bad scores

In order to label restaurants with bad scores (say, 40 and above), you’ll want to make a column called Label_40 on scores_over_time. It should have the Name if the Score is greater than or equal to 40, and be blank (i.e. "") if the Score is below that. Use mutate() and ifelse() to make this Label_40 column.

[YOUR CODE HERE]

Most recent scores

We’ll also want to look at just the most recent scores for each restaurant. Make a data frame called recent_scores from scores_over_time that has one row per Business_ID, with the business Name, its Address and ZIP, Longitude and Latitude, the most recent value of Score, the Date of that score, and Label_40. The slides from last week pertaining to looking at the most recent inspections of coffee shops have code that might help.

[YOUR CODE HERE]

Map-making

Mapping the recent scores

Now, use the ggmap package and the longitude and latitude information to plot the most recent inspection scores for restaurants on top of a map of Seattle. Try coloring and/or sizing the points according to their most recent inspection score (bigger points = higher score). You can use scale_color_gradient() to set the colors so that establishments with lower scores are white or gray, and establishments with higher scores are red, and scale_size= to set the sizes. Play with these options and map settings until you get something you think looks good.

[YOUR CODE AND PLOT HERE]

The U District

Now repeat the plot, but filter() the data to the U District area (latitude above 47.674 and below 47.651, longitude above -122.322 and below -122.301). Add some text labels using Label_40 for businesses whose scores were 40 or higher on their most recent inspection. See the ggplot2 docs on geom_text() and geom_label() for how you can get these to look good, perhaps trying out the ggrepel package to avoid overlaps.

[YOUR CODE AND PLOT HERE]

Capitol Hill

Repeat the above, but for Capitol Hill instead (latitude between 47.614 and 47.642, longitude between -122.329 and -122.302).

[YOUR CODE AND PLOT HERE]

Scores over time

Sub-sampling the data

Now we want to look at inspection scores over time for restaurants, but there are far too many to visualize. Pick something more limited to investigate and subset the scores_over_time data to include somewhere between around 5 and 25 establishments. To do this, you’ll want to make a vector that has just the Business_ID or Name values of restaurants of interest, and then filter() the scores_over_time data based on this. Some examples of angles you could choose for doing this subsetting:

  • Restaurants in your ZIP code
  • Your favorite chain restaurant
  • Diners
  • Coffee shops in a part of the city
  • A cuisine based on words in restaurant names (e.g. “Pho”)
  • Restaurants that have had a really bad score at some time – did they have previous bad scores, or were they mostly without problems before?

The string pattern matching tools from last week could be helpful depending on the criteria you choose.

[YOUR CODE TO SAMPLE HERE]

Mapping your subsample

Make a plot, appropriately cropped, showing the locations of the restaurants you’ve chosen with a dot for each restaurant and text labels.

[YOUR CODE AND PLOT HERE]