Instructions

Questions for you to answer are as quoted blocks of text. Put your code used to address these questions and interpretation below each block. Make sure your answers are NOT in block quotes like these. Load all libraries you want to use in the setup chunk; when you discover you want to lose a library later, add it to the setup chunk at the top. Part 1 is due Tuesday May 1st, at 11:59 PM. Part 2 is due Tuesday May 8th, at 11:59 PM. You will upload the entire template each time, with whatever progress you have made.

Part 1

Getting the data in

Download the data from https://raw.githubusercontent.com/clanfear/CSSS508/master/Homework/HW5/king_county_elections_2016.txt. It is a plain text file of data, about 60 MB in size. Values are separated with commas (you can see this by opening it with a good text editor, e.g. not Notepad). Save it somewhere on your computer—in the same folder as this template is easiest—and read the file into R. You will want to use the cache=TRUE chunk option for this (and potentially other chunks). cache=TRUE will allow R to read the file only once to save time.

This file is comma separated (a CSV), so we’ll use the read_csv function in readr without any fancy options:

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
king_raw <- read_csv("king_county_elections_2016.txt")
## Parsed with column specification:
## cols(
##   Precinct = col_character(),
##   Race = col_character(),
##   LEG = col_integer(),
##   CC = col_integer(),
##   CG = col_integer(),
##   CounterGroup = col_character(),
##   Party = col_character(),
##   CounterType = col_character(),
##   SumOfCount = col_double()
## )

Inspecting the data

Use a function like str() or glimpse() to look at the data. Describe the data in their current state. How many rows are there? What variables are there? What kinds of values do they take (don’t list them all if there are many)? Are the column types sensible?

data %>%
    select(variable) %>%
    distinct(variable) %>%
    head(10)

There are 643163 rows and 9 columns, as follows:

str(king_raw)
## Classes 'tbl_df', 'tbl' and 'data.frame':    643163 obs. of  9 variables:
##  $ Precinct    : chr  "ADAIR" "ADAIR" "ADAIR" "ADAIR" ...
##  $ Race        : chr  "Advisory Vote 14" "Advisory Vote 14" "Advisory Vote 14" "Advisory Vote 14" ...
##  $ LEG         : int  45 45 45 45 45 45 45 45 45 45 ...
##  $ CC          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ CG          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ CounterGroup: chr  "Total" "Total" "Total" "Total" ...
##  $ Party       : chr  "NP" "NP" "NP" "NP" ...
##  $ CounterType : chr  "Maintained" "Registered Voters" "Repealed" "Times Blank Voted" ...
##  $ SumOfCount  : num  183 519 251 51 485 0 220 519 222 43 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 9
##   .. ..$ Precinct    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Race        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LEG         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ CC          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ CG          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ CounterGroup: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Party       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ CounterType : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ SumOfCount  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

In addition to looking generally, look at each variable individually… except consider LEG, CC and CG at the same time (I will tell you now these three aren’t likely to be useful to you, but maybe guess what they are!). Remember these are real administrative data so they may be really strangely structured and some variables are indecipherable; in real world data work, you often have to get by with intuition or poking around online with regard to the nature of your data. Here useful way to look at 10 unique values of individual columns, given some data and a variable of interest:

data %>%
    select(variable) %>%
    distinct(variable) %>%
    head(10)

Precinct

Precinct has a precinct identifier, which is sometimes a city or neighborhood name, sometimes some text followed by an ID number:

king_raw %>%
    select(Precinct) %>%
    distinct(Precinct) %>%
    head(10)
## # A tibble: 10 x 1
##    Precinct     
##    <chr>        
##  1 ADAIR        
##  2 ALDARRA      
##  3 ALDER SPRINGS
##  4 ALDERWOOD    
##  5 ALG 30-0013  
##  6 ALG 30-0014  
##  7 ALG 30-3141  
##  8 ALPINE       
##  9 AMES LAKE    
## 10 ANGEL CITY
king_raw %>%
    select(Precinct) %>%
    distinct(Precinct) %>%
    tail(10)
## # A tibble: 10 x 1
##    Precinct   
##    <chr>      
##  1 WOD 45-3192
##  2 WOD 45-3193
##  3 WOD 45-3270
##  4 WOD 45-3530
##  5 WOLF       
##  6 WOODSIDE   
##  7 WYNOCHE    
##  8 WYNONA     
##  9 YPT 48-1233
## 10 YPT 48-1234
precinct_count <- king_raw %>%
    select(Precinct) %>%
    distinct(Precinct) %>%
    tally() %>%
    as.numeric()

There are 2517 distinct values appearing the Precinct column. Sounds about right for King County?

Race

Race appears to contain the specific races, some of which are positions candidates are running for, some of which are local propositions, and some are miscellaneous Tim Eyman ballot bloating garbage (“Advisory Vote of the People”):

king_raw %>%
    select(Race) %>%
    distinct(Race) %>%
    head(10)
## # A tibble: 10 x 1
##    Race                             
##    <chr>                            
##  1 Advisory Vote 14                 
##  2 Advisory Vote 15                 
##  3 Attorney General                 
##  4 Commissioner of Public Lands     
##  5 Congressional District 1         
##  6 Court of Appeals Judge Position 1
##  7 Governor                         
##  8 Initiative Measure 1433          
##  9 Initiative Measure 1464          
## 10 Initiative Measure 1491
king_raw %>%
    select(Race) %>%
    distinct(Race) %>%
    tail(10)
## # A tibble: 10 x 1
##    Race                                             
##    <chr>                                            
##  1 Legislative District 36 Representative Position 2
##  2 Legislative District 36 State Senator            
##  3 Legislative District 43 Representative Position 1
##  4 Legislative District 43 Representative Position 2
##  5 City of Shoreline Proposition 1                  
##  6 Legislative District 39 Representative Position 1
##  7 Legislative District 39 Representative Position 2
##  8 Legislative District 39 State Senator            
##  9 City of Snoqualmie Proposition 1                 
## 10 City of Tukwila Proposition 1

LEG, CC, CG

LEG, CC, and CG appear to be the numbers for legislative district, King County Council district, and Congressional district. We could argue that LEG, CC, and CG should be converted to character because their numerical values are irrelevant, but we’re not planning on using this information, so nothing bad will happen by being lazy and not changing it from integer to character. I could say col_types = "cccccccci" when using read_csv, though, to get all the column types perfect.

We note LEG and CC values are missing on over 200 rows:

king_raw %>%
    select(LEG, CC, CG) %>%
    summary()
##       LEG              CC             CG      
##  Min.   : 1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.:32.00   1st Qu.:3.00   1st Qu.:7.00  
##  Median :37.00   Median :5.00   Median :7.00  
##  Mean   :34.91   Mean   :4.86   Mean   :7.03  
##  3rd Qu.:45.00   3rd Qu.:7.00   3rd Qu.:9.00  
##  Max.   :48.00   Max.   :9.00   Max.   :9.00  
##  NA's   :224     NA's   :224

Let’s look at a sample:

king_raw %>%
    filter(is.na(LEG)) %>%
    head(10)
## # A tibble: 10 x 9
##    Precinct    Race        LEG    CC    CG CounterGroup Party CounterType 
##    <chr>       <chr>     <int> <int> <int> <chr>        <chr> <chr>       
##  1 ELECTIONS ~ Advisory~    NA    NA     9 Total        NP    Maintained  
##  2 ELECTIONS ~ Advisory~    NA    NA     9 Total        NP    Registered ~
##  3 ELECTIONS ~ Advisory~    NA    NA     9 Total        NP    Repealed    
##  4 ELECTIONS ~ Advisory~    NA    NA     9 Total        NP    Times Blank~
##  5 ELECTIONS ~ Advisory~    NA    NA     9 Total        NP    Times Count~
##  6 ELECTIONS ~ Advisory~    NA    NA     9 Total        NP    Times Over ~
##  7 ELECTIONS ~ Advisory~    NA    NA     9 Total        NP    Maintained  
##  8 ELECTIONS ~ Advisory~    NA    NA     9 Total        NP    Registered ~
##  9 ELECTIONS ~ Advisory~    NA    NA     9 Total        NP    Repealed    
## 10 ELECTIONS ~ Advisory~    NA    NA     9 Total        NP    Times Blank~
## # ... with 1 more variable: SumOfCount <dbl>

Interesting. It looks like there’s a “precinct” called “ELECTIONS OFFICE” which has some counts listed, but sure doesn’t sound like a real precinct. We could think about dropping rows for this “precinct” since it seems like might be a mistake, though it won’t make a big difference.

CounterGroup

CounterGroup only has one value (Total) and is completely, utterly useless. A waste of a column!

king_raw %>%
    select(CounterGroup) %>%
    distinct(CounterGroup)
## # A tibble: 1 x 1
##   CounterGroup
##   <chr>       
## 1 Total

Party

Party contains values for the political parties involved in each race:

king_raw %>%
    group_by(Party) %>%
    tally() %>%
    arrange(desc(n))
## # A tibble: 17 x 2
##    Party      n
##    <chr>  <int>
##  1 NP    565031
##  2 Dem    29412
##  3 Rep    27089
##  4 Lib     3276
##  5 CPN     2517
##  6 DPN     2517
##  7 GPN     2517
##  8 LPN     2517
##  9 RPN     2517
## 10 SPN     2517
## 11 SWN     2517
## 12 NOP      174
## 13 PAR      174
## 14 DFW      168
## 15 GP       166
## 16 ID        52
## 17 Dcr        2

We see quite a few different values here. Dem and Rep stand out as being Democrats and Republicans, and NP is so common that it must mean “no party” or “non-partisan”. Most of the other values have exactly the same number of values as there are distinct precincts, so these are probably Presidential or Senate candidates from third parties if every precinct is voting for them.

CounterType

CounterType appears to contain a mix of candidate names or position votes (e.g. “Bob Ferguson”, “Approved”, “No”) and overall summaries for the particular race in the precinct (e.g. “Registered Voters”, “Times Blank Voted”, “Times Counted”).

I would say this column has a lot of the info we want, and that it is currently structured “too long” since all of these are different rows corresponding to the same precinct for the same race.

king_raw %>%
    select(CounterType) %>%
    distinct() %>%
    head(20)
## # A tibble: 20 x 1
##    CounterType         
##    <chr>               
##  1 Maintained          
##  2 Registered Voters   
##  3 Repealed            
##  4 Times Blank Voted   
##  5 Times Counted       
##  6 Times Over Voted    
##  7 Bob Ferguson        
##  8 Joshua B. Trumbull  
##  9 Write-In            
## 10 Hilary Franz        
## 11 Steve McLaughlin    
## 12 Suzan DelBene       
## 13 Robert J. Sutherland
## 14 Michael J. Trickey  
## 15 Jay Inslee          
## 16 Bill Bryant         
## 17 No                  
## 18 Yes                 
## 19 Mike Kreidler       
## 20 Richard Schrock

SumOfCount

SumOfCount appears to be just counts of votes (or registered voters) associated with CounterType. This is the numeric information we want to use, and there are no missing values:

king_raw %>%
    select(SumOfCount) %>%
    summary()
##    SumOfCount    
##  Min.   :   0.0  
##  1st Qu.:   4.0  
##  Median : 121.0  
##  Mean   : 198.5  
##  3rd Qu.: 330.0  
##  Max.   :1126.0

The quantities of interest

We will focus on only the three major executive races in Washington in 2016:

  • President (and Vice-President)
  • Governor
  • Lieutenant Governor

With these races, we are interested in: 1. Turnout rates for each of these races in each precinct. We will measure turnout as times votes were counted (including for a candidate, blank, write-in, or “over vote”) divided by the number of registered voters. 2. Differences between precincts in Seattle and precincts elsewhere in King County. Again, these data are not documented, so you will have to figure out how to do this. 3. Precinct-level support for the Democratic candidates in King County in 2012 for each contest. We will measure support as the percentage of votes in a precinct for the Democratic candidate out of all votes for candidates or write-ins. Do not include blank votes or “over votes” (where the voter indicated multiple choices) in the overall vote count for the denominator.

You will perform most of the data management for #1 and #2 in Part 1. Part 2 will contain most of the work for #3 and also covers visualizing results.

The primary goal to accomplish over Parts 1 and 2 will be to get the data to one row per precinct with the following 7 columns:

  • Precinct identifier
  • Indicator for whether the precinct is in Seattle or not
  • Precinct size in terms of registered voters
  • Turnout rate
  • Percentage Democratic support for President
  • Percentage Democratic support for Governor
  • Percentage Democratic support for Lieutenant Governor

The sections below describe steps you may want to do to get your data organized, and provide some hints and suggestions for methods, in particular using dplyr and tidyr.

Filtering down the data

For what we want to do, there are a lot of rows that are not useful. We only want ones pertaining to races for President, Governor, and Lieutenant Governor. So let’s trim everything down. You will want to see how these things show up in the data. The easiest way may be to (1) display every unique value of Race and find which ones match our races of interest, then (2) filter the data to those races.

# info on the distinct races
races <- king_raw %>%
    select(Race) %>%
    distinct(Race) %>%
    arrange(Race)
# print it out as a character vector
as.character(races$Race)
##  [1] "Advisory Vote 14"                                            
##  [2] "Advisory Vote 15"                                            
##  [3] "Attorney General"                                            
##  [4] "Auburn School District No. 408 Proposition 1"                
##  [5] "City of Bellevue Proposition 1"                              
##  [6] "City of Bellevue Proposition 2"                              
##  [7] "City of Bothell Advisory Proposition 1"                      
##  [8] "City of Bothell Proposition 1"                               
##  [9] "City of Duvall Advisory Vote 1"                              
## [10] "City of Duvall Proposition 1"                                
## [11] "City of Issaquah Proposition 1"                              
## [12] "City of Kenmore Proposition 1"                               
## [13] "City of Seattle Initiative Measure 124"                      
## [14] "City of Shoreline Proposition 1"                             
## [15] "City of Snoqualmie Proposition 1"                            
## [16] "City of Tukwila Proposition 1"                               
## [17] "Commissioner of Public Lands"                                
## [18] "Congressional District 1"                                    
## [19] "Congressional District 7"                                    
## [20] "Congressional District 8"                                    
## [21] "Congressional District 9"                                    
## [22] "Court of Appeals Judge Position 1"                           
## [23] "District Court Southwest Electoral District Judge Position 3"
## [24] "District Court West Electoral District Judge Position 1"     
## [25] "District Court West Electoral District Judge Position 4"     
## [26] "Governor"                                                    
## [27] "Highline School District No. 401 Proposition 1"              
## [28] "Initiative Measure 1433"                                     
## [29] "Initiative Measure 1464"                                     
## [30] "Initiative Measure 1491"                                     
## [31] "Initiative Measure 1501"                                     
## [32] "Initiative Measure 732"                                      
## [33] "Initiative Measure 735"                                      
## [34] "Insurance Commissioner"                                      
## [35] "Kent School District No. 415 Proposition 1"                  
## [36] "King County Charter Amendment 1"                             
## [37] "King County Charter Amendment 2"                             
## [38] "King County Fire Protection District 27 Proposition 1"       
## [39] "Legislative District 1 Representative Position 1"            
## [40] "Legislative District 1 Representative Position 2"            
## [41] "Legislative District 1 State Senator"                        
## [42] "Legislative District 11 Representative Position 1"           
## [43] "Legislative District 11 Representative Position 2"           
## [44] "Legislative District 11 State Senator"                       
## [45] "Legislative District 30 Representative Position 1"           
## [46] "Legislative District 30 Representative Position 2"           
## [47] "Legislative District 31 Representative Position 1"           
## [48] "Legislative District 31 Representative Position 2"           
## [49] "Legislative District 32 Representative Position 1"           
## [50] "Legislative District 32 Representative Position 2"           
## [51] "Legislative District 33 Representative Position 1"           
## [52] "Legislative District 33 Representative Position 2"           
## [53] "Legislative District 34 Representative Position 1"           
## [54] "Legislative District 34 Representative Position 2"           
## [55] "Legislative District 36 Representative Position 1"           
## [56] "Legislative District 36 Representative Position 2"           
## [57] "Legislative District 36 State Senator"                       
## [58] "Legislative District 37 Representative Position 1"           
## [59] "Legislative District 37 Representative Position 2"           
## [60] "Legislative District 39 Representative Position 1"           
## [61] "Legislative District 39 Representative Position 2"           
## [62] "Legislative District 39 State Senator"                       
## [63] "Legislative District 41 Representative Position 1"           
## [64] "Legislative District 41 Representative Position 2"           
## [65] "Legislative District 41 State Senator"                       
## [66] "Legislative District 43 Representative Position 1"           
## [67] "Legislative District 43 Representative Position 2"           
## [68] "Legislative District 45 Representative Position 1"           
## [69] "Legislative District 45 Representative Position 2"           
## [70] "Legislative District 46 Representative Position 1"           
## [71] "Legislative District 46 Representative Position 2"           
## [72] "Legislative District 47 Representative Position 1"           
## [73] "Legislative District 47 Representative Position 2"           
## [74] "Legislative District 48 Representative Position 1"           
## [75] "Legislative District 48 Representative Position 2"           
## [76] "Legislative District 5 Representative Position 1"            
## [77] "Legislative District 5 Representative Position 2"            
## [78] "Legislative District 5 State Senator"                        
## [79] "Lieutenant Governor"                                         
## [80] "Proposed Eastside Regional Fire Authority Proposition 1"     
## [81] "Secretary of State"                                          
## [82] "Senate Joint Resolution 8210"                                
## [83] "Sound Transit Proposition 1"                                 
## [84] "South King Fire and Rescue Proposition 1"                    
## [85] "State Auditor"                                               
## [86] "State Supreme Court Justice Position 1"                      
## [87] "State Supreme Court Justice Position 5"                      
## [88] "State Supreme Court Justice Position 6"                      
## [89] "State Treasurer"                                             
## [90] "Superintendent of Public Instruction"                        
## [91] "Superior Court Judge Position 14"                            
## [92] "Superior Court Judge Position 26"                            
## [93] "Superior Court Judge Position 31"                            
## [94] "Superior Court Judge Position 44"                            
## [95] "Superior Court Judge Position 52"                            
## [96] "Superior Court Judge Position 53"                            
## [97] "US President & Vice President"                               
## [98] "US Senator"

Governor, Lieutenant Governor, and US President & Vice President are the ones we want. These are in positions 26, 79, and 97 of my sorted races output, respectively, so I will make a character vector holding those values specifically for easier subsetting. Never type more than you have to!

# make a character vector of relevant races
(rel_races <- races$Race[c(26, 79, 97)])
## [1] "Governor"                      "Lieutenant Governor"          
## [3] "US President & Vice President"
# subset the data to relevant races
king_rel_races <- king_raw %>%
    filter(Race %in% rel_races)

Seattle precincts

We want to determine which precincts are in Seattle and which are not. You will probably want to look at values of the Precinct variable and see if you can figure out what uniquely identifies Seattle precincts. You will then want to create a binary variable that identifies Seattle tracts (for instance, with values "Seattle" and "Not Seattle"). Hint: You can use substr() to grab a number of characters—a sub-string—from text (say, to test if they equal something); if you use this with ifelse() inside mutate() you can make a new variable based on whether the sub-string of Precinct equals a value.

How can I figure out which precincts are in Seattle? I’m going to make a dataset with the whole list and eyeball it to see if anything jumps out.

precincts <- king_rel_races %>%
    select(Precinct) %>%
    distinct(Precinct) %>%
    arrange(Precinct)

Scrolling to the “S” section, it looks like Seattle precincts all start with SEA followed by a space and a precinct number. Looking at a map on the King County website and zooming in enough to see the precinct numbers confirms it. Precincts near but not in Seattle like in Shoreline to the north or Tukwila to the south have a different naming system. Thus, I am confident that identifying Seattle precincts as those whose first four characters are SEA will work to flag those. Just the three characters SEA on its own won’t – there are precincts called SEALTH, SEAN, and SEAVIEW we don’t want to flag as in Seattle.

One way to proceed is to use the substr function (seen in Week 4 when checking if the second letter of some first names was “a”) to pull out the first four characters of Precinct and check if they are equal to SEA.

king_flag <- king_rel_races %>%
    mutate(Location = ifelse(substr(Precinct, start = 1, stop = 4) == "SEA ",
                             "Seattle",
                             "Not Seattle"))

An alternative way is to use the separate function in tidyr (which we used to take a character representation of song length and split it into minutes and seconds). We could split these precincts at the first space and then check if the stuff in the first part of the split says SEA:

king_flag_alt <- king_rel_races %>%
    separate(Precinct, into = c("part1", "part2"), sep = " ") %>%
    mutate(Location = ifelse(part1 == "SEA",
                             "Seattle",
                             "Not Seattle"))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 6136
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].

Note that we get a warning message when doing this with separate because some precincts only have one word in them, so there is no space to split on. This is fine. (We’ll see more ways to match text patterns in Week 8.)

Sanity check: do we get the same answer either way? Let’s sum how many times the approaches disagree.

sum(king_flag$Location != king_flag_alt$Location)
## [1] 0

Both ways give the same answers, so we’re all good!

Registered voters and turnout rates

We want to calculate turnout rates as total votes (including normal votes, blank votes, over votes, write-ins) for the Presidential race divided by registered voters. Hint: You will want to look at CounterType and SumOfCount at the same time, within each Precinct and Race. Examine how the SumOfCount values for CounterType value "Times Counted" relate to all the other CounterType values.

First, I see there is a value in CounterType called “Times Counted”. It would be nice if this was the numerator we were after. I’m going to check this by summing SumOfCount up within each precinct and race of interest over all the rows besides where CounterType is “Registered Voters” or “Times Counted”. Then I’ll compare these to the “Times Counted” rows. We’ll use joins to do this:

# sum over rows besides "Registered Voters" or "Times Counted"
# within each precinct and race
times_counted_manual <- king_flag %>%
    select(Precinct, Race, CounterType, SumOfCount) %>%
    filter(CounterType != "Registered Voters" & CounterType != "Times Counted") %>%
    group_by(Precinct, Race) %>%
    summarize(votes_added_up = sum(SumOfCount))
head(times_counted_manual)
## # A tibble: 6 x 3
## # Groups:   Precinct [2]
##   Precinct Race                          votes_added_up
##   <chr>    <chr>                                  <dbl>
## 1 ADAIR    Governor                                485.
## 2 ADAIR    Lieutenant Governor                     485.
## 3 ADAIR    US President & Vice President           485.
## 4 ALDARRA  Governor                                625.
## 5 ALDARRA  Lieutenant Governor                     625.
## 6 ALDARRA  US President & Vice President           625.
# now just grab the "Times Counted" rows and merge
times_counted_compare <- king_flag %>%
    select(Precinct, Race, CounterType, SumOfCount) %>%
    filter(CounterType == "Times Counted") %>%
    # rename the column on filtered data for clarity
    rename(times_counted_value = SumOfCount) %>%
    left_join(times_counted_manual,
              by = c("Precinct", "Race")) %>%
    # compute differences
    mutate(diff = times_counted_value - votes_added_up)

summary(times_counted_compare$diff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0

They’re always the same! That means “Times Counted” is including every possible kind of vote for each race, such as blanks, write-ins, accidental over-votes, or your usual ones. Now we can make a data frame that has registered voters and turnout rates (for the Presidential race) for each precinct:

turnout_rates <- king_flag %>%
    # filter to just the presidential election
    filter(Race == rel_races[3]) %>%
    # filter to just registered voters or times counted
    filter(CounterType %in% c("Registered Voters", "Times Counted")) %>%
    # just the columns we want
    select(Precinct, Location, CounterType, SumOfCount) %>%
    # use spread to put the two counts on the same row for each precinct
    spread(key = CounterType, value = SumOfCount) %>%
    # use new columns to compute turnout rate
    mutate(Turnout = `Times Counted` / `Registered Voters`)
head(turnout_rates)
## # A tibble: 6 x 5
##   Precinct      Location    `Registered Voters` `Times Counted` Turnout
##   <chr>         <chr>                     <dbl>           <dbl>   <dbl>
## 1 ADAIR         Not Seattle                519.            485.   0.934
## 2 ALDARRA       Not Seattle                763.            625.   0.819
## 3 ALDER SPRINGS Not Seattle                557.            476.   0.855
## 4 ALDERWOOD     Not Seattle                472.            404.   0.856
## 5 ALG 30-0013   Not Seattle                497.            319.   0.642
## 6 ALG 30-0014   Not Seattle                525.            364.   0.693

Part 2

Democratic support rates

We want to get measures of democratic support in each Precinct for each of our three races. You are asked to measure support as the percentage of votes in a precinct for the Democratic candidate out of all votes for candidates or write-ins, but this time do not to include blank votes or “over votes” (where the voter indicated multiple choices) in the overall vote count for the denominator. Hint: A good approach here is to compute the denominator (total votes) for each precinct, and then merge (e.g. left_join()) on the Democratic vote count for each race and divide by the total votes.

Computing candidate votes

You will probably want to follow a process like this:

  1. Make a new dataframe with the total number of votes cast for any actual candidates (including "Write-In") in each precinct and race. Hint: You will likely want to use filter() followed by group_by() and summarize() using the SumOfCount variable.
  1. MAke another dataframe with the total number of votes for democratic candidates in each precinct and race. You will want to check the Party of candidates and work only with the democratic observations to get these vote counts. Hint: There are different democratic parties for different races (e.g. "Dem" or "DPN").
  1. Merge the total votes data with the democratic votes data, then calculate a percent democratic votes variable for each race.

I observe that for the races of interest, the proper candidates all have rows where Party is not NP:

king_flag %>%
    select(Race, Party, CounterType) %>%
    distinct()
## # A tibble: 26 x 3
##    Race                Party CounterType      
##    <chr>               <chr> <chr>            
##  1 Governor            Dem   Jay Inslee       
##  2 Governor            NP    Registered Voters
##  3 Governor            NP    Times Blank Voted
##  4 Governor            NP    Times Counted    
##  5 Governor            NP    Times Over Voted 
##  6 Governor            NP    Write-In         
##  7 Governor            Rep   Bill Bryant      
##  8 Lieutenant Governor Dem   Cyrus Habib      
##  9 Lieutenant Governor NP    Registered Voters
## 10 Lieutenant Governor NP    Times Blank Voted
## # ... with 16 more rows

I could keep those rows or rows for “Write-in” and that would be perfect.

candidate_vote_rows <- king_flag %>%
    # keep just not NP rows, or write-in rows
    filter(Party != "NP" | CounterType == "Write-In") %>%
    select(Precinct, Location, Race, Party, CounterType, SumOfCount)

# sum over all votes for candidates within a precinct and race
total_candidate_votes <- candidate_vote_rows %>%
    group_by(Precinct, Location, Race) %>%
    summarize(total_candidate_votes  = sum(SumOfCount))

Let’s look at how to pull up the Democrat rows specifically:

candidate_vote_rows %>%
    select(Race, Party, CounterType) %>%
    distinct()
## # A tibble: 14 x 3
##    Race                          Party CounterType                        
##    <chr>                         <chr> <chr>                              
##  1 Governor                      Dem   Jay Inslee                         
##  2 Governor                      NP    Write-In                           
##  3 Governor                      Rep   Bill Bryant                        
##  4 Lieutenant Governor           Dem   Cyrus Habib                        
##  5 Lieutenant Governor           NP    Write-In                           
##  6 Lieutenant Governor           Rep   Marty McClendon                    
##  7 US President & Vice President CPN   Darrell L. Castle & Scott N. Bradl~
##  8 US President & Vice President DPN   Hillary Clinton & Tim Kaine        
##  9 US President & Vice President GPN   Jill Stein & Ajamu Baraka          
## 10 US President & Vice President LPN   Gary Johnson & Bill Weld           
## 11 US President & Vice President NP    Write-In                           
## 12 US President & Vice President RPN   Donald J. Trump & Michael R. Pence 
## 13 US President & Vice President SPN   Gloria Estela La Riva & Eugene Pur~
## 14 US President & Vice President SWN   Alyson Kennedy & Osborne Hart

Interesting. For Governor and Lieutenant Governor, the Democrat candidate has Party of "Dem". For President, it’s "DPN". Life is a rich tapestry! I’ll count as Democratic votes anything that is "Dem" or "DPN".

# subset to votes for Democrat candidate
democratic_vote_rows <- candidate_vote_rows %>%
    filter(Party %in% c("Dem", "DPN")) %>%
    select(Precinct, Location, Race, SumOfCount) %>%
    # rename the count to be informative
    rename(dem_votes = SumOfCount)

Now we can merge by precinct and race and do the math:

democrat_vote_rates <- democratic_vote_rows %>%
    left_join(total_candidate_votes,
              by = c("Precinct", "Location", "Race")) %>%
    mutate(`Democrat support` = dem_votes / total_candidate_votes) %>%
    select(Precinct, Location, Race, `Democrat support`)
head(democrat_vote_rates)
## # A tibble: 6 x 4
##   Precinct Location    Race                          `Democrat support`
##   <chr>    <chr>       <chr>                                      <dbl>
## 1 ADAIR    Not Seattle Governor                                   0.532
## 2 ADAIR    Not Seattle Lieutenant Governor                        0.545
## 3 ADAIR    Not Seattle US President & Vice President              0.562
## 4 ALDARRA  Not Seattle Governor                                   0.481
## 5 ALDARRA  Not Seattle Lieutenant Governor                        0.519
## 6 ALDARRA  Not Seattle US President & Vice President              0.546

Combining it all

Once you’ve calculated democratic voting percentages for each race you’ll want to put them back together with the precinct turnout rate data using a join. Then you will want to make sure your data are shaped as I recommend above: One row per precincts, with columns for each of the relevant measures. If your data are in a format where you have a row for each race within each precinct (“long format”), you may find the spread() command useful for turning multiple rows for each precinct into single precinct rows with different columns for each race.

We have registered voters and turnout in turnout_rates, and Democratic candidate support rates in democrat_vote_rates. Now we merge using left_join():

precinct_data <- turnout_rates %>%
    left_join(democrat_vote_rates,
              by = c("Precinct", "Location"))
head(precinct_data)
## # A tibble: 6 x 7
##   Precinct Location  `Registered Vote~ `Times Counted` Turnout Race       
##   <chr>    <chr>                 <dbl>           <dbl>   <dbl> <chr>      
## 1 ADAIR    Not Seat~              519.            485.   0.934 Governor   
## 2 ADAIR    Not Seat~              519.            485.   0.934 Lieutenant~
## 3 ADAIR    Not Seat~              519.            485.   0.934 US Preside~
## 4 ALDARRA  Not Seat~              763.            625.   0.819 Governor   
## 5 ALDARRA  Not Seat~              763.            625.   0.819 Lieutenant~
## 6 ALDARRA  Not Seat~              763.            625.   0.819 US Preside~
## # ... with 1 more variable: `Democrat support` <dbl>

We can make this wide using spread() and clean up the names a bit:

wide_precinct_data <- precinct_data %>%
    spread(key = Race, value = `Democrat support`) %>%
    rename(Governor = `Governor`,
           `Lt. Governor` = `Lieutenant Governor`,
           President = `US President & Vice President`)
wide_precinct_data
## # A tibble: 2,517 x 8
##    Precinct   Location  `Registered Vote~ `Times Counted` Turnout Governor
##    <chr>      <chr>                 <dbl>           <dbl>   <dbl>    <dbl>
##  1 ADAIR      Not Seat~              519.            485.   0.934    0.532
##  2 ALDARRA    Not Seat~              763.            625.   0.819    0.481
##  3 ALDER SPR~ Not Seat~              557.            476.   0.855    0.458
##  4 ALDERWOOD  Not Seat~              472.            404.   0.856    0.519
##  5 ALG 30-00~ Not Seat~              497.            319.   0.642    0.557
##  6 ALG 30-00~ Not Seat~              525.            364.   0.693    0.602
##  7 ALG 30-31~ Not Seat~              551.            364.   0.661    0.439
##  8 ALPINE     Not Seat~              629.            513.   0.816    0.483
##  9 AMES LAKE  Not Seat~              865.            726.   0.839    0.522
## 10 ANGEL CITY Not Seat~              779.            546.   0.701    0.795
## # ... with 2,507 more rows, and 2 more variables: `Lt. Governor` <dbl>,
## #   President <dbl>

Graphing the results

Turnout

Make a scatterplot where the horizontal axis is number of registered voters in the precinct, and the vertical axis is turnout rate. Color the precincts in Seattle one color, and use a different color for other precincts. Do you observe anything?

ggplot(data = wide_precinct_data,
       aes(x = `Registered Voters`, y = 100*Turnout,
           color = Location, group = Location)) +
    geom_point(alpha = 0.4, size = 1) +
    geom_smooth(method="loess") +
    scale_y_continuous(breaks=seq(0, 100, 10)) +
    scale_color_manual(values = c("orange", "navyblue")) +
    ggtitle("Turnout rates by precinct in King County, 2016") +
    ylab("Turnout\n(% of registered voters voting in Presidential race)") +
    theme_bw()
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

I’m getting warnings when plotting because of precincts with zero registered voters whose turnout rate is NaN (coming from division by zero), which isn’t a big deal. I’ve also superimposed a smooth trend line for each location to see the average relationship between registered voters and turnout rates.

We can see that Seattle precincts are bigger on average, with almost all above 250 registered voters, while precincts can be quite a bit smaller elsewhere in King County. It looks like within Seattle, there is a slight negative relationship between the number of registered voters in a precinct and the proportion of whom actually vote in the Presidential race. The trend is instead flat-to-slightly-increasing for precincts outside of Seattle. However, for precincts of the same size (in terms of registered voters), Seattle precincts actually had slightly higher turnout rates on average than non-Seattle precincts.

The overall level of turnout seems pretty impressive, with many not-too-small precincts having rates of 80% or higher. However, keep in mind this is just calculated out of registered voters. We would need to use Census data and do something much more sophisticated if we wanted to account for all eligible voters residing in King County who are not registered.

Democratic support

Now let’s visualize the Democratic support rates for the three races within each precinct for sufficently large precincts. Limit the data to precincts with at least 500 registered voters. Make a line plot where the horizontal axis indicates precincts, and the vertical axis shows the Democratic support rates. There should be three lines in different colors (one for each race of interest).

Do not label the precincts on the horizontal axis (you will probably have to search to figure out how). You should, however, arrange them on the axis in order from smallest to largest in terms of support for the Democratic candidate for president — that is, the line plotting percentage support for Obama should be smoothly increasing from left to right. The order of the lines in the legend should follow the order of the lines at the right edge of the plot.

To do this, we need to use the “wide” version of the data (one row per precinct), and order Precinct based on Democratic support for the Presidential race (Hint: You will probably want to use fct_reorder() on Precinct). Then we can reshape back from “wide” to “tidy” form using gather() so that we have one variable giving the race—and another giving vote percentage—and can plot a separate line for each race.

tidy_big_precinct_data <- wide_precinct_data %>%
  filter(`Registered Voters` >= 500) %>% # Filter the data to big precincts
  mutate(Precinct=fct_reorder(Precinct, President)) %>% # Reorder the precincts
    gather(key = Race, value = `Democrat support`, # rotate down the columns for each race
           Governor, `Lt. Governor`, `President`) 

Finally we plot, suppressing labels for each precinct/order on the horizontal axis since it is not informative:

ggplot(data = tidy_big_precinct_data,
       aes(x = Precinct, y = `Democrat support`*100,
           group = Race, color = Race)) +
    geom_line(alpha = 0.5) +
    ggtitle("Democratic support in three races\nKing County, 2012, large precincts only") +
    scale_x_discrete(breaks=NULL, # no x-axis labels
                       name = "Precinct (ordered by Clinton support)") +
    scale_y_continuous(breaks = seq(30, 100, 10),
                       name = "Percent of votes within precinct for Democratic candidate") +
    scale_color_manual(values = c("black", "red", "blue")) +
    theme_bw()

This kind of plot is a way to represent three dimensional information (Democrat support rates for each of the three positions per precinct) in two dimensions, by putting the observations in order on the horizontal axis and showing the three values of interest on the same scale on the vertical axis. Here’s a version using points instead of lines if you find the line version hard to interpret:

ggplot(data = tidy_big_precinct_data,
       aes(x = Precinct, y = `Democrat support`*100,
           group = Race, color = Race)) +
    geom_point(alpha = 0.35, size = 0.75) +
    ggtitle("Democratic support in three races\nKing County, 2016, large precincts only") +
    scale_x_discrete(breaks = NULL, # no x-axis labels
                       name = "Precinct (ordered by Clinton support)") +
    scale_y_continuous(breaks = seq(30, 100, 10),
                       name = "Percent of votes within precinct for Democratic candidate") +
    scale_color_manual(values = c("black", "red", "blue")) +
    theme_bw()

From either chart, we see that support for Clinton was usually higher than support for the Democratic candidates in the other two races within each precinct, as the President line typically lies above the Governor and Lt. Governor lines. Particularly in precincts where support for Clinton was below 70% or so, we see considerably lower support for Inslee in the gubernatorial race by about 5-10%.

If you are concerned that this is an artifact of limiting only to large precincts, you can repeat this analysis without filtering based on registered voters and obtain a similar result with more noise.