Questions for you to answer are as quoted blocks of text. Put your code used to address these questions and interpretation below each block. Make sure your answers are NOT in block quotes like these. Load all libraries you want to use in the
setup
chunk; when you discover you want to use a library later, add it to the setup chunk at the top. You will turn in Part 1 for the coming week and Part 2 for the following week. You will upload the entire template each time, with whatever progress you have made.
Download the data from https://raw.githubusercontent.com/clanfear/CSSS508/master/Homework/HW5/king_county_elections_2016.txt. It is a plain text file of data, about 60 MB in size. Values are separated with commas (you can see this by opening it with a good text editor, e.g. Atom or Sublime but not Notepad). Save it in the same folder as this template and read the file into R. You will want to use the
cache=TRUE
chunk option for this (and potentially other chunks).cache=TRUE
will allow R to read the file only once to save time.
This file is comma separated (a CSV), so we’ll use the read_csv
function in readr
without any fancy options:
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
king_raw <- read_csv("king_county_elections_2016.txt")
##
## -- Column specification --------------------------------------------------------
## cols(
## Precinct = col_character(),
## Race = col_character(),
## LEG = col_double(),
## CC = col_double(),
## CG = col_double(),
## CounterGroup = col_character(),
## Party = col_character(),
## CounterType = col_character(),
## SumOfCount = col_double()
## )
Use
glimpse()
to look at the data. Describe the data in their current state. How many rows are there? What variables are there? What kinds of values do they take (don’t list them all if there are many)? Are the column types sensible?
glimpse(king_raw)
## Rows: 643,163
## Columns: 9
## $ Precinct <chr> "ADAIR", "ADAIR", "ADAIR", "ADAIR", "ADAIR", "ADAIR", ...
## $ Race <chr> "Advisory Vote 14", "Advisory Vote 14", "Advisory Vote...
## $ LEG <dbl> 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45...
## $ CC <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ CG <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ CounterGroup <chr> "Total", "Total", "Total", "Total", "Total", "Total", ...
## $ Party <chr> "NP", "NP", "NP", "NP", "NP", "NP", "NP", "NP", "NP", ...
## $ CounterType <chr> "Maintained", "Registered Voters", "Repealed", "Times ...
## $ SumOfCount <dbl> 183, 519, 251, 51, 485, 0, 220, 519, 222, 43, 485, 0, ...
There are over 640,000 rows and 9 columns. There’s a mix of character and numeric data. Nothing looks unusual. It looks like what people voted for is in CounterType
and the vote tallies might be in SumOfCount
. There are repeated observations for each Precinct
and Race
.
In addition to looking generally, look at each variable individually… except consider
LEG
,CC
andCG
at the same time (I will tell you now these three aren’t likely to be useful to you, but maybe guess what they are!). Remember these are real administrative data so they may be really strangely structured and some variables are indecipherable; in real world data work, you often have to get by with intuition or poking around online with regard to the nature of your data. Here useful way to look at 10 unique values of individual columns, given somedata
and avariable
of interest:
data %>%
distinct(variable) %>%
head(10)
Another thing you may want to do (or instead) is get a frequency (count) of distinct values:
data %>% count(variable) %>% head(10)
Precinct
has a precinct identifier, which is sometimes a city or neighborhood name, sometimes some text followed by an ID number:
king_raw %>%
count(Precinct) %>%
head(10)
## # A tibble: 10 x 2
## Precinct n
## <chr> <int>
## 1 ADAIR 237
## 2 ALDARRA 257
## 3 ALDER SPRINGS 237
## 4 ALDERWOOD 237
## 5 ALG 30-0013 250
## 6 ALG 30-0014 250
## 7 ALG 30-3141 250
## 8 ALPINE 251
## 9 AMES LAKE 251
## 10 ANGEL CITY 244
king_raw %>%
count(Precinct) %>%
tail(10)
## # A tibble: 10 x 2
## Precinct n
## <chr> <int>
## 1 WOD 45-3192 243
## 2 WOD 45-3193 243
## 3 WOD 45-3270 243
## 4 WOD 45-3530 243
## 5 WOLF 251
## 6 WOODSIDE 250
## 7 WYNOCHE 244
## 8 WYNONA 257
## 9 YPT 48-1233 244
## 10 YPT 48-1234 244
precinct_count <- king_raw %>%
distinct(Precinct) %>%
nrow()
There are 2517 distinct values appearing the Precinct
column. Sounds about right for King County?
Race
appears to contain the specific races, some of which are positions candidates are running for, some of which are local propositions, and some are miscellaneous Tim Eyman ballot bloating garbage (“Advisory Vote of the People”):
king_raw %>%
count(Race) %>%
head(10)
## # A tibble: 10 x 2
## Race n
## <chr> <int>
## 1 Advisory Vote 14 15102
## 2 Advisory Vote 15 15102
## 3 Attorney General 17619
## 4 Auburn School District No. 408 Proposition 1 522
## 5 City of Bellevue Proposition 1 1014
## 6 City of Bellevue Proposition 2 1014
## 7 City of Bothell Advisory Proposition 1 156
## 8 City of Bothell Proposition 1 156
## 9 City of Duvall Advisory Vote 1 42
## 10 City of Duvall Proposition 1 42
king_raw %>%
count(Race) %>%
tail(10)
## # A tibble: 10 x 2
## Race n
## <chr> <int>
## 1 State Treasurer 17619
## 2 Superintendent of Public Instruction 17619
## 3 Superior Court Judge Position 14 17619
## 4 Superior Court Judge Position 26 17619
## 5 Superior Court Judge Position 31 17619
## 6 Superior Court Judge Position 44 17619
## 7 Superior Court Judge Position 52 17619
## 8 Superior Court Judge Position 53 17619
## 9 US President & Vice President 30204
## 10 US Senator 17619
LEG
, CC
, and CG
appear to be the numbers for legislative district, King County Council district, and Congressional district. We could argue that LEG
, CC
, and CG
should be converted to character because their numerical values are irrelevant, but we’re not planning on using this information, so nothing bad will happen by being lazy and not changing it from integer to character. I could say col_types = "cccccccci"
when using read_csv
, though, to get all the column types perfect.
We note LEG
and CC
values are missing on over 200 rows:
king_raw %>%
select(LEG, CC, CG) %>%
summary()
## LEG CC CG
## Min. : 1.00 Min. :1.00 Min. :1.00
## 1st Qu.:32.00 1st Qu.:3.00 1st Qu.:7.00
## Median :37.00 Median :5.00 Median :7.00
## Mean :34.91 Mean :4.86 Mean :7.03
## 3rd Qu.:45.00 3rd Qu.:7.00 3rd Qu.:9.00
## Max. :48.00 Max. :9.00 Max. :9.00
## NA's :224 NA's :224
Let’s look at a sample:
king_raw %>%
filter(is.na(LEG)) %>%
head(10)
## # A tibble: 10 x 9
## Precinct Race LEG CC CG CounterGroup Party CounterType SumOfCount
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 ELECTION~ Adviso~ NA NA 9 Total NP Maintained 10
## 2 ELECTION~ Adviso~ NA NA 9 Total NP Registered~ 58
## 3 ELECTION~ Adviso~ NA NA 9 Total NP Repealed 16
## 4 ELECTION~ Adviso~ NA NA 9 Total NP Times Blan~ 18
## 5 ELECTION~ Adviso~ NA NA 9 Total NP Times Coun~ 44
## 6 ELECTION~ Adviso~ NA NA 9 Total NP Times Over~ 0
## 7 ELECTION~ Adviso~ NA NA 9 Total NP Maintained 15
## 8 ELECTION~ Adviso~ NA NA 9 Total NP Registered~ 58
## 9 ELECTION~ Adviso~ NA NA 9 Total NP Repealed 13
## 10 ELECTION~ Adviso~ NA NA 9 Total NP Times Blan~ 16
Interesting. It looks like there’s a Precinct
called "ELECTIONS OFFICE"
which has some counts listed, but sure doesn’t sound like a real precinct. We could think about dropping rows for this Precinct
since it seems like might be a mistake, though it won’t make a big difference.
CounterGroup
only has one value (Total
) and is useless. A waste of a column!
king_raw %>%
count(CounterGroup)
## # A tibble: 1 x 2
## CounterGroup n
## <chr> <int>
## 1 Total 643163
Party
contains values for the political parties involved in each race:
king_raw %>%
count(Party) %>%
arrange(desc(n))
## # A tibble: 17 x 2
## Party n
## <chr> <int>
## 1 NP 565031
## 2 Dem 29412
## 3 Rep 27089
## 4 Lib 3276
## 5 CPN 2517
## 6 DPN 2517
## 7 GPN 2517
## 8 LPN 2517
## 9 RPN 2517
## 10 SPN 2517
## 11 SWN 2517
## 12 NOP 174
## 13 PAR 174
## 14 DFW 168
## 15 GP 166
## 16 ID 52
## 17 Dcr 2
We see quite a few different values here. "Dem"
and "Rep"
stand out as being Democrats and Republicans, and "NP"
is so common that it must mean “no party” or “non-partisan”. Most of the other values have exactly the same number of values as there are distinct precincts, so these are probably Presidential or Senate candidates from third parties if every precinct is voting for them. One might suspect that "DPN"
, for instance, is the Democratic Party National. The national party is separate from the state parties, so this makes sense.
king_raw %>%
count(CounterType) %>%
head(20)
## # A tibble: 20 x 2
## CounterType n
## <chr> <int>
## 1 Adam Smith 755
## 2 Alvin Rutledge 106
## 3 Alyson Kennedy & Osborne Hart 2517
## 4 Andrew Pilloud 209
## 5 Anthony Gipe 2517
## 6 Approved 5533
## 7 Barbara Madsen 2517
## 8 Barry Knowles 136
## 9 Benjamin Judah Phelps 174
## 10 Bill Bryant 2517
## 11 Bob Ferguson 2517
## 12 Bob Hasegawa 141
## 13 Brady Pinero Walkinshaw 1001
## 14 Brian J. Todd 298
## 15 Brooke Valentine 136
## 16 Cathy Moore 2517
## 17 Chad Magendanz 168
## 18 Charles (Charlie) Wiggins 2517
## 19 Chris Reykdal 2517
## 20 Chris Vance 2517
Notice something odd about CounterType in particular? It tells you what a given row of votes was for… but it also has
Registered Voters
andTimes Counted
. What isTimes Counted
?
CounterType
appears to contain a mix of candidate names or position votes (e.g. "Bob Ferguson"
, "Approved"
, "No"
) and overall summaries for the particular race in the precinct (e.g. "Registered Voters"
, "Times Blank Voted"
, "Times Counted"
).
I would say this column has a lot of the info we want, and that it is currently structured “too long” since all of these are different rows corresponding to the same precinct for the same race. Some of these are really summaries, like the number of registered (eligible) voters. What is "Times Counted"
though? Maybe the next column will give a hint?
SumOfCount
appears to be just counts of votes (or registered voters) associated with CounterType
. This is the numeric information we want to use, and there are no missing values:
king_raw %>%
select(SumOfCount) %>%
summary()
## SumOfCount
## Min. : 0.0
## 1st Qu.: 4.0
## Median : 121.0
## Mean : 198.5
## 3rd Qu.: 330.0
## Max. :1126.0
If we take a look CounterType
and SumOfCount
at the same time, we can figure out "Times Counted"
is.
king_raw %>%
select(Precinct, Race, CounterType, SumOfCount)
## # A tibble: 643,163 x 4
## Precinct Race CounterType SumOfCount
## <chr> <chr> <chr> <dbl>
## 1 ADAIR Advisory Vote 14 Maintained 183
## 2 ADAIR Advisory Vote 14 Registered Voters 519
## 3 ADAIR Advisory Vote 14 Repealed 251
## 4 ADAIR Advisory Vote 14 Times Blank Voted 51
## 5 ADAIR Advisory Vote 14 Times Counted 485
## 6 ADAIR Advisory Vote 14 Times Over Voted 0
## 7 ADAIR Advisory Vote 15 Maintained 220
## 8 ADAIR Advisory Vote 15 Registered Voters 519
## 9 ADAIR Advisory Vote 15 Repealed 222
## 10 ADAIR Advisory Vote 15 Times Blank Voted 43
## # ... with 643,153 more rows
For the first Precinct
and Race
, we see "Times Counted"
is 485. This is equal to the sum of SumOfCount
for everything except "Registered Voters"
. It is the total number of ballots cast! This will be handy.
We will focus on only the three major executive races in Washington in 2016:
- President (and Vice-President)
- Governor
- Lieutenant Governor
With these races, we are interested in: 1. Turnout rates for each of these races in each precinct. We will measure turnout as times votes were counted (including for a candidate, blank, write-in, or “over vote”) divided by the number of registered voters. 2. Differences between precincts in Seattle and precincts elsewhere in King County. Again, these data are not documented, so you will have to figure out how to do this. 3. Precinct-level support for the Democratic candidates in King County in 2012 for each contest. We will measure support as the percentage of votes in a precinct for the Democratic candidate out of all votes for candidates or write-ins. Do not include blank votes or “over votes” (where the voter indicated multiple choices) in the overall vote count for the denominator.
You will perform most of the data management for #1 and #2 in Part 1. Part 2 will contain most of the work for #3 and also covers visualizing results.
The goal to accomplish over Parts 1 and 2 will be to get the data to one row per precinct with the following 7 columns:
- Precinct identifier
- Indicator for whether the precinct is in Seattle or not
- Precinct size in terms of registered voters
- Turnout rate
- Percentage Democratic support for President
- Percentage Democratic support for Governor
- Percentage Democratic support for Lieutenant Governor
The sections below describe steps you may want to do to get your data organized, and provide some hints and suggestions for methods, in particular using
dplyr
andtidyr
.
For what we want to do, there are a lot of rows that are not useful. We only want ones pertaining to races for President, Governor, and Lieutenant Governor. So let’s trim everything down. You will want to see how these things show up in the data. The easiest way may be to (1) display every unique value of
Race
and find which ones match our races of interest, then (2) filter the data to those races.
# info on the distinct races
races <- king_raw %>%
select(Race) %>%
distinct(Race) %>%
arrange(Race)
# print it out as a character vector
as.character(races$Race)
## [1] "Advisory Vote 14"
## [2] "Advisory Vote 15"
## [3] "Attorney General"
## [4] "Auburn School District No. 408 Proposition 1"
## [5] "City of Bellevue Proposition 1"
## [6] "City of Bellevue Proposition 2"
## [7] "City of Bothell Advisory Proposition 1"
## [8] "City of Bothell Proposition 1"
## [9] "City of Duvall Advisory Vote 1"
## [10] "City of Duvall Proposition 1"
## [11] "City of Issaquah Proposition 1"
## [12] "City of Kenmore Proposition 1"
## [13] "City of Seattle Initiative Measure 124"
## [14] "City of Shoreline Proposition 1"
## [15] "City of Snoqualmie Proposition 1"
## [16] "City of Tukwila Proposition 1"
## [17] "Commissioner of Public Lands"
## [18] "Congressional District 1"
## [19] "Congressional District 7"
## [20] "Congressional District 8"
## [21] "Congressional District 9"
## [22] "Court of Appeals Judge Position 1"
## [23] "District Court Southwest Electoral District Judge Position 3"
## [24] "District Court West Electoral District Judge Position 1"
## [25] "District Court West Electoral District Judge Position 4"
## [26] "Governor"
## [27] "Highline School District No. 401 Proposition 1"
## [28] "Initiative Measure 1433"
## [29] "Initiative Measure 1464"
## [30] "Initiative Measure 1491"
## [31] "Initiative Measure 1501"
## [32] "Initiative Measure 732"
## [33] "Initiative Measure 735"
## [34] "Insurance Commissioner"
## [35] "Kent School District No. 415 Proposition 1"
## [36] "King County Charter Amendment 1"
## [37] "King County Charter Amendment 2"
## [38] "King County Fire Protection District 27 Proposition 1"
## [39] "Legislative District 1 Representative Position 1"
## [40] "Legislative District 1 Representative Position 2"
## [41] "Legislative District 1 State Senator"
## [42] "Legislative District 11 Representative Position 1"
## [43] "Legislative District 11 Representative Position 2"
## [44] "Legislative District 11 State Senator"
## [45] "Legislative District 30 Representative Position 1"
## [46] "Legislative District 30 Representative Position 2"
## [47] "Legislative District 31 Representative Position 1"
## [48] "Legislative District 31 Representative Position 2"
## [49] "Legislative District 32 Representative Position 1"
## [50] "Legislative District 32 Representative Position 2"
## [51] "Legislative District 33 Representative Position 1"
## [52] "Legislative District 33 Representative Position 2"
## [53] "Legislative District 34 Representative Position 1"
## [54] "Legislative District 34 Representative Position 2"
## [55] "Legislative District 36 Representative Position 1"
## [56] "Legislative District 36 Representative Position 2"
## [57] "Legislative District 36 State Senator"
## [58] "Legislative District 37 Representative Position 1"
## [59] "Legislative District 37 Representative Position 2"
## [60] "Legislative District 39 Representative Position 1"
## [61] "Legislative District 39 Representative Position 2"
## [62] "Legislative District 39 State Senator"
## [63] "Legislative District 41 Representative Position 1"
## [64] "Legislative District 41 Representative Position 2"
## [65] "Legislative District 41 State Senator"
## [66] "Legislative District 43 Representative Position 1"
## [67] "Legislative District 43 Representative Position 2"
## [68] "Legislative District 45 Representative Position 1"
## [69] "Legislative District 45 Representative Position 2"
## [70] "Legislative District 46 Representative Position 1"
## [71] "Legislative District 46 Representative Position 2"
## [72] "Legislative District 47 Representative Position 1"
## [73] "Legislative District 47 Representative Position 2"
## [74] "Legislative District 48 Representative Position 1"
## [75] "Legislative District 48 Representative Position 2"
## [76] "Legislative District 5 Representative Position 1"
## [77] "Legislative District 5 Representative Position 2"
## [78] "Legislative District 5 State Senator"
## [79] "Lieutenant Governor"
## [80] "Proposed Eastside Regional Fire Authority Proposition 1"
## [81] "Secretary of State"
## [82] "Senate Joint Resolution 8210"
## [83] "Sound Transit Proposition 1"
## [84] "South King Fire and Rescue Proposition 1"
## [85] "State Auditor"
## [86] "State Supreme Court Justice Position 1"
## [87] "State Supreme Court Justice Position 5"
## [88] "State Supreme Court Justice Position 6"
## [89] "State Treasurer"
## [90] "Superintendent of Public Instruction"
## [91] "Superior Court Judge Position 14"
## [92] "Superior Court Judge Position 26"
## [93] "Superior Court Judge Position 31"
## [94] "Superior Court Judge Position 44"
## [95] "Superior Court Judge Position 52"
## [96] "Superior Court Judge Position 53"
## [97] "US President & Vice President"
## [98] "US Senator"
Governor
, Lieutenant Governor
, and US President & Vice President
are the ones we want. These are in positions 26, 79, and 97 of my sorted races
output, respectively, so I will make a character vector holding those values specifically for easier subsetting.
# make a character vector of relevant races
(rel_races <- races$Race[c(26, 79, 97)])
## [1] "Governor" "Lieutenant Governor"
## [3] "US President & Vice President"
# subset the data to relevant races
king_rel_races <- king_raw %>%
filter(Race %in% rel_races)
We want to determine which precincts are in Seattle and which are not. You should look at values of the
Precinct
variable and see if you can figure out what uniquely identifies Seattle precincts. Hint: All Seattle tracts have the same naming scheme… but some non-Seattle tracts are similar so be careful!
You will then want to create a binary variable that identifies Seattle tracts (for instance, with values
"Seattle"
and"Not Seattle"
). One approach: You can usestringr::str_sub()
or base R’ssubstr()
to grab a number of characters—a sub-string—from text (say, to test if they equal something); if you use this withifelse()
insidemutate()
you can make a new variable based on whether the sub-string ofPrecinct
equals a value.
How can I figure out which precincts are in Seattle? I’m going to make a dataset with the whole list and eyeball it to see if anything jumps out.
precincts <- king_rel_races %>%
select(Precinct) %>%
distinct(Precinct) %>%
arrange(Precinct)
I’ve omitted the full list for space here, but if one scrolls to the “S” section, it looks like Seattle precincts all start with SEA
followed by a space and a precinct number. Looking at a map on the King County website and zooming in enough to see the precinct numbers confirms it. Precincts near but not in Seattle like in Shoreline to the north or Tukwila to the south have a different naming system. Thus, I am confident that identifying Seattle precincts as those whose first four characters are SEA
will work to flag those. Just the three characters SEA
on its own won’t – there are precincts called SEALTH
, SEAN
, and SEAVIEW
we don’t want to flag as in Seattle.
One way to proceed is to use the substr()
function (seen in Week 4 when checking if the second letter of some first names was “a”) to pull out the first four characters of Precinct
and check if they are equal to SEA
.
king_flag <- king_rel_races %>%
mutate(Location = ifelse(substr(Precinct, start = 1, stop = 4) == "SEA ",
"Seattle",
"Not Seattle"))
An alternative way is to use the str_detect()
function in stringr
. I haven’t introduced this yet, but it is convenient in cases like this. str_detect()
returns a TRUE
when it encounters a value containing a term we search for. It accepts regular expressions as search terms. The term "^SEA "
says “start at the beginning of each value, then look exclusively for S, E, A, and then a space.”
king_flag_alt <- king_rel_races %>%
mutate(Location = ifelse(stringr::str_detect(Precinct, "^SEA "),
"Seattle",
"Not Seattle"))
Do we get the same answer either way? Let’s sum how many times the approaches disagree.
sum(king_flag$Location != king_flag_alt$Location)
## [1] 0
Both ways give the same answers, so we’re all good!
We want to calculate turnout rates as total votes (including normal votes, blank votes, over votes, write-ins) for the Presidential race divided by registered voters. \(Turnout = \frac{Total Votes}{Registered Voters}\). Hint: You will want to look at
CounterType
andSumOfCount
at the same time, within eachPrecinct
andRace
. Examine how theSumOfCount
values forCounterType
value"Times Counted"
relate to all the otherCounterType
values.
First, I see there is a value in CounterType
called “Times Counted”. It would be nice if this was the numerator we were after. I’m going to check this by summing SumOfCount
up within each precinct and race of interest over all the rows besides where CounterType
is “Registered Voters” or “Times Counted”. Then I’ll compare these to the “Times Counted” rows. We’ll use joins to do this:
# sum over rows besides "Registered Voters" or "Times Counted"
# within each precinct and race
times_counted_manual <- king_flag %>%
select(Precinct, Race, CounterType, SumOfCount) %>%
filter(CounterType != "Registered Voters" &
CounterType != "Times Counted") %>%
group_by(Precinct, Race) %>%
summarize(votes_added_up = sum(SumOfCount))
## `summarise()` regrouping output by 'Precinct' (override with `.groups` argument)
head(times_counted_manual)
## # A tibble: 6 x 3
## # Groups: Precinct [2]
## Precinct Race votes_added_up
## <chr> <chr> <dbl>
## 1 ADAIR Governor 485
## 2 ADAIR Lieutenant Governor 485
## 3 ADAIR US President & Vice President 485
## 4 ALDARRA Governor 625
## 5 ALDARRA Lieutenant Governor 625
## 6 ALDARRA US President & Vice President 625
# now just grab the "Times Counted" rows and merge
times_counted_compare <- king_flag %>%
select(Precinct, Race, CounterType, SumOfCount) %>%
filter(CounterType == "Times Counted") %>%
# rename the column on filtered data for clarity
rename(times_counted_value = SumOfCount) %>%
left_join(times_counted_manual,
by = c("Precinct", "Race")) %>%
# compute differences
mutate(diff = times_counted_value - votes_added_up)
summary(times_counted_compare$diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
They’re always the same! That means “Times Counted” is indeed including every possible kind of vote for each race, such as blanks, write-ins, accidental over-votes, or your usual ones. Now we can make a data frame that has registered voters and turnout rates (for the Presidential race) for each precinct:
turnout_rates <- king_flag %>%
# filter to just the presidential election
filter(Race == rel_races[3]) %>%
# filter to just registered voters or times counted
filter(CounterType %in% c("Registered Voters", "Times Counted")) %>%
# just the columns we want
select(Precinct, Location, CounterType, SumOfCount) %>%
# use spread to put the two counts on the same row for each precinct
pivot_wider(names_from = CounterType, values_from = SumOfCount) %>%
# use new columns to compute turnout rate
mutate(Turnout = `Times Counted` / `Registered Voters`)
head(turnout_rates)
## # A tibble: 6 x 5
## Precinct Location `Registered Voters` `Times Counted` Turnout
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 ADAIR Not Seattle 519 485 0.934
## 2 ALDARRA Not Seattle 763 625 0.819
## 3 ALDER SPRINGS Not Seattle 557 476 0.855
## 4 ALDERWOOD Not Seattle 472 404 0.856
## 5 ALG 30-0013 Not Seattle 497 319 0.642
## 6 ALG 30-0014 Not Seattle 525 364 0.693
That’s it for Part 1!
We want to get measures of democratic support in each Precinct for each of our three races. You are asked to measure support as the percentage of votes in a precinct for the Democratic candidate out of all votes for candidates or write-ins, but this time do not to include blank votes or “over votes” (where the voter indicated multiple choices) in the overall vote count for the denominator. Hint: A good approach here is to compute the total votes (denominator) for each precinct, and then merge (e.g.
left_join()
) on the Democratic vote count for each race (numerator) and divide by the total votes. That is, \(Dem Support = \frac{Dem Count}{Total Votes}\).
You will probably want to follow a process like this:
- Make a new dataframe with the total number of votes cast for any actual candidates (including
"Write-In"
) in each precinct and race. Hint: You will likely want to usefilter()
followed bygroup_by()
andsummarize()
using theSumOfCount
variable.
- Make another dataframe with the total number of votes for democratic candidates in each precinct and race. You will want to check the
Party
of candidates and work only with the democratic observations to get these vote counts. Hint: There are different democratic parties for different races (e.g."Dem"
and"DPN"
).
- Merge the total votes data with the democratic votes data, then calculate a percent democratic votes variable for each race.
# [YOUR CODE HERE]
Once you’ve calculated democratic voting percentages for each race you’ll want to put them back together with the precinct turnout rate data using a join. Then you will want to make sure your data are shaped as I recommend above: One row per precincts, with columns for each of the relevant measures. If your data are in a format where you have a row for each race within each precinct (“long format”), you may find the
pivot_wider()
command useful for turning multiple rows for each precinct into single precinct rows with different columns for each race.
# [YOUR CODE HERE]
Make a scatterplot where the horizontal axis (
x=
) is number of registered voters in the precinct, and the vertical axis (y=
) is turnout rate. Color (color=
) the precincts in Seattle one color, and use a different color for other precincts. Do you observe anything?
# [YOUR CODE HERE]
Now let’s visualize the Democratic support rates for the three races within each precinct for sufficently large precincts. Limit the data to precincts with at least 500 registered voters (use
filter()
). Make a line plot where the horizontal axis (x=
) indicates precincts, and the vertical axis (y=
) shows the Democratic support rates. There should be three lines in different colors (one for each race of interest).
Do not label the precincts on the horizontal axis:
scale_x_discrete(breaks=NULL)
is one method for doing this. You should, however, arrange them on the axis in order from smallest to largest in terms of support for the Democratic candidate for president—that is, the line plotting percentage support for Obama should be smoothly increasing from left to right. The order of the lines in the legend should follow the order of the lines at the right edge of the plot.
To do this, we need to use the “wide” version of the data (one row per precinct), and order
Precinct
based on Democratic support for the Presidential race (Hint: You will probably want to useforcats::fct_reorder()
onPrecinct
). Then we can reshape back from “wide” to “tidy” form usingpivot_longer()
so that we have one variable giving the race—and another giving vote percentage—and can plot a separate line for each race.
# [YOUR CODE HERE]