class: center, top, title-slide .title[ # Quantitative Data and Relationships ] .subtitle[ ## CRM Week 2 ] .author[ ### Charles Lanfear ] .date[ ### 22 Oct 2024
Updated: 17 Oct 2024 ] --- # Last Week ## Was it interesting? ## Was it almost fun? --- class: inverse .text-big-center[ Sorry about today! ] .footnote[ At least there's no calculus! Haha, got ya, IQA students! ] --- # Today * Populations and Samples * Crime Data * Measurement * Distributions * Conditional Distributions --- class: inverse # Populations and Samples ![:width 70%](img/sampling.png) --- # Terminology **Population**: All members of some given set of interest * Our goal is usually to learn something about this group -- **Sample**: A subset of the population selected for study -- **Sampling**: Process of selecting members of population into the sample -- **Sampling Frame**: Members of population eligible to be in your sample * The *actual* members we're drawing the sample from * Ideally as similar as possible to the population -- **Observation**: One of the sampled units -- **Variable**: A measure describing the observation(s) -- .text-center[ *Your data will contain only observations and variables, but their value is determined by everything else* ] --- # Example > To estimate the level of crime victimisation in the United Kingdom, the British Crime Survey surveys a random sample of 3,000 UK household residents per month and asks them about incidents of victimisation (e.g., burglary) in the past year. -- **Population**: Everyone in the United Kingdom -- **Sample**: 3000 household residents -- **Sampling**: Random selection from all UK households -- **Sampling Frame**: People living in a known household -- **Observation**: A person -- **Variable**: Past year burglary victimisation -- .text-center[ *What would these data be bad for?* ] --- # Types of Statistics .pull-left[ ### Descriptive * Describing a sample * Summarizing variables ] .pull-right[ ### Inferential * Learning about the population * Testing hypotheses ] -- To make **inferences** about populations using samples, your sample needs to be **representative** of your population—at least in the ways that matter -- Samples are **representative** when they have a similar composition to the population -- .text-center[ *Next week we'll be talking about inferential statistics* ] --- class: inverse background-color: #111111 # Crime Data .pull-left[ ![](img/shr_rates_anim.gif) ] .pull-right[ ![](img/shr_age_distribution.gif) ] --- ## Police Data An example: `data.police.uk` * All police-recorded crime in England, Wales, and Northern Ireland * A *population*! * Tens/hundreds of thousands of observations * Geolocation * Small area estimates possible -- .pull-left[ ### Advantages: * Captures wide array of offenses * Often easily available * Often *all* that is available ] .pull-right[ ### Challenges: * Misses unreported crime * *Reporting* varies widely * Quality varies widely ] --- ## The Dark Figure of Crime .pull-left-60[ Not all crimes are discovered * Victimless crimes * Missing person homicides ] .pull-right-40[ ![](img/dark_figure.jpg) ] --- count: false ## The Dark Figure of Crime .pull-left-60[ Not all crimes are discovered * Victimless crimes * Missing person homicides Not all crimes are reported * Varies by place * Varies by individuals * Varies by crime ] .pull-right-40[ ![](img/dark_figure.jpg) ] --- count: false ## The Dark Figure of Crime .pull-left-60[ Not all crimes are discovered * Victimless crimes * Missing person homicides Not all crimes are reported * Varies by place * Varies by individuals * Varies by crime Not all reported are recorded by police * Varies by agency and crime * Hierarchy rules * Political incentives * Differential treatment ] .pull-right-40[ ![](img/dark_figure.jpg) ] --- ## An Example Observed Crime Rate = Actual Rate ∗ Probability of Reporting > According to the U.C.R., the incidence of rape nearly doubled from 1973 to 1990. The N.C.V.S., by contrast, shows that it declined by around forty per cent during the same period. Researchers... found that the upward trend in the U.C.R. data correlated with upticks in the number of female police officers, and with the advent of rape crisis centers and reformed investigative styles. It could be, in short, that a modernized approach to the policing of rape drastically increased the frequency with which it was reported while reducing its incidence. [Matthew Hutson. 2020. "The Trouble with Crime Statistics." The New Yorker](https://www.newyorker.com/culture/annals-of-inquiry/the-trouble-with-crime-statistics) -- .text-center[ *Note the complex causal process generating the data!* ] --- ## Victimization Data .pull-left[ ### Advantages: * Captures unreported crime * Can get at exposure ] .pull-right[ ### Challenges: * Reluctance to answer * Hidden populations * Expense * Doesn't capture crime without victim ] -- An Example: **British Crime Survey** * Self-reported victimisation * Sample of 3,000 per month * Great for nationwide estimates * Too small for area estimates --- ## Self-Reported Offense Data .pull-left[ ### Advantages: * Captures unreported crime * Captures crime without victim * Can get at motivation ] .pull-right[ ### Challenges: * Reluctance to answer * Hidden, even dangerous, populations * Many crimes very rare * Expense ] -- An Example: **PADS+** * 716 youth in/around Peterborough * Representative sample at age 12 * Repeated follow-up—with incredible *retention* * Detailed information on life and context --- # Ideal Data *Ideal criminological data would cover every potential crime and what occurred as a result* -- .pull-left[ That is, for a given opportunity, was the crime... * Attempted? * Successful? * Discovered? * Reported to police? ] -- .pull-right[ And was the offender...<br><br> * Seen or found? * Informally sanctioned? * Arrested? * Incarcerated? ] -- *These are all important but nearly never all observed, especially not for a large number of incidents* --- class: inverse # Measurement ![:width 28%](img/craniograph.jpg) --- # What is Measurement? * Measurement is the process of connecting an abstract concept to empirical data -- * Research questions must be testable with empirical data -- * Testing thus relies heavily on measurement -- * No research design can overcome bad measures --- # Types of Measures ‍**Observables**: Objective, externally measurable variables * Individual height, occupation, stated beliefs * Neighborhood population, geographical area, crime rate<sup>1</sup> .footnote[[1] Hard to measure doesn't mean unobservable!] *Observables can be directly seen and/or experienced* -- <br> **Unobservables** (or **Constructs**): Subjective, internal, or only indirectly measurable * Individual self-control, wellness, social class, actual beliefs * Neighborhood social capital, disadvantage, collective efficacy *Unobservables do not exist in the real world: they are conceptual* -- .text-center[ In Criminology, we're *very often* interested in unobservables. ] --- # Measurement Principles ‍**Conceptualization**: *What is X?* * Define what a measure constitutes * Define relations to other concepts * Goal: Minimize ambiguity -- ‍**Operationalization**: *How do I measure X?* * Define observable(s)—**indicators**—that are related to the concept * If single indicator, it is a **proxy** for the concept + e.g. *Years of School* is a proxy for *Education* * If multiple indicators, they can be combined into *composite* measure<sup>1</sup> -- When reading articles, pay attention to how things are measured! .footnote[[1] Composites include things like indices, scales, factors, and principle components. Turning multiple measures into one composite is sometimes called *dimension reduction*.] --- # Individual Example ### Self-Control ‍Concept: The ability to delay gratification, tolerate frustration, and carefully consider before acting. -- ‍Indicators: *Would you agree, neither agree nor disagree, or disagree that you...* * "Get upset when you have to wait for something?" * "Act without stopping to think?" * "Like to do daring things?" * "Are impatient—want to have things right away?" * "Are careful about what you do?" -- ‍Assumption: *Shared variation in the indicators represents underlying self-control.* --- # Neighborhood Example ### Child-Centered Informal Social Control ‍Concept: The shared neighborhood norms and expectations for intervening against child misbehavior. ??? This is a complex one that we'll see a bunch in social disorg--mainly collective efficacy Idea is capturing neighborhood capacity to control child behavior in public spaces Isn't about what respondent would do--is about what people around would do -- ‍Indicators: *How likely<sup>1</sup> is it that people in your neighborhood would stop it if...* * "a group of neighborhood children were skipping school and hanging out on a street corner." * "some children were spray-painting graffiti on a local building." * "children were fighting out in the street." * "a child was showing disrespect to an adult." .footnote[[1] (1) Very Likely, (2) Likely, (3) Unlikely, (4) Very Unlikely] -- ‍Assumption: *Shared variation in the measures represents underlying expectations for social control of children.* --- # Estimation is hard! <br> ![:width 80%](img/measurement_representation.gif) .footnote[ From [Groves & Lyburg (2010) "Total Survey Error: Past, Present, and Future"](https://doi.org/10.1093/poq/nfq065) ] --- class: inverse # Types of Quantitative Data ![:width 70%](img/data.jpg) --- ## Continuous Data **Continuous** data take ordered, evenly-spaced, numeric values: * The difference between 1cm and 2cm is the same as 5cm and 6cm * The difference between 20C and 25C is the same as 25C and 30C -- If they have a *true non-arbitrary zero*, they are **ratio** data * If something is 0cm long, it has *no length* (ratio!) * If the temperature is 0C, there isn't *no temperature* (not ratio!) * The 0 is arbitrary: In Fahrenheit it is 32 degrees! -- * Multiplication only works with ratio data * 15cm is half as long as 30cm * 15C is not half as hot as 30C * 15C is 59F is 288K<sup>1</sup> * 30C is 86F is 303K .pull-right[ .footnote[[1] Kelvin *is* ratio though!] ] --- ## Discrete or Count Data **Counts** are a special type of numeric data: * True zero like ratio data * Strictly non-negative * Only whole (integer) values -- Events or incidents are counts * You can't have negative crimes or half a crime. -- We create **rates** by dividing counts by an **exposure** variable like population, time, or area<sup>1</sup> * **Exposures** measure what is "at risk" for an event * e.g., Homicides per 100,000 people .footnote[[1] Rates have to be used carefully—we'll talk about this in IQA later in the term] --- ## Ordinal Data **Ordinal** data have values that can be ranked or ordered, but have no units or meaningful intervals -- A common **Likert scale** item: * Strongly Agree * Disagree * Neither Agree nor Disagree * Disagree * Strongly Disagree -- The difference between "strongly agree" and "disagree" and between "disagree" and "neither agree nor disagree" may not be the same -- This means you cannot add/subtract or multiply/divide ordinal data -- You also can't easily compare one person's ordinal value to another's * There's no objective reference point --- ## Nominal (Categorical) Data **Nominal** or **categorical** data can't even be ranked. Every value is *qualitatively*, rather than quantitatively, different * Gender * Nationality * Occupation -- Sometimes we use nominal data to create other types: * Occupation into ranked socio-economic status or prestige * Nationality into high/middle/low income country -- **Binary** or **dichotomous** data are a special case of nominal or ordinal data when there are only two categories * e.g., alive or dead --- class: inverse # Describing Data ![:width 70%](img/distributions.png) --- # Central Tendency Measures of **central tendency** are ways to summarize a variable mathematically using a (usually) single number -- **Mean** * The "average" or "expected value" * Defined only for continuous data -- **Median** * The "middle" value or 50th **percentile** * Defined for continuous or ordinal data -- **Mode** * The most common single value<sup>1</sup> * Defined for all data .footnote[[1] For continuous data, the mode is the highest peak of probability density function] --- # Dispersion Measures of **dispersion** are ways to summarize how spread out data are in a variable -- We typically only define these for continuous data -- **Standard Deviation** * Roughly the average difference between data values and the mean * Low values indicate data are clustered around the mean * High values indicate data are spread away from the mean -- **Variance**<sup>1</sup> * The square of the standard deviation * Not typically interpretable * Mainly useful in calculations .footnote[[1] Note there are different variance formulae for the population (divide by N) and sample (divide by N-1). We nearly always use the second one.] --- # Probability Distributions **Probability distributions** are mathematical functions that give the **probability** of observing each value * This is just how likely it is for a variable to take any given value * Common notation: `\(P(X)\)` (probability of `\(X\)`) -- **Empirical distributions** or **sample distributions** * The set of all observed values of a variable -- **Theoretical distributions** * Imagined or expected frequencies based on mathematical parameters * Not based on empirical data * We often compare empirical distributions to theoretical ones -- .text-center[ *We'll talk about theoretical distributions next week—we'll focus on empirical ones today* ] --- ## Categorical Because they take a limited number of values, we typically summarize categorical data by the count or proportion of each value -- .text-72[ |Type of Crime | Frequency| Proportion| |:-------------|---------:|----------:| |Assault | 89| 0.36| |Robbery | 24| 0.10| |Theft | 137| 0.55| ] -- Bar plots are a good way to visualize categorical distributions ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-3-1.svg)<!-- --> --- ## Ordinal Ordinals can be summarized similarly to categoricals, but they have a median—where the cumulative proportion (percentile) passes 0.50. .text-72[ |Fear of Crime | Frequency| Proportion| Cumulative Proportion| |:-------------|---------:|----------:|---------------------:| |Very Low | 25| 0.08| 0.08| |Low | 100| 0.32| 0.40| |Moderate | 75| 0.24| 0.64| |High | 60| 0.19| 0.83| |Very High | 50| 0.16| 0.99| ] -- ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-5-1.svg)<!-- --> --- ## Continuous Continuous data can take a limitless number of values, so we summarize them with measures of central tendency and dispersion. | Mean| Median| Mode| Std. Dev.| N| |------:|------:|-----:|---------:|---:| | 131.14| 78.68| 33.57| 153.32| 400| -- We visualize continuous distributions with **histograms** and **density plots** .pull-left[ ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-7-1.svg)<!-- --> ] .pull-right[ ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-8-1.svg)<!-- --> ] --- # Visualise Your Data These all have the same mean, median, and standard deviation! |Distribution |Mean |Median |Std. Dev | |:------------|:----|:------|:--------| |Normal |1.0 |1.0 |1.0 | |Poisson |1.0 |1.0 |1.0 | |Uniform |1.0 |1.0 |1.0 | ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-10-1.svg)<!-- --> --- class: inverse # Quantitative Relationships ![:width 80%](img/age-crime-curve.png) --- # Types of Relationships ‍**Association**: *X and Y tend to "go together"* * e.g., ice cream sales and violent crime rise at same time * *May* imply common causes, e.g. temperature * *Does not* imply banning ice cream will reduce crime or vice versa * Purely observational—makes no assumptions -- ‍**Effect**: *X causes Y* * e.g., clearing a vacant lot reduces nearby violence * Implies *direction* and *cause*: * Clearing more lots will reduce violence there * Reducing violence will not clear lots * Requires strong assumptions to **identify** -- .text-center[ *For now, we'll focus on associations* ] --- # Distributions Associations are really about **conditional probabilities** * The probability of `\(Y\)` taking a particular value when `\(X\)` takes a particular value * Notation: `\(P(Y|X)\)` is "Probability of Y given X" * e.g., if a drug possession suspect is white, their probability of arrest is 0.3. * `\(P(arrest|white) = 0.3\)` -- A **conditional distribution** is the distribution a variable takes when a different variable takes some particular value -- A **joint distribution** is the full distribution of *both* variables * i.e., the **conditional distributions** of each variable at every value of the other variable -- .text-center[ *These are hard to think about but intuitive in action* ] --- # Cross-Tabs The simplest joint distributions are those between categorical variables, usually displayed as **cross-tabulations** .pull-left[ |Sex | Non-Violent| Violent| |:------|-----------:|-------:| |Female | 13| 7| |Male | 22| 28| ] .pull-right[ |Sex | Non-Violent| Violent| |:------|-----------:|-------:| |Female | 0.65| 0.35| |Male | 0.44| 0.56| ] -- We can also visualise them with a bar plot ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-13-1.svg)<!-- --> --- # Categorical x Continuous If our association is between a categorical and a continuous variable, we can calculate statistics within each category |Area | Mean Crime Rate| Std. Dev.| N| |:-----|---------------:|---------:|---:| |Rural | 5.40| 3.78| 131| |Urban | 23.72| 12.98| 119| -- Or draw a histogram or density for each category .pull-left[ ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-15-1.svg)<!-- --> ] .pull-right[ ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-16-1.svg)<!-- --> ] --- # Continuous x Continuous When both variables are continuous, we usually measure association with a **correlation** and visualise with a **scatterplot** | | Crime Rate| |:------------|----------:| |Pop. Density | 0.79| -- ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-18-1.svg)<!-- --> Here I've added a **regression line** to illustrate the correlation --- # Going Further We can go pretty far just visualising results! ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-19-1.svg)<!-- --> -- Soon we'll get into statistical models to measure and test relationships -- But they're pretty much all just based on cross-tabs, group means, or correlations --- # What does this show? ![:width 80%](img/age-crime-curve.png) --- class: inverse # Correlation Cautions ![:width 80%](img/caution.png) <br> .text-center[ *He **will** kill again* ] --- ## Anscombe's Quartet Visualisation is important—these all have the same correlation (0.82)! ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-20-1.svg)<!-- --> --- ## Dinosaur Deceit So do these! Their `\(X\)` and `\(Y\)` values also have the same means and standard deviations! ![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-21-1.svg)<!-- --> --- # Wrap-Up * Takeaways * Measurement is complicated but important * Quantitative analysis is mainly about distributions * **Always visualise your data!** * We'll pick up with more on relationships next week, and use that to lead into statistical inference