Quantitative Data and Relationships

class: center, top, title-slide

.title[
# Quantitative Data and Relationships
]
.subtitle[
## CRM Week 2
]
.author[
### Charles Lanfear
]
.date[
### 21 Oct 2025 Updated: 20 Oct 2025
]

---

# Today

* Populations and Samples

* Crime Data

* Measurement

* Distributions

* Conditional Distributions

---
class: inverse

# Populations and Samples

&nbsp;

![:width 70%](img/sampling.png)

---
# Terminology

**Population**: All members of some given set of interest

* Our goal is usually to learn something about this group

**Sample**: A subset of the population selected for study

**Sampling**: Process of selecting members of population into the sample

**Sampling Frame**: Members of population eligible to be in your sample

* The *actual* members we're drawing the sample from
* Ideally as similar as possible to the population

**Observation**: One of the sampled units

**Variable**: A measure describing the observation(s)

.text-center[
*Your data will contain only observations and variables, but their value is determined by everything else*
]

---
# Example

> To estimate the level of crime victimisation in the United Kingdom, the British Crime Survey surveys a random sample of 3,000 UK household residents per month and asks them about incidents of victimisation (e.g., burglary) in the past year.

**Population**: Everyone in the United Kingdom

**Sample**: 3000 household residents

**Sampling**: Random selection from all UK households

**Sampling Frame**: People living in a known household

**Observation**: A person

**Variable**: Past year burglary victimisation

.text-center[
*What would these data be bad for?*
]

---
# Types of Statistics

.pull-left[

### Descriptive

* Describing a sample
* Summarizing variables
]
.pull-right[

### Inferential

* Learning about the population
* Testing hypotheses
]

To make **inferences** about populations using samples, your sample needs to be **representative** of your population—at least in the ways that matter

Samples are **representative** when they have a similar composition to the population along the characteristics you are interested in

&nbsp;

.text-center[

*Next week we'll be talking about inferential statistics*

]

---

![](img/phdcn-sampling.jpg)
.footnote[Source: [Sampson et al. (2022)](https://link.springer.com/article/10.1007/s40865-022-00203-0) ]

---
class: inverse
background-color: #111111

# Crime Data

&nbsp;

.pull-left[
![](img/shr_rates_anim.gif)
]
.pull-right[
![](img/shr_age_distribution.gif)
]

---

## Police Data

An example: `data.police.uk`

* All police-recorded crime in England, Wales, and Northern Ireland
   * A *population*!
   * Tens/hundreds of thousands of observations
   * Geolocation
   * Small area estimates possible

.pull-left[
### Advantages:

* Captures wide array of offenses
* Often easily available
* Often *all* that is available
]
.pull-right[
### Challenges:

* Misses *unreported crime*
* *Reporting* varies widely
* Quality varies widely
]

---

## The Dark Figure of Crime

.pull-left-60[
Not all crimes are *discovered*

* Victimless crimes
* Missing person homicides

]
.pull-right-40[
![](img/dark_figure.jpg)
]

---
count: false

## The Dark Figure of Crime

.pull-left-60[
Not all crimes are *discovered*

* Victimless crimes
* Missing person homicides

Not all crimes are *reported*

* Varies by place
* Varies by individuals
* Varies by crime

]
.pull-right-40[
![](img/dark_figure.jpg)
]

---
count: false

## The Dark Figure of Crime

.pull-left-60[
Not all crimes are *discovered*

* Victimless crimes
* Missing person homicides

Not all crimes are *reported*

* Varies by place
* Varies by individuals
* Varies by crime

Not all reported are *recorded* by police

* Varies by agency and crime
* Differential treatment
* Hierarchy rules
* Political incentives

]
.pull-right-40[
![](img/dark_figure.jpg)
]

---

## An Example

Observed Crime Rate = Actual Rate ∗ Probability of Reporting

> According to the U.C.R., the incidence of rape nearly doubled from 1973 to 1990. The N.C.V.S., by contrast, shows that it declined by around forty per cent during the same period. Researchers... found that the upward trend in the U.C.R. data correlated with upticks in the number of female police officers, and with the advent of rape crisis centers and reformed investigative styles. It could be, in short, that a modernized approach to the policing of rape drastically increased the frequency with which it was reported while reducing its incidence.

[Matthew Hutson. 2020. "The Trouble with Crime Statistics." The New Yorker](https://www.newyorker.com/culture/annals-of-inquiry/the-trouble-with-crime-statistics)

&nbsp;

.text-center[
*Note the complex causal process generating the data!*
]

---
## Victimization Data

.pull-left[
### Advantages:

* Captures unreported crime
* Can get at exposure and effects
   * *How many times*
   * *How badly injured*

]
.pull-right[
### Challenges:

* Reluctance to answer
* Hidden populations
* Expense
* Doesn't capture crime without victim
]

An Example: **British Crime Survey**

* Self-reported victimisation
* Sample of 3,000 per month

* Great for nationwide estimates
   * Too small for area estimates

---
## Self-Reported Offense Data

.pull-left[

### Advantages:

* Captures unreported crime
* Captures crime without victim
* Can get at motivation
]

.pull-right[
### Challenges:

* Reluctance to answer
* Hidden, even dangerous, populations
* Many crimes very rare
* Expense
]

An Example: **PADS+**

* 716 youth in/around Peterborough
   * Representative sample at age 12
* Repeated follow-up—with incredible *retention*
* Detailed information on life and context

---
# Ideal Data

*Ideal criminological data would cover every potential crime and what occurred as a result*

.pull-left[
That is, for a given opportunity, was the crime...

* Attempted?
* Successful?
* Discovered?
* Reported to police?
]

.pull-right[
And was the offender...

* Seen or found?
* Informally sanctioned?
* Arrested?
* Incarcerated?
]

.text-center[
*These are all important but nearly never all observed, especially not for a large number of incidents*
]

---
# Types of Studies

* **Case study**: One unit observed once (perhaps over a period)

* **Cross-section**: Many units observed at one time point

* **Time series**: One unit observed at multiple times

* **Time-series cross-section**:

* Many units observed at different times
   * Not the same units at each time

* **Panel study**

* Many units observed at different times
   * Same units observed at each time

&nbsp;

.center[
*Each is useful for answering different questions*
]

---
class: inverse

# Measurement

&nbsp;

![:width 28%](img/craniograph.jpg)

---
# What is Measurement?

* Measurement is the process of connecting an abstract concept to empirical data

* Research questions must be testable with empirical data

* Testing thus relies heavily on measurement

* No research design can overcome bad measures

---
# Types of Measures

&zwj;**Observables**: Objective, externally measurable variables

* Individual height, occupation, stated beliefs, arrest history
 * Neighborhood population, geographical area, crime rate1

.footnote[[1] Hard to measure doesn't mean unobservable!]

*Observables can be directly seen and/or experienced*

**Unobservables** (or **Constructs**): Subjective, internal, or only indirectly measurable

* Individual self-control, wellness, social class, actual beliefs
   * Neighborhood social capital, disadvantage, collective efficacy

*Unobservables do not exist in the real world: they are conceptual*

.text-center[
*In Criminology, we're* very often *interested in unobservables.*
]

---
# Measurement Principles

&zwj;**Conceptualization**: *What is X?*

* Define what a measure constitutes
* Define relations to other concepts
* Goal: Minimize ambiguity

&zwj;**Operationalization**: *How do I measure X?*

* Define observable(s)—**indicators**—that are related to the concept
* If single indicator, it is a **proxy** for the concept
 + e.g. *Years of School* is a proxy for *Education*
* If multiple indicators, they can be combined into *composite* measure1

When reading articles, pay attention to how things are measured!

.footnote[[1] Composites include things like indices, scales, factors, and principle components. Turning multiple measures into one composite is sometimes called *dimension reduction*.]

---
# Indicators and Causes

.pull-left[
**Reflective**

The measure causes the indicators

![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-2-1.svg)

*The value of the underlying unobservable variable is __reflected__ in the indicators*

]

.pull-right[
**Formative**

The indicators cause the measure

![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-3-1.svg)

*The underyling unobservable variable is __formed__ by the combination of the indicators*

]

.center[
*These impact how we estimate measures—a topic for another time!*

*Let's see some examples*
]

---

# Individual Reflective

### Self-Control

&zwj;Concept: The ability to delay gratification, tolerate frustration, and carefully consider before acting.

&zwj;Indicators: *Would you agree, neither agree nor disagree, or disagree that you...*

* "get upset when you have to wait for something?"
   * "act without stopping to think?"
   * "like to do daring things?"
   * "are impatient—want to have things right away?"
   * "are careful about what you do?"
   
--

&zwj;Assumption: *Shared variation in the indicators represents underlying self-control.*

---

# Individual Formative

### Offending Scale

&zwj;Concept: A summary measure of frequency and severity of offending.

&zwj;Indicators: *In the past year, how many times have you...*

* "taken something that wasn't yours?"
   * "used force to take something from someone else?"
   * "punched, kicked, or pushed someone else?"
   * "attacked someone with a weapon like a knife or gun?"
   
--

&zwj;Assumption: *This combination of different types of offending counts provides a useful measure of something.*

---
# Neighborhood Reflective

### Child-Centered Informal Social Control

&zwj;Concept: The shared neighborhood norms and expectations for intervening against child misbehavior.

???

This is a complex one that we'll see a bunch in social disorg--mainly collective efficacy

Idea is capturing neighborhood capacity to control child behavior in public spaces

Isn't about what respondent would do--is about what people around would do

&zwj;Indicators: *How likely1 is it that people in your neighborhood would stop it if...*

* "a group of neighborhood children were skipping school and hanging out on a street corner."
   * "some children were spray-painting graffiti on a local building."
   * "children were fighting out in the street."
   * "a child was showing disrespect to an adult."

.footnote[[1] (1) Very Likely, (2) Likely, (3) Unlikely, (4) Very Unlikely]

&zwj;Assumption: *Shared variation in the measures represents underlying expectations for social control of children.*

---

# Neighborhood Formative

### Concentrated Disadvantage

&zwj;Concept: The spatial clustering of socioeconomic disadvantage.

&zwj;Indicators: *Census tract percentage of...*

* "families below the poverty line."
   * "individuals on public assistance."
   * "individuals unemployed."
   * "female-headed households."
   * "individuals under age 18."
   
--

&zwj;Assumption: *The combination of these measures provides a useful index of disadvantage.*

---

# Estimation is hard!

![:width 80%](img/measurement_representation.gif)

.footnote[
From [Groves & Lyburg (2010) "Total Survey Error: Past, Present, and Future"](https://doi.org/10.1093/poq/nfq065)
]

---
class: inverse

# Types of Quantitative Data

&nbsp;

![:width 70%](img/data.jpg)

---

## Continuous Data

**Continuous** data take ordered, evenly-spaced, numeric values:

* The difference between 1cm and 2cm is the same as 5cm and 6cm
* The difference between 20C and 25C is the same as 25C and 30C

If they have a *true non-arbitrary zero*, they are **ratio** data

* If something is 0cm long, it has *no length* (ratio!)

* If the temperature is 0C, there isn't *no temperature* (not ratio!)
   * The 0 is arbitrary: In Fahrenheit it is 32 degrees!

* Multiplication only works with ratio data

* 15cm is half as long as 30cm
 * 15C is not half as hot as 30C
 * 15C is 59F is 288K1
 * 30C is 86F is 303K

.pull-right[
.footnote[[1] Kelvin *is* ratio though!]
]

---

## Discrete or Count Data

**Counts** are a special type of numeric data:

* True zero like ratio data
* Strictly non-negative
* Only whole (integer) values

Events or incidents are counts

* You can't have negative crimes or half a crime.

We create **rates** by dividing counts by an **exposure** variable like population, time, or area1

* **Exposures** measure what is "at risk" for an event
* e.g., Homicides per 100,000 people

.footnote[[1] Rates have to be used carefully—we'll talk about this in IQA later in the term]

---

## Ordinal Data

**Ordinal** data have values that can be ranked or ordered, but have no units or meaningful intervals

A common **Likert scale** item:

* Strongly Agree
* Agree
* Neither Agree nor Disagree
* Disagree
* Strongly Disagree

The difference between "strongly agree" and "disagree" and between "disagree" and "neither agree nor disagree" may not be the same

This means you cannot add/subtract or multiply/divide ordinal data

You also can't easily compare one person's ordinal value to another's

* There's no objective reference point

---

## Nominal (Categorical) Data

**Nominal** or **categorical** data can't even be ranked. Every value is *qualitatively*, rather than quantitatively, different

* Gender
* Nationality
* Occupation

Sometimes we use nominal data to create other types:

* Occupation into ranked socio-economic status or prestige
* Nationality into high/middle/low income country

**Binary** or **dichotomous** data are a special case of nominal or ordinal data when there are only two categories

* e.g., alive or dead
* A useful mean can be calculated for binary data!

---
class: inverse

# Describing Data

&nbsp;

![:width 70%](img/distributions.png)

---

# Central Tendency

Measures of **central tendency** are ways to summarize a variable mathematically using a (usually) single number

**Mean**

* The "average" or "expected value"
* Defined only for continuous data

**Median**

* The "middle" value or 50th **percentile**
* Defined for continuous or ordinal data

**Mode**

* The most common single value1
* Defined for all data

.footnote[[1] For continuous data, the mode is the highest peak of probability density function]

---

# Dispersion

Measures of **dispersion** are ways to summarize how spread out data are in a variable

We typically only define these for continuous data

**Standard Deviation**

* Roughly the average difference between data values and the mean
* Low values indicate data are clustered around the mean
* High values indicate data are spread away from the mean

**Variance**1

* The square of the standard deviation
* Not typically interpretable
* Mainly useful in calculations

.footnote[[1] Note there are different variance formulae for the population (divide by N) and sample (divide by N-1). We nearly always use the second one.]

---

# Probability Distributions

**Probability distributions** are mathematical functions that give the **probability** of observing each value

* This is just how likely it is for a variable to take any given value
* Common notation: `\(P(X)\)` (probability of `\(X\)`)

**Empirical distributions** or **sample distributions**

* The set of all observed values of a variable

**Theoretical distributions**

* Imagined or expected frequencies based on mathematical parameters
* Not based on empirical data
* We often compare empirical distributions to theoretical ones

.text-center[
*We'll talk about theoretical distributions next week—we'll focus on empirical ones today*
]

---

## Categorical

Because they take a limited number of values, we typically summarize categorical data by the count or proportion of each value

.text-72[

|Type of Crime | Frequency| Proportion|
|:-------------|---------:|----------:|
|Assault       |        89|       0.36|
|Robbery       |        24|       0.10|
|Theft         |       137|       0.55|
]

Bar plots are a good way to visualize categorical distributions

![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-5-1.svg)

---

## Ordinal

Ordinals can be summarized similarly to categoricals, but they have a median—where the cumulative proportion (percentile) passes 0.50.

.text-72[

|Fear of Crime | Frequency| Proportion| Cumulative Proportion|
|:-------------|---------:|----------:|---------------------:|
|Very Low      |        25|       0.08|                  0.08|
|Low           |       100|       0.32|                  0.40|
|Moderate      |        75|       0.24|                  0.64|
|High          |        60|       0.19|                  0.83|
|Very High     |        50|       0.16|                  0.99|
]

![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-7-1.svg)

---

## Continuous

Continuous data can take a limitless number of values, so we summarize them with measures of central tendency and dispersion.

|   Mean| Median|  Mode| Std. Dev.|   N|
|------:|------:|-----:|---------:|---:|
| 131.14|  78.68| 33.57|    153.32| 400|

We visualize continuous distributions with **histograms** and **density plots**

.pull-left[
![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-9-1.svg)
]
.pull-right[
![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-10-1.svg)

]

---

# Visualise Your Data

These all have the same mean, median, and standard deviation!

|Distribution |Mean |Median |Std. Dev |
|:------------|:----|:------|:--------|
|Normal       |1.0  |1.0    |1.0      |
|Poisson      |1.0  |1.0    |1.0      |
|Uniform      |1.0  |1.0    |1.0      |

![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-12-1.svg)

---
class: inverse
# Quantitative Relationships

&nbsp;

![:width 80%](img/age-crime-curve.png)

---
# Types of Relationships

&zwj;**Association**: *X and Y tend to "go together"*

* e.g., ice cream sales and violent crime rise at same time
* *May* imply common causes, e.g. temperature
* *Does not* imply banning ice cream will reduce crime or vice versa
* Purely observational—makes no assumptions

&zwj;**Effect**: *X causes Y*

* e.g., clearing a vacant lot reduces nearby violence 
* Implies *direction* and *cause*:
   * Clearing more lots will reduce violence there
   * Reducing violence will not clear lots
* Requires strong assumptions to **identify**

.text-center[
*For now, we'll focus on associations*
]

---

# Distributions

Associations are really about **conditional probabilities**

* Probability `\(Y\)` takes a particular value when `\(X\)` takes a particular value
   * Notation: `\(P(Y|X)\)` is "Probability of Y given X"
* e.g., if a suspect is white, their probability of arrest is 0.3.
   * `\(P(arrest|white) = 0.3\)`

A **conditional distribution** is the distribution a variable takes when a different variable takes some particular value

A **joint distribution** is the full distribution of *both* variables

* i.e., the **conditional distributions** of each variable at every value of the other variable

.text-center[
*These are hard to think about but intuitive in action*
]

---

# Cross-Tabs

The simplest joint distributions are those between categorical variables, usually displayed as **cross-tabulations**

.pull-left[

|Sex    | Non-Violent| Violent|
|:------|-----------:|-------:|
|Female |          13|       7|
|Male   |          22|      28|

]

.pull-right[

|Sex    | Non-Violent| Violent|
|:------|-----------:|-------:|
|Female |        0.65|    0.35|
|Male   |        0.44|    0.56|

]

We can also visualise them with a bar plot

![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-15-1.svg)

---

# Categorical x Continuous

If our association is between a categorical and a continuous variable, we can calculate statistics within each category

|Area  | Mean Crime Rate| Std. Dev.|   N|
|:-----|---------------:|---------:|---:|
|Rural |            5.40|      3.78| 131|
|Urban |           23.72|     12.98| 119|

Or draw a histogram or density for each category

.pull-left[
![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-17-1.svg)
]

.pull-right[
![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-18-1.svg)
]

---

# Continuous x Continuous

When both variables are continuous, we usually measure association with a **correlation** and visualise with a **scatterplot**

|             | Crime Rate|
|:------------|----------:|
|Pop. Density |       0.79|

![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-20-1.svg)

Here I've added a **regression line** to illustrate the correlation

---

# Going Further

We can go pretty far just visualising results!

![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-21-1.svg)

Soon we'll get into statistical models to measure and test relationships

But they're pretty much all just based on cross-tabs, group means, or correlations

---
class: inverse
# What does this show?

&nbsp;

![:width 80%](img/age-crime-curve.png)

---
class: inverse
# Correlation Cautions

&nbsp;

![:width 80%](img/caution.png)

.text-center[
*He **will** kill again*
]

---

## Anscombe's Quartet

Visualisation is important—these all have the same correlation (0.82)!

![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-22-1.svg)

---

## Dinosaur Deceit

So do these! Their `\(X\)` and `\(Y\)` values also have the same *means* and *standard deviations*!

![](slides_quantitative-data-relationships_files/figure-html/unnamed-chunk-23-1.svg)

---
class: inverse

# Wrap-Up

* Takeaways

* Measurement is complicated but important
   * Quantitative analysis is mainly about distributions
   * **Always visualise your data!**

* We'll pick up with more on relationships next week, and use that to lead into statistical inference