library(dplyr)
library(gt)
library(palmerpenguins)
library(shadowpop)
library(withr)An outline of {shadowpop}
“What’s a shadow population?” I hear you ask?
It’s a bit like a control group in an experiment, drawn from an existing defined population dataset.
Imagine you are asked to report on activity relating to a particular population where there has been some kind of intervention or treatment applied. As part of the evaluation of the intervention, you would like to compare these activity outcomes against the activity of a similar, but separate, population. This secondary population is what I am calling a shadow population.
You want the shadow population to have a similar distribution of certain key (independent) variables to your primary population. For example, you may want to generate a population that lives in areas that have a similar overall socio-economic profile as your primary population. Or you may want the age or gender profile to be as similar as possible. Or any other variable that you think matters.
I’m assuming that the primary population and the shadow population can both be drawn from a single overall dataset, for example patients registered at a hospital or students at a college. It may not be possible to exactly mirror the distribution of these variables in the shadow population, but {shadowpop} helps you get as close as possible.
Using shadowpop
To better illustrate what shadowpop() does, we can use the {palmerpenguins} dataset.
I am going to restrict the example here to the Adélie penguins.
Let’s say we have a small sample of 5 Adélies from Biscoe island:
sample_data <- penguins |>
dplyr::filter(species == "Adelie" & island == "Biscoe") |>
dplyr::slice_sample(n = 5) |>
withr::with_seed(seed = 777)
sample_data |>
gt()| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Biscoe | 41.1 | 18.2 | 192 | 4050 | male | 2008 |
| Adelie | Biscoe | 37.8 | 18.3 | 174 | 3400 | female | 2007 |
| Adelie | Biscoe | 41.6 | 18.0 | 192 | 3950 | male | 2008 |
| Adelie | Biscoe | 36.5 | 16.6 | 181 | 2850 | female | 2008 |
| Adelie | Biscoe | 40.1 | 18.9 | 188 | 4300 | male | 2008 |
and we would like to find a “shadow population” of 5 Adélies from Dream island that have a similar set of characteristics.
The penguins dataset doesn’t have an ID column to uniquely identify each observation; in this situation, shadowpop() will add one automatically.
(If your data has a unique ID/key field, you should specify its name via the id_col argument.)
Let’s create our source data. In this case we’re going to use all the Adélies found on Dream island (top five rows shown below):
source_data <- penguins |>
dplyr::filter(species == "Adelie" & island == "Dream")
head(source_data, n = 5) |>
gt()| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Dream | 39.5 | 16.7 | 178 | 3250 | female | 2007 |
| Adelie | Dream | 37.2 | 18.1 | 178 | 3900 | male | 2007 |
| Adelie | Dream | 39.5 | 17.8 | 188 | 3300 | female | 2007 |
| Adelie | Dream | 40.9 | 18.9 | 184 | 3900 | male | 2007 |
| Adelie | Dream | 36.4 | 17.0 | 195 | 3325 | female | 2007 |
Finally we need to decide which variables we want to use to generate the best match for our shadow population, and in which order.
Probably sex is important, and maybe body mass, and let’s also say we want to match by year if possible.
We hit a problem now because the body mass variable is a specific number of grams, so it’s very unlikely that exact matches will occur.
(Currently shadowpop() only works via exact matching of fields.)
There are also some NA body mass values in the source data which we might want to convert to a specific character value.
There are two ways we can solve the problem:
- convert the specific mass values to ranges and match on those,
- create a match by closeness, for example match any source penguins that are within +/-50g of the mass of the sample individual.
For now we will use the range option.
sample_data2 <- sample_data |>
dplyr::mutate(body_mass_range = dplyr::case_when(
.data[["body_mass_g"]] < 3000 ~ "<3000",
.data[["body_mass_g"]] < 3250 ~ "3000-3249",
.data[["body_mass_g"]] < 3500 ~ "3250-3499",
.data[["body_mass_g"]] < 3750 ~ "3500-3749",
.data[["body_mass_g"]] < 4000 ~ "3750-3999",
.data[["body_mass_g"]] < 4250 ~ "4000-4249",
.data[["body_mass_g"]] < 4500 ~ "4250-4499",
.data[["body_mass_g"]] >= 4500 ~ "4500+",
is.na(.data[["body_mass_g"]]) ~ "Missing"
))
source_data2 <- source_data |>
dplyr::mutate(body_mass_range = dplyr::case_when(
.data[["body_mass_g"]] < 3000 ~ "<3000",
.data[["body_mass_g"]] < 3250 ~ "3000-3249",
.data[["body_mass_g"]] < 3500 ~ "3250-3499",
.data[["body_mass_g"]] < 3750 ~ "3500-3749",
.data[["body_mass_g"]] < 4000 ~ "3750-3999",
.data[["body_mass_g"]] < 4250 ~ "4000-4249",
.data[["body_mass_g"]] < 4500 ~ "4250-4499",
.data[["body_mass_g"]] >= 4500 ~ "4500+",
is.na(.data[["body_mass_g"]]) ~ "Missing"
))We can now use body_mass_range as one of our match variables, and create a shadow population:
match_vars <- c("sex", "body_mass_range", "year")
shadow_pop <- shadowpop(sample_data2, source_data2, match_vars) |>
dplyr::select(!c("id", "body_mass_range"))
shadow_pop |>
gt()| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Dream | 43.2 | 18.5 | 192 | 4100 | male | 2008 |
| Adelie | Dream | 39.5 | 16.7 | 178 | 3250 | female | 2007 |
| Adelie | Dream | 38.3 | 19.2 | 189 | 3950 | male | 2008 |
| Adelie | Dream | 33.1 | 16.1 | 178 | 2900 | female | 2008 |
| Adelie | Dream | 40.3 | 18.5 | 196 | 4350 | male | 2008 |
sample_data |>
gt()| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Biscoe | 41.1 | 18.2 | 192 | 4050 | male | 2008 |
| Adelie | Biscoe | 37.8 | 18.3 | 174 | 3400 | female | 2007 |
| Adelie | Biscoe | 41.6 | 18.0 | 192 | 3950 | male | 2008 |
| Adelie | Biscoe | 36.5 | 16.6 | 181 | 2850 | female | 2008 |
| Adelie | Biscoe | 40.1 | 18.9 | 188 | 4300 | male | 2008 |
How well does the shadow population match against the sample population?
Pretty well! In fact, we can tell that each row in the sample data found a match in the source data across all three of the variables we asked it to match against (the exact body mass values don’t match of course, but there was a match in the 250g mass range for each of the sample individuals).
All other things being equal, the larger the source data, the more likely you are to get a good match for each member of the sample population.
The more variables you try to match against, the less likely you are to find an exact shadow in the sample data across all chosen variables.
If shadowpop() doesn’t find a shadow match across all the initially specified match_cols, it will ignore the last one and try again to find a shadow row using the remaining variables. Then it will try again, this time ignoring the last two variables in the list, and so on until there are no more match variables to try.