An outline of {shadowpop}

library(dplyr)
library(gt)
library(palmerpenguins)
library(shadowpop)
library(withr)

“What’s a shadow population?” I hear you ask?

It’s a bit like a control group in an experiment, drawn from an existing defined population dataset.

Imagine you are asked to report on activity relating to a particular population where there has been some kind of intervention or treatment applied. As part of the evaluation of the intervention, you would like to compare these activity outcomes against the activity of a similar, but separate, population. This secondary population is what I am calling a shadow population.

You want the shadow population to have a similar distribution of certain key (independent) variables to your primary population. For example, you may want to generate a population that lives in areas that have a similar overall socio-economic profile as your primary population. Or you may want the age or gender profile to be as similar as possible. Or any other variable that you think matters.

I’m assuming that the primary population and the shadow population can both be drawn from a single overall dataset, for example patients registered at a hospital or students at a college. It may not be possible to exactly mirror the distribution of these variables in the shadow population, but {shadowpop} helps you get as close as possible.

Using shadowpop

To better illustrate what shadowpop() does, we can use the {palmerpenguins} dataset.

A hand-drawn image of three penguins, one of each of three species:
Chinstrap, Gentoo and Adélie. The Chinstrap penguin drawing has some magenta
shading in the background, the Gentoo penguin has a dark green background,
and the Adélie penguin has a bright orange background. — Palmer penguins package artwork

Artwork by @allison_horst

I am going to restrict the example here to the Adélie penguins.

Let’s say we have a small sample of 5 Adélies from Biscoe island:

sample_data <- penguins |>
  dplyr::filter(species == "Adelie" & island == "Biscoe") |>
  dplyr::slice_sample(n = 5) |>
  withr::with_seed(seed = 777)

sample_data |>
  gt()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Biscoe	41.1	18.2	192	4050	male	2008
Adelie	Biscoe	37.8	18.3	174	3400	female	2007
Adelie	Biscoe	41.6	18.0	192	3950	male	2008
Adelie	Biscoe	36.5	16.6	181	2850	female	2008
Adelie	Biscoe	40.1	18.9	188	4300	male	2008

and we would like to find a “shadow population” of 5 Adélies from Dream island that have a similar set of characteristics.

The penguins dataset doesn’t have an ID column to uniquely identify each observation; in this situation, shadowpop() will add one automatically.

(If your data has a unique ID/key field, you should specify its name via the id_col argument.)

Let’s create our source data. In this case we’re going to use all the Adélies found on Dream island (top five rows shown below):

source_data <- penguins |>
  dplyr::filter(species == "Adelie" & island == "Dream")

head(source_data, n = 5) |>
  gt()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Dream	39.5	16.7	178	3250	female	2007
Adelie	Dream	37.2	18.1	178	3900	male	2007
Adelie	Dream	39.5	17.8	188	3300	female	2007
Adelie	Dream	40.9	18.9	184	3900	male	2007
Adelie	Dream	36.4	17.0	195	3325	female	2007

Finally we need to decide which variables we want to use to generate the best match for our shadow population, and in which order.

Probably sex is important, and maybe body mass, and let’s also say we want to match by year if possible.

We hit a problem now because the body mass variable is a specific number of grams, so it’s very unlikely that exact matches will occur.

(Currently shadowpop() only works via exact matching of fields.)

There are also some NA body mass values in the source data which we might want to convert to a specific character value.

There are two ways we can solve the problem:

convert the specific mass values to ranges and match on those,
create a match by closeness, for example match any source penguins that are within +/-50g of the mass of the sample individual.

For now we will use the range option.

sample_data2 <- sample_data |>
  dplyr::mutate(body_mass_range = dplyr::case_when(
    .data[["body_mass_g"]] < 3000 ~ "<3000",
    .data[["body_mass_g"]] < 3250 ~ "3000-3249",
    .data[["body_mass_g"]] < 3500 ~ "3250-3499",
    .data[["body_mass_g"]] < 3750 ~ "3500-3749",
    .data[["body_mass_g"]] < 4000 ~ "3750-3999",
    .data[["body_mass_g"]] < 4250 ~ "4000-4249",
    .data[["body_mass_g"]] < 4500 ~ "4250-4499",
    .data[["body_mass_g"]] >= 4500 ~ "4500+",
    is.na(.data[["body_mass_g"]]) ~ "Missing"
  ))

source_data2 <- source_data |>
  dplyr::mutate(body_mass_range = dplyr::case_when(
    .data[["body_mass_g"]] < 3000 ~ "<3000",
    .data[["body_mass_g"]] < 3250 ~ "3000-3249",
    .data[["body_mass_g"]] < 3500 ~ "3250-3499",
    .data[["body_mass_g"]] < 3750 ~ "3500-3749",
    .data[["body_mass_g"]] < 4000 ~ "3750-3999",
    .data[["body_mass_g"]] < 4250 ~ "4000-4249",
    .data[["body_mass_g"]] < 4500 ~ "4250-4499",
    .data[["body_mass_g"]] >= 4500 ~ "4500+",
    is.na(.data[["body_mass_g"]]) ~ "Missing"
  ))

We can now use body_mass_range as one of our match variables, and create a shadow population:

match_vars <- c("sex", "body_mass_range", "year")

shadow_pop <- shadowpop(sample_data2, source_data2, match_vars) |>
  dplyr::select(!c("id", "body_mass_range"))

shadow_pop |>
  gt()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Dream	43.2	18.5	192	4100	male	2008
Adelie	Dream	39.5	16.7	178	3250	female	2007
Adelie	Dream	38.3	19.2	189	3950	male	2008
Adelie	Dream	33.1	16.1	178	2900	female	2008
Adelie	Dream	40.3	18.5	196	4350	male	2008

sample_data |>
  gt()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Biscoe	41.1	18.2	192	4050	male	2008
Adelie	Biscoe	37.8	18.3	174	3400	female	2007
Adelie	Biscoe	41.6	18.0	192	3950	male	2008
Adelie	Biscoe	36.5	16.6	181	2850	female	2008
Adelie	Biscoe	40.1	18.9	188	4300	male	2008

How well does the shadow population match against the sample population?

Pretty well! In fact, we can tell that each row in the sample data found a match in the source data across all three of the variables we asked it to match against (the exact body mass values don’t match of course, but there was a match in the 250g mass range for each of the sample individuals).

All other things being equal, the larger the source data, the more likely you are to get a good match for each member of the sample population.

The more variables you try to match against, the less likely you are to find an exact shadow in the sample data across all chosen variables.

If shadowpop() doesn’t find a shadow match across all the initially specified match_cols, it will ignore the last one and try again to find a shadow row using the remaining variables. Then it will try again, this time ignoring the last two variables in the list, and so on until there are no more match variables to try.