Building Cohorts • PicnicHealth

Introduction

Analysts may often wish to rerun the same analysis separately on different subsets of the data, specific cohorts, or sub-cohorts based on specific inclusion or exclusion criteria. One way to accomplish such tasks is to write functions that perform the analysis and take entire PicnicHealth data sets as input arguments. For example, if the analyst wanted to run the same analysis on the entire cohort but then again separately including only those patients 30 years of age or less, or including only those patients who have ever taken a specific drug, then one could first subset the entire data set to include only those patients 30 years of age or less, or create a subset of only those patients who have ever taken the specific drug of interest, and then use the same analytic functions to rerun the analysis on those specific sub-cohorts. Similarly, the analyst may wish to exclude certain patients, e.g., to exclude patients who have ever taken a specific drug.

The general steps to this analytic approach would be:

Write functions that perform the analysis on the entire cohort and that take the entire cohort (data set) as an input argument.
Identify a list of patients using person_id that meet the desired inclusion or exclusion criteria.
Create a new data set that is a subset or sub-cohort of the entire cohort (data set) that either includes or excludes patients identified using person_id in step 2.
Re-run the analytic functions on the new sub-cohort data set as the input argument.

See below for some hands-on implementation examples using the PicnicHealth R package which has a number of built-in functions to help facilitate this. Note that we will use a small, synthetic data set that may not necessarily be clinically plausible in all cases and is only used to demonstrate functionality.

PicnicHealth R Package

Once a PicnicHealth data set is loaded into memory, we can begin to build cohorts based on different inclusion and exclusion criteria. Note here that we use a very small sample dataset for demonstration purposes only.

To demonstrate this, we will create a standard “Table 1” as a very simple example of an analysis that could be done for each of three different cohorts:

Entire cohort
Patients 50 years of age or younger
Patients older than 50 years of age

Loading Data

We can load a PicnicHealth data set into memory using the load_data_set() function. A PicnicHealth data set is an unzipped directory of CSV files and we need to specify the full path to that data set. See help("load_data_set") for more information.

library(PicnicHealth)
library(tidyverse)
ds = load_data_set("path/to/data")

Cohort 1: Entire Cohort

Since a PicnicHealth data set is already the entire cohort, we can proceed directly to running our analysis, which is simple analysis using the table_one() function and we pass in the entire data set ds as the input argument.

table_one(ds)

Characteristic	N = 50¹
Sex
Female	39 (78%)
Male	11 (22%)
Race
Black or African American	1 (2.0%)
More than one race	3 (6.0%)
Unknown	1 (2.0%)
White	45 (90%)
Ethnicity
Hispanic or Latino	2 (4.0%)
Not Hispanic or Latino	47 (94%)
Prefer not to say	1 (2.0%)
Age	44 (35, 56)
Total Years of Clinical Documents	16.6 (12.1, 19.2)
Total Years of Visits	15 (10, 19)
Number of Providers	0 (0, 0)
Number of Care Sites	0 (0, 0)
Number of Hospitalizations	1 (0, 3)
Total Hospital Days	1.00 (0.00, 3.00)
¹ n (%); Median (Q1, Q3)

Cohort 2: Patients 50 years of age or younger

To create a cohort of younger patients, we first identify the person_ids for patients 50 years of age or younger. Note here that the PicnicHealth R package has a helper function get_person_age() that facilitates this easily. Once we have that list of patients, we can pass the entire data set ds along with the list of patients that we identified, i.e. young_patient_ids, to the filter_data_set() function which will create a new data set that includes only those patients. Note that filter_data_set() will programmatically subset all tables in the PicnicHealth data set list object, ds, to remove any rows associated with the vector of person_ids, see help(filter_data_set) for more information. Also note that we are implicitly using the argument operation = "include" as an inclusion criteria here since we did not specify that argument and it is the default. We can then perform the analysis, a simple table_one() in this case, on that newly created sub-cohort.

young_patient_ids = ds %>%
  get_person_age() %>% 
  filter(age <= 50) %>%
  pull(person_id)

young_cohort = filter_data_set(ds, young_patient_ids)

table_one(young_cohort)

Characteristic	N = 29¹
Sex
Female	22 (76%)
Male	7 (24%)
Race
Black or African American	1 (3.4%)
More than one race	2 (6.9%)
White	26 (90%)
Ethnicity
Hispanic or Latino	2 (6.9%)
Not Hispanic or Latino	26 (90%)
Prefer not to say	1 (3.4%)
Age	38.3 (31.2, 40.7)
Total Years of Clinical Documents	16.4 (12.1, 19.1)
Total Years of Visits	15.8 (10.6, 19.2)
Number of Providers	0 (0, 0)
Number of Care Sites	0 (0, 0)
Number of Hospitalizations	2 (1, 3)
Total Hospital Days	2.00 (1.00, 3.00)
¹ n (%); Median (Q1, Q3)

Cohort 3: Older than 50

If we now wanted to create the complementary sub-cohort of patients older than age 50, we could proceed as above to identify a list of patients older than age 50 and use filter_data_set() to subset to include only those patients. However, instead of defining a new list of patients, we can alternatively use the operation = "exclude" argument to impose exclusion criteria in filter_data_set() which will create a new data set that excludes patients aged 50 or less, i.e. now only contains patients older than age 50. Then we can perform the analysis, a simple table_one() in this case, on that newly created sub-cohort. Note here that we pass in the original entire data set ds and note that we pass in the same list of young_patient_ids but now exclude them.

older_cohort = filter_data_set(ds, young_patient_ids, operation = "exclude")

table_one(older_cohort)

Characteristic	N = 21¹
Sex
Female	17 (81%)
Male	4 (19%)
Race
More than one race	1 (4.8%)
Unknown	1 (4.8%)
White	19 (90%)
Ethnicity
Not Hispanic or Latino	21 (100%)
Age	62 (54, 68)
Total Years of Clinical Documents	17.4 (12.6, 19.3)
Total Years of Visits	14 (10, 18)
Number of Providers	0 (0, 0)
Number of Care Sites	0 (0, 0)
Number of Hospitalizations	1 (0, 2)
Total Hospital Days	1.00 (0.00, 2.00)
¹ n (%); Median (Q1, Q3)