Building Cohorts
building_cohorts.Rmd
Introduction
Analysts may often wish to rerun the same analysis separately on different subsets of the data, specific cohorts, or sub-cohorts based on specific inclusion or exclusion criteria. One way to accomplish such tasks is to write functions that perform the analysis and take entire PicnicHealth data sets as input arguments. For example, if the analyst wanted to run the same analysis on the entire cohort but then again separately including only those patients 30 years of age or less, or including only those patients who have ever taken a specific drug, then one could first subset the entire data set to include only those patients 30 years of age or less, or create a subset of only those patients who have ever taken the specific drug of interest, and then use the same analytic functions to rerun the analysis on those specific sub-cohorts. Similarly, the analyst may wish to exclude certain patients, e.g., to exclude patients who have ever taken a specific drug.
The general steps to this analytic approach would be:
- Write functions that perform the analysis on the entire cohort and that take the entire cohort (data set) as an input argument.
- Identify a list of patients using
person_id
that meet the desired inclusion or exclusion criteria. - Create a new data set that is a subset or sub-cohort of the entire
cohort (data set) that either includes or excludes patients identified
using
person_id
in step 2. - Re-run the analytic functions on the new sub-cohort data set as the input argument.
See below for some hands-on implementation examples using the PicnicHealth R package which has a number of built-in functions to help facilitate this. Note that we will use a small, synthetic data set that may not necessarily be clinically plausible in all cases and is only used to demonstrate functionality.
PicnicHealth R Package
Once a PicnicHealth data set is loaded into memory, we can begin to build cohorts based on different inclusion and exclusion criteria. Note here that we use a very small sample dataset for demonstration purposes only.
To demonstrate this, we will create a standard “Table 1” as a very simple example of an analysis that could be done for each of three different cohorts:
Entire cohort
Patients 50 years of age or younger
Patients older than 50 years of age
Loading Data
We can load a PicnicHealth data set into memory using the
load_data_set()
function. A PicnicHealth data set is an
unzipped directory of CSV files and we need to specify the full path to
that data set. See help("load_data_set")
for more
information.
library(PicnicHealth)
library(tidyverse)
ds = load_data_set("path/to/data")
Cohort 1: Entire Cohort
Since a PicnicHealth data set is already the entire cohort, we can
proceed directly to running our analysis, which is simple analysis using
the table_one()
function and we pass in the entire data set
ds
as the input argument.
table_one(ds)
Characteristic | N = 501 |
---|---|
Sex | |
Female | 39 (78%) |
Male | 11 (22%) |
Race | |
Black or African American | 1 (2.0%) |
More than one race | 3 (6.0%) |
Unknown | 1 (2.0%) |
White | 45 (90%) |
Ethnicity | |
Hispanic or Latino | 2 (4.0%) |
Not Hispanic or Latino | 47 (94%) |
Prefer not to say | 1 (2.0%) |
Age | 44 (35, 56) |
Total Years of Clinical Documents | 16.6 (12.1, 19.2) |
Total Years of Visits | 15 (10, 19) |
Number of Providers | 0 (0, 0) |
Number of Care Sites | 0 (0, 0) |
Number of Hospitalizations | 1 (0, 3) |
Total Hospital Days | 1.00 (0.00, 3.00) |
1 n (%); Median (Q1, Q3) |
Cohort 2: Patients 50 years of age or younger
To create a cohort of younger patients, we first identify the
person_id
s for patients 50 years of age or younger. Note
here that the PicnicHealth R package has a helper function
get_person_age()
that facilitates this easily. Once we have
that list of patients, we can pass the entire data set ds
along with the list of patients that we identified,
i.e. young_patient_ids
, to the
filter_data_set()
function which will create a new data set
that includes only those patients. Note that
filter_data_set()
will programmatically subset all tables
in the PicnicHealth data set list object, ds
, to remove any
rows associated with the vector of person_ids, see
help(filter_data_set)
for more information. Also note that
we are implicitly using the argument operation = "include"
as an inclusion criteria here since we did not specify that argument and
it is the default. We can then perform the analysis, a simple
table_one()
in this case, on that newly created
sub-cohort.
young_patient_ids = ds %>%
get_person_age() %>%
filter(age <= 50) %>%
pull(person_id)
young_cohort = filter_data_set(ds, young_patient_ids)
table_one(young_cohort)
Characteristic | N = 291 |
---|---|
Sex | |
Female | 22 (76%) |
Male | 7 (24%) |
Race | |
Black or African American | 1 (3.4%) |
More than one race | 2 (6.9%) |
White | 26 (90%) |
Ethnicity | |
Hispanic or Latino | 2 (6.9%) |
Not Hispanic or Latino | 26 (90%) |
Prefer not to say | 1 (3.4%) |
Age | 38.3 (31.2, 40.7) |
Total Years of Clinical Documents | 16.4 (12.1, 19.1) |
Total Years of Visits | 15.8 (10.6, 19.2) |
Number of Providers | 0 (0, 0) |
Number of Care Sites | 0 (0, 0) |
Number of Hospitalizations | 2 (1, 3) |
Total Hospital Days | 2.00 (1.00, 3.00) |
1 n (%); Median (Q1, Q3) |
Cohort 3: Older than 50
If we now wanted to create the complementary sub-cohort of patients
older than age 50, we could proceed as above to identify a list of
patients older than age 50 and use filter_data_set()
to
subset to include only those patients. However, instead of defining a
new list of patients, we can alternatively use the
operation = "exclude"
argument to impose exclusion criteria
in filter_data_set()
which will create a new data set that
excludes patients aged 50 or less, i.e. now only contains patients older
than age 50. Then we can perform the analysis, a simple
table_one()
in this case, on that newly created sub-cohort.
Note here that we pass in the original entire data set ds
and note that we pass in the same list of young_patient_ids
but now exclude them.
older_cohort = filter_data_set(ds, young_patient_ids, operation = "exclude")
table_one(older_cohort)
Characteristic | N = 211 |
---|---|
Sex | |
Female | 17 (81%) |
Male | 4 (19%) |
Race | |
More than one race | 1 (4.8%) |
Unknown | 1 (4.8%) |
White | 19 (90%) |
Ethnicity | |
Not Hispanic or Latino | 21 (100%) |
Age | 62 (54, 68) |
Total Years of Clinical Documents | 17.4 (12.6, 19.3) |
Total Years of Visits | 14 (10, 18) |
Number of Providers | 0 (0, 0) |
Number of Care Sites | 0 (0, 0) |
Number of Hospitalizations | 1 (0, 2) |
Total Hospital Days | 1.00 (0.00, 2.00) |
1 n (%); Median (Q1, Q3) |