Skip to contents

Introduction

Analysts may often wish to rerun the same analysis separately on different subsets of the data, specific cohorts, or sub-cohorts based on specific inclusion or exclusion criteria. One way to accomplish such tasks is to write functions that perform the analysis and take entire PicnicHealth data sets as input arguments. For example, if the analyst wanted to run the same analysis on the entire cohort but then again separately including only those patients 30 years of age or less, or including only those patients who have ever taken a specific drug, then one could first subset the entire data set to include only those patients 30 years of age or less, or create a subset of only those patients who have ever taken the specific drug of interest, and then use the same analytic functions to rerun the analysis on those specific sub-cohorts. Similarly, the analyst may wish to exclude certain patients, e.g., to exclude patients who have ever taken a specific drug.

The general steps to this analytic approach would be:

  1. Write functions that perform the analysis on the entire cohort and that take the entire cohort (data set) as an input argument.
  2. Identify a list of patients using person_id that meet the desired inclusion or exclusion criteria.
  3. Create a new data set that is a subset or sub-cohort of the entire cohort (data set) that either includes or excludes patients identified using person_id in step 2.
  4. Re-run the analytic functions on the new sub-cohort data set as the input argument.

See below for some hands-on implementation examples using the PicnicHealth R package which has a number of built-in functions to help facilitate this. Note that we will use a small, synthetic data set that may not necessarily be clinically plausible in all cases and is only used to demonstrate functionality.

PicnicHealth R Package

Once a PicnicHealth data set is loaded into memory, we can begin to build cohorts based on different inclusion and exclusion criteria. Note here that we use a very small sample dataset for demonstration purposes only.

To demonstrate this, we will create a standard “Table 1” as a very simple example of an analysis that could be done for each of three different cohorts:

  • Entire cohort

  • Patients 50 years of age or younger

  • Patients older than 50 years of age

Loading Data

We can load a PicnicHealth data set into memory using the load_data_set() function. A PicnicHealth data set is an unzipped directory of CSV files and we need to specify the full path to that data set. See help("load_data_set") for more information.

library(PicnicHealth)
library(tidyverse)
ds = load_data_set("path/to/data")

Cohort 1: Entire Cohort

Since a PicnicHealth data set is already the entire cohort, we can proceed directly to running our analysis, which is simple analysis using the table_one() function and we pass in the entire data set ds as the input argument.

Characteristic N = 501
Sex
    Female 39 (78%)
    Male 11 (22%)
Race
    Black or African American 1 (2.0%)
    More than one race 3 (6.0%)
    Unknown 1 (2.0%)
    White 45 (90%)
Ethnicity
    Hispanic or Latino 2 (4.0%)
    Not Hispanic or Latino 47 (94%)
    Prefer not to say 1 (2.0%)
Age 44 (35, 56)
Total Years of Clinical Documents 16.6 (12.1, 19.2)
Total Years of Visits 15 (10, 19)
Number of Providers 0 (0, 0)
Number of Care Sites 0 (0, 0)
Number of Hospitalizations 1 (0, 3)
Total Hospital Days 1.00 (0.00, 3.00)
1 n (%); Median (Q1, Q3)

Cohort 2: Patients 50 years of age or younger

To create a cohort of younger patients, we first identify the person_ids for patients 50 years of age or younger. Note here that the PicnicHealth R package has a helper function get_person_age() that facilitates this easily. Once we have that list of patients, we can pass the entire data set ds along with the list of patients that we identified, i.e. young_patient_ids, to the filter_data_set() function which will create a new data set that includes only those patients. Note that filter_data_set() will programmatically subset all tables in the PicnicHealth data set list object, ds, to remove any rows associated with the vector of person_ids, see help(filter_data_set) for more information. Also note that we are implicitly using the argument operation = "include" as an inclusion criteria here since we did not specify that argument and it is the default. We can then perform the analysis, a simple table_one() in this case, on that newly created sub-cohort.

young_patient_ids = ds %>%
  get_person_age() %>% 
  filter(age <= 50) %>%
  pull(person_id)

young_cohort = filter_data_set(ds, young_patient_ids)

table_one(young_cohort)
Characteristic N = 291
Sex
    Female 22 (76%)
    Male 7 (24%)
Race
    Black or African American 1 (3.4%)
    More than one race 2 (6.9%)
    White 26 (90%)
Ethnicity
    Hispanic or Latino 2 (6.9%)
    Not Hispanic or Latino 26 (90%)
    Prefer not to say 1 (3.4%)
Age 38.3 (31.2, 40.7)
Total Years of Clinical Documents 16.4 (12.1, 19.1)
Total Years of Visits 15.8 (10.6, 19.2)
Number of Providers 0 (0, 0)
Number of Care Sites 0 (0, 0)
Number of Hospitalizations 2 (1, 3)
Total Hospital Days 2.00 (1.00, 3.00)
1 n (%); Median (Q1, Q3)

Cohort 3: Older than 50

If we now wanted to create the complementary sub-cohort of patients older than age 50, we could proceed as above to identify a list of patients older than age 50 and use filter_data_set() to subset to include only those patients. However, instead of defining a new list of patients, we can alternatively use the operation = "exclude" argument to impose exclusion criteria in filter_data_set() which will create a new data set that excludes patients aged 50 or less, i.e. now only contains patients older than age 50. Then we can perform the analysis, a simple table_one() in this case, on that newly created sub-cohort. Note here that we pass in the original entire data set ds and note that we pass in the same list of young_patient_ids but now exclude them.

older_cohort = filter_data_set(ds, young_patient_ids, operation = "exclude")

table_one(older_cohort)
Characteristic N = 211
Sex
    Female 17 (81%)
    Male 4 (19%)
Race
    More than one race 1 (4.8%)
    Unknown 1 (4.8%)
    White 19 (90%)
Ethnicity
    Not Hispanic or Latino 21 (100%)
Age 62 (54, 68)
Total Years of Clinical Documents 17.4 (12.6, 19.3)
Total Years of Visits 14 (10, 18)
Number of Providers 0 (0, 0)
Number of Care Sites 0 (0, 0)
Number of Hospitalizations 1 (0, 2)
Total Hospital Days 1.00 (0.00, 2.00)
1 n (%); Median (Q1, Q3)