Skip to contents

Introduction

Motivating Question: How can we use a PicnicHealth dataset to calculate periods of time in which individual patients have sufficient data density to be included in an analysis?

In any Real-World Data (RWD) dataset, there may be some intervals in a patient’s timeline with a high density of healthcare events such as visits, lab results, procedures, etc., particularly around the time of diagnosis or disease flare-ups. At other times, data may appear sparse, with few or no events in some cases simply because of changes in health-care utilization, disease management or disease severity, but also because of missing records due to changes in insurance or healthcare providers.

To address patient-level variation in data density over time, PicnicHealth analysts derive observation periods to define periods of time meeting minimum data density thresholds for completeness. We recommend a standard definition of observation periods that gives a more robust estimate of healthcare data density by requiring at least one outpatient visit from a primary care provider or from a disease-specific specialist at least once every 18 months based on PicnicHealth’s clinical expertise and experience across a wide range of disease cohorts. This allows for real-world considerations such as individual patient variation in scheduling routine visits, disease management, and insurance reimbursement requirements for minimum length of time between visits. Note that this cadence is a general recommendation which may be adapted along with more targeted definitions such as variations in visit types such as including inpatient visits, qualifying specialists, or time periods depending on the nature of the disease of interest. See the Additional Considerations section below for other analytic approaches.

PicnicHealth’s default recommended algorithm:

  1. Filter the visit table to visit_type = "Outpatient Visit"
  2. Join to the document table on visit_id and person_id
  3. Filter to Primary Care Physicians (PCPs) and disease-specific specialists of interest via the specialty field:
  • PicnicHealth recommends defining PCPs using the following values in the specialty field:

    • "Adolescent Medicine"
    • "Family Medicine"
    • "Geriatric Medicine"
    • "Internal Medicine"
    • "Osteopathic Medicine"
    • "Pediatrics"
    • "Preventative Medicine"
  • Specialists of interest are disease-specific. For example, in an analysis of Immunoglobulin A Nephropathy patients, using the following values in the specialty field might be considered as relevant specialist visits:

    • "Gastroenterology"
    • "Hematology/Oncology"
    • "Nephrology"
    • "Pediatric Gastroenterology"
    • "Pediatric Hematology/Oncology"
    • "Pediatric Nephrology"
    • "Pediatric Rheumatology"
    • "Rheumatology"
    • "Urology"
  1. Generate observation periods from these qualifying visits:
  • Order each visit by visit_start_date for each patient
  • De-duplicate events on the same visit_start_date for each patient
  • Calculate the number of days from previous visit_start_date to the following visit_start_date
  • Calculate a rolling sum of event counts that occur within the window size of interest
    • PicnicHealth recommends 18 months or 548 days as a default window size for most cohorts and diseases
  • Extract periods of time when the minimum required number of events occurring within the window size of interest is satisfied thus creating observation eras where minimum expected healthcare utilization is met
  1. As a final step, the analyst could choose to use the single, longest, continuous observation period for each patient, or choose to include data from all observation periods, regardless of whether periods are continuous or discontinuous to compute total person-time depending on the details of the analysis.

See below for some hands-on implementation examples of this algorithm using the PicnicHealth R package which has a number of built-in functions to help facilitate such analysis. Analysts progamming in languages other than R such as SAS, Python, etc. may still benefit from examining the R code steps, help files, and the underlying code of the get_observation_periods() function in the PicnicHealth R package.

Additional Considerations

PicnicHealth’s Abstraction Process

PicnicHealth synthesizes Real-World Data (RWD) across a broad spectrum of sources such as structured and unstructured portions of patients’ medical records, both electronic and paper records, as well as linkage to claims databases, electronic Patient Reported Outcomes (ePROs), and more. As with any RWD dataset, there may be some intervals in a patient’s timeline when there is a high density of healthcare events such as visits, lab results, procedures, etc. and other intervals where the data appears sparse, with few or no events present. This may be a reflection of individual patient variation in healthcare resource utilization (HCRU) and may be dependent on the nature of the disease of interest. For example:

  • We may expect a high density of visits around acute disease flare-ups, such as vaso-occlusive events in the case of sickle cell disease, versus fewer visits when disease care is well-managed.
  • Some conditions may have a high density of visits immediately after diagnosis, quickly followed by additional imaging visits, pathology reports, and treatment procedures. For example, in breast cancer, a positive screening mammogram may be quickly followed by lumpectomy, chemotherapy, and radiation treatment. In addition, that flurry of initial activity may gradually subside as initial treatment courses are completed. For example, in breast cancer, time from abnormal screening mammogram to completion of first line treatments such as lumpectomy and radiation may take several months and the density of visits may return to routine levels of surveillance visits, but treatments such as hormone therapy may last for several years.
  • Disease severity may impact expected HCRU for particular conditions.

However, sparse data may also be due to missing records resulting from changes in insurance or healthcare providers, or patients not reporting all providers.

What sets PicnicHealth RWD solutions apart is that patients provide blanket consent to retrieve all records, from any provider, at any care site in the U.S., and across the entire patient’s health journey longitudinally, both retrospectively and prospectively. This allows PicnicHealth to build more complete patient timelines using a multi-pronged approach to identify mentions of referrals to additional providers and care sites as well as mentions of additional procedures and test results to target record retrieval and to maximize completeness of records. Machine learning predictions are surfaced in the medical records and leveraged by highly trained abstractors who transform records into timelines for patients and fit-for-purpose research datasets for our partners.

Other Considerations for Computing Observation Eras

Importantly, data density may also be intrinsically tied to study inclusion/exclusion criteria such as requiring a minimum number of vaso-occlusive events per year in the case of sickle cell disease or patients who have a history of a particular drug regimen that can serve as an external control arm. Below we discuss a few modifications and approaches to calculate observation eras and person-time for HCRU as well as assist in defining cohorts by implementation of inclusion/exclusion criteria.

  1. The simplest approach to define observation periods at the patient level is to calculate the time from a patient’s earliest visit to their most recent visit using data available in the visit table. Although this approach is simple, intuitive, and easy to explain, this may overestimate available person-years as some patients may have periods of sparse data such as in the early years of their available medical history long before disease onset or diagnosis.
  2. Another approach could be to require a minimum number of outpatient visits over time with a defined list of providers that form a primary care team as a simple measure that all available medical records have been retrieved and abstracted.
  3. Similarly, an analyst could require a minimum number of events of interest such as disease-specific procedures, or measurements such as blood draws in addition to PCP and specialist visit approaches above.
  4. Finally, as noted previously, the nature of the disease may have targeted impacts on how qualifying criteria should be considered. Diseases with frequent inpatient stays, extended hospitalizations, or high mortality rates may benefit from the inclusion of inpatient visits when calculating observation periods, as these patients may not have as regularly scheduled routine care as may be the case with acid sphingomyelinase deficiency disease (ASMD).

With the implementation of HIPAA record retention requirements, facilities generally maintain 7 years of records on average with some facilities retaining up to 10 years of records, thus one may expect a higher density of visits and more robust coverage in the most recent 5-10 years of data. This is highly disease dependent but our general recommendation in most cases is to focus on more data-dense, recent years of a patient’s health care journey, that is, periods of time in which patients meet a minimum expected threshold for healthcare utilization. This may be particularly important if study outcomes of interest involve more recent treatments or recent changes to standard of care.

PicnicHealth R Package

Once a PicnicHealth data set is loaded into memory, we can use the get_observation_periods() function in the PicnicHealth R package to robustly define and calculate the periods of time during which patients have a specified density of event data. See help(get_observation_periods) for more information. Note here that we use a very small sample dataset for demonstration purposes only.

Loading Data

We can load a PicnicHealth data set into memory using the load_data_set() function. A PicnicHealth data set is an unzipped directory of CSV files and we need to specify the full path to that data set. See help(load_data_set) for more information.

library(PicnicHealth)
library(tidyverse)
ds = load_data_set("path/to/data")

Basic Example

To calculate observation periods, we must specify:

  • the types of healthcare events of interest, and
  • the minimum data density cutoff.

Specifically, the events of interest are given via a table (data.frame or tibble), usually from the PicnicHealth data set object, and a specification of which column of dates to use. The minimum data density cutoff is specified as the minimum number of healthcare events that a patient must have over specified period of time.

For example, we can use any visit in the visit table from the PicnicHealth data set object, and the "visit_start_date" column in that table. We can specify that the observation periods of interest are the intervals between events (in this case, visits) during which the patient has at least 2 events (technically, 2 distinct dates with events) in each 365 day period. Here we show a sample of output from the get_observation_periods() function:

get_observation_periods(
  table = ds$visit,
  dates = "visit_start_date",
  window_size = 365,
  min_utilization = 2
) %>% 
  dplyr::slice_head(n = 10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling()
person_id period_start period_end period_length n_event_dates
005da1fb-e3bd-45c2-871f-1714d563c6c5 2015-01-05 2015-07-24 201 3
005da1fb-e3bd-45c2-871f-1714d563c6c5 2017-04-28 2023-09-22 2339 61
012f2aac-d70e-4bee-adee-ad7924e43f98 2006-12-21 2007-07-07 199 3
012f2aac-d70e-4bee-adee-ad7924e43f98 2012-07-06 2012-10-22 109 2
012f2aac-d70e-4bee-adee-ad7924e43f98 2016-01-26 2023-10-02 2807 41
018c06b3-ff03-4618-9fae-ed4e65e0e352 2018-11-08 2019-11-14 372 4
018c06b3-ff03-4618-9fae-ed4e65e0e352 2021-03-10 2021-06-11 94 2
0628337f-bede-4ddb-a1a1-a04fd09f28a9 2019-11-14 2020-03-02 110 3
0628337f-bede-4ddb-a1a1-a04fd09f28a9 2022-05-23 2022-10-04 135 3
0d78fd7c-4d52-4247-8017-1427592032e4 2014-04-21 2014-07-17 88 2

PicnicHealth’s Default Algorithm

Here we demonstrate PicnicHealth’s default recommended algorithm described in detail above, but briefly here as at least one outpatient visit from a primary care provider or from a disease-specific specialist at least once every 18 months.

# Define primary care providers
primary_care_providers <- c(
  "Adolescent Medicine", "Family Medicine", "Geriatric Medicine", 
  "Internal Medicine", "Osteopathic Medicine", "Pediatrics", 
  "Preventative Medicine"
)
  
# Define specialists of interest using Huntington's Disease as an example
specialist_of_interest <- c(
  "Neurology", "Family Medicine", "Internal Medicine", "Physical Therapy", 
  "Nursing", "Psychiatry", "Sleep Medicine", "Speech Pathology"
)

observation_periods <- ds$visit %>%
  # Filter to outpatient visits
  filter(visit_type == "Outpatient Visit") %>%
  # Join to the `document` table to get the `specialty` column
  inner_join(x = .,
             y = ds$document,
             join_by(visit_id, person_id)) %>% 
  # Filter to primary care providers and specialists of interest
  filter(specialty %in% c(primary_care_providers, specialist_of_interest)) %>%
  # Calculate observations periods using PicnicHealth::get_observation_periods()
  get_observation_periods(
    table = .,
    dates = "visit_start_date",
    window_size = 548,
    min_utilization = 1
  ) 

# See the first 10 as a brief example
observation_periods %>%
  slice_head(n = 10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling()
person_id period_start period_end period_length n_event_dates
005da1fb-e3bd-45c2-871f-1714d563c6c5 2015-01-05 2023-09-22 3183 53
012f2aac-d70e-4bee-adee-ad7924e43f98 2007-01-16 2008-02-08 389 2
012f2aac-d70e-4bee-adee-ad7924e43f98 2012-07-06 2012-10-22 109 2
012f2aac-d70e-4bee-adee-ad7924e43f98 2016-01-26 2023-10-02 2807 38
018c06b3-ff03-4618-9fae-ed4e65e0e352 2017-05-31 2021-06-11 1473 7
0628337f-bede-4ddb-a1a1-a04fd09f28a9 2019-11-14 2020-11-13 366 4
0628337f-bede-4ddb-a1a1-a04fd09f28a9 2022-05-23 2023-07-06 410 3
0d78fd7c-4d52-4247-8017-1427592032e4 2014-04-21 2014-07-17 88 2
0d78fd7c-4d52-4247-8017-1427592032e4 2017-12-01 2022-12-16 1842 19
16ba6e82-437f-4954-8939-086ebcb72032 2011-06-15 2014-06-24 1106 5

Behind the scenes

The get_observation_periods() algorithm for calculating observation periods begins by computing the “healthcare utilization” (HCU) of each patient by date, in relation to some “window size”. A patient’s healthcare utilization on date x is defined as the number of distinct days on which they had a healthcare event within the window centered on date x, i.e. within window_size/2 days before x and window_size/2 days after x (inclusive).

We can visualize a patient’s healthcare utilization for visit data with the plot_visit_hcu() function:

example_patient = ds$person$person_id[13]
plot_visit_hcu(ds, patient = example_patient, window_size = 365, min_utilization = 2)

Here, we’ve used a window size of 365 days. The height of the line on the graph at a given date corresponds to the number of distinct event dates (in this case, distinct dates with visits) the patient had during the year-long window centered on that date. The red plus signs indicate the exact dates of the visits.

Once the healthcare utilization values have been calculated, the function identifies the continuous periods during which the HCU exceeds some user-specified min_utilization. These are the observation periods.

Additional details

For more information, and details on the additional arguments to the function, see the documentation for get_observation_periods().