Skip to main content

Build a (Tabular Baseline) Model over MEDS data

In this tutorial, we'll use the dataset we extracted previously and build a neural network model to predict the task we extracted previously. See below for the jupyter notebook tutorial, rendered here, or check it out online on Google Colab or on our GitHub Repository

Building a Model

In this part of the tutorial, we'll walk through the task of building a model with MEDS

We will continue exactly where we've left off in the previous part of the tutorial. Our task is to predict a short ICU stay (of < 3 days) given the patient's input data. We have already transformed the dataset into the MEDS format and extracted a set of label files for the dataset.

Let's download the files we produced in the prior sections from the tutorial resources:

1 
2
3
4
5
6
7
%%bash
wget -q -c https://github.com/Medical-Event-Data-Standard/MEDS_KDD_2025_Tutorial/raw/refs/heads/main/MEDS_data.zip

unzip -q -o MEDS_data.zip

apt-get -qq install tree
tree MEDS_data
MEDS_data
├── data
│   ├── held_out
│   │   └── 0.parquet
│   ├── train
│   │   └── 0.parquet
│   └── tuning
│       └── 0.parquet
├── labels
│   └── short_LOS
│       ├── held_out
│       │   └── 0.parquet
│       ├── train
│       │   └── 0.parquet
│       └── tuning
│           └── 0.parquet
└── metadata
    ├── codes.parquet
    ├── dataset.json
    └── subject_splits.parquet

10 directories, 9 files

We can see that we have our core "data" files in MEDS_data/data, our metadata files in MEDS_data/metadata, and our task labels in MEDS_data/labels/short_LOS. We'll store these paths in variables for ease of use later.

1 
2
3
4
from pathlib import Path
data_root = Path("MEDS_data/data")
labels_root = Path("MEDS_data/labels/short_LOS")
metadata_root = Path("MEDS_data/metadata")

To help us get started, let's look at some of our core data files and our label files:

1 
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd

train_data = pd.read_parquet(data_root / "train")

# Remember that MEDS lets you store arbitrary other columns beyond those in the
# official schema -- we'll simplify things here by just keeping the main 5:
train_data = train_data[
  ["subject_id", "time", "code", "numeric_value", "text_value"]
]
display(train_data.head())

train_labels = pd.read_parquet(labels_root / "train")

# Similarly, we'll only keep the relevant boolean label in our labels
train_labels = train_labels[["subject_id", "prediction_time", "boolean_value"]]

display(train_labels.head())
subject_id time code numeric_value text_value
0 10000032 NaT GENDER//F NaN None
1 10000032 2128-01-01 00:00:00 MEDS_BIRTH NaN None
2 10000032 2180-03-23 11:51:00 LAB//51087//UNK NaN None
3 10000032 2180-03-23 11:51:00 LAB//50940//UNK NaN None
4 10000032 2180-03-23 11:51:00 LAB//51079//UNK NaN NEG
subject_id prediction_time boolean_value
0 10002428 2156-04-13 16:24:18 False
1 10002428 2156-04-20 18:11:19 False
2 10002428 2156-05-01 21:53:00 True
3 10002428 2156-05-12 14:49:34 False
4 10002495 2141-05-23 20:18:01 False

Now that we have our files again and know what our data looks like, we will learn how to use MEDS datasets to train a couple of different kinds of models.

First, it is a good practice to start from a baseline model over some tabularized features representing the patient's data. We will start by extracting these features manually, which is viable on our demo dataset given its small size. To show you how to scale this up to a much larger dataset, we will also show how you could use the MEDS-Tab tool to do this.

Once that is done, we will show how you could build a simple neural network model for this task, highlighting a few different other tools in the process.

Let's dive in!

A Tabular Baseline

MEDS Format vs Tabularized Data

The first step in training a tabular model on a MEDS dataset is to prepare the data, as there are some important differences between MEDS representation and the usual "tabular" representation:

  • MEDS has the "one row = one event" structure. Every patient will usually have many events (such as birth, death, hospital admission, lab results, diagnostic codes, etc.) associated with them, and each of these events will be represented in their own row.
  • Tabular data is typically treated as having a "one row = one training sample" structure, which would correspond to each patient -- along with their whole history of measurements -- being represented as a single row in the dataset.

For example, we can see that the first subject in our MEDS training dataset has quite a few measurements (rows) associated with them.

1 
2
sample_subject = train_labels.iloc[0]["subject_id"]
display(train_data[train_data['subject_id'] == sample_subject])
subject_id time code numeric_value text_value
4860 10002428 NaT GENDER//F NaN None
4861 10002428 2075-01-01 00:00:00 MEDS_BIRTH NaN None
4862 10002428 2154-08-12 00:00:00 Weight (Lbs) NaN 98
4863 10002428 2154-08-12 00:00:00 BMI (kg/m2) NaN 19.6
4864 10002428 2154-08-12 00:00:00 Blood Pressure NaN 158/98
... ... ... ... ... ...
41436 10002428 2160-07-16 18:49:00 DIAGNOSIS//ICD//10//S4992XA NaN None
41437 10002428 2160-07-16 18:49:00 DIAGNOSIS//ICD//10//Z9181 NaN None
41438 10002428 2160-07-16 18:49:00 DIAGNOSIS//ICD//10//M3500 NaN None
41439 10002428 2160-07-16 18:49:00 TRANSFER_TO//admit//Emergency Department Obser... NaN None
41440 10002428 2160-07-16 19:15:04 TRANSFER_TO//discharge//UNK NaN None

36581 rows × 5 columns

If we want to train a simple baseline model (e.g., logistic regression, XGBoost, etc.), all these measurements will eventually have to be summarized into a single, fixed-size representation, and the overall structure of our tabularized dataset will look more like this:

subject_idfeature_1feature_2...
10000032??...
10001217??...
............

Note there's actually one more complexity here -- we actually don't want all of the subject's measurements included in that fixed-size tabular input. Instead, we just want those that are "permitted inputs" for a given task sample. Let's look at our label dataframe again:

1 
train_labels[train_labels["subject_id"] == sample_subject]
[6]: 
subject_id prediction_time boolean_value
0 10002428 2156-04-13 16:24:18 False
1 10002428 2156-04-20 18:11:19 False
2 10002428 2156-05-01 21:53:00 True
3 10002428 2156-05-12 14:49:34 False

We can see that there are actually multiple "task samples" for this subject, each with different prediction_time inputs! Moreover, all available subject data for each of these samples could potentially lead to data leakage -- that is, "future" patient data being available for an earlier prediction time.

What this means is that we want to find a way to summarize all of the patient's data up to and including each of these time points separately, and use the appropriate summary to predict each sample's label -- adding another dimension to our "tabularization process".

Tabularization Strategy

Clearly, tabularization must follow the following strategy: for each prediction time of interest, we'll pick a set of "features" we want to extract about the patient's data up to that time, and write code to extract those features. Those will then be our columns in our tabular representation.

But with the MEDS format, we can be a bit more specific. In particular, we can ask "what is a feature when derived from MEDS data?" Given the structure of MEDS, an easy template to use is the following:

Given a patient's MEDS data and a prediction time, a "feature" is an aggregation function applied to a filtered set of measurements that occur in some relative or absolute window prior to the prediction time.

With this in mind, to define a feature, we just need to decide (1) what's our aggregation function, (2) what filtered set of measurements should we use, and (3) what is our window prior to the prediction time.

This is still a broad definition, but it lets us operationalize things very cleanly. Let's see this idea in action by constructing some features on the MEDS data we have.

Constructing the Features

So how do we featurize a MEDS dataset into a tabular form?

We could start from exploring all the event codes appearing in a MEDS dataset. To help us do that, here's a simple helper that will let us search through the columns of a dataframe interactively.

1 
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
import ipywidgets as widgets
from IPython.display import display
from functools import partial

def df_search(df: pd.DataFrame):
  # Create a Text widget for the search input
  search_input = widgets.Text(
      value='',
      placeholder='Enter search term...',
      description='Search:',
      disabled=False
  )

  # Function to filter the DataFrame based on the search term
  def filter_dataframe(df: pd.DataFrame, search_term: str):
      if not search_term:
        display(df)
      else:
        filtered_df = df[df.apply(lambda row: row.astype(str).str.contains(search_term, case=False).any(), axis=1)]
        display(filtered_df)

  # Create an interactive output to display the filtered DataFrame
  output = widgets.interactive_output(
      partial(filter_dataframe, df=df), {'search_term': search_input}
  )

  # Display the search input and the output
  display(search_input, output)

Let's start by exploring the codes.parquet dataframe within a MEDS dataset's metadata directory:

1 
2
metadata_df = pd.read_parquet(metadata_root / "codes.parquet")
df_search(metadata_df)
Text(value='', description='Search:', placeholder='Enter search term...')
Output()

We can also search through the code strings or text values associated with them in the training data directly:

1 
df_search(train_data[["code", "text_value"]].drop_duplicates())
Text(value='', description='Search:', placeholder='Enter search term...')
Output()

For MIMIC-IV data specifically, we can also use another resource from the raw data, which may help interpret some of the potentially uninformative codes in the training data (e.g. try searching for 220046 in metadata_df, train_data as well as the table below).

1 
2
3
!wget -q -c https://physionet.org/files/mimic-iv-demo/1.0/icu/d_items.csv.gz
mimic_icu_d_items = pd.read_csv("d_items.csv.gz")
df_search(mimic_icu_d_items)
Text(value='', description='Search:', placeholder='Enter search term...')
Output()

However, using every code appearing in the train dataset as a feature leads to two challenges:

  • There might be a lot of missing data, since most of these codes are rare and would only appear for a few patients (especially when we narrow down our cohort as part of the task and label extraction process),
  • Many of these codes could have many measurements over time.

For example, let's look for the codes measuring the white blood cell (leukocyte) count, as those could indicate an infection.

1 
display(metadata_df[metadata_df['description'].str.contains('leukocyte', case=False, na=False)].head(30))
code description parent_codes itemid possibly_cpt_code valueuom
33 LAB//51124//% Plasma cells/100 leukocytes in Peritoneal flui... [LOINC/40518-3] [51124] [None] [%]
85 LAB//52272//% Monocytes/100 leukocytes in Cerebral spinal fluid [LOINC/26486-1] [52272] [None] [%]
127 LAB//51118//% Mesothelial cells/100 leukocytes in Peritoneal... [LOINC/30432-9] [51118] [None] [%]
133 LAB//51448//% Mesothelial cells/100 leukocytes in Pleural fluid [LOINC/30431-1] [51448] [None] [%]
146 LAB//51263//% Plasma cells/100 leukocytes in Blood [LOINC/13047-6] [51263] [None] [%]
222 LAB//51376//% Macrophages/100 leukocytes in Synovial fluid [LOINC/33376-5] [51376] [None] [%]
310 LAB//51244//% Lymphocytes/100 leukocytes in Blood by Automat... [LOINC/736-9] [51244] [None] [%]
329 LAB//51427//% Lymphocytes/100 leukocytes in Body fluid [LOINC/11031-2] [51427] [None] [%]
458 LAB//51117//% Macrophages/100 leukocytes in Peritoneal fluid... [LOINC/40517-5] [51117] [None] [%]
492 LAB//51112//% Basophils/100 leukocytes in Peritoneal fluid [LOINC/35069-4] [51112] [None] [%]
493 LAB//51455//% Polymorphonuclear cells/100 leukocytes in Pleu... [LOINC/26519-9] [51455] [None] [%]
516 LAB//51301//K/uL Leukocytes [#/volume] in Blood by Automated count [LOINC/6690-2] [51301] [None] [K/uL]
525 LAB//51431//% Monocytes/100 leukocytes in Body fluid [LOINC/26487-9] [51431] [None] [%]
552 LAB//52220//% Leukocytes other/100 leukocytes in Blood [LOINC/26471-3] [52220] [None] [%]
558 LAB//51386//% Band form neutrophils/100 leukocytes in Body f... [LOINC/26510-8] [51386] [None] [%]
570 LAB//51116//% Lymphocytes/100 leukocytes in Peritoneal fluid [LOINC/26482-0] [51116] [None] [%]
581 LAB//51256//% Neutrophils/100 leukocytes in Blood by Automat... [LOINC/770-8] [51256] [None] [%]
639 LAB//51444//% Eosinophils/100 leukocytes in Pleural fluid [LOINC/30379-2] [51444] [None] [%]
686 LAB//51387//% Basophils/100 leukocytes in Body fluid [LOINC/28543-7] [51387] [None] [%]
760 LAB//51518///hpf Leukocyte clumps [Presence] in Urine sediment ... [LOINC/67848-2] [51518] [None] [/hpf]
779 LAB//51375//% Lymphocytes/100 leukocytes in Synovial fluid [LOINC/26483-8] [51375] [None] [%]
957 LAB//52135//% Immature granulocytes/100 leukocytes in Blood [LOINC/38518-7] [52135] [None] [%]
1016 LAB//51419//% Eosinophils/100 leukocytes in Body fluid [LOINC/26452-3] [51419] [None] [%]
1081 LAB//51428//% Macrophages/100 leukocytes in Body fluid [LOINC/30427-9] [51428] [None] [%]
1094 LAB//51259//% Leukocytes other/100 leukocytes in Blood [LOINC/26471-3] [51259] [None] [%]
1096 LAB//51516//#/hpf Leukocytes [#/area] in Urine sediment by Micro... [LOINC/5821-4] [51516] [None] [#/hpf]
1114 LAB//51254//% Monocytes/100 leukocytes in Blood by Automated... [LOINC/5905-5] [51254] [None] [%]
1174 LAB//51300//K/uL Leukocytes [#/volume] in Blood by Automated count [LOINC/6690-2] [51300] [None] [K/uL]
1181 LAB//51120//% Monocytes/100 leukocytes in Peritoneal fluid [LOINC/26488-7] [51120] [None] [%]
1221 LAB//51429//% Mesothelial cells/100 leukocytes in Body fluid [LOINC/28544-5] [51429] [None] [%]

There seem to be two codes corresponding to the leukocyte count, LAB//51300//K/uL and LAB//51301//K/uL.

As we group the dataset by these codes, we see that patients could have many measurements for the same type of a vital sign or biomarker.

1 
2
mask = train_data['code'].isin(['LAB//51300//K/uL', 'LAB//51301//K/uL'])
display(train_data[mask].groupby('subject_id').describe()['numeric_value'].head())
count mean min 25% 50% 75% max std
subject_id
10000032 13.0 5.815384 3.0 4.800 5.60 6.900000 8.200000 1.618562
10001217 13.0 10.723078 5.4 8.400 9.70 12.600000 19.000000 3.827130
10002428 57.0 10.873684 3.8 7.100 9.10 11.300000 33.400002 6.166604
10002495 10.0 21.619999 11.0 11.575 23.05 29.099999 36.799999 9.659974
10002930 43.0 3.048837 1.9 2.400 2.90 3.300000 5.000000 0.846666

Since creating a new column for each possible timestamp (not to mention for each possible code) would be infeasible, how do we appropriately aggregate them into a single column? This is where the domain expertise and close collaboration with a clinician is extremely useful -- just like when selecting meaningful prediction tasks and deciding on a patient cohort.

However, as the focus of this notebook is the practical aspect of actually tabularizing selected features and building models, let's just look at a few examples. For the length of stay task, some features we could reasonably use include:

  • Demographic and admission information, including birth date (to calculate age at admission), gender, ICU admission type (subtypes of ICU admission by which the patients are filtered into our cohort)
  • Vital signs (min/max/mean over the first 24h), such as heart rate, blood pressure, temperature
  • Lab values (min/max/mean over the first 24h), e.g. white blood cell (leukocyte) count

Let's find the codes for those features in the dataset.

1 
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
search_terms = [
  # Demographics and admission info
  'birth', 'gender', 'icu_admission',
  # Vital signs
  'heart rate', 'blood pressure', 'temperature',
  # Lab values
  r'leukocytes \[\#/volume\]',
]

code_list = []

for s in search_terms:
  print(f"Codes for \"{s}\":\n")
  search_metadata_df = metadata_df[(metadata_df['code'].str.contains(s, case=False)
    | metadata_df['description'].str.contains(s, case=False))]
  if not search_metadata_df.empty:
    for index, row in search_metadata_df.iterrows():
      print(row['code'], '\n\t', row['description'])
      code_list.append(row['code'])
    print()

  search_d_items_df = mimic_icu_d_items[(mimic_icu_d_items['label'].str.contains(s, case=False))]
  if not search_d_items_df.empty:
    # 'itemid' in the D_ITEMS dataframe contains an ID that is reused when
    # constructing the codes in the MEDS dataset,
    # so we search for this ID in training dataframe:
    for i, l in zip(search_d_items_df['itemid'], search_d_items_df['label']):
      search_itemid_in_train_df = train_data[
        (train_data['code'].str.contains(str(i)))]
      if not search_itemid_in_train_df.empty:
        code = search_itemid_in_train_df['code'].unique()[0]
        print(code, '\n\t', l)
        code_list.append(code)

  search_train_df = train_data[(train_data['code'].str.contains(s, case=False))]
  if not search_train_df.empty:
    for c in search_train_df['code'].unique():
      code_list.append(c)
      print(c)

  print("\n")
Codes for "birth":

MEDS_BIRTH


Codes for "gender":

GENDER//F
GENDER//M


Codes for "icu_admission":

ICU_ADMISSION//Medical Intensive Care Unit (MICU)
ICU_ADMISSION//Surgical Intensive Care Unit (SICU)
ICU_ADMISSION//Coronary Care Unit (CCU)
ICU_ADMISSION//Medical/Surgical Intensive Care Unit (MICU/SICU)
ICU_ADMISSION//Trauma SICU (TSICU)
ICU_ADMISSION//Cardiac Vascular Intensive Care Unit (CVICU)
ICU_ADMISSION//Neuro Surgical Intensive Care Unit (Neuro SICU)
ICU_ADMISSION//Neuro Stepdown
ICU_ADMISSION//Neuro Intermediate


Codes for "heart rate":

LAB//220046//bpm 
	 Heart rate Alarm - High
LAB//220047//bpm 
	 Heart Rate Alarm - Low
LAB//220045//bpm 
	 Heart Rate


Codes for "blood pressure":

LAB//227539//UNK 
	 ART Blood Pressure Alarm Source
LAB//227537//mmHg 
	 ART Blood Pressure Alarm - High
LAB//220058//mmHg 
	 Arterial Blood Pressure Alarm - High
LAB//227538//mmHg 
	 ART Blood Pressure Alarm - Low
LAB//223751//mmHg 
	 Non-Invasive Blood Pressure Alarm - High
LAB//223752//mmHg 
	 Non-Invasive Blood Pressure Alarm - Low
LAB//220056//mmHg 
	 Arterial Blood Pressure Alarm - Low
LAB//220179//mmHg 
	 Non Invasive Blood Pressure systolic
LAB//224643//mmHg 
	 Manual Blood Pressure Diastolic Left
LAB//220181//mmHg 
	 Non Invasive Blood Pressure mean
LAB//220052//mmHg 
	 Arterial Blood Pressure mean
LAB//227243//mmHg 
	 Manual Blood Pressure Systolic Right
LAB//220180//mmHg 
	 Non Invasive Blood Pressure diastolic
LAB//224167//mmHg 
	 Manual Blood Pressure Systolic Left
LAB//220050//mmHg 
	 Arterial Blood Pressure systolic
LAB//227242//mmHg 
	 Manual Blood Pressure Diastolic Right
LAB//220051//mmHg 
	 Arterial Blood Pressure diastolic
Blood Pressure
Blood Pressure Standing (1 min)
Blood Pressure Sitting
Blood Pressure Lying
Blood Pressure Standing (3 mins)


Codes for "temperature":

LAB//50825//UNK 
	 Body temperature
DIAGNOSIS//ICD//9//78065 
	 Hypothermia not associated with low environmental temperature
DIAGNOSIS//ICD//10//R680 
	 Hypothermia, not associated with low environmental temperature

LAB//224027//UNK 
	 Skin Temperature
LAB//224642//UNK 
	 Temperature Site
LAB//226329//°C 
	 Blood Temperature CCO (C)
LAB//223762//°C 
	 Temperature Celsius
LAB//223761//°F 
	 Temperature Fahrenheit


Codes for "leukocytes \[\#/volume\]":

LAB//51301//K/uL 
	 Leukocytes [#/volume] in Blood by Automated count
LAB//51300//K/uL 
	 Leukocytes [#/volume] in Blood by Automated count



Simplifying this list to account for feature occurrence in our training data, we could get a selection of features like this:

1 
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
birth_codes = ["MEDS_BIRTH"]

# Categorical codes
gender_codes = ["GENDER//F", "GENDER//M"]

icu_admission_codes = ['ICU_ADMISSION//Medical Intensive Care Unit (MICU)',
 'ICU_ADMISSION//Surgical Intensive Care Unit (SICU)',
 'ICU_ADMISSION//Coronary Care Unit (CCU)',
 'ICU_ADMISSION//Medical/Surgical Intensive Care Unit (MICU/SICU)',
 'ICU_ADMISSION//Trauma SICU (TSICU)',
 'ICU_ADMISSION//Cardiac Vascular Intensive Care Unit (CVICU)',
 'ICU_ADMISSION//Neuro Surgical Intensive Care Unit (Neuro SICU)',
 'ICU_ADMISSION//Neuro Stepdown',
 'ICU_ADMISSION//Neuro Intermediate',]

# Numeric codes
numeric_codes = ['LAB//223761//°F',
                 'LAB//220179//mmHg', 'LAB//220180//mmHg',
                 'LAB//220050//mmHg', 'LAB//220051//mmHg',
                 'LAB//51301//K/uL']

Filtering the cohorts

Remember from our previous discussion on tabularization strategy that a "feature" in a tabular dataset consists of three components:

  1. aggregation function for a
  2. filtered set of measurements, occuring in
  3. some window prior to the prediction time.

We have now selected the set of measurements that we will be filtering (component 2): birth_codes, gender_codes, icu_admission_codes and numeric_codes.

For the choice of the aggregation function, some of the features do not need an aggregation function, such as age, gender, and ICU admission type -- those features either have a unique value for a particular patient, or, in case of ICU admission, are constrained by our prediction task, as a separate ICU admission is treated as an independent training sample. In case of numerical features that can potentially have many measurements, we can summarize the data in several ways, but for simplicity we can pick min(), max() and mean() aggregation functions.

The decision on the right aggregation windows again depends on the features. For age and gender, we use information available since the patient's birth, and we could have also used other features spanning through the subject's life up to the prediction time (e.g. number of previous admissions, comorbidities, etc.) For simplicity though, all of our numerical features will be aggregated to over a window from the ICU admission time up to the prediction time, spanning 24 hours.

An example implementation of such tabularization strategy is shown below. This works for our small demo dataset, but with larger scale other practical and efficiency considerations may become more important.

1 
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
def tabularize_data(data, labels):
    featurized_data = []

    for index, row in labels.iterrows():
        features = {}

        subject_id = row['subject_id']
        window_end = row['prediction_time']
        window_start = window_end - pd.Timedelta(days=1)

        features['subject_id'] = subject_id
        features['prediction_time'] = window_end

        # Time-windowed data (within prediction window)
        window_events = data[(data['subject_id'] == subject_id)
                                    & ((data['time'] >= window_start) & (data['time'] <= window_end))]

        # Age calculation
        subject_birth = data[(data['subject_id'] == subject_id) & (data['code'].isin(birth_codes))]
        if len(subject_birth) > 0:
            birth_time = subject_birth['time'].iloc[0]
            if pd.notna(birth_time):
                age = (window_end - birth_time).days / 365.25  # Convert to years
                features['age'] = age
            else:
                features['age'] = np.nan
        else:
            features['age'] = np.nan

        # Gender (categorical)
        subject_gender = data[(data['subject_id'] == subject_id) & (data['code'].isin(gender_codes))]
        if len(subject_gender) > 0:
            gender_code = subject_gender['code'].iloc[0]
            gender = gender_code.split('//')[-1] if '//' in gender_code else gender_code
            features['gender'] = gender
        else:
            features['gender'] = 'Unknown'

        # ICU admission type (categorical)
        icu_admission_type_events = window_events[window_events['code'].isin(icu_admission_codes)]
        if len(icu_admission_type_events) > 0:
            icu_code = icu_admission_type_events['code'].iloc[0]
            icu_type = icu_code.split('//')[-1] if '//' in icu_code else icu_code
            features['icu_admission_type'] = icu_type
        else:
            features['icu_admission_type'] = 'Unknown'

        # Numeric codes
        for code in numeric_codes:
            code_events = window_events[window_events['code'] == code]
            # Record min/max/mean values
            if not code_events.empty:
                features[f'{code}_min'] = code_events['numeric_value'].min()
                features[f'{code}_max'] = code_events['numeric_value'].max()
                features[f'{code}_mean'] = code_events['numeric_value'].mean()
            else:
                features[f'{code}_min'] = np.nan
                features[f'{code}_max'] = np.nan
                features[f'{code}_mean'] = np.nan

        featurized_data.append(features)

    featurized_data = pd.DataFrame(featurized_data)
    featurized_data['gender'] = featurized_data['gender'].astype('category')
    featurized_data['icu_admission_type'] = featurized_data['icu_admission_type'].astype('category')

    return featurized_data

In the end, we obtain a featurized tabular dataset like this:

1 
2
tabularized_train = tabularize_data(train_data, train_labels)
tabularized_train.head()
[16]: 
subject_id prediction_time age gender icu_admission_type LAB//223761//°F_min LAB//223761//°F_max LAB//223761//°F_mean LAB//220179//mmHg_min LAB//220179//mmHg_max ... LAB//220180//mmHg_mean LAB//220050//mmHg_min LAB//220050//mmHg_max LAB//220050//mmHg_mean LAB//220051//mmHg_min LAB//220051//mmHg_max LAB//220051//mmHg_mean LAB//51301//K/uL_min LAB//51301//K/uL_max LAB//51301//K/uL_mean
0 10002428 2156-04-13 16:24:18 81.278576 F Medical Intensive Care Unit (MICU) 98.599998 102.900002 100.299995 73.0 122.0 ... 51.918365 NaN NaN NaN NaN NaN NaN 22.4 27.900000 25.150000
1 10002428 2156-04-20 18:11:19 81.297741 F Medical Intensive Care Unit (MICU) 98.500000 101.099998 99.416664 48.0 163.0 ... 61.074074 73.0 146.0 106.207550 46.0 83.0 63.075470 10.5 13.200000 11.850000
2 10002428 2156-05-01 21:53:00 81.327858 F Medical Intensive Care Unit (MICU) 97.500000 99.599998 98.271423 77.0 145.0 ... 49.416668 NaN NaN NaN NaN NaN NaN 7.1 8.800000 7.950000
3 10002428 2156-05-12 14:49:34 81.357974 F Surgical Intensive Care Unit (SICU) 97.199997 102.099998 99.471436 NaN NaN ... NaN 88.0 148.0 110.074074 35.0 73.0 53.444443 11.0 11.000000 11.000000
4 10002495 2141-05-23 20:18:01 81.388090 M Coronary Care Unit (CCU) 98.000000 98.699997 98.349998 71.0 160.0 ... 56.648647 NaN NaN NaN NaN NaN NaN 28.5 36.799999 31.533333

5 rows × 23 columns

Again, this is just an example and by no means a complete set of features, and there is so much more useful data that we could have selected and processed in more sophisticated ways -- ideally in collaboration with a domain expert.

However, with this basic tabularized dataset we are now ready to train our tabular model.

Training the Model

For our binary prediction task, let's first implement a basic logistic regression model.

For scikit-learn's implementation of logistic regression we need to take additional steps to prepare the features. First, we handle any missing features by either imputing the most frequent category or the median value (of course other strategies are possible), as these are not supported by our logistic regression model. In addition, we encode the categorical features using one-hot encoding, and scale the numerical features for stability.

1 
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


categorical_features = ['gender', 'icu_admission_type']
numerical_features = [col for col in tabularized_train.columns if col not in ['subject_id', 'prediction_time'] + categorical_features]

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Now let's construct the model pipeline, consisting of the preprocessing stage and the logistic regression stage:

1 
2
3
4
5
6
from sklearn.linear_model import LogisticRegression

model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

We train the model simply by calling the model's fit function:

1 
2
3
4
X_train = tabularized_train.drop(columns=['subject_id', 'prediction_time'])
y_train = train_labels['boolean_value']

model.fit(X_train, y_train)
[19]: 
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'LAB//223761//°F_min',
                                                   'LAB//223761//°F_max',
                                                   'LAB//223761//°F_mean',
                                                   'LAB//220179//mmHg_min',
                                                   'LAB//220179//mmHg_max',
                                                   'LAB//220179//mmHg_mean',
                                                   'LAB//220180//mmHg_min',
                                                   'LAB//220180//mmH...
                                                   'LAB//220051//mmHg_mean',
                                                   'LAB//51301//K/uL_min',
                                                   'LAB//51301//K/uL_max',
                                                   'LAB//51301//K/uL_mean']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(drop='first',
                                                                                 handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['gender',
                                                   'icu_admission_type'])])),
                ('classifier',
                 LogisticRegression(max_iter=1000, random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluating the Model

It is simple to evaluate scikit-learn models by using the model's predict function on our (tabularized) tuning dataset.

1 
2
3
4
5
6
7
tuning_data = pd.read_parquet(data_root / "tuning")
tuning_labels = pd.read_parquet(labels_root / "tuning")

X_tuning = tabularize_data(tuning_data, tuning_labels).drop(columns=['subject_id', 'prediction_time'])
y_tuning = tuning_labels['boolean_value']

y_pred = model.predict(X_tuning)

We can evaluate the model's performance using some of scikit-learn's built-in metrics -- most straightforwardly the classification accuracy, but we can also use other metrics like Area Under the Receiver Operating Characteristic curve (AUROC).

1 
2
3
4
5
6
7
8
9
from sklearn.metrics import accuracy_score, roc_auc_score

accuracy = accuracy_score(y_tuning, y_pred)
print(f"Accuracy: {accuracy}")

y_pred_proba = model.predict_proba(X_tuning)[:, 1]

roc_auc = roc_auc_score(y_tuning, y_pred_proba)
print(f"AUROC: {roc_auc}")
Accuracy: 0.7
AUROC: 0.5714285714285714

Alternatively, for binary prediction tasks, the MEDS ecosystem supports evaluation pipelines like MEDS-Evaluation, which can automatically provide a collection of metrics for prediction data formatted as the prediction schema (which is very similar to the MEDS label schema, but has two additional columns for predicted_boolean_value and predicted_boolean_probability):

1 
2
3
4
5
6
y_pred_meds_eval = tuning_labels.copy()
y_pred_meds_eval['predicted_boolean_value'] = y_pred
y_pred_meds_eval['predicted_boolean_probability'] = y_pred_proba
display(y_pred_meds_eval.head())

y_pred_meds_eval.to_parquet("predictions.parquet")
subject_id prediction_time boolean_value integer_value float_value categorical_value predicted_boolean_value predicted_boolean_probability
0 10001725 2110-04-12 15:52:22 True NaN NaN None True 0.515569
1 10016810 2185-06-17 02:16:00 True NaN NaN None True 0.583523
2 10019172 2118-11-17 09:38:22 False NaN NaN None True 0.722865
3 10019777 2187-02-11 20:34:00 False NaN NaN None False 0.297040
4 10020740 2150-03-12 15:34:56 False NaN NaN None False 0.189860
1 
2
3
4
5
%%bash
pip install -q meds-evaluation

mkdir -p evaluation_output
meds-evaluation-cli predictions_path='predictions.parquet' output_file='evaluation_output/evaluation_results.json'
1 
2
evaluation_results = pd.read_json("evaluation_output/evaluation_results.json")
evaluation_results
[48]: 
samples_equally_weighted subjects_equally_weighted
binary_accuracy 0.700000 0.678571
f1_score 0.571429 0.640000
roc_auc_score 0.571429 0.583333
average_precision_score 0.402778 0.535095
calibration_error 0.391543 0.456051
brier_score 0.241134 0.263120

We have now built a very basic tabular model using a MEDS dataset! We could build upon it using a wider and better selection of features, and potentially more sophisticated preprocessing pipelines or more advanced tabular models (such as XGBoost).

Using MEDS-Tab Instead

You may have noticed that our feature selection process for tabular models has been quite involved, as it required a deep understanding of the dataset structure and case-by-case selection, filtering and preprocessing of each of the features.

Considering the significance of tabular baselines, this adds up to a lot of repeated work for each potential dataset and task we may want to implement models for, and may introduce some reproducibility challenges when the feature selection is inconsistent between different applications.

To help address these challenges, MEDS ecosystem contributors have developed MEDS-Tab, a package for automatically tabularizing MEDS data and training decision tree models with hyperparameter tuning.

image.png

In this case, the whole model-building process could take just a few commands, such as

meds-tab-tabularize-time-series \
    --multirun \
    worker="range(0,$N_PARALLEL_WORKERS)" \
    hydra/launcher=joblib \
    "input_dir=${MEDS_RESHARD_DIR}/data" \
    "output_dir=$OUTPUT_DIR" \
    tabularization.min_code_inclusion_count=10 \
    tabularization.window_sizes=[1d,30d,365d,full] \
    tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]

Feel free to check out the GitHub repository, the usage guide or the preprint for more information.