Building a Model

In this part of the tutorial, we'll walk through the task of building a model with MEDS

We will continue exactly where we've left off in the previous part of the tutorial. Our task is to predict a short ICU stay (of < 3 days) given the patient's input data. We have already transformed the dataset into the MEDS format and extracted a set of label files for the dataset.

Let's download the files we produced in the prior sections from the tutorial resources:

%%bash
wget -q -c https://github.com/Medical-Event-Data-Standard/MEDS_KDD_2025_Tutorial/raw/refs/heads/main/MEDS_data.zip

unzip -q -o MEDS_data.zip

apt-get -qq install tree
tree MEDS_data

MEDS_data
├── data
│   ├── held_out
│   │   └── 0.parquet
│   ├── train
│   │   └── 0.parquet
│   └── tuning
│       └── 0.parquet
├── labels
│   └── short_LOS
│       ├── held_out
│       │   └── 0.parquet
│       ├── train
│       │   └── 0.parquet
│       └── tuning
│           └── 0.parquet
└── metadata
    ├── codes.parquet
    ├── dataset.json
    └── subject_splits.parquet

10 directories, 9 files

We can see that we have our core "data" files in MEDS_data/data, our metadata files in MEDS_data/metadata, and our task labels in MEDS_data/labels/short_LOS. We'll store these paths in variables for ease of use later.

from pathlib import Path
data_root = Path("MEDS_data/data")
labels_root = Path("MEDS_data/labels/short_LOS")
metadata_root = Path("MEDS_data/metadata")

To help us get started, let's look at some of our core data files and our label files:

import pandas as pd

train_data = pd.read_parquet(data_root / "train")

# Remember that MEDS lets you store arbitrary other columns beyond those in the
# official schema -- we'll simplify things here by just keeping the main 5:
train_data = train_data[
  ["subject_id", "time", "code", "numeric_value", "text_value"]
]
display(train_data.head())

train_labels = pd.read_parquet(labels_root / "train")

# Similarly, we'll only keep the relevant boolean label in our labels
train_labels = train_labels[["subject_id", "prediction_time", "boolean_value"]]

display(train_labels.head())

	subject_id	time	code	numeric_value	text_value
0	10000032	NaT	GENDER//F	NaN	None
1	10000032	2128-01-01 00:00:00	MEDS_BIRTH	NaN	None
2	10000032	2180-03-23 11:51:00	LAB//51087//UNK	NaN	None
3	10000032	2180-03-23 11:51:00	LAB//50940//UNK	NaN	None
4	10000032	2180-03-23 11:51:00	LAB//51079//UNK	NaN	NEG

	subject_id	prediction_time	boolean_value
0	10002428	2156-04-13 16:24:18	False
1	10002428	2156-04-20 18:11:19	False
2	10002428	2156-05-01 21:53:00	True
3	10002428	2156-05-12 14:49:34	False
4	10002495	2141-05-23 20:18:01	False

Now that we have our files again and know what our data looks like, we will learn how to use MEDS datasets to train a couple of different kinds of models.

First, it is a good practice to start from a baseline model over some tabularized features representing the patient's data. We will start by extracting these features manually, which is viable on our demo dataset given its small size. To show you how to scale this up to a much larger dataset, we will also show how you could use the MEDS-Tab tool to do this.

Once that is done, we will show how you could build a simple neural network model for this task, highlighting a few different other tools in the process.

Let's dive in!

A Tabular Baseline

MEDS Format vs Tabularized Data

The first step in training a tabular model on a MEDS dataset is to prepare the data, as there are some important differences between MEDS representation and the usual "tabular" representation:

MEDS has the "one row = one event" structure. Every patient will usually have many events (such as birth, death, hospital admission, lab results, diagnostic codes, etc.) associated with them, and each of these events will be represented in their own row.
Tabular data is typically treated as having a "one row = one training sample" structure, which would correspond to each patient -- along with their whole history of measurements -- being represented as a single row in the dataset.

For example, we can see that the first subject in our MEDS training dataset has quite a few measurements (rows) associated with them.

1 
2

sample_subject = train_labels.iloc[0]["subject_id"]
display(train_data[train_data['subject_id'] == sample_subject])

	subject_id	time	code	numeric_value	text_value
4860	10002428	NaT	GENDER//F	NaN	None
4861	10002428	2075-01-01 00:00:00	MEDS_BIRTH	NaN	None
4862	10002428	2154-08-12 00:00:00	Weight (Lbs)	NaN	98
4863	10002428	2154-08-12 00:00:00	BMI (kg/m2)	NaN	19.6
4864	10002428	2154-08-12 00:00:00	Blood Pressure	NaN	158/98
...	...	...	...	...	...
41436	10002428	2160-07-16 18:49:00	DIAGNOSIS//ICD//10//S4992XA	NaN	None
41437	10002428	2160-07-16 18:49:00	DIAGNOSIS//ICD//10//Z9181	NaN	None
41438	10002428	2160-07-16 18:49:00	DIAGNOSIS//ICD//10//M3500	NaN	None
41439	10002428	2160-07-16 18:49:00	TRANSFER_TO//admit//Emergency Department Obser...	NaN	None
41440	10002428	2160-07-16 19:15:04	TRANSFER_TO//discharge//UNK	NaN	None

36581 rows × 5 columns

If we want to train a simple baseline model (e.g., logistic regression, XGBoost, etc.), all these measurements will eventually have to be summarized into a single, fixed-size representation, and the overall structure of our tabularized dataset will look more like this:

`subject_id`	`feature_1`	`feature_2`	...
10000032	?	?	...
10001217	?	?	...
...	...	...	...

Note there's actually one more complexity here -- we actually don't want all of the subject's measurements included in that fixed-size tabular input. Instead, we just want those that are "permitted inputs" for a given task sample. Let's look at our label dataframe again:

train_labels[train_labels["subject_id"] == sample_subject]

[6]:

	subject_id	prediction_time	boolean_value
0	10002428	2156-04-13 16:24:18	False
1	10002428	2156-04-20 18:11:19	False
2	10002428	2156-05-01 21:53:00	True
3	10002428	2156-05-12 14:49:34	False

We can see that there are actually multiple "task samples" for this subject, each with different prediction_time inputs! Moreover, all available subject data for each of these samples could potentially lead to data leakage -- that is, "future" patient data being available for an earlier prediction time.

What this means is that we want to find a way to summarize all of the patient's data up to and including each of these time points separately, and use the appropriate summary to predict each sample's label -- adding another dimension to our "tabularization process".

Tabularization Strategy

Clearly, tabularization must follow the following strategy: for each prediction time of interest, we'll pick a set of "features" we want to extract about the patient's data up to that time, and write code to extract those features. Those will then be our columns in our tabular representation.

But with the MEDS format, we can be a bit more specific. In particular, we can ask "what is a feature when derived from MEDS data?" Given the structure of MEDS, an easy template to use is the following:

Given a patient's MEDS data and a prediction time, a "feature" is an aggregation function applied to a filtered set of measurements that occur in some relative or absolute window prior to the prediction time.

With this in mind, to define a feature, we just need to decide (1) what's our aggregation function, (2) what filtered set of measurements should we use, and (3) what is our window prior to the prediction time.

This is still a broad definition, but it lets us operationalize things very cleanly. Let's see this idea in action by constructing some features on the MEDS data we have.

Constructing the Features

So how do we featurize a MEDS dataset into a tabular form?

We could start from exploring all the event codes appearing in a MEDS dataset. To help us do that, here's a simple helper that will let us search through the columns of a dataframe interactively.

import pandas as pd
import ipywidgets as widgets
from IPython.display import display
from functools import partial

def df_search(df: pd.DataFrame):
  # Create a Text widget for the search input
  search_input = widgets.Text(
      value='',
      placeholder='Enter search term...',
      description='Search:',
      disabled=False
  )

  # Function to filter the DataFrame based on the search term
  def filter_dataframe(df: pd.DataFrame, search_term: str):
      if not search_term:
        display(df)
      else:
        filtered_df = df[df.apply(lambda row: row.astype(str).str.contains(search_term, case=False).any(), axis=1)]
        display(filtered_df)

  # Create an interactive output to display the filtered DataFrame
  output = widgets.interactive_output(
      partial(filter_dataframe, df=df), {'search_term': search_input}
  )

  # Display the search input and the output
  display(search_input, output)

Let's start by exploring the codes.parquet dataframe within a MEDS dataset's metadata directory:

1 
2

metadata_df = pd.read_parquet(metadata_root / "codes.parquet")
df_search(metadata_df)

Text(value='', description='Search:', placeholder='Enter search term...')

Output()

We can also search through the code strings or text values associated with them in the training data directly:

df_search(train_data[["code", "text_value"]].drop_duplicates())

Text(value='', description='Search:', placeholder='Enter search term...')

Output()

For MIMIC-IV data specifically, we can also use another resource from the raw data, which may help interpret some of the potentially uninformative codes in the training data (e.g. try searching for 220046 in metadata_df, train_data as well as the table below).

1 
2
3

!wget -q -c https://physionet.org/files/mimic-iv-demo/1.0/icu/d_items.csv.gz
mimic_icu_d_items = pd.read_csv("d_items.csv.gz")
df_search(mimic_icu_d_items)

Text(value='', description='Search:', placeholder='Enter search term...')

Output()

However, using every code appearing in the train dataset as a feature leads to two challenges:

There might be a lot of missing data, since most of these codes are rare and would only appear for a few patients (especially when we narrow down our cohort as part of the task and label extraction process),
Many of these codes could have many measurements over time.

For example, let's look for the codes measuring the white blood cell (leukocyte) count, as those could indicate an infection.

display(metadata_df[metadata_df['description'].str.contains('leukocyte', case=False, na=False)].head(30))

	code	description	parent_codes	itemid	possibly_cpt_code	valueuom
33	LAB//51124//%	Plasma cells/100 leukocytes in Peritoneal flui...	[LOINC/40518-3]	[51124]	[None]	[%]
85	LAB//52272//%	Monocytes/100 leukocytes in Cerebral spinal fluid	[LOINC/26486-1]	[52272]	[None]	[%]
127	LAB//51118//%	Mesothelial cells/100 leukocytes in Peritoneal...	[LOINC/30432-9]	[51118]	[None]	[%]
133	LAB//51448//%	Mesothelial cells/100 leukocytes in Pleural fluid	[LOINC/30431-1]	[51448]	[None]	[%]
146	LAB//51263//%	Plasma cells/100 leukocytes in Blood	[LOINC/13047-6]	[51263]	[None]	[%]
222	LAB//51376//%	Macrophages/100 leukocytes in Synovial fluid	[LOINC/33376-5]	[51376]	[None]	[%]
310	LAB//51244//%	Lymphocytes/100 leukocytes in Blood by Automat...	[LOINC/736-9]	[51244]	[None]	[%]
329	LAB//51427//%	Lymphocytes/100 leukocytes in Body fluid	[LOINC/11031-2]	[51427]	[None]	[%]
458	LAB//51117//%	Macrophages/100 leukocytes in Peritoneal fluid...	[LOINC/40517-5]	[51117]	[None]	[%]
492	LAB//51112//%	Basophils/100 leukocytes in Peritoneal fluid	[LOINC/35069-4]	[51112]	[None]	[%]
493	LAB//51455//%	Polymorphonuclear cells/100 leukocytes in Pleu...	[LOINC/26519-9]	[51455]	[None]	[%]
516	LAB//51301//K/uL	Leukocytes [#/volume] in Blood by Automated count	[LOINC/6690-2]	[51301]	[None]	[K/uL]
525	LAB//51431//%	Monocytes/100 leukocytes in Body fluid	[LOINC/26487-9]	[51431]	[None]	[%]
552	LAB//52220//%	Leukocytes other/100 leukocytes in Blood	[LOINC/26471-3]	[52220]	[None]	[%]
558	LAB//51386//%	Band form neutrophils/100 leukocytes in Body f...	[LOINC/26510-8]	[51386]	[None]	[%]
570	LAB//51116//%	Lymphocytes/100 leukocytes in Peritoneal fluid	[LOINC/26482-0]	[51116]	[None]	[%]
581	LAB//51256//%	Neutrophils/100 leukocytes in Blood by Automat...	[LOINC/770-8]	[51256]	[None]	[%]
639	LAB//51444//%	Eosinophils/100 leukocytes in Pleural fluid	[LOINC/30379-2]	[51444]	[None]	[%]
686	LAB//51387//%	Basophils/100 leukocytes in Body fluid	[LOINC/28543-7]	[51387]	[None]	[%]
760	LAB//51518///hpf	Leukocyte clumps [Presence] in Urine sediment ...	[LOINC/67848-2]	[51518]	[None]	[/hpf]
779	LAB//51375//%	Lymphocytes/100 leukocytes in Synovial fluid	[LOINC/26483-8]	[51375]	[None]	[%]
957	LAB//52135//%	Immature granulocytes/100 leukocytes in Blood	[LOINC/38518-7]	[52135]	[None]	[%]
1016	LAB//51419//%	Eosinophils/100 leukocytes in Body fluid	[LOINC/26452-3]	[51419]	[None]	[%]
1081	LAB//51428//%	Macrophages/100 leukocytes in Body fluid	[LOINC/30427-9]	[51428]	[None]	[%]
1094	LAB//51259//%	Leukocytes other/100 leukocytes in Blood	[LOINC/26471-3]	[51259]	[None]	[%]
1096	LAB//51516//#/hpf	Leukocytes [#/area] in Urine sediment by Micro...	[LOINC/5821-4]	[51516]	[None]	[#/hpf]
1114	LAB//51254//%	Monocytes/100 leukocytes in Blood by Automated...	[LOINC/5905-5]	[51254]	[None]	[%]
1174	LAB//51300//K/uL	Leukocytes [#/volume] in Blood by Automated count	[LOINC/6690-2]	[51300]	[None]	[K/uL]
1181	LAB//51120//%	Monocytes/100 leukocytes in Peritoneal fluid	[LOINC/26488-7]	[51120]	[None]	[%]
1221	LAB//51429//%	Mesothelial cells/100 leukocytes in Body fluid	[LOINC/28544-5]	[51429]	[None]	[%]

There seem to be two codes corresponding to the leukocyte count, LAB//51300//K/uL and LAB//51301//K/uL.

As we group the dataset by these codes, we see that patients could have many measurements for the same type of a vital sign or biomarker.

1 
2

mask = train_data['code'].isin(['LAB//51300//K/uL', 'LAB//51301//K/uL'])
display(train_data[mask].groupby('subject_id').describe()['numeric_value'].head())

	count	mean	min	25%	50%	75%	max	std
subject_id
10000032	13.0	5.815384	3.0	4.800	5.60	6.900000	8.200000	1.618562
10001217	13.0	10.723078	5.4	8.400	9.70	12.600000	19.000000	3.827130
10002428	57.0	10.873684	3.8	7.100	9.10	11.300000	33.400002	6.166604
10002495	10.0	21.619999	11.0	11.575	23.05	29.099999	36.799999	9.659974
10002930	43.0	3.048837	1.9	2.400	2.90	3.300000	5.000000	0.846666

Since creating a new column for each possible timestamp (not to mention for each possible code) would be infeasible, how do we appropriately aggregate them into a single column? This is where the domain expertise and close collaboration with a clinician is extremely useful -- just like when selecting meaningful prediction tasks and deciding on a patient cohort.

However, as the focus of this notebook is the practical aspect of actually tabularizing selected features and building models, let's just look at a few examples. For the length of stay task, some features we could reasonably use include:

Demographic and admission information, including birth date (to calculate age at admission), gender, ICU admission type (subtypes of ICU admission by which the patients are filtered into our cohort)
Vital signs (min/max/mean over the first 24h), such as heart rate, blood pressure, temperature
Lab values (min/max/mean over the first 24h), e.g. white blood cell (leukocyte) count

Let's find the codes for those features in the dataset.

search_terms = [
  # Demographics and admission info
  'birth', 'gender', 'icu_admission',
  # Vital signs
  'heart rate', 'blood pressure', 'temperature',
  # Lab values
  r'leukocytes \[\#/volume\]',
]

code_list = []

for s in search_terms:
  print(f"Codes for \"{s}\":\n")
  search_metadata_df = metadata_df[(metadata_df['code'].str.contains(s, case=False)
    | metadata_df['description'].str.contains(s, case=False))]
  if not search_metadata_df.empty:
    for index, row in search_metadata_df.iterrows():
      print(row['code'], '\n\t', row['description'])
      code_list.append(row['code'])
    print()

  search_d_items_df = mimic_icu_d_items[(mimic_icu_d_items['label'].str.contains(s, case=False))]
  if not search_d_items_df.empty:
    # 'itemid' in the D_ITEMS dataframe contains an ID that is reused when
    # constructing the codes in the MEDS dataset,
    # so we search for this ID in training dataframe:
    for i, l in zip(search_d_items_df['itemid'], search_d_items_df['label']):
      search_itemid_in_train_df = train_data[
        (train_data['code'].str.contains(str(i)))]
      if not search_itemid_in_train_df.empty:
        code = search_itemid_in_train_df['code'].unique()[0]
        print(code, '\n\t', l)
        code_list.append(code)

  search_train_df = train_data[(train_data['code'].str.contains(s, case=False))]
  if not search_train_df.empty:
    for c in search_train_df['code'].unique():
      code_list.append(c)
      print(c)

  print("\n")

Codes for "birth":

MEDS_BIRTH


Codes for "gender":

GENDER//F
GENDER//M


Codes for "icu_admission":

ICU_ADMISSION//Medical Intensive Care Unit (MICU)
ICU_ADMISSION//Surgical Intensive Care Unit (SICU)
ICU_ADMISSION//Coronary Care Unit (CCU)
ICU_ADMISSION//Medical/Surgical Intensive Care Unit (MICU/SICU)
ICU_ADMISSION//Trauma SICU (TSICU)
ICU_ADMISSION//Cardiac Vascular Intensive Care Unit (CVICU)
ICU_ADMISSION//Neuro Surgical Intensive Care Unit (Neuro SICU)
ICU_ADMISSION//Neuro Stepdown
ICU_ADMISSION//Neuro Intermediate


Codes for "heart rate":

LAB//220046//bpm 
	 Heart rate Alarm - High
LAB//220047//bpm 
	 Heart Rate Alarm - Low
LAB//220045//bpm 
	 Heart Rate


Codes for "blood pressure":

LAB//227539//UNK 
	 ART Blood Pressure Alarm Source
LAB//227537//mmHg 
	 ART Blood Pressure Alarm - High
LAB//220058//mmHg 
	 Arterial Blood Pressure Alarm - High
LAB//227538//mmHg 
	 ART Blood Pressure Alarm - Low
LAB//223751//mmHg 
	 Non-Invasive Blood Pressure Alarm - High
LAB//223752//mmHg 
	 Non-Invasive Blood Pressure Alarm - Low
LAB//220056//mmHg 
	 Arterial Blood Pressure Alarm - Low
LAB//220179//mmHg 
	 Non Invasive Blood Pressure systolic
LAB//224643//mmHg 
	 Manual Blood Pressure Diastolic Left
LAB//220181//mmHg 
	 Non Invasive Blood Pressure mean
LAB//220052//mmHg 
	 Arterial Blood Pressure mean
LAB//227243//mmHg 
	 Manual Blood Pressure Systolic Right
LAB//220180//mmHg 
	 Non Invasive Blood Pressure diastolic
LAB//224167//mmHg 
	 Manual Blood Pressure Systolic Left
LAB//220050//mmHg 
	 Arterial Blood Pressure systolic
LAB//227242//mmHg 
	 Manual Blood Pressure Diastolic Right
LAB//220051//mmHg 
	 Arterial Blood Pressure diastolic
Blood Pressure
Blood Pressure Standing (1 min)
Blood Pressure Sitting
Blood Pressure Lying
Blood Pressure Standing (3 mins)


Codes for "temperature":

LAB//50825//UNK 
	 Body temperature
DIAGNOSIS//ICD//9//78065 
	 Hypothermia not associated with low environmental temperature
DIAGNOSIS//ICD//10//R680 
	 Hypothermia, not associated with low environmental temperature

LAB//224027//UNK 
	 Skin Temperature
LAB//224642//UNK 
	 Temperature Site
LAB//226329//°C 
	 Blood Temperature CCO (C)
LAB//223762//°C 
	 Temperature Celsius
LAB//223761//°F 
	 Temperature Fahrenheit


Codes for "leukocytes \[\#/volume\]":

LAB//51301//K/uL 
	 Leukocytes [#/volume] in Blood by Automated count
LAB//51300//K/uL 
	 Leukocytes [#/volume] in Blood by Automated count

Simplifying this list to account for feature occurrence in our training data, we could get a selection of features like this:

birth_codes = ["MEDS_BIRTH"]

# Categorical codes
gender_codes = ["GENDER//F", "GENDER//M"]

icu_admission_codes = ['ICU_ADMISSION//Medical Intensive Care Unit (MICU)',
 'ICU_ADMISSION//Surgical Intensive Care Unit (SICU)',
 'ICU_ADMISSION//Coronary Care Unit (CCU)',
 'ICU_ADMISSION//Medical/Surgical Intensive Care Unit (MICU/SICU)',
 'ICU_ADMISSION//Trauma SICU (TSICU)',
 'ICU_ADMISSION//Cardiac Vascular Intensive Care Unit (CVICU)',
 'ICU_ADMISSION//Neuro Surgical Intensive Care Unit (Neuro SICU)',
 'ICU_ADMISSION//Neuro Stepdown',
 'ICU_ADMISSION//Neuro Intermediate',]

# Numeric codes
numeric_codes = ['LAB//223761//°F',
                 'LAB//220179//mmHg', 'LAB//220180//mmHg',
                 'LAB//220050//mmHg', 'LAB//220051//mmHg',
                 'LAB//51301//K/uL']

Filtering the cohorts

Remember from our previous discussion on tabularization strategy that a "feature" in a tabular dataset consists of three components:

aggregation function for a
filtered set of measurements, occuring in
some window prior to the prediction time.

We have now selected the set of measurements that we will be filtering (component 2): birth_codes, gender_codes, icu_admission_codes and numeric_codes.

For the choice of the aggregation function, some of the features do not need an aggregation function, such as age, gender, and ICU admission type -- those features either have a unique value for a particular patient, or, in case of ICU admission, are constrained by our prediction task, as a separate ICU admission is treated as an independent training sample. In case of numerical features that can potentially have many measurements, we can summarize the data in several ways, but for simplicity we can pick min(), max() and mean() aggregation functions.

The decision on the right aggregation windows again depends on the features. For age and gender, we use information available since the patient's birth, and we could have also used other features spanning through the subject's life up to the prediction time (e.g. number of previous admissions, comorbidities, etc.) For simplicity though, all of our numerical features will be aggregated to over a window from the ICU admission time up to the prediction time, spanning 24 hours.

An example implementation of such tabularization strategy is shown below. This works for our small demo dataset, but with larger scale other practical and efficiency considerations may become more important.

import numpy as np
def tabularize_data(data, labels):
    featurized_data = []

    for index, row in labels.iterrows():
        features = {}

        subject_id = row['subject_id']
        window_end = row['prediction_time']
        window_start = window_end - pd.Timedelta(days=1)

        features['subject_id'] = subject_id
        features['prediction_time'] = window_end

        # Time-windowed data (within prediction window)
        window_events = data[(data['subject_id'] == subject_id)
                                    & ((data['time'] >= window_start) & (data['time'] <= window_end))]

        # Age calculation
        subject_birth = data[(data['subject_id'] == subject_id) & (data['code'].isin(birth_codes))]
        if len(subject_birth) > 0:
            birth_time = subject_birth['time'].iloc[0]
            if pd.notna(birth_time):
                age = (window_end - birth_time).days / 365.25  # Convert to years
                features['age'] = age
            else:
                features['age'] = np.nan
        else:
            features['age'] = np.nan

        # Gender (categorical)
        subject_gender = data[(data['subject_id'] == subject_id) & (data['code'].isin(gender_codes))]
        if len(subject_gender) > 0:
            gender_code = subject_gender['code'].iloc[0]
            gender = gender_code.split('//')[-1] if '//' in gender_code else gender_code
            features['gender'] = gender
        else:
            features['gender'] = 'Unknown'

        # ICU admission type (categorical)
        icu_admission_type_events = window_events[window_events['code'].isin(icu_admission_codes)]
        if len(icu_admission_type_events) > 0:
            icu_code = icu_admission_type_events['code'].iloc[0]
            icu_type = icu_code.split('//')[-1] if '//' in icu_code else icu_code
            features['icu_admission_type'] = icu_type
        else:
            features['icu_admission_type'] = 'Unknown'

        # Numeric codes
        for code in numeric_codes:
            code_events = window_events[window_events['code'] == code]
            # Record min/max/mean values
            if not code_events.empty:
                features[f'{code}_min'] = code_events['numeric_value'].min()
                features[f'{code}_max'] = code_events['numeric_value'].max()
                features[f'{code}_mean'] = code_events['numeric_value'].mean()
            else:
                features[f'{code}_min'] = np.nan
                features[f'{code}_max'] = np.nan
                features[f'{code}_mean'] = np.nan

        featurized_data.append(features)

    featurized_data = pd.DataFrame(featurized_data)
    featurized_data['gender'] = featurized_data['gender'].astype('category')
    featurized_data['icu_admission_type'] = featurized_data['icu_admission_type'].astype('category')

    return featurized_data

In the end, we obtain a featurized tabular dataset like this:

1 
2

tabularized_train = tabularize_data(train_data, train_labels)
tabularized_train.head()

[16]:

	subject_id	prediction_time	age	gender	icu_admission_type	LAB//223761//°F_min	LAB//223761//°F_max	LAB//223761//°F_mean	LAB//220179//mmHg_min	LAB//220179//mmHg_max	...	LAB//220180//mmHg_mean	LAB//220050//mmHg_min	LAB//220050//mmHg_max	LAB//220050//mmHg_mean	LAB//220051//mmHg_min	LAB//220051//mmHg_max	LAB//220051//mmHg_mean	LAB//51301//K/uL_min	LAB//51301//K/uL_max	LAB//51301//K/uL_mean
0	10002428	2156-04-13 16:24:18	81.278576	F	Medical Intensive Care Unit (MICU)	98.599998	102.900002	100.299995	73.0	122.0	...	51.918365	NaN	NaN	NaN	NaN	NaN	NaN	22.4	27.900000	25.150000
1	10002428	2156-04-20 18:11:19	81.297741	F	Medical Intensive Care Unit (MICU)	98.500000	101.099998	99.416664	48.0	163.0	...	61.074074	73.0	146.0	106.207550	46.0	83.0	63.075470	10.5	13.200000	11.850000
2	10002428	2156-05-01 21:53:00	81.327858	F	Medical Intensive Care Unit (MICU)	97.500000	99.599998	98.271423	77.0	145.0	...	49.416668	NaN	NaN	NaN	NaN	NaN	NaN	7.1	8.800000	7.950000
3	10002428	2156-05-12 14:49:34	81.357974	F	Surgical Intensive Care Unit (SICU)	97.199997	102.099998	99.471436	NaN	NaN	...	NaN	88.0	148.0	110.074074	35.0	73.0	53.444443	11.0	11.000000	11.000000
4	10002495	2141-05-23 20:18:01	81.388090	M	Coronary Care Unit (CCU)	98.000000	98.699997	98.349998	71.0	160.0	...	56.648647	NaN	NaN	NaN	NaN	NaN	NaN	28.5	36.799999	31.533333

5 rows × 23 columns

Again, this is just an example and by no means a complete set of features, and there is so much more useful data that we could have selected and processed in more sophisticated ways -- ideally in collaboration with a domain expert.

However, with this basic tabularized dataset we are now ready to train our tabular model.

Training the Model

For our binary prediction task, let's first implement a basic logistic regression model.

For scikit-learn's implementation of logistic regression we need to take additional steps to prepare the features. First, we handle any missing features by either imputing the most frequent category or the median value (of course other strategies are possible), as these are not supported by our logistic regression model. In addition, we encode the categorical features using one-hot encoding, and scale the numerical features for stability.

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


categorical_features = ['gender', 'icu_admission_type']
numerical_features = [col for col in tabularized_train.columns if col not in ['subject_id', 'prediction_time'] + categorical_features]

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Now let's construct the model pipeline, consisting of the preprocessing stage and the logistic regression stage:

from sklearn.linear_model import LogisticRegression

model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

We train the model simply by calling the model's fit function:

X_train = tabularized_train.drop(columns=['subject_id', 'prediction_time'])
y_train = train_labels['boolean_value']

model.fit(X_train, y_train)

[19]:

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'LAB//223761//°F_min',
                                                   'LAB//223761//°F_max',
                                                   'LAB//223761//°F_mean',
                                                   'LAB//220179//mmHg_min',
                                                   'LAB//220179//mmHg_max',
                                                   'LAB//220179//mmHg_mean',
                                                   'LAB//220180//mmHg_min',
                                                   'LAB//220180//mmH...
                                                   'LAB//220051//mmHg_mean',
                                                   'LAB//51301//K/uL_min',
                                                   'LAB//51301//K/uL_max',
                                                   'LAB//51301//K/uL_mean']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(drop='first',
                                                                                 handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['gender',
                                                   'icu_admission_type'])])),
                ('classifier',
                 LogisticRegression(max_iter=1000, random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'LAB//223761//°F_min',
                                                   'LAB//223761//°F_max',
                                                   'LAB//223761//°F_mean',
                                                   'LAB//220179//mmHg_min',
                                                   'LAB//220179//mmHg_max',
                                                   'LAB//220179//mmHg_mean',
                                                   'LAB//220180//mmHg_min',
                                                   'LAB//220180//mmH...
                                                   'LAB//220051//mmHg_mean',
                                                   'LAB//51301//K/uL_min',
                                                   'LAB//51301//K/uL_max',
                                                   'LAB//51301//K/uL_mean']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(drop='first',
                                                                                 handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['gender',
                                                   'icu_admission_type'])])),
                ('classifier',
                 LogisticRegression(max_iter=1000, random_state=42))])

preprocessor: ColumnTransformer

?Documentation for preprocessor: ColumnTransformer

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['age', 'LAB//223761//°F_min',
                                  'LAB//223761//°F_max', 'LAB//223761//°F_mean',
                                  'LAB//220179//mmHg_min',
                                  'LAB//220179//mmHg_max',
                                  'LAB//220179//mmHg_mean',
                                  'LAB//220180//mmHg_min',
                                  'LAB//220180//mmHg_max',
                                  'LAB//220180//mmHg_mean',...
                                  'LAB//220050//mmHg_mean',
                                  'LAB//220051//mmHg_min',
                                  'LAB//220051//mmHg_max',
                                  'LAB//220051//mmHg_mean',
                                  'LAB//51301//K/uL_min',
                                  'LAB//51301//K/uL_max',
                                  'LAB//51301//K/uL_mean']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 ['gender', 'icu_admission_type'])])

num

['age', 'LAB//223761//°F_min', 'LAB//223761//°F_max', 'LAB//223761//°F_mean', 'LAB//220179//mmHg_min', 'LAB//220179//mmHg_max', 'LAB//220179//mmHg_mean', 'LAB//220180//mmHg_min', 'LAB//220180//mmHg_max', 'LAB//220180//mmHg_mean', 'LAB//220050//mmHg_min', 'LAB//220050//mmHg_max', 'LAB//220050//mmHg_mean', 'LAB//220051//mmHg_min', 'LAB//220051//mmHg_max', 'LAB//220051//mmHg_mean', 'LAB//51301//K/uL_min', 'LAB//51301//K/uL_max', 'LAB//51301//K/uL_mean']

SimpleImputer

?Documentation for SimpleImputer

SimpleImputer(strategy='median')

StandardScaler

?Documentation for StandardScaler

StandardScaler()

cat

['gender', 'icu_admission_type']

SimpleImputer

?Documentation for SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

?Documentation for OneHotEncoder

OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False)

LogisticRegression

?Documentation for LogisticRegression

LogisticRegression(max_iter=1000, random_state=42)

Evaluating the Model

It is simple to evaluate scikit-learn models by using the model's predict function on our (tabularized) tuning dataset.

tuning_data = pd.read_parquet(data_root / "tuning")
tuning_labels = pd.read_parquet(labels_root / "tuning")

X_tuning = tabularize_data(tuning_data, tuning_labels).drop(columns=['subject_id', 'prediction_time'])
y_tuning = tuning_labels['boolean_value']

y_pred = model.predict(X_tuning)

We can evaluate the model's performance using some of scikit-learn's built-in metrics -- most straightforwardly the classification accuracy, but we can also use other metrics like Area Under the Receiver Operating Characteristic curve (AUROC).

from sklearn.metrics import accuracy_score, roc_auc_score

accuracy = accuracy_score(y_tuning, y_pred)
print(f"Accuracy: {accuracy}")

y_pred_proba = model.predict_proba(X_tuning)[:, 1]

roc_auc = roc_auc_score(y_tuning, y_pred_proba)
print(f"AUROC: {roc_auc}")

Accuracy: 0.7
AUROC: 0.5714285714285714

Alternatively, for binary prediction tasks, the MEDS ecosystem supports evaluation pipelines like MEDS-Evaluation, which can automatically provide a collection of metrics for prediction data formatted as the prediction schema (which is very similar to the MEDS label schema, but has two additional columns for predicted_boolean_value and predicted_boolean_probability):

y_pred_meds_eval = tuning_labels.copy()
y_pred_meds_eval['predicted_boolean_value'] = y_pred
y_pred_meds_eval['predicted_boolean_probability'] = y_pred_proba
display(y_pred_meds_eval.head())

y_pred_meds_eval.to_parquet("predictions.parquet")

	subject_id	prediction_time	boolean_value	integer_value	float_value	categorical_value	predicted_boolean_value	predicted_boolean_probability
0	10001725	2110-04-12 15:52:22	True	NaN	NaN	None	True	0.515569
1	10016810	2185-06-17 02:16:00	True	NaN	NaN	None	True	0.583523
2	10019172	2118-11-17 09:38:22	False	NaN	NaN	None	True	0.722865
3	10019777	2187-02-11 20:34:00	False	NaN	NaN	None	False	0.297040
4	10020740	2150-03-12 15:34:56	False	NaN	NaN	None	False	0.189860

%%bash
pip install -q meds-evaluation

mkdir -p evaluation_output
meds-evaluation-cli predictions_path='predictions.parquet' output_file='evaluation_output/evaluation_results.json'

1 
2

evaluation_results = pd.read_json("evaluation_output/evaluation_results.json")
evaluation_results

[48]:

	samples_equally_weighted	subjects_equally_weighted
binary_accuracy	0.700000	0.678571
f1_score	0.571429	0.640000
roc_auc_score	0.571429	0.583333
average_precision_score	0.402778	0.535095
calibration_error	0.391543	0.456051
brier_score	0.241134	0.263120

We have now built a very basic tabular model using a MEDS dataset! We could build upon it using a wider and better selection of features, and potentially more sophisticated preprocessing pipelines or more advanced tabular models (such as XGBoost).

Using MEDS-Tab Instead

You may have noticed that our feature selection process for tabular models has been quite involved, as it required a deep understanding of the dataset structure and case-by-case selection, filtering and preprocessing of each of the features.

Considering the significance of tabular baselines, this adds up to a lot of repeated work for each potential dataset and task we may want to implement models for, and may introduce some reproducibility challenges when the feature selection is inconsistent between different applications.

To help address these challenges, MEDS ecosystem contributors have developed MEDS-Tab, a package for automatically tabularizing MEDS data and training decision tree models with hyperparameter tuning.

In this case, the whole model-building process could take just a few commands, such as

meds-tab-tabularize-time-series \
    --multirun \
    worker="range(0,$N_PARALLEL_WORKERS)" \
    hydra/launcher=joblib \
    "input_dir=${MEDS_RESHARD_DIR}/data" \
    "output_dir=$OUTPUT_DIR" \
    tabularization.min_code_inclusion_count=10 \
    tabularization.window_sizes=[1d,30d,365d,full] \
    tabularization.aggs=[static/present,static/first,code/count,value/count,value/sum,value/sum_sqd,value/min,value/max]

Feel free to check out the GitHub repository, the usage guide or the preprint for more information.