Skip to main content

Getting Familiar with MEDS

Before we start building datasets and models using the MEDS ecosystem, we need to understand what the MEDS schema is and how it can be used to represent medical data. This tutorial does exactly that, with a simple introduction to the file format, layout, and what MEDS is all about. Check it out in the jupyter notebook tutorial below, or see it on Google Colab or on our GitHub Repository

What is MEDS?

<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/Medical-Event-Data-Standard/meds/refs/heads/main/static/logo_dark.svg"> <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/Medical-Event-Data-Standard/meds/refs/heads/main/static/logo_light.svg"> <img width="200" height="200" alt="MEDS Logo" src="https://raw.githubusercontent.com/Medical-Event-Data-Standard/meds/refs/heads/main/static/logo_light.svg"> </picture> </p>

MEDS (The Medical Event Data Standard) is both a data standard for electronic health record (EHR) data and an open source ecosystem built atop that data standard that enables transportable, efficient tools to be used across different AI applications. This tutorial quickly explores that standard and ecosystem, so you can learn more. Let's dive in!

MEDS Data Standard

The MEDS Data standard embodies simplicity first and foremost -- rather than trying to capture in a unified vocabulary all aspects of EHR data, MEDS tries to capture only the shared underlying structure that defines EHR data. Namely, the fact that EHR data consists of a sequence of complex events occurring for a patient in continuous time. Let's see how this translates into our schema, visually:

<p align="center"> <img alt="MEDS Data Standard" src="https://raw.githubusercontent.com/Medical-Event-Data-Standard/medical-event-data-standard.github.io/refs/heads/main/static/img/data_figure.svg"> </p>

Seeing this visually is one thing, but let's check out some actual data! To do this, we'll use a simple, static dataset defined in the MEDS Testing Helpers repository, which we'll write to the directory MEDS_data in this notebook:

1 
!pip install --quiet meds_testing_helpers~=0.3.0
[?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/154.5 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 153.6/154.5 kB 4.8 MB/s eta 0:00:01
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.5/154.5 kB 2.8 MB/s eta 0:00:00
[?25h
1 
2
3
4
5
6
7
8
from meds_testing_helpers.static_sample_data import SIMPLE_STATIC_SHARDED_BY_SPLIT
from meds_testing_helpers.dataset import MEDSDataset
from pathlib import Path

data_root = Path("MEDS_data")

data = MEDSDataset.from_yaml(SIMPLE_STATIC_SHARDED_BY_SPLIT)
data.write(data_root);

What's in this directory? We'll use the linux tree command to print it:

1 
2
3
%%bash
apt-get -qq install tree > /dev/null
tree MEDS_data
MEDS_data
├── data
│   ├── held_out
│   │   └── 0.parquet
│   ├── train
│   │   ├── 0.parquet
│   │   └── 1.parquet
│   └── tuning
│       └── 0.parquet
└── metadata
    ├── codes.parquet
    ├── dataset.json
    └── subject_splits.parquet

5 directories, 7 files

As we can see, there are a variety of files here. Let's break them down by type: data and metadata.

MEDS Data Files (data/**.parquet)

MEDS data files are stored as a sharded set of Parquet files. In MEDS, we use the term "shard name" to refer to the full relative string to the file name under the data/ sub-directory. So, in this case, we have 4 shards, with the following names:

  • held_out/0
  • train/0
  • train/1
  • tuning/0

Importantly, note that there are no requirements on shard names -- clearly, this dataset is "sharded by split" so that all patients in any given shard are within the same modeling split (either a train, tuning, or held out split), but this is not required in general.

Each of these data files follows the schema above:

1 
2
3
4
5
6
import pandas as pd

print("Train 0:")
display(pd.read_parquet(data_root / "data" / "train/0.parquet").head(5))
print("\nTuning 0:")
display(pd.read_parquet(data_root / "data" / "tuning/0.parquet").head(5))
Train 0:
subject_id time code numeric_value
0 239684 NaT EYE_COLOR//BROWN NaN
1 239684 NaT HEIGHT 175.271118
2 239684 1980-12-28 00:00:00 DOB NaN
3 239684 2010-05-11 17:41:51 ADMISSION//CARDIAC NaN
4 239684 2010-05-11 17:41:51 HR 102.599998

Tuning 0:
subject_id time code numeric_value
0 754281 NaT EYE_COLOR//BROWN NaN
1 754281 NaT HEIGHT 166.22261
2 754281 1988-12-19 00:00:00 DOB NaN
3 754281 2010-01-03 06:27:59 ADMISSION//PULMONARY NaN
4 754281 2010-01-03 06:27:59 HR 142.00000

MEDS Metadata Files (metadata/**)

In addition to the data files, we also have metadata files codes.parquet, dataset.json, and subject_splits.parquet. Let's check those out now:

codes.parquet

This helps users understand the vocabulary of codes in a given dataset -- and it can be a source to link codes in the dataset to external ontologies or vocabularies.

1 
pd.read_parquet(data_root / "metadata" / "codes.parquet").head(5)
[5]: 
code description parent_codes
0 EYE_COLOR//BLUE Blue Eyes. Less common than brown. None
1 EYE_COLOR//BROWN Brown Eyes. The most common eye color. None
2 EYE_COLOR//HAZEL Hazel eyes. These are uncommon None
3 HR Heart Rate [LOINC/8867-4]
4 TEMP Body Temperature [LOINC/8310-5]

dataset.json

This file has basic metadata about the dataset, used for tracking and versioining results using the dataset. In this case, as this is a testing dataset, it is empty, but you'll see some other datasets later where it will be filled in.

1 
print(Path(data_root / "metadata" / "dataset.json").read_text())
{}

subject_splits.parquet

This file allows a data owner to perscribe a specific data split to use in downstream modeling, to ensure results are maximally comparable. The split names used in MEDS are train, for the training set; tuning, for the set of patients you may use for hyperparameter tuning, etc., but are not part of the final held-out test set (often called "dev" or "val"); and held_out, for the held-out set of patients (often called "test"). Users can, however, also include other, special splits corresponding to additional sets of patients held-out for other reasons specific to that dataset (e.g., dedicated, existing internal held-out sets for different projects).

1 
pd.read_parquet(data_root / "metadata" / "subject_splits.parquet").head(5)
[7]: 
subject_id split
0 239684 train
1 1195293 train
2 68729 train
3 814703 train
4 754281 tuning

Using MEDS Data

How can we use these files to answer a simple question? Let's try plotting to see if there is a correlation in this dataset between the patient's height and their max heart rate.

1 
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from matplotlib import pyplot as plt
import numpy as np

# For this hypothetical question, we'll just use all the data
df = pd.concat(
    [pd.read_parquet(fp) for fp in (data_root / "data").rglob("*.parquet")]
)

max_HR_df = df[df.code == "HR"].groupby("subject_id")["numeric_value"].max()

subjects = max_HR_df.index

height = (
    df[df.code == "HEIGHT"]
    .set_index("subject_id")
    .loc[subjects]
    ["numeric_value"]
)

plt.scatter(height, max_HR_df)
plt.xlabel("Height (cm)")
plt.ylabel("Max HR (bpm)")
plt.show()

Alas, here we see no clear relationship -- though this does make sense as this dataset is, after all, purely random! But hopefully through this simple example you see how you can begin to manipulate MEDS formatted data to perform modeling tasks. Beyond that, maybe you even see how such processes might be easier to standardize across different EHR systems.

This is clearly an overly simplified example, but if you want to explore the basics of MEDS further and see how you can build real models on MEDS, with no external MEDS-specific packages required, you can check out this additional tutorial which explores these ideas in more depth!

MEDS Ecosystem

<p align="center"> <img alt="MEDS Ecosystem" width="60%" src="https://raw.githubusercontent.com/Medical-Event-Data-Standard/medical-event-data-standard.github.io/refs/heads/main/static/img/ecosystem_figure.svg"> </p>

It is exactly the ideas explored in our simple plotting task that empower the growth of the MEDS ecosystem. There are a lot of things in that ecosystem, including utilities for task extraction, tabular baseline generation, model building, evaluation, and more! Check out some of the projects that use MEDS here!