Getting Familiar with MEDS
Before we start building datasets and models using the MEDS ecosystem, we need to understand what the MEDS schema is and how it can be used to represent medical data. This tutorial does exactly that, with a simple introduction to the file format, layout, and what MEDS is all about. Check it out in the jupyter notebook tutorial below, or see it on Google Colab or on our GitHub Repository
What is MEDS?
<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/Medical-Event-Data-Standard/meds/refs/heads/main/static/logo_dark.svg"> <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/Medical-Event-Data-Standard/meds/refs/heads/main/static/logo_light.svg"> <img width="200" height="200" alt="MEDS Logo" src="https://raw.githubusercontent.com/Medical-Event-Data-Standard/meds/refs/heads/main/static/logo_light.svg"> </picture> </p>MEDS (The Medical Event Data Standard) is both a data standard for electronic health record (EHR) data and an open source ecosystem built atop that data standard that enables transportable, efficient tools to be used across different AI applications. This tutorial quickly explores that standard and ecosystem, so you can learn more. Let's dive in!
MEDS Data Standard
The MEDS Data standard embodies simplicity first and foremost -- rather than trying to capture in a unified vocabulary all aspects of EHR data, MEDS tries to capture only the shared underlying structure that defines EHR data. Namely, the fact that EHR data consists of a sequence of complex events occurring for a patient in continuous time. Let's see how this translates into our schema, visually:
<p align="center"> <img alt="MEDS Data Standard" src="https://raw.githubusercontent.com/Medical-Event-Data-Standard/medical-event-data-standard.github.io/refs/heads/main/static/img/data_figure.svg"> </p>Seeing this visually is one thing, but let's check out some actual data! To do this, we'll use a simple, static dataset defined in the MEDS Testing Helpers repository, which we'll write to the directory MEDS_data
in this notebook:
!pip install --quiet meds_testing_helpers~=0.3.0
[?25l [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.5 kB[0m [31m?[0m eta [36m-:--:--[0m [2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m153.6/154.5 kB[0m [31m4.8 MB/s[0m eta [36m0:00:01[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.5/154.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m [?25h
from meds_testing_helpers.static_sample_data import SIMPLE_STATIC_SHARDED_BY_SPLIT
from meds_testing_helpers.dataset import MEDSDataset
from pathlib import Path
data_root = Path("MEDS_data")
data = MEDSDataset.from_yaml(SIMPLE_STATIC_SHARDED_BY_SPLIT)
data.write(data_root);
What's in this directory? We'll use the linux tree
command to print it:
%%bash
apt-get -qq install tree > /dev/null
tree MEDS_data
MEDS_data ├── data │ ├── held_out │ │ └── 0.parquet │ ├── train │ │ ├── 0.parquet │ │ └── 1.parquet │ └── tuning │ └── 0.parquet └── metadata ├── codes.parquet ├── dataset.json └── subject_splits.parquet 5 directories, 7 files
As we can see, there are a variety of files here. Let's break them down by type: data and metadata.
MEDS Data Files (data/**.parquet
)
MEDS data files are stored as a sharded set of Parquet files. In MEDS, we use the term "shard name" to refer to the full relative string to the file name under the data/
sub-directory. So, in this case, we have 4 shards, with the following names:
held_out/0
train/0
train/1
tuning/0
Importantly, note that there are no requirements on shard names -- clearly, this dataset is "sharded by split" so that all patients in any given shard are within the same modeling split (either a train, tuning, or held out split), but this is not required in general.
Each of these data files follows the schema above:
import pandas as pd
print("Train 0:")
display(pd.read_parquet(data_root / "data" / "train/0.parquet").head(5))
print("\nTuning 0:")
display(pd.read_parquet(data_root / "data" / "tuning/0.parquet").head(5))
Train 0:
subject_id | time | code | numeric_value | |
---|---|---|---|---|
0 | 239684 | NaT | EYE_COLOR//BROWN | NaN |
1 | 239684 | NaT | HEIGHT | 175.271118 |
2 | 239684 | 1980-12-28 00:00:00 | DOB | NaN |
3 | 239684 | 2010-05-11 17:41:51 | ADMISSION//CARDIAC | NaN |
4 | 239684 | 2010-05-11 17:41:51 | HR | 102.599998 |
Tuning 0:
subject_id | time | code | numeric_value | |
---|---|---|---|---|
0 | 754281 | NaT | EYE_COLOR//BROWN | NaN |
1 | 754281 | NaT | HEIGHT | 166.22261 |
2 | 754281 | 1988-12-19 00:00:00 | DOB | NaN |
3 | 754281 | 2010-01-03 06:27:59 | ADMISSION//PULMONARY | NaN |
4 | 754281 | 2010-01-03 06:27:59 | HR | 142.00000 |
MEDS Metadata Files (metadata/**
)
In addition to the data files, we also have metadata files codes.parquet
, dataset.json
, and subject_splits.parquet
. Let's check those out now:
codes.parquet
This helps users understand the vocabulary of codes in a given dataset -- and it can be a source to link codes in the dataset to external ontologies or vocabularies.
pd.read_parquet(data_root / "metadata" / "codes.parquet").head(5)
[5]:
code | description | parent_codes | |
---|---|---|---|
0 | EYE_COLOR//BLUE | Blue Eyes. Less common than brown. | None |
1 | EYE_COLOR//BROWN | Brown Eyes. The most common eye color. | None |
2 | EYE_COLOR//HAZEL | Hazel eyes. These are uncommon | None |
3 | HR | Heart Rate | [LOINC/8867-4] |
4 | TEMP | Body Temperature | [LOINC/8310-5] |
dataset.json
This file has basic metadata about the dataset, used for tracking and versioining results using the dataset. In this case, as this is a testing dataset, it is empty, but you'll see some other datasets later where it will be filled in.
print(Path(data_root / "metadata" / "dataset.json").read_text())
{}
subject_splits.parquet
This file allows a data owner to perscribe a specific data split to use in downstream modeling, to ensure results are maximally comparable. The split names used in MEDS are train
, for the training set; tuning
, for the set of patients you may use for hyperparameter tuning, etc., but are not part of the final held-out test set (often called "dev" or "val"); and held_out
, for the held-out set of patients (often called "test"). Users can, however, also include other, special splits corresponding to additional sets of patients held-out for other reasons specific to that dataset (e.g., dedicated, existing internal held-out sets for different projects).
pd.read_parquet(data_root / "metadata" / "subject_splits.parquet").head(5)
[7]:
subject_id | split | |
---|---|---|
0 | 239684 | train |
1 | 1195293 | train |
2 | 68729 | train |
3 | 814703 | train |
4 | 754281 | tuning |
Using MEDS Data
How can we use these files to answer a simple question? Let's try plotting to see if there is a correlation in this dataset between the patient's height and their max heart rate.
from matplotlib import pyplot as plt
import numpy as np
# For this hypothetical question, we'll just use all the data
df = pd.concat(
[pd.read_parquet(fp) for fp in (data_root / "data").rglob("*.parquet")]
)
max_HR_df = df[df.code == "HR"].groupby("subject_id")["numeric_value"].max()
subjects = max_HR_df.index
height = (
df[df.code == "HEIGHT"]
.set_index("subject_id")
.loc[subjects]
["numeric_value"]
)
plt.scatter(height, max_HR_df)
plt.xlabel("Height (cm)")
plt.ylabel("Max HR (bpm)")
plt.show()
Alas, here we see no clear relationship -- though this does make sense as this dataset is, after all, purely random! But hopefully through this simple example you see how you can begin to manipulate MEDS formatted data to perform modeling tasks. Beyond that, maybe you even see how such processes might be easier to standardize across different EHR systems.
This is clearly an overly simplified example, but if you want to explore the basics of MEDS further and see how you can build real models on MEDS, with no external MEDS-specific packages required, you can check out this additional tutorial which explores these ideas in more depth!
MEDS Ecosystem
<p align="center"> <img alt="MEDS Ecosystem" width="60%" src="https://raw.githubusercontent.com/Medical-Event-Data-Standard/medical-event-data-standard.github.io/refs/heads/main/static/img/ecosystem_figure.svg"> </p>It is exactly the ideas explored in our simple plotting task that empower the growth of the MEDS ecosystem. There are a lot of things in that ecosystem, including utilities for task extraction, tabular baseline generation, model building, evaluation, and more! Check out some of the projects that use MEDS here!