What is MEDS?

MEDS (The Medical Event Data Standard) is both a data standard for electronic health record (EHR) data and an open source ecosystem built atop that data standard that enables transportable, efficient tools to be used across different AI applications. This tutorial quickly explores that standard and ecosystem, so you can learn more. Let's dive in!

MEDS Data Standard

The MEDS Data standard embodies simplicity first and foremost -- rather than trying to capture in a unified vocabulary all aspects of EHR data, MEDS tries to capture only the shared underlying structure that defines EHR data. Namely, the fact that EHR data consists of a sequence of complex events occurring for a patient in continuous time. Let's see how this translates into our schema, visually:

Seeing this visually is one thing, but let's check out some actual data! To do this, we'll use a simple, static dataset defined in the MEDS Testing Helpers repository, which we'll write to the directory MEDS_data in this notebook:

!pip install --quiet meds_testing_helpers~=0.3.0

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.5 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m153.6/154.5 kB[0m [31m4.8 MB/s[0m eta [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.5/154.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h

from meds_testing_helpers.static_sample_data import SIMPLE_STATIC_SHARDED_BY_SPLIT
from meds_testing_helpers.dataset import MEDSDataset
from pathlib import Path

data_root = Path("MEDS_data")

data = MEDSDataset.from_yaml(SIMPLE_STATIC_SHARDED_BY_SPLIT)
data.write(data_root);

What's in this directory? We'll use the linux tree command to print it:

1 
2
3

%%bash
apt-get -qq install tree > /dev/null
tree MEDS_data

MEDS_data
├── data
│   ├── held_out
│   │   └── 0.parquet
│   ├── train
│   │   ├── 0.parquet
│   │   └── 1.parquet
│   └── tuning
│       └── 0.parquet
└── metadata
    ├── codes.parquet
    ├── dataset.json
    └── subject_splits.parquet

5 directories, 7 files

As we can see, there are a variety of files here. Let's break them down by type: data and metadata.

MEDS Data Files (`data/**.parquet`)

MEDS data files are stored as a sharded set of Parquet files. In MEDS, we use the term "shard name" to refer to the full relative string to the file name under the data/ sub-directory. So, in this case, we have 4 shards, with the following names:

held_out/0
train/0
train/1
tuning/0

Importantly, note that there are no requirements on shard names -- clearly, this dataset is "sharded by split" so that all patients in any given shard are within the same modeling split (either a train, tuning, or held out split), but this is not required in general.

Each of these data files follows the schema above:

import pandas as pd

print("Train 0:")
display(pd.read_parquet(data_root / "data" / "train/0.parquet").head(5))
print("\nTuning 0:")
display(pd.read_parquet(data_root / "data" / "tuning/0.parquet").head(5))

Train 0:

	subject_id	time	code	numeric_value
0	239684	NaT	EYE_COLOR//BROWN	NaN
1	239684	NaT	HEIGHT	175.271118
2	239684	1980-12-28 00:00:00	DOB	NaN
3	239684	2010-05-11 17:41:51	ADMISSION//CARDIAC	NaN
4	239684	2010-05-11 17:41:51	HR	102.599998


Tuning 0:

	subject_id	time	code	numeric_value
0	754281	NaT	EYE_COLOR//BROWN	NaN
1	754281	NaT	HEIGHT	166.22261
2	754281	1988-12-19 00:00:00	DOB	NaN
3	754281	2010-01-03 06:27:59	ADMISSION//PULMONARY	NaN
4	754281	2010-01-03 06:27:59	HR	142.00000

MEDS Metadata Files (`metadata/**`)

In addition to the data files, we also have metadata files codes.parquet, dataset.json, and subject_splits.parquet. Let's check those out now:

`codes.parquet`

This helps users understand the vocabulary of codes in a given dataset -- and it can be a source to link codes in the dataset to external ontologies or vocabularies.

pd.read_parquet(data_root / "metadata" / "codes.parquet").head(5)

[5]:

	code	description	parent_codes
0	EYE_COLOR//BLUE	Blue Eyes. Less common than brown.	None
1	EYE_COLOR//BROWN	Brown Eyes. The most common eye color.	None
2	EYE_COLOR//HAZEL	Hazel eyes. These are uncommon	None
3	HR	Heart Rate	[LOINC/8867-4]
4	TEMP	Body Temperature	[LOINC/8310-5]

`dataset.json`

This file has basic metadata about the dataset, used for tracking and versioining results using the dataset. In this case, as this is a testing dataset, it is empty, but you'll see some other datasets later where it will be filled in.

print(Path(data_root / "metadata" / "dataset.json").read_text())

{}

`subject_splits.parquet`

This file allows a data owner to perscribe a specific data split to use in downstream modeling, to ensure results are maximally comparable. The split names used in MEDS are train, for the training set; tuning, for the set of patients you may use for hyperparameter tuning, etc., but are not part of the final held-out test set (often called "dev" or "val"); and held_out, for the held-out set of patients (often called "test"). Users can, however, also include other, special splits corresponding to additional sets of patients held-out for other reasons specific to that dataset (e.g., dedicated, existing internal held-out sets for different projects).

pd.read_parquet(data_root / "metadata" / "subject_splits.parquet").head(5)

[7]:

	subject_id	split
0	239684	train
1	1195293	train
2	68729	train
3	814703	train
4	754281	tuning

Using MEDS Data

How can we use these files to answer a simple question? Let's try plotting to see if there is a correlation in this dataset between the patient's height and their max heart rate.

from matplotlib import pyplot as plt
import numpy as np

# For this hypothetical question, we'll just use all the data
df = pd.concat(
    [pd.read_parquet(fp) for fp in (data_root / "data").rglob("*.parquet")]
)

max_HR_df = df[df.code == "HR"].groupby("subject_id")["numeric_value"].max()

subjects = max_HR_df.index

height = (
    df[df.code == "HEIGHT"]
    .set_index("subject_id")
    .loc[subjects]
    ["numeric_value"]
)

plt.scatter(height, max_HR_df)
plt.xlabel("Height (cm)")
plt.ylabel("Max HR (bpm)")
plt.show()

Alas, here we see no clear relationship -- though this does make sense as this dataset is, after all, purely random! But hopefully through this simple example you see how you can begin to manipulate MEDS formatted data to perform modeling tasks. Beyond that, maybe you even see how such processes might be easier to standardize across different EHR systems.

This is clearly an overly simplified example, but if you want to explore the basics of MEDS further and see how you can build real models on MEDS, with no external MEDS-specific packages required, you can check out this additional tutorial which explores these ideas in more depth!

MEDS Ecosystem

It is exactly the ideas explored in our simple plotting task that empower the growth of the MEDS ecosystem. There are a lot of things in that ecosystem, including utilities for task extraction, tabular baseline generation, model building, evaluation, and more! Check out some of the projects that use MEDS here!

What is MEDS?

MEDS Data Standard

MEDS Data Files (data/**.parquet)

MEDS Metadata Files (metadata/**)

codes.parquet

dataset.json

subject_splits.parquet