What is MEDS?

MEDS is a data standard for structured, longitudinal medical record data, built for reproducible, efficient Machine Learning (ML)/Artificial Intelligence (AI) research in healthcare. It is designed to be simple, flexible, and interoperable with existing tools and standards. MEDS is entirely open-source and community-driven, and we welcome contributions from all interested parties!

The critical aspects of the MEDS standard can be seen visually, in the image below:

Here, we show both the required organization of MEDS files on disk, as well as the schema of the core data and metadata elements for MEDS datasets. In the rest of this document, we will explore these key concepts in more detail, in particular covering: 2. Requirements for a MEDS compliant dataset 3. MEDS dataset conventions and best practices 4. Future roadmap and how to contribute 5. And, finally, a glossary of some key terminology and concepts

Requirements for a MEDS Compliant Dataset

For a dataset to be compliant with the MEDS standard at a given version (versioning is given by the PyPi package version), it must satisfy several requirements:

It must be stored in a directory structure that is compliant with the MEDS directory structure specification.
It must store the required data files in the required PyArrow Parquet format.
It must store the required metadata files in the required JSON and PyArrow Parquet formats.

MEDS Directory Structure Specification

The MEDS directory structure is a simple, hierarchical directory structure that is designed to be easy to use and understand. The root directory of a MEDS dataset is referred to as the MEDS root directory, and all paths within the MEDS dataset are relative to this root directory. There are two required subdirectories of the MEDS root directory: data/ and metadata/. The data/ directory contains the MEDS data files, and the metadata/ directory contains the MEDS metadata files:

├─data/
│ └─**.parquet
│
└─metadata/
  ├─codes.parquet
  ├─dataset.json
  └─subject_splits.parquet

MEDS Data File Specification

As is shown above, data files are stored in any nested (potentially multi-level) parquet files within the data/ folder (and all such parquet files must be data files). Each of these individual data files is a single shard of the dataset, and must follow the following specifications:

It must be compliant with the MEDS data schema
All data for a given subject must be stored in the same shard.
Shards must be sorted by subject_id and time within the shard---ordering within these groups is unspecified.

The MEDS data schema is an Apache Arrow schema that specifies the required columns and data types for MEDS data files. It currently includes the following columns:

subject_id: A unique identifier for each subject in the dataset, of type int64.
time: The time at which the measurement corresponding to this row occurred, of type timestamp[us].
code: A code representing the measurement that occurred (e.g., a diagnosis or medication code), of type string.
numeric_value: If the measurement has a numeric value associated with it (e.g., a lab result), this column contains that value, of type float32.

All columns except subject_id and code may contain nulls. If the time column is null it indicates a static measurement, and such rows should be sorted to the beginning of their associated subject's data. If the numeric_value column is null, it indicates that the measurement does not have an associatecd numeric value.

tip

Note that MEDS data files can contain additional columns beyond the required columns to store additional identifiers, other data modalities, etc. to support the specific needs of a given dataset.

tip

The MEDS data pyarrow schema can be imported from the MEDS PyPi package to validate MEDS data files.

The path from the MEDS data folder ($MEDS_ROOT/data/) to the shard file, "/" separated and without the .parquet extension, is the shard name.

MEDS Metadata File Specification

As shown above, there are three key MEDS metadata files: codes.parquet, dataset.json, and subject_splits.parquet.

`codes.parquet`

This file contains metadata about the code vocabulary featured in the data files. It must contain the following three files:

code: The code value, of type string.
description: An optional free-text, human readable description of the code, of type string.
parent_codes: An optional list of links to parent codes in this dataset or external ontology nodes associated with this code, of type list[string].

Much like the data schema, the codes.parquet file can contain additional columns beyond the required columns.

warning

It is not guaranteed that all codes will have descriptions or parent codes, or even appear as a row in the metadata file at all! Further, the parent codes listed in this file are not guaranteed to be exhaustive or complete.

One common use of the parent_codes column is to link to external ontologies in the OMOP vocabulary space. Such linkages should be formatted as "$VOCABULARY_NAME/$CONCEPT_NAME"; for example, a parent_code of "ICD9CM/487.0" would be a reference to ICD9 code 487.0.

The formal schema for the codes.parquet file can be imported from the meds package and is documented here

tip

Some libraries and models will rely on codes.parquet file for various tasks, such as for producing embedding vectors of codes based on free-text descriptions, performing ontology expansion, or storing code value statistics for normalization, etc.

`dataset.json`

This file contains metadata about the dataset itself, including the following:

dataset_name: The name of the dataset, of type string.
dataset_version: The version of the dataset, of type string. Ensuring the version numbers used are meaningful and unique is important for reproducibility, but is ultimately not enforced by the MEDS schema and is left to the dataset creator.
etl_name: The name of the ETL process used to generate the dataset, of type string.
etl_version: The version of the ETL process used to generate the dataset, of type string.
meds_version: The version of the MEDS standard used to generate the dataset, of type string.
created_at: The timestamp at which the dataset was created, of type string in ISO 8601 format (note that this is not an official timestamp type, but rather a string representation of a timestamp as this is a JSON file).

The formal JSON schema for the dataset.json file can be imported from the meds package and is documented here

`subject_splits.parquet`

This file maps subject IDs to pre-defined splits of the dataset, such as training, hyperparameter tuning, and held-out sets. In the MEDS splits file, each row contains a subject_id (int64) and a split (string) column, where split is the name of the split in which that subject lives. For the three canonical AI/ML splits, MEDS uses the following split names:

train: The training split. This data can be used for any purpose during model building, and in supervised training labels over this split will be visible to the model.
tuning: The hyperparameter tuning split. This split is sometimes called the "dev" or "val" split in other contexts. This data can be used for tuning hyperparameters or for training of the final model, but should not be used for final evaluation of model performance. Users may choose to merge this with the training split then re-shuffle themselves if they need more splits or a different split ratio. Not all datasets will specify this split, as it is optional.
held_out: The final evaluation held-out split. This data should not be used for training or tuning, and should only be used for final evaluation of model performance. This split is sometimes called the "test" split in other contexts. No data about these patients should be assumed to be available during data pre-processing, training, or tuning.

In addition to these splits, any additional custom splits as desired by the user can be used. No additional columns are allowed in this file. The parquet schema for this file can be imported from the meds package and is documented here

Labeled cohorts over a MEDS dataset

In addition to the data and metadata files, MEDS also provides a schema for defining labeled cohorts over a MEDS dataset. Label files do not have a required on-disk organization, though it is recommended to store them in a labels/$COHORT_NAME/**.parquet format within the MEDS-root directory. Labeled cohorts within MEDS consist of a set of sharded parquet files (the sharding need not be identical to the data shards). Each of these files is a table such that each row in the table corresponds to one "sample" in the cohort (a sample is a single unit of prediction, and there may be multiple samples corresponding to a single subject in the MEDS dataset). Each row in the table must contain the following columns:

subject_id: The subject ID of the subject for this sample, of type int64.
prediction_time: The upper bound (inclusive) of the time window of data which can be observed when this prediction is made, of type timestamp[us]. E.g., your model may use data for all events that occur at or before this time to make a prediction for this sample.
boolean_value: If this task is a binary classification task, this column contains the binary label for the sample, of type bool, otherwise this column is null.
integer_value: If this task is an ordinal regression or a classification task with integral labels, this column contains the numeric label for the sample, of type int64, otherwise this column is null.
float_value: If this task is a regression task, this column contains the numeric label for the sample, of type float64, otherwise this column is null.
categorical_value: If this task is a classification task with categorical labels, this column contains the categorical label for the sample, of type string, otherwise this column is null.

The formal schema for the labeled cohort files can be imported from the meds package and is documented here

MEDS Dataset Conventions and Best Practices

Recommended constants

The meds Python package defines a number of constants that are useful for building maximally compatible datasets. These include:

Subdirectory and file names for the required files, such as meds.data_subdirectory and meds.subject_splits_filepath
Constants for column names and dtypes, such as meds.subject_id_column and meds.subject_id_dtype
Codes for birth and death events: meds.birth_code = "MEDS_BIRTH" and meds.death_code = "MEDS_DEATH"
The three sentinel split names: meds.train_split = "train" meds.tuning_split = "tuning" meds.held_out_split = "held_out"

Future Roadmap and How to Contribute

While MEDS enables a lot of exciting research already, we have a number of exciting plans to make it even better. Key ongoing efforts include but are not limited to those listed below. If you'd like to contribute to MEDS, either through any of these efforts or other approaches, please feel free to reach out on our GitHub. We welcome any and all contributions!

MEDS is built for supporting longitudinal, structured EHR data, but it is clear that health AI covers more than just this kind of data. We are actively working to extend MEDS to support additional data modalities, including free-text, imaging, waveform, and other data types. There are already several tools that make use of some free-text data through a proposed text_value column, but official support is still in the works. Stay tuned or get involved on our GitHub to help with these efforts!

Visualization and data exploration tools

In order to model data effectively, you first have to understand it, and few things help data understanding more than high-quality visualization and data exploration tools. The ecosystem for such tools in the EHR data landscape is very limited, and we are actively working to build out a set of tools that can help researchers better understand their data for the MEDS format.

Standardized support for complex data pre-processing steps

While MEDS is designed to be simple and flexible, there are a number of complex data pre-processing steps that are common in health AI research but not yet supported out of the box through existing tools, such as vocabulary conversion, unit standardization, structured-data summarization to free-text, use of large language models (LLMs), or data QA testing. We are actively working to build support for these tools through both dedicated MEDS-Transforms stages or standalone tools on a case-by-case basis. Feel free to reach out if these efforts would help your research or you'd like to contribute!

More extensive data validation and error checking

Health data is known to be highly noisy and suffer from high rates of errors, be it physiologically impossible measurements, mis-labeled data, or low-information content observations. We are actively working to build out standardized tools that can help automatically clean MEDS datasets to a limited degree to help researchers make their data more meaningful and reliable in a transparent, reproducible way.

Key Terminology and Concepts

A subject in a MEDS dataset is the primary entity being described by the sequences of care observations in the underlying dataset. In most cases, subjects will, naturally, be individuals, and the sequences of care observations will cover all known observations about those individuals in a source health datasets. However, in some cases, data may be organized so that we cannot describe all the data for an individual reliably in a dataset, but instead can only describe subsequences of an individual's data, such as in datasets that only link an individual's data observations together if they are within the same hospital admission, regardless of how many admissions that individual has in the dataset (such as the eICU dataset). In these cases, a subject in the MEDS dataset may refer to a hospital admission rather than an individual.
A measurement in a MEDS dataset is a particular observation being made on a subject either statically or dynamically at a point in time. Measurements are the fundamental unit of data in MEDS datasets, and the core data schema is a longitudinal sequence of measurements for each subject in the dataset. Measurements generally fall into one of three categories, which may require different handling:
- static measurements are those that are recorded in the source dataset absent a specific timestamp and are assumed to be observed and applicable across all observations in the patient record. Note this is not the same as things that are conceptually assumed to be static; e.g., a patient's race may be recorded at each visit in a health record, and thus would be treated as a dynamic measurement in that dataset specifically. Likewise, some datasets may have static measurements that we would conceptually expect to plausibly change over time, such as a patient's gender or the institution of care.
- time-derived measurements are measurements that vary in time, but are directly programmatically determinable from the timestamp of the observation and the subject's static or historic data. For example, a patient's age at the time of a measurement is a time-derived measurement, as it can be calculated from the patient's date of birth and the timestamp of the observation. Similarly, the time of day that a set of labs is taken is a time-derived measurement. Time-derived measurements are often not directly recorded in the raw data, but may be inferred or added during analysis.
- dynamic measurements are those that are recorded in the source dataset with a specific timestamp indicating when the observation was made. These measurements are assumed to be observed at a single unique point in time and are not necessarily applicable across all observations in the patient record. As these are recorded observations, they are generally assumed to not be programmatically determinable in the manner of time-derived measurements.
An event in a MEDS dataset is a set of measurements that are observed at a single unique point in time. Measurements within an event are not necessarily independent of each other. Further, while _event_s can be meaningfully ordered in time, measurements within an event should not be assumed a priori to come with any meaningful ordering. In some cases, event will be used interchangeably with measurement, but when the two terms are used distinctly, event will refer to those measurements that share a unique timepoint, and measurement will refer to the individual observations within an event.
Within a measurement, a code is the categorical descriptor of what is being observed in that measurement. _Code_s are not required to follow any particular coding vocabulary, and should be assumed to be institution specific unless otherwise specified.
A shard in a MEDS dataset is a single file containing a subset of the data for the dataset. Shards are used to split the data into manageable chunks for processing and storage. All data for a given subject must be stored in the same shard.
A sample in a labeled cohort is one unit of prediction. This may be at the subject level or, more commonly at the subject-event level, where a prediction is made for a subset of key events in a subject's record. For example, we may wish to make a prediction of in-hospital mortality at the 24 hour mark after admission for each admission of a subject in a dataset. In this case, each admission that meets the inclusion/exclusion criteria would constitute a sample in the cohort.

Requirements for a MEDS Compliant Dataset​

MEDS Directory Structure Specification​

MEDS Data File Specification​

MEDS Metadata File Specification​

codes.parquet​

dataset.json​

subject_splits.parquet​

Labeled cohorts over a MEDS dataset​

MEDS Dataset Conventions and Best Practices​

Recommended constants​

Future Roadmap and How to Contribute​

Multi-modal data support​

Visualization and data exploration tools​

Standardized support for complex data pre-processing steps​

More extensive data validation and error checking​

Key Terminology and Concepts​