What is MEDS?
MEDS is a data standard for structured, longitudinal medical record data, built for reproducible, efficient Machine Learning (ML)/Artificial Intelligence (AI) research in healthcare. It is designed to be simple, flexible, and interoperable with existing tools and standards. MEDS is entirely open-source and community-driven, and we welcome contributions from all interested parties!
The critical aspects of the MEDS standard can be seen visually, in the image below:
Here, we show both the required organization of MEDS files on disk, as well as the schema of the core data and metadata elements for MEDS datasets. In the rest of this document, we will explore these key concepts in more detail, in particular covering: 2. Requirements for a MEDS compliant dataset 3. MEDS dataset conventions and best practices 4. Future roadmap and how to contribute 5. And, finally, a glossary of some key terminology and concepts
Requirements for a MEDS Compliant Dataset
For a dataset to be compliant with the MEDS standard at a given version (versioning is given by the PyPi package version), it must satisfy several requirements:
- It must be stored in a directory structure that is compliant with the MEDS directory structure specification.
- It must store the required data files in the required PyArrow Parquet format.
- It must store the required metadata files in the required JSON and PyArrow Parquet formats.
MEDS Directory Structure Specification
The MEDS directory structure is a simple, hierarchical directory structure that is designed to be easy to use
and understand. The root directory of a MEDS dataset is referred to as the MEDS root directory, and all
paths within the MEDS dataset are relative to this root directory. There are two required subdirectories of
the MEDS root directory: data/
and metadata/
. The data/
directory contains the MEDS data files, and the
metadata/
directory contains the MEDS metadata files:
├─data/
│ └─**.parquet
│
└─metadata/
├─codes.parquet
├─dataset.json
└─subject_splits.parquet
MEDS Data File Specification
As is shown above, data files are stored in any nested (potentially multi-level) parquet files within the
data/
folder (and all such parquet files must be data files). Each of these individual data files is a
single shard of the dataset, and must follow the following specifications:
- It must be compliant with the MEDS data schema
- All data for a given subject must be stored in the same shard.
- Shards must be sorted by
subject_id
andtime
within the shard---ordering within these groups is unspecified.
The MEDS data schema is an Apache Arrow schema that specifies the required columns and data types for MEDS data files. It currently includes the following columns:
subject_id
: A unique identifier for each subject in the dataset, of typeint64
.time
: The time at which the measurement corresponding to this row occurred, of typetimestamp[us]
.code
: A code representing the measurement that occurred (e.g., a diagnosis or medication code), of typestring
.numeric_value
: If the measurement has a numeric value associated with it (e.g., a lab result), this column contains that value, of typefloat32
.
All columns except subject_id
and code
may contain null
s. If the time
column is null
it indicates a
static measurement, and such rows should be sorted to the beginning of their associated subject's data. If
the numeric_value
column is null
, it indicates that the measurement does not have an associatecd numeric
value.
Note that MEDS data files can contain additional columns beyond the required columns to store additional identifiers, other data modalities, etc. to support the specific needs of a given dataset.
The MEDS data pyarrow schema can be imported from the MEDS PyPi package to validate MEDS data files.
The path from the MEDS data folder ($MEDS_ROOT/data/
) to the shard file, "/
" separated and without the
.parquet
extension, is the shard name.
MEDS Metadata File Specification
As shown above, there are three key MEDS metadata files: codes.parquet
, dataset.json
, and
subject_splits.parquet
.
codes.parquet
This file contains metadata about the code
vocabulary featured in the data files. It must contain the
following three files:
code
: The code value, of typestring
.description
: An optional free-text, human readable description of the code, of typestring
.parent_codes
: An optional list of links to parent codes in this dataset or external ontology nodes associated with this code, of typelist[string]
.
Much like the data schema, the codes.parquet
file can contain additional columns beyond the required
columns.
It is not guaranteed that all codes will have descriptions or parent codes, or even appear as a row in the metadata file at all! Further, the parent codes listed in this file are not guaranteed to be exhaustive or complete.
One common use of the parent_codes
column is to link to external ontologies in the OMOP vocabulary space.
Such linkages should be formatted as "$VOCABULARY_NAME/$CONCEPT_NAME"
; for example, a parent_code
of
"ICD9CM/487.0"
would be a reference to ICD9 code 487.0.
The formal schema for the codes.parquet
file can be imported from the meds
package and is documented
here
Some libraries and models will rely on codes.parquet
file for various tasks, such as for producing embedding
vectors of codes based on free-text descriptions, performing ontology expansion, or storing code value
statistics for normalization, etc.
dataset.json
This file contains metadata about the dataset itself, including the following:
dataset_name
: The name of the dataset, of typestring
.dataset_version
: The version of the dataset, of typestring
. Ensuring the version numbers used are meaningful and unique is important for reproducibility, but is ultimately not enforced by the MEDS schema and is left to the dataset creator.etl_name
: The name of the ETL process used to generate the dataset, of typestring
.etl_version
: The version of the ETL process used to generate the dataset, of typestring
.meds_version
: The version of the MEDS standard used to generate the dataset, of typestring
.created_at
: The timestamp at which the dataset was created, of typestring
in ISO 8601 format (note that this is not an official timestamp type, but rather a string representation of a timestamp as this is a JSON file).
The formal JSON schema for the dataset.json
file can be imported from the meds
package and is documented
here
subject_splits.parquet
This file maps subject IDs to pre-defined splits of the dataset, such as training, hyperparameter tuning, and
held-out sets. In the MEDS splits file, each row contains a subject_id
(int64
) and a split
(string
)
column, where split
is the name of the split in which that subject lives. For the three canonical AI/ML
splits, MEDS uses the following split names:
train
: The training split. This data can be used for any purpose during model building, and in supervised training labels over this split will be visible to the model.tuning
: The hyperparameter tuning split. This split is sometimes called the "dev" or "val" split in other contexts. This data can be used for tuning hyperparameters or for training of the final model, but should not be used for final evaluation of model performance. Users may choose to merge this with the training split then re-shuffle themselves if they need more splits or a different split ratio. Not all datasets will specify this split, as it is optional.held_out
: The final evaluation held-out split. This data should not be used for training or tuning, and should only be used for final evaluation of model performance. This split is sometimes called the "test" split in other contexts. No data about these patients should be assumed to be available during data pre-processing, training, or tuning.
In addition to these splits, any additional custom splits as desired by the user can be used. No additional
columns are allowed in this file. The parquet schema for this file can be imported from the meds
package and
is documented
here
Labeled cohorts over a MEDS dataset
In addition to the data and metadata files, MEDS also provides a schema for defining labeled cohorts over a
MEDS dataset. Label files do not have a required on-disk organization, though it is recommended to store them
in a labels/$COHORT_NAME/**.parquet
format within the MEDS-root directory. Labeled cohorts within MEDS
consist of a set of sharded parquet files (the sharding need not be identical to the data shards). Each of
these files is a table such that each row in the table corresponds to one "sample" in the cohort (a sample
is a single unit of prediction, and there may be multiple samples corresponding to a single subject in the
MEDS dataset). Each row in the table must contain the following columns:
subject_id
: The subject ID of the subject for this sample, of typeint64
.prediction_time
: The upper bound (inclusive) of the time window of data which can be observed when this prediction is made, of typetimestamp[us]
. E.g., your model may use data for all events that occur at or before this time to make a prediction for this sample.boolean_value
: If this task is a binary classification task, this column contains the binary label for the sample, of typebool
, otherwise this column isnull
.integer_value
: If this task is an ordinal regression or a classification task with integral labels, this column contains the numeric label for the sample, of typeint64
, otherwise this column isnull
.float_value
: If this task is a regression task, this column contains the numeric label for the sample, of typefloat64
, otherwise this column isnull
.categorical_value
: If this task is a classification task with categorical labels, this column contains the categorical label for the sample, of typestring
, otherwise this column isnull
.
The formal schema for the labeled cohort files can be imported from the meds
package and is documented
here
MEDS Dataset Conventions and Best Practices
Recommended constants
The meds
Python package defines a number of constants that are useful for
building maximally compatible datasets. These include:
- Subdirectory and file names for the required files, such as
meds.data_subdirectory
andmeds.subject_splits_filepath
- Constants for column names and dtypes, such as
meds.subject_id_column
andmeds.subject_id_dtype
- Codes for birth and death events:
meds.birth_code = "MEDS_BIRTH"
andmeds.death_code = "MEDS_DEATH"
- The three sentinel split names:
meds.train_split = "train"
meds.tuning_split = "tuning"
meds.held_out_split = "held_out"
Future Roadmap and How to Contribute
While MEDS enables a lot of exciting research already, we have a number of exciting plans to make it even better. Key ongoing efforts include but are not limited to those listed below. If you'd like to contribute to MEDS, either through any of these efforts or other approaches, please feel free to reach out on our GitHub. We welcome any and all contributions!
Multi-modal data support
MEDS is built for supporting longitudinal, structured EHR data, but it is clear that health AI covers more
than just this kind of data. We are actively working to extend MEDS to support additional data modalities,
including free-text, imaging, waveform, and other data types. There are already several tools that make use of
some free-text data through a proposed text_value
column, but official support is still in the works. Stay
tuned or get involved on our GitHub to help with these efforts!
Visualization and data exploration tools
In order to model data effectively, you first have to understand it, and few things help data understanding more than high-quality visualization and data exploration tools. The ecosystem for such tools in the EHR data landscape is very limited, and we are actively working to build out a set of tools that can help researchers better understand their data for the MEDS format.
Standardized support for complex data pre-processing steps
While MEDS is designed to be simple and flexible, there are a number of complex data pre-processing steps that are common in health AI research but not yet supported out of the box through existing tools, such as vocabulary conversion, unit standardization, structured-data summarization to free-text, use of large language models (LLMs), or data QA testing. We are actively working to build support for these tools through both dedicated MEDS-Transforms stages or standalone tools on a case-by-case basis. Feel free to reach out if these efforts would help your research or you'd like to contribute!
More extensive data validation and error checking
Health data is known to be highly noisy and suffer from high rates of errors, be it physiologically impossible measurements, mis-labeled data, or low-information content observations. We are actively working to build out standardized tools that can help automatically clean MEDS datasets to a limited degree to help researchers make their data more meaningful and reliable in a transparent, reproducible way.
Key Terminology and Concepts
-
A subject in a MEDS dataset is the primary entity being described by the sequences of care observations in the underlying dataset. In most cases, subjects will, naturally, be individuals, and the sequences of care observations will cover all known observations about those individuals in a source health datasets. However, in some cases, data may be organized so that we cannot describe all the data for an individual reliably in a dataset, but instead can only describe subsequences of an individual's data, such as in datasets that only link an individual's data observations together if they are within the same hospital admission, regardless of how many admissions that individual has in the dataset (such as the eICU dataset). In these cases, a subject in the MEDS dataset may refer to a hospital admission rather than an individual.
-
A measurement in a MEDS dataset is a particular observation being made on a subject either statically or dynamically at a point in time. Measurements are the fundamental unit of data in MEDS datasets, and the core data schema is a longitudinal sequence of measurements for each subject in the dataset. Measurements generally fall into one of three categories, which may require different handling:
- static measurements are those that are recorded in the source dataset absent a specific timestamp and are assumed to be observed and applicable across all observations in the patient record. Note this is not the same as things that are conceptually assumed to be static; e.g., a patient's race may be recorded at each visit in a health record, and thus would be treated as a dynamic measurement in that dataset specifically. Likewise, some datasets may have static measurements that we would conceptually expect to plausibly change over time, such as a patient's gender or the institution of care.
- time-derived measurements are measurements that vary in time, but are directly programmatically determinable from the timestamp of the observation and the subject's static or historic data. For example, a patient's age at the time of a measurement is a time-derived measurement, as it can be calculated from the patient's date of birth and the timestamp of the observation. Similarly, the time of day that a set of labs is taken is a time-derived measurement. Time-derived measurements are often not directly recorded in the raw data, but may be inferred or added during analysis.
- dynamic measurements are those that are recorded in the source dataset with a specific timestamp indicating when the observation was made. These measurements are assumed to be observed at a single unique point in time and are not necessarily applicable across all observations in the patient record. As these are recorded observations, they are generally assumed to not be programmatically determinable in the manner of time-derived measurements.
-
An event in a MEDS dataset is a set of measurements that are observed at a single unique point in time. Measurements within an event are not necessarily independent of each other. Further, while _event_s can be meaningfully ordered in time, measurements within an event should not be assumed a priori to come with any meaningful ordering. In some cases, event will be used interchangeably with measurement, but when the two terms are used distinctly, event will refer to those measurements that share a unique timepoint, and measurement will refer to the individual observations within an event.
-
Within a measurement, a code is the categorical descriptor of what is being observed in that measurement. _Code_s are not required to follow any particular coding vocabulary, and should be assumed to be institution specific unless otherwise specified.
-
A shard in a MEDS dataset is a single file containing a subset of the data for the dataset. Shards are used to split the data into manageable chunks for processing and storage. All data for a given subject must be stored in the same shard.
-
A sample in a labeled cohort is one unit of prediction. This may be at the subject level or, more commonly at the subject-event level, where a prediction is made for a subset of key events in a subject's record. For example, we may wish to make a prediction of in-hospital mortality at the 24 hour mark after admission for each admission of a subject in a dataset. In this case, each admission that meets the inclusion/exclusion criteria would constitute a sample in the cohort.