Converting to MEDS

In this tutorial, we'll extract a dataset in MEDS format for downstream use. See below for the jupyter notebook tutorial, rendered here, or check it out online on Google Colab or on our GitHub Repository

Converting a Custom Dataset to MEDS

Part 1: Loading the raw data

In this tutorial, we'll use the publicly available MIMIC-IV Demo v2.2 dataset as our fictional "raw data source". Naturally, MIMIC has been used extensively in the public space, so its structure is actually very well understood and widely used; however, for the sake of this tutorial, let's act as though it isn't and we're seeing it for the first time.

The first thing we need to do is load up the raw data (or, really, generally, a small random chunk of the raw data, so we can iterate quickly, though here we'll just use the entire demo dataset given its size) and take a look at it. To do that, we'll go ahead and download the raw files from PhysioNet and store them in the newly created raw_data directory (note this will take some time):

%%bash
mkdir -p raw_data

wget \
  --quiet \
  --no-host-directories \
  --recursive \
  --no-parent \
  --cut-dirs=3 \
  --directory-prefix \
  raw_data \
  https://physionet.org/files/mimic-iv-demo/2.2/

Now that the files have downloaded, what do they actually contain?

%%bash
apt-get -qq install tree > /dev/null

tree raw_data

raw_data
├── demo_subject_id.csv
├── hosp
│   ├── admissions.csv.gz
│   ├── d_hcpcs.csv.gz
│   ├── diagnoses_icd.csv.gz
│   ├── d_icd_diagnoses.csv.gz
│   ├── d_icd_procedures.csv.gz
│   ├── d_labitems.csv.gz
│   ├── drgcodes.csv.gz
│   ├── emar.csv.gz
│   ├── emar_detail.csv.gz
│   ├── hcpcsevents.csv.gz
│   ├── index.html
│   ├── labevents.csv.gz
│   ├── microbiologyevents.csv.gz
│   ├── omr.csv.gz
│   ├── patients.csv.gz
│   ├── pharmacy.csv.gz
│   ├── poe.csv.gz
│   ├── poe_detail.csv.gz
│   ├── prescriptions.csv.gz
│   ├── procedures_icd.csv.gz
│   ├── provider.csv.gz
│   ├── services.csv.gz
│   └── transfers.csv.gz
├── icu
│   ├── caregiver.csv.gz
│   ├── chartevents.csv.gz
│   ├── datetimeevents.csv.gz
│   ├── d_items.csv.gz
│   ├── icustays.csv.gz
│   ├── index.html
│   ├── ingredientevents.csv.gz
│   ├── inputevents.csv.gz
│   ├── outputevents.csv.gz
│   └── procedureevents.csv.gz
├── index.html
├── LICENSE.txt
├── README.txt
├── robots.txt
└── SHA256SUMS.txt

2 directories, 39 files

We can see there are a number of data files here, including:

hosp/*.csv.gz
icu/*.csv.gz

as well as a variety of other, likely non-data files. To understand any clinical dataset, you generally should rely on both provided documentation and a local, subject-matter expert who is familiar with both the clinical and operational context of the dataset; however, in practice, we rarely have this. For our purposes, let's take a look at the provided MIMIC-IV documentation to try to understand these various files.

Part 2: MEDS Extraction, conceptually

For now, we'll focus on only a few files, to keep things simple (note that each file below links to its specific data source documentation):

To start understanding how we should think about extracting a MEDS view of this data, let's inspect some of the data using pandas:

import pandas as pd
from pathlib import Path

DATA_ROOT = Path("raw_data")

dfs = {}

for fn in [
    "hosp/patients.csv.gz",
    "hosp/admissions.csv.gz",
    "hosp/procedures_icd.csv.gz",
    "icu/icustays.csv.gz",
    "icu/chartevents.csv.gz",
]:
  fp = DATA_ROOT / fn
  df = pd.read_csv(fp)
  print(f"{fn}:")
  display(df.head(2))
  dfs[fn.split(".")[0]] = df

hosp/patients.csv.gz:

	subject_id	gender	anchor_age	anchor_year	anchor_year_group	dod
0	10014729	F	21	2125	2011 - 2013	NaN
1	10003400	F	72	2134	2011 - 2013	2137-09-02

hosp/admissions.csv.gz:

	subject_id	hadm_id	admittime	dischtime	deathtime	admission_type	admit_provider_id	admission_location	discharge_location	insurance	language	marital_status	race	edregtime	edouttime	hospital_expire_flag
0	10004235	24181354	2196-02-24 14:38:00	2196-03-04 14:02:00	NaN	URGENT	P03YMR	TRANSFER FROM HOSPITAL	SKILLED NURSING FACILITY	Medicaid	ENGLISH	SINGLE	BLACK/CAPE VERDEAN	2196-02-24 12:15:00	2196-02-24 17:07:00	0
1	10009628	25926192	2153-09-17 17:08:00	2153-09-25 13:20:00	NaN	URGENT	P41R5N	TRANSFER FROM HOSPITAL	HOME HEALTH CARE	Medicaid	?	MARRIED	HISPANIC/LATINO - PUERTO RICAN	NaN	NaN	0

hosp/procedures_icd.csv.gz:

	subject_id	hadm_id	seq_num	chartdate	icd_code	icd_version
0	10011398	27505812	3	2146-12-15	3961	9
1	10011398	27505812	2	2146-12-15	3615	9

icu/icustays.csv.gz:

	subject_id	hadm_id	stay_id	first_careunit	last_careunit	intime	outtime	los
0	10018328	23786647	31269608	Neuro Stepdown	Neuro Stepdown	2154-04-24 23:03:44	2154-05-02 15:55:21	7.702512
1	10020187	24104168	37509585	Neuro Surgical Intensive Care Unit (Neuro SICU)	Neuro Stepdown	2169-01-15 04:56:00	2169-01-20 15:47:50	5.452662

icu/chartevents.csv.gz:

	subject_id	hadm_id	stay_id	caregiver_id	charttime	storetime	itemid	value	valuenum	valueuom	warning
0	10005817	20626031	32604416	6770.0	2132-12-16 00:00:00	2132-12-15 23:45:00	225054	On	NaN	NaN	0.0
1	10005817	20626031	32604416	6770.0	2132-12-16 00:00:00	2132-12-15 23:43:00	223769	100	100.0	%	0.0

We can see there is a lot of data contained in just these files! How can we hope to go about unifying it all into the simple MEDS format in a reasonable time?

To do so, we'll follow the assumptions of the MEDS-Extract library, which organizes the mapping of EHR data elements into the MEDS format via the following questions. For each row of each input source, we ask

What is happening in this row?
To whom is it happening?
When is it happening?

Once we can answer each of these three questions, we're ready to extract a full MEDS dataset over our inputs.

Part 2.1 Mapping the `hosp/patients` table

To see these in action, let's work through our files in order, starting with hosp/patients.csv.gz:

dfs['hosp/patients'].head(2)

[4]:

	subject_id	gender	anchor_age	anchor_year	anchor_year_group	dod
0	10014729	F	21	2125	2011 - 2013	NaN
1	10003400	F	72	2134	2011 - 2013	2137-09-02

We can see that this dataframe clearly captures some static data about the patients in the population, any external date-of-death information present about the subject, and meta-data about how this subject's data is transformed when included in MIMIC via the anchor year group. The latter aspect won't feature ino the MEDS representation, so this means we only have the following pieces of information to represent about the patient:

The information in the gender column, which for this dataset we will assign to a static measurement given if is recorded as sudh within the raw dataset.
The information in the anchor_age column indicating the patient's date of birth (after some transformation).
The information in the dod column, which contains a de-identified date of death for the patient, if applicable.

Ultimately, this tells us that for each row of the hosp/patients table, we'll want to construct 3 MEDS events:

A measurement for subject subject_id with a null timestamp with a code indicating the value in the gender column.
A measurement for subject subject_id with a timestamp given by the dod column (if it is not null) and the MEDS_DEATH code (as this is a death event) and no values.
A measurement for subject subject_id with a timestamp given by the difference between the anchor_year and the anchor_age, converted to a date-time, with the MEDS_BIRTH code and no values.

Let's record the information for these events in a simple, declarative format that we'll encode in YAML. For now, just think of this as an approximate format -- it isn't technically precise just yet. But, as we build up our specification, we'll see how we can turn it into a complete description of the extraction process. In particular, we'll have an outer level of the YAML correspond to the file we're talking about (in this case hosp/patients) and then we'll have an inner block for each of the 3 events we've identified, describing what columns they'll use for to construct their timestamps and codes (we don't have values for any of these events, but we'll add them in later).

Specification so far:

hosp/patients:
  gender:
    subject_id: subject_id
    code: gender
    time: null
  death:
    subject_id: subject_id
    code: MEDS_DEATH
    time: dod @ 11:59 p.m.
  birth:
    subject_id: subject_id
    code: MEDS_BIRTH
    time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.

Note: Note that the patients table has already revealed two common complications when converting clinical data (to any format, not just MEDS):

The dod column only provides a date level resolution, not a time level resolution. This means that we don't know whether or not the patient died at 12:01 a.m. on that date or at 11:59 p.m. on that date, despite these two times being separated by nearly 24 full hours! This can cause issues with measurement ordering, the validity of temporal prediction tasks (e.g., predicting imminent mortality), etc. Ultimately, some choice needs to be made in how we want to represent this in MEDS. By design, MEDS does not allow you to specify a date-only timestamp, as such a timestamp does not permit a total ordering of measurements across different events. Here, as we know that death is a final event and is often (if not universally) the last event recorded for the patient, it makes sense to place it at the latest possible time within that date (i.e., add an implicit 11:59:59 p.m. onto the end of that timestamp column).
As this dataset records an "age" (via anchor_age) rather than an explicit date of birth, we have a similar, but even greater lack of temporal resolution in the date of birth column of the data. Here, we need to choose when within that year we should assign the patient's date of birth; again, there is no "right" answer, but we need to make a choice. For this event, we'll choose January 1st of that year, to keep things simple.

Part 2.2: The `hosp/admissions` table:

Next, let's inspect the admissions table:

dfs['hosp/admissions'].head(2)

[5]:

	subject_id	hadm_id	admittime	dischtime	deathtime	admission_type	admit_provider_id	admission_location	discharge_location	insurance	language	marital_status	race	edregtime	edouttime	hospital_expire_flag
0	10004235	24181354	2196-02-24 14:38:00	2196-03-04 14:02:00	NaN	URGENT	P03YMR	TRANSFER FROM HOSPITAL	SKILLED NURSING FACILITY	Medicaid	ENGLISH	SINGLE	BLACK/CAPE VERDEAN	2196-02-24 12:15:00	2196-02-24 17:07:00	0
1	10009628	25926192	2153-09-17 17:08:00	2153-09-25 13:20:00	NaN	URGENT	P41R5N	TRANSFER FROM HOSPITAL	HOME HEALTH CARE	Medicaid	?	MARRIED	HISPANIC/LATINO - PUERTO RICAN	NaN	NaN	0

Here, we have a lot of additional pieces of data -- records of admissions, discharges, possible competing records of deaths, admission types, locations for both admissions and discharges, patient information at time of admission (e.g., insurance, language, marital status, race), and emergency department (ed*) registration & discharge information. One new piece of complexity here that is worth noting is that many of these events are "interval" style events -- namely, events that present with both a start and an end time (e.g., an admission and discharge, an ED registration and an ED out, etc.). The "MEDS way" to handle such events is to simply include both a separate, appropriately timed start event and an end event -- that way you are representing each interaction separately in its appropriate place in the patient timeline. This comes through naturally when we focus on asking our three questions from above. With this perspective, we can quickly identify a list of measurements these columns represent:

There is (or may be, if the timestamp is not null) a "hospital admission" of type admission_type to location admission_location at the time given by admittime for the subject given in subject_id (Note we are not tracking the admit_provider_id as MEDS does not currently formalize the notion of the treating provider).
At the time of the hospital admission, some patient demographics are collected about subject_id, including their:

- `insurance`
- `language`
- `marital_status`
- `race`

There may be a "hospital discharge" to the location discharge_location at time dischtime for subject_id.
There may be a "death" event at time deathtime for subject_id.
There may be an "ED Registration" event at time edregtime for subject_id.
There may be an "ED Out" event at time edouttime for subject_id.

Given these event descriptions, we can update our specification as follows:

hosp/patients:
  gender:
    subject_id: subject_id
    code: gender
    time: null
  death:
    subject_id: subject_id
    code: MEDS_DEATH
    time: dod @ 11:59 p.m.
  birth:
    subject_id: subject_id
    code: MEDS_BIRTH
    time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.

hosp/admissions:
  admission:
    subject_id: subject_id
    code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
    time: admittime
  language:
    subject_id: subject_id
    code: "LANGUAGE//${language}"
    time: admittime
  marital_status:
    subject_id: subject_id
    code: "MARITAL_STATUS//${marital_status}"
    time: admittime
  insurance:
    subject_id: subject_id
    code: "INSURANCE//${insurance}"
    time: admittime
  race:
    subject_id: subject_id
    code: "RACE//${race}"
    time: admittime
  discharge:
    subject_id: subject_id
    code: "HOSPITAL_DISCHARGE//${discharge_location}"
    time: dischtime
  death:
    subject_id: subject_id
    code: MEDS_DEATH
    time: deathtime
  ed_reg:
    subject_id: subject_id
    code: ED_REGISTRATION
    time: edregtime
  ed_out:
    subject_id: subject_id
    code: ED_OUT
    time: edouttime

Heads up that we're being a bit imprecise with our syntax here, as this is just (for now) a mental aid -- namely, we're using some plain strings to represent column names (e.g., code: gender and subject_id: subject_id) and sometimes we're using strings explicitly indicated with double-quotes to indicate compound codes using python's string interpolation syntax (e.g., `code: "HOSPITAL_DISCHARGE//${discharge_location}"). We'll formalize this later, but for now, use context to disambiguate which we mean.

Note that, much like before we've seen some other areas where challenges arise and assumptions need to be made in mapping this table:

Multifactorial measurements: Here, there are several measurements that come with different parts. We have admissions occurring with types and to locations alongside demographic data being measured like language, marital status, race, and insurance type. How should we map all of these to a set of distinct measurements with what codes? In general, this question comes down to a trade-off between more simultaneous measurements vs. more complex codes -- i.e., you can either produce more measurements for each distinct aspect of the code at the same time-point, or you can add more pieces of information into a single code string, thereby increasing the size of your vocabulary. This data extraction step shows both strategies in action, for good reason:

For admission type and location, we include them in the core hospital admission code. This makes sense because every admission has to have a type and a location -- so they are natural "modifiers" to the admission measurement conceptually, as opposed to being distinct measurements. We'd also almost never have a situation where a model would need to know that an admission happened, but not know of what type or to where.
On the other hand, for the patient demographic information, these have each been separated into distinct measurements all at the same point in time -- each aspect of the demographic data is thus recorded separately, so if we wish to later filter out rare or unknown recordings for one aspect of the demographic data in isolation of the others, this will be easy to do at a measurement level. Ultimately, however, it may also be reasonable (or even work better in some modeling tasks) to instead have produced a joint code string across all demographic information (e.g., LANGUAGE//${language}//INSURANCE//${insurance}//...). If you want to try that out yourself, let us know if it works better!

The existence of these multifactorial codes also highlights a convention we'll take in this guide, which is to compose "structured code strings" using the double-slash ("//") as a separator, as this is unlikely to occur in a raw code string. This is not a formal requirement, so feel free to use a different approach in your data -- but what is important to note is that you likely do not want code strings to collide across different measurement sources. So, if you just used code: race and code: language, for example, and UNK was an option for both race and language, your model's wouldn't be able to differentiate between those two options unless you use a unique prefix (like we do here).

Competing Measurement Sources: There's another death time in this file, in addition to the dod recorded in the patients table! This is, unfortunately, a common enough problem in EHR data. Luckily, its solution is pretty straightforward -- simply decide which source has precedent (ideally this will be a universal property, not a data-dependent one) and favor that over the other. Here, as the deathtime in this dataset has full datetime resolution, it will be preferred over the dod in the other file. We'll merely denote that with a comment in our specification for now.

At this point, our specification is also getting pretty verbose. Let's pull out the shared aspects across all event blocks into the upper level categories -- for now this just includes the subject_id specification -- so we can get rid of some wasted space:

subject_id: subject_id
hosp/patients:
  gender:
    code: gender
    time: null
  death: # Superceded by the `death` measurement in hosp/admissions
    code: MEDS_DEATH
    time: dod @ 11:59 p.m.
  birth:
    code: MEDS_BIRTH
    time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.

hosp/admissions:
  admission:
    code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
    time: admittime
  language:
    code: "LANGUAGE//${language}"
    time: admittime
  marital_status:
    code: "MARITAL_STATUS//${marital_status}"
    time: admittime
  insurance:
    code: "INSURANCE//${insurance}"
    time: admittime
  race:
    code: "RACE//${race}"
    time: admittime
  discharge:
    code: "HOSPITAL_DISCHARGE//${discharge_location}"
    time: dischtime
  death: # Takes precedent over the `death` measurement in hosp/patients
    code: MEDS_DEATH
    time: deathtime
  ed_reg:
    code: ED_REGISTRATION
    time: edregtime
  ed_out:
    code: ED_OUT
    time: edouttime

While there are other ways we could further condense this (e.g., using a list of objects rather than a dictionary of objects within each data source) that will hurt us more on clarity, so we'll keep that for now.

Part 2.3: The `hosp/procedures_icd` table

Let's move onto our next table: procedures_icd

dfs['hosp/procedures_icd'].head(2)

[6]:

	subject_id	hadm_id	seq_num	chartdate	icd_code	icd_version
0	10011398	27505812	3	2146-12-15	3961	9
1	10011398	27505812	2	2146-12-15	3615	9

Here, we have a bit of an easier time -- there's clearly only one measurement being recorded here -- the ICD code itself, recorded for subject_id at the time given by chartdate. However, much like for the dod column in the hosp/patients table, this is only a date, not a full datetime, so we need to decide at what timestamp within the date we should assign this. Here, the situation is not quite so simple; unlike death, which is clearly a "final" event, procedures can happen throughout the day, and we don't know where it would be best to assign the recordings of their ICD codes. Ultimately, as we are more likely to want to predict things that are based on these procedures or heavily indicated by these procedures, it is better to put them later in the day rather than earlier to avoid temporal leakage -- though note that this can still cause leakage in tasks that are attempting to predict these procedure codes themselves! Regardless, we'll assign them the time of 11:59:59 p.m. on the given day here. We'll also want to ensure we capture both the icd_code and icd_version in these measurements, as both are necessary to fully define the assigned ICD code.

Before we show our new specification, note that there is one additional complexity here we should take into account, and that is seq_num. This is actually an important piece of information, as it indicates the relative prioritization of the given codes assigned to the patient (a lower seq_num indicating a higher priority code). This is a common paradigm for diagnostic codes in U.S. healthcare datasets, so we do want to include it; however, it doesn't feel quite right to include it in the code as it is not a real part of the measurement about the patient. Instead, for this example, we'll use the fact that MEDS datasets are permitted to include any other desired columns beyond the required columns, so we can just track it directly as an external column:

subject_id: subject_id
hosp/patients:
  gender:
    code: gender
    time: null
  death: # Superceded by the `death` measurement in hosp/admissions
    code: MEDS_DEATH
    time: dod @ 11:59 p.m.
  birth:
    code: MEDS_BIRTH
    time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.

hosp/admissions:
  admission:
    code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
    time: admittime
  language:
    code: "LANGUAGE//${language}"
    time: admittime
  marital_status:
    code: "MARITAL_STATUS//${marital_status}"
    time: admittime
  insurance:
    code: "INSURANCE//${insurance}"
    time: admittime
  race:
    code: "RACE//${race}"
    time: admittime
  discharge:
    code: "HOSPITAL_DISCHARGE//${discharge_location}"
    time: dischtime
  death: # Takes precedent over the `death` measurement in hosp/patients
    code: MEDS_DEATH
    time: deathtime
  ed_reg:
    code: ED_REGISTRATION
    time: edregtime
  ed_out:
    code: ED_OUT
    time: edouttime

hosp/procedures_icd:
  procedure_icd:
    code: "PROCEDURE//ICD${icd_version}//${icd_code}"
    time: chartdate @ 11:59 p.m.
    seq_num: seq_num

Part 2.4: The `icu/icustays` table

Now, let's look at icustays:

dfs['icu/icustays'].head(2)

[7]:

	subject_id	hadm_id	stay_id	first_careunit	last_careunit	intime	outtime	los
0	10018328	23786647	31269608	Neuro Stepdown	Neuro Stepdown	2154-04-24 23:03:44	2154-05-02 15:55:21	7.702512
1	10020187	24104168	37509585	Neuro Surgical Intensive Care Unit (Neuro SICU)	Neuro Stepdown	2169-01-15 04:56:00	2169-01-20 15:47:50	5.452662

This table is much like the hosp/admissions table -- we have some "interval" style events being recorded here (namely, ICU stays) which we'll separate into endpoints, resulting in:

An ICU admission event for subject_id at intime to the first_careunit
An ICU discharge event for subject_id at outtime from the last_careunit.

Note two things:

The los here is actually a derived property -- it isn't something we want to record in the MEDS data directly (especially not in the ICU admission event because that could risk future leakage).
We're actually being a bit inconsistent here -- really, we should likely try to find another table in the MIMIC source which captures the sequence of careunits the patient is seen within so that we can record transfers to a careunit universally, rather than having an ICU admission to a careunit and an ICU discharge from a careunit -- but for now, this is outside the scope of our tutorial (but if you're interested, the right table to use for this is the hosp/transfers table, which is actually the ground truth source for the icu/icustays table.).

When we add this to our spec, we obtain:

subject_id: subject_id
hosp/patients:
  gender:
    code: gender
    time: null
  death: # Superceded by the `death` measurement in hosp/admissions
    code: MEDS_DEATH
    time: dod @ 11:59 p.m.
  birth:
    code: MEDS_BIRTH
    time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.

hosp/admissions:
  admission:
    code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
    time: admittime
  language:
    code: "LANGUAGE//${language}"
    time: admittime
  marital_status:
    code: "MARITAL_STATUS//${marital_status}"
    time: admittime
  insurance:
    code: "INSURANCE//${insurance}"
    time: admittime
  race:
    code: "RACE//${race}"
    time: admittime
  discharge:
    code: "HOSPITAL_DISCHARGE//${discharge_location}"
    time: dischtime
  death: # Takes precedent over the `death` measurement in hosp/patients
    code: MEDS_DEATH
    time: deathtime
  ed_reg:
    code: ED_REGISTRATION
    time: edregtime
  ed_out:
    code: ED_OUT
    time: edouttime

hosp/procedures_icd:
  procedure_icd:
    code: "PROCEDURE//ICD${icd_version}//${icd_code}"
    time: chartdate @ 11:59 p.m.
    seq_num: seq_num

icu/icustays:
  admission:
    code: "ICU_ADMISSION//${first_careunit}"
    time: intime
  discharge:
    code: "ICU_DISCHARGE//${last_careunit}"
    time: outtime

Part 2.5: `icu/chartevents`:

Finally, let's look at chartevents:

dfs['icu/chartevents'].head(2)

[8]:

	subject_id	hadm_id	stay_id	caregiver_id	charttime	storetime	itemid	value	valuenum	valueuom	warning
0	10005817	20626031	32604416	6770.0	2132-12-16 00:00:00	2132-12-15 23:45:00	225054	On	NaN	NaN	0.0
1	10005817	20626031	32604416	6770.0	2132-12-16 00:00:00	2132-12-15 23:43:00	223769	100	100.0	%	0.0

This table clearly has rows that capture a variety of recordings of some more nuanced measurements. Some have numerical results, units of measure, etc. We also have another complexity here in that we have some uncertainty in timestamp, with both charttime and storetime being included. Ultimately, though, there is still just one kind of measurement being recorded here: Namely, a "chart event" (often a lab test), identified via the "Item ID" itemid being recorded at either charttime or storetime, with a value given by the columns within value, valuenum, and valueuom. Let's see how to add that to our specification (for brevity, we'll just show the new bit first, before we put it all together):

icu/chartevents:
  chartevent:
    time: charttime
    code: "CHARTEVENT//${itemid}//${valueuom}"
    numeric_value: valuenum

Note here that we've made a few assumptions:

We've defaulted to favor charttime here -- this is because, according to the data documentation, charttime is the closest proxy to when the data was actually recorded. However, this could benefit from further investigation and empirical validation!
We are omitting the warning column -- this is because we don't know when a warning would actually have been noted by the care-team, as it does not represent an automated process as part of the chart event measurement, but rather is a manual observation by the care team after the data has been recorded

In addition, this format has the following undesired property -- if valueuom is empty or NaN, the code string will have a trailing // (because we've included valueuom in the template, even though it will only be used for things with a numeric measurement). We can try to remedy this later, though it is not a high-priority issue as it only results in a superficial change.

All told, this gives us a final "specification" for the data extraction (at a conceptual level) as follows:

subject_id: subject_id
hosp/patients:
  gender:
    code: gender
    time: null
  death: # Superceded by the `death` measurement in hosp/admissions
    code: MEDS_DEATH
    time: dod @ 11:59 p.m.
  birth:
    code: MEDS_BIRTH
    time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.

hosp/admissions:
  admission:
    code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
    time: admittime
  language:
    code: "LANGUAGE//${language}"
    time: admittime
  marital_status:
    code: "MARITAL_STATUS//${marital_status}"
    time: admittime
  insurance:
    code: "INSURANCE//${insurance}"
    time: admittime
  race:
    code: "RACE//${race}"
    time: admittime
  discharge:
    code: "HOSPITAL_DISCHARGE//${discharge_location}"
    time: dischtime
  death: # Takes precedent over the `death` measurement in hosp/patients
    code: MEDS_DEATH
    time: deathtime
  ed_reg:
    code: ED_REGISTRATION
    time: edregtime
  ed_out:
    code: ED_OUT
    time: edouttime

hosp/procedures_icd:
  procedure_icd:
    code: "PROCEDURE//ICD${icd_version}//${icd_code}"
    time: chartdate @ 11:59 p.m.
    seq_num: seq_num

icu/icustays:
  admission:
    code: "ICU_ADMISSION//${first_careunit}"
    time: intime
  discharge:
    code: "ICU_DISCHARGE//${last_careunit}"
    time: outtime

icu/chartevents:
  chartevent:
    time: charttime
    code: "CHARTEVENT//${itemid}//${valueuom}"
    numeric_value: valuenum

Then, our question becomes, how can we use this model to actually extract the data?

Part 3: Using MEDS-Extract to Automate Extraction

So far, all we've built up is a conceptual map on how to think about extracting data to MEDS. Hopefully, in doing so, you've come to see how the simplicity of MEDS gives rise to likewise simple extraction pipelines -- rather than requiring hours or days to understand the various input files, you can often map the rows of input tables into a conceptual specification for MEDS extraction in minutes, even when presented with more complex cases that require some assumptions to be made.

However, as it turns out, not only is this conceptual specification useful theoretically, it also is very close to a precise technical specification that the MEDS-Extract package can use to extract your data in the MEDS format for you.

The MEDS-Extract library leverages MEDS-Transforms to run a full ETL pipeline, with the secret sauce in the middle being the "MEDS-Extract Specification Syntax YAML" (MESSY) file -- which tells you how to map your messy input data into the MEDS format in alignment with this conceptual model.

This file is (as the name implies) in the YAML format and looks much like our specification above. It consists of blocks mapping input source table name to named measurements within the rows of that table, each measurement block having some sentinel properties which map to a prescribed extraction syntax that controls how the input data is parsed. It does, unfortunately, have some limitations that will make certain operations in our conceptual specification a bit harder. Let's dig in!

The MESSY File Format

1. The Outer Structure

First, much like our conceptual specification above, the MESSY file will have a block per input source, within which we'll go through and identify all the measurements we want to extract from that source. In this case, that means we'll have a block for each of the tables we've listed above:

hosp/patients:
  ...
hosp/admissions:
  ...
hosp/procedures_icd:
  ...
icu/icustays:
  ...
icu/chartevents:
  ...

Also, much like our specification above, we can specify shared properties at the top level -- so we can add back in our subject_id indicator as well, though in the MESSY format, we need to name it subject_id_col (for no particularly good reason):

subject_id_col: subject_id
hosp/patients:
  ...
hosp/admissions:
  ...
hosp/procedures_icd:
  ...
icu/icustays:
  ...
icu/chartevents:
  ...

2. Measurement blocks

Within each table source, we also need to specify all of the measurements we want to extract. Again, our format will look pretty similar, but a bit different. Our conceptual specification had measurements that looked like each of the following prototypical examples:

gender:
  code: gender
  time: null
death:
  code: MEDS_DEATH
  time: dod @ 11:59 p.m.
birth:
  code: MEDS_BIRTH
  time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
admission:
  code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
  time: admittime
death:
  code: MEDS_DEATH
  time: deathtime
procedure_icd:
  code: "PROCEDURE//ICD${icd_version}//${icd_code}"
  time: chartdate @ 11:59 p.m.
  seq_num: seq_num
chartevent:
  time: charttime
  code: "CHARTEVENT//${itemid}//${valueuom}"
  numeric_value: valuenum

Let's walk through each to see which features we'll need to change:

Specifying Time Format Strings

A key missing piecce here is that we've indicated some strings are "time" columns, but we're not saying how those should be parsed from the (string) input types accessible in our CSV files! While this is not an issue if our inputs were parquets or something else with typed timestamp columns, for CSVs we need to address it. Luckily, this is simple; we can just add a time_format key to each block with a time format string used to parse the column. Refer to the chrono crate documentation for how these format strings should be specified. In this case, we want the following format string for most use cases: time_format: "%Y-%m-%d %H:%M:%S".

What if a column isn't so nicely formatted, and there are multiple format strings in the data? You can also specify a list of format strings to the time_format key, and they will be used in specified order until one works on a given input for that column; e.g., time_format: ["%Y-%m-%d %H:%M:%S", "%Y"].

We'll omit this added detail from our measurement configs for now in the interest of brevity, but see it added in at the end.

Disambiguating column references from string literals

We can see in the gender and death measurements that, in our conceptual specification, we sometimes used strings to refer to column names and sometimes as string literals. For the code and time columns only, the MESSY file disambiguates column references with col(...) and treats all others as string literals. String literals are only allowed for the code column; the time column can only accept null literals. So, we'll need to make some changes to these blocks to account for this (note that as we're making changes iteratively, they won't be fully valid until we're done). In some cases, it isn't clear how to make the change we're describing, so we'll add ??? indicators to those cases.

gender: # We're all done with this format -- this block is complete!
  code: col(gender)
  time: null
death:
  code: MEDS_DEATH
  time: ??? # dod @ 11:59 p.m.
birth:
  code: MEDS_BIRTH
  time: ??? # anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
admission:
  code: ??? # "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
  time: col(admittime)
death: # We're all done with this format -- this block is complete!
  code: MEDS_DEATH
  time: col(deathtime)
procedure_icd:
  code: ??? # "PROCEDURE//ICD${icd_version}//${icd_code}"
  time: ??? # chartdate @ 11:59 p.m.
  seq_num: seq_num # Note that this doesn't need a col(...) specifier
chartevent:
  code: ??? # "CHARTEVENT//${itemid}//${valueuom}"
  time: col(charttime)
  numeric_value: valuenum # Note that this doesn't need a col(...) specifier

Note that we can actually now see that in some cases, resolving this piece has "completed" a full block; gender and death are feature complete now, and can be omitted from the later sections for our tutorial pieces.

String interpolation in the code column

Another feature we see a lot of is string interpolation in the code column; e.g., CHARTEVENT//${itemid}//${valueuom}. How can we handle that?

Unfortunately, as of now, the MEDS-Extract does not allow generit string interpolation; but it does allow you to specify a list of parts which will be concatenated together with the // separator. This is done just by specifying a list of each of the literals or columns (with the col(...) syntax to denote the latter) in the YAML file directly. Let's see it in action!

death:
  code: MEDS_DEATH
  time: ??? # dod @ 11:59 p.m.
birth:
  code: MEDS_BIRTH
  time: ??? # anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
admission: # We're all done with this format -- this block is complete!
  code:
    - HOSPITAL_ADMISSION
    - col(admission_type)
    - col(admission_location)
  time: col(admittime)
procedure_icd:
  code:
    - PROCEDURE
    - ??? #ICD${icd_version}
    - col(icd_code)
  time: ??? # chartdate @ 11:59 p.m.
  seq_num: seq_num
chartevent: # We're all done with this format -- this block is complete!
  code:
    - CHARTEVENT
    - col(itemid)
    - col(valueuom)
  time: col(charttime)
  numeric_value: valuenum

With this change, we've knocked out two blocks, but we see there is a tricky issue with a third -- the procedure_icd block doesn't support expressing things in the way we want. This is unfortunate, but for now it is unavoidable, so we'll have to change what we want the code string to be, and make the ICD part be separated from the version with another //:

procedure_icd:
  code:
    - PROCEDURE
    - ICD
    - coi(icd_version)
    - col(icd_code)
  time: ??? # chartdate @ 11:59 p.m.
  seq_num: seq_num

Timestamp Resolution and basic arithmetic

Now, we have a tricky one: in all remaining sources of uncertainty, we have one of two (or more) problems going on -- either (a) we need to resolve a timestamp to a specific time of day (e.g., chartdate @ 11:59 p.m.) or (b) we need to perform some simple arithmetic (e.g., anchor_year - anchor_age).

MEDS-Extract does not currently support either of these operations. So, they need to happen in a "pre-MEDS" step, where we have some custom code go through and perform these operations for us on the raw dataframes, before we call MEDS-Extract. There are some other operations that might be required that MEDS-Extract can't handle currently that you should know about (even though we don't need them here), such as:

Joining multiple tables together to ensure the subject_id is present in all cases.
Adjusting "offset" time columns into true datetime columns (this is actually just a case of arithmetic and datetime parsing as well, but it warrants an explicit mention).
Any data filtering that needs to happen before MEDS extraction occurs (though often data cleaning can happen after the MEDS conversion process as well).

Let's write a simple pre-MEDS step we can run here.

Pre-MEDS

Our Pre-MEDS step will have a few simple goals:

Subtract the anchor age from the anchor year to get a "year of birth"
Resolve the timestamps in hosp/procedures_icd.
Remove the duplication between the dod column in hosp/patients and the deathtime in hosp/admissions to favor the latter where both are specified.

We'll write this using pandas for now, but you can use whatever method you want for your data.

from datetime import timedelta

def get_year_of_birth(df: pd.DataFrame) -> pd.DataFrame:
  df["year_of_birth"] = (
      df["anchor_year"].astype(int) - df["anchor_age"].astype(int)
  ).astype(str)
  return df

def put_procedure_at_EOD(df: pd.DataFrame) -> pd.DataFrame:
  df["chartdate"] = (
      pd.to_datetime(df["chartdate"], format="%Y-%m-%d") +
      timedelta(hours=23, minutes=59, seconds=59)
  )
  return df

def remove_dod_duplication_and_put_at_EOD(
    patients_df: pd.DataFrame,
    admissions_df: pd.DataFrame,
) -> pd.DataFrame:
  subjects_with_deathtime = (
      admissions_df[~admissions_df["deathtime"].isna()]["subject_id"]
  )

  idx = patients_df["subject_id"].isin(subjects_with_deathtime)
  patients_df.loc[idx, "dod"] = None
  patients_df["dod"] = (
      pd.to_datetime(patients_df["dod"], format="%Y-%m-%d") +
      timedelta(hours=23, minutes=59, seconds=59)
  )
  return patients_df

We'll store the output of our pre-MEDS stage in an "intermediate directory" called intermediate_dir -- that way we can always re-use our raw data.

INTERMEDIATE_DIR = Path("intermediate_dir")

for name, df in dfs.items():
  if name == "hosp/patients":
    df = get_year_of_birth(df)
    df = remove_dod_duplication_and_put_at_EOD(df, dfs["hosp/admissions"])
  elif name == "hosp/procedures_icd":
    df = put_procedure_at_EOD(df)

  out_fp = INTERMEDIATE_DIR / f"{name}.parquet"
  out_fp.parent.mkdir(parents=True, exist_ok=True)
  df.to_parquet(out_fp)

1 
2

%%bash
tree intermediate_dir

intermediate_dir
├── hosp
│   ├── admissions.parquet
│   ├── patients.parquet
│   └── procedures_icd.parquet
└── icu
    ├── chartevents.parquet
    └── icustays.parquet

2 directories, 5 files

pd.read_parquet(INTERMEDIATE_DIR / "hosp/patients.parquet").head(5)

[12]:

	subject_id	gender	anchor_age	anchor_year	anchor_year_group	dod	year_of_birth
0	10014729	F	21	2125	2011 - 2013	NaT	2104
1	10003400	F	72	2134	2011 - 2013	NaT	2062
2	10002428	F	80	2155	2011 - 2013	NaT	2075
3	10032725	F	38	2143	2011 - 2013	2143-03-30 23:59:59	2105
4	10027445	F	48	2142	2011 - 2013	2146-02-09 23:59:59	2094

pd.read_parquet(INTERMEDIATE_DIR / "hosp/procedures_icd.parquet").head(5)

[13]:

	subject_id	hadm_id	seq_num	chartdate	icd_code	icd_version
0	10011398	27505812	3	2146-12-15 23:59:59	3961	9
1	10011398	27505812	2	2146-12-15 23:59:59	3615	9
2	10011398	27505812	1	2146-12-15 23:59:59	3614	9
3	10014729	23300884	4	2125-03-23 23:59:59	3897	9
4	10014729	23300884	1	2125-03-20 23:59:59	3403	9

The final MESSY File

Now that we've resolved our remaining issues, let's put together our final, complete MESSY file!

YAML_contents = """
subject_id_col: subject_id
hosp/patients:
  gender:
    code: col(gender)
    time: null
  death:
    code: MEDS_DEATH
    time: col(dod)
  birth:
    code: MEDS_BIRTH
    time: col(year_of_birth)
    time_format: "%Y"

hosp/admissions:
  ed_registration:
    code: ED_REGISTRATION
    time: col(edregtime)
    time_format: "%Y-%m-%d %H:%M:%S"
  ed_out:
    code: ED_OUT
    time: col(edouttime)
    time_format: "%Y-%m-%d %H:%M:%S"
  admission:
    code:
      - HOSPITAL_ADMISSION
      - col(admission_type)
      - col(admission_location)
    time: col(admittime)
    time_format: "%Y-%m-%d %H:%M:%S"
    hadm_id: hadm_id
  discharge:
    code:
      - HOSPITAL_DISCHARGE
      - col(discharge_location)
    time: col(dischtime)
    time_format: "%Y-%m-%d %H:%M:%S"
    hadm_id: hadm_id

hosp/procedures_icd:
  procedure_icd:
    code:
      - PROCEDURE
      - ICD
      - coi(icd_version)
      - col(icd_code)
    time: col(chartdate)
    seq_num: seq_num

icu/icustays:
  admission:
    code:
      - ICU_ADMISSION
      - col(first_careunit)
    time: col(intime)
    time_format: "%Y-%m-%d %H:%M:%S"
  discharge:
    code:
      - ICU_DISCHARGE
      - col(last_careunit)
    time: col(outtime)
    time_format: "%Y-%m-%d %H:%M:%S"

icu/chartevents:
  chartevent:
    code:
      - CHARTEVENT
      - col(itemid)
      - col(valueuom)
    time: col(charttime)
    time_format: "%Y-%m-%d %H:%M:%S"
    numeric_value: valuenum
"""

YAML_fp = Path("MESSY.yaml")
YAML_fp.write_text(YAML_contents)
print(YAML_fp.read_text())


subject_id_col: subject_id
hosp/patients:
  gender:
    code: col(gender)
    time: null
  death:
    code: MEDS_DEATH
    time: col(dod)
  birth:
    code: MEDS_BIRTH
    time: col(year_of_birth)
    time_format: "%Y"

hosp/admissions:
  ed_registration:
    code: ED_REGISTRATION
    time: col(edregtime)
    time_format: "%Y-%m-%d %H:%M:%S"
  ed_out:
    code: ED_OUT
    time: col(edouttime)
    time_format: "%Y-%m-%d %H:%M:%S"
  admission:
    code:
      - HOSPITAL_ADMISSION
      - col(admission_type)
      - col(admission_location)
    time: col(admittime)
    time_format: "%Y-%m-%d %H:%M:%S"
    hadm_id: hadm_id
  discharge:
    code:
      - HOSPITAL_DISCHARGE
      - col(discharge_location)
    time: col(dischtime)
    time_format: "%Y-%m-%d %H:%M:%S"
    hadm_id: hadm_id

hosp/procedures_icd:
  procedure_icd:
    code:
      - PROCEDURE
      - ICD
      - coi(icd_version)
      - col(icd_code)
    time: col(chartdate)
    seq_num: seq_num

icu/icustays:
  admission:
    code:
      - ICU_ADMISSION
      - col(first_careunit)
    time: col(intime)
    time_format: "%Y-%m-%d %H:%M:%S"
  discharge:
    code:
      - ICU_DISCHARGE
      - col(last_careunit)
    time: col(outtime)
    time_format: "%Y-%m-%d %H:%M:%S"

icu/chartevents:
  chartevent:
    code:
      - CHARTEVENT
      - col(itemid)
      - col(valueuom)
    time: col(charttime)
    time_format: "%Y-%m-%d %H:%M:%S"
    numeric_value: valuenum

Using the MESSY File -- how do you run MEDS-Extract?

With the MESSY file specified, running MEDS-Extract is easy. There are two steps. First, install the package:

!pip --quiet install MEDS-extract

Next, use the typical MEDS-Transforms syntax for running a dependent pipeline, and pass in the override variables you want. In our case, the command will look like the below:

%%bash
MEDS_transform-pipeline \
    pkg://MEDS_extract.configs._extract.yaml \
    --overrides \
    input_dir=intermediate_dir \
    output_dir=output_dir \
    event_conversion_config_fp=MESSY.yaml \
    dataset.name=KDD_Tutorial \
    dataset.version=1.0

If we've done things right, then we should see the above cell complete with no errors -- if we haven't, we'll need to debug. Thankfully, MEDS-Extract writes out some nice logs to help with this, which we can find in the output directory output_dir, under output_dir/.logs/pipeline.log:

!cat output_dir/.logs/pipeline.log

INFO:root:Running MEDS-Transforms Pipeline Runner with the following arguments:
INFO:root:  pipeline_config_fp: pkg://MEDS_extract.configs._extract.yaml
INFO:root:  stage_runner_fp: None
INFO:root:  do_profile: False
INFO:root:  overrides: ['input_dir=intermediate_dir', 'output_dir=output_dir', 'event_conversion_config_fp=MESSY.yaml', 'dataset.name=KDD_Tutorial', 'dataset.version=1.0']
INFO:MEDS_transforms.runner:No parallelization configuration provided.
INFO:MEDS_transforms.runner:Running stage: shard_events
INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml shard_events stage=shard_events input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0
INFO:MEDS_transforms.runner:Command output:
[2025-07-17 17:49:41,164][MEDS_extract.shard_events.shard_events][INFO] - Running with config:
dataset:
  root_dir: ???
  name: KDD_Tutorial
  version: '1.0'
  code_modifiers: []
input_dir: intermediate_dir
output_dir: output_dir
_default_description: 'This is a MEDS pipeline ETL. Please set a more detailed description
  at the top of your specific pipeline

  configuration file.'
log_dir: ${stage_cfg.output_dir}/.logs
do_overwrite: false
seed: 1
worker: 0
polling_time: 300
stages:
- shard_events
- split_and_shard_subjects
- convert_to_subject_sharded
- convert_to_MEDS_events
- merge_to_MEDS_cohort
- extract_code_metadata
- finalize_MEDS_metadata
- finalize_MEDS_data
stage: shard_events
etl_metadata:
  pipeline_name: MEDS-Transforms Pipeline
  dataset_name: ${dataset.name}
  dataset_version: ${dataset.version}
  package_name: ${get_package_name:}
  package_version: ${get_package_version:}
code_modifiers: null
etl_metadata.pipeline_name: extract
description: "This pipeline extracts raw MEDS events in longitudinal, sparse form\
  \ from an input dataset meeting select\ncriteria and converts them to the flattened,\
  \ MEDS format. It can be run in its entirety, with controllable\nlevels of parallelism,\
  \ or in stages. Arguments:\n  - `event_conversion_config_fp`: The path to the event\
  \ conversion configuration file. This file defines\n    the events to extract from\
  \ the various rows of the various input files encountered in the global input\n\
  \    directory.\n  - `input_dir`: The path to the directory containing the raw input\
  \ files.\n  - `output_dir`: The path to the directory where the output cohort will\
  \ be written. It will be written in\n    various subfolders of this dir depending\
  \ on the stage, as intermediate stages cache their output during\n    computation\
  \ for efficiency of re-running and distributing."
event_conversion_config_fp: MESSY.yaml
shards_map_fp: ${output_dir}/metadata/.shards.json
cloud_io_storage_options: {}
stage_cfg:
  row_chunksize: 200000000
  infer_schema_length: 10000
  data_input_dir: ${input_dir}/data
  metadata_input_dir: ${input_dir}/metadata
  reducer_output_dir: null
  train_only: false
  output_dir: ${output_dir}/shard_events

Stage: shard_events

Stage config:
row_chunksize: 200000000
infer_schema_length: 10000
data_input_dir: ${input_dir}/data
metadata_input_dir: ${input_dir}/metadata
reducer_output_dir: null
train_only: false
output_dir: ${output_dir}/shard_events

[2025-07-17 17:49:41,167][MEDS_extract.shard_events.shard_events][INFO] - Reading event conversion config from MESSY.yaml to identify needed columns.
[2025-07-17 17:49:41,195][MEDS_extract.shard_events.shard_events][INFO] - Starting event sub-sharding. Sub-sharding 5 files:
  * /content/intermediate_dir/hosp/admissions.parquet
  * /content/intermediate_dir/hosp/procedures_icd.parquet
  * /content/intermediate_dir/icu/chartevents.parquet
  * /content/intermediate_dir/icu/icustays.parquet
  * /content/intermediate_dir/hosp/patients.parquet
[2025-07-17 17:49:41,196][MEDS_extract.shard_events.shard_events][INFO] - Will read raw data from /content/intermediate_dir/$IN_FILE.parquet and write sub-sharded data to output_dir/shard_events/$IN_FILE/$ROW_START-$ROW_END.parquet
[2025-07-17 17:49:41,198][MEDS_extract.shard_events.shard_events][INFO] - Processing intermediate_dir/hosp/admissions.parquet to output_dir/shard_events/hosp/admissions.
[2025-07-17 17:49:41,206][MEDS_extract.shard_events.shard_events][INFO] - Performing preliminary read of /content/intermediate_dir/hosp/admissions.parquet to determine row count.
[2025-07-17 17:49:41,207][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files.
[2025-07-17 17:49:41,234][MEDS_extract.shard_events.shard_events][INFO] - Read 275 rows from /content/intermediate_dir/hosp/admissions.parquet.
[2025-07-17 17:49:41,234][MEDS_extract.shard_events.shard_events][INFO] - Splitting intermediate_dir/hosp/admissions.parquet into 1 row-chunks of size 200000000.
[2025-07-17 17:49:41,235][MEDS_extract.shard_events.shard_events][INFO] - Writing file 1/1: intermediate_dir/hosp/admissions.parquet row-chunk [0-275) to output_dir/shard_events/hosp/admissions/[0-275).parquet.
[2025-07-17 17:49:41,236][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from intermediate_dir/hosp/admissions.parquet
[2025-07-17 17:49:41,236][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files.
[2025-07-17 17:49:41,236][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:41,237][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/shard_events/hosp/admissions/[0-275).parquet
[2025-07-17 17:49:41,261][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.024981
[2025-07-17 17:49:41,262][MEDS_extract.shard_events.shard_events][INFO] - Processing intermediate_dir/hosp/procedures_icd.parquet to output_dir/shard_events/hosp/procedures_icd.
[2025-07-17 17:49:41,262][MEDS_extract.shard_events.shard_events][INFO] - Performing preliminary read of /content/intermediate_dir/hosp/procedures_icd.parquet to determine row count.
[2025-07-17 17:49:41,263][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files.
[2025-07-17 17:49:41,273][MEDS_extract.shard_events.shard_events][INFO] - Read 722 rows from /content/intermediate_dir/hosp/procedures_icd.parquet.
[2025-07-17 17:49:41,274][MEDS_extract.shard_events.shard_events][INFO] - Splitting intermediate_dir/hosp/procedures_icd.parquet into 1 row-chunks of size 200000000.
[2025-07-17 17:49:41,274][MEDS_extract.shard_events.shard_events][INFO] - Writing file 1/1: intermediate_dir/hosp/procedures_icd.parquet row-chunk [0-722) to output_dir/shard_events/hosp/procedures_icd/[0-722).parquet.
[2025-07-17 17:49:41,275][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from intermediate_dir/hosp/procedures_icd.parquet
[2025-07-17 17:49:41,275][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files.
[2025-07-17 17:49:41,276][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:41,276][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/shard_events/hosp/procedures_icd/[0-722).parquet
[2025-07-17 17:49:41,284][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.009133
[2025-07-17 17:49:41,290][MEDS_extract.shard_events.shard_events][INFO] - Processing intermediate_dir/icu/chartevents.parquet to output_dir/shard_events/icu/chartevents.
[2025-07-17 17:49:41,291][MEDS_extract.shard_events.shard_events][INFO] - Performing preliminary read of /content/intermediate_dir/icu/chartevents.parquet to determine row count.
[2025-07-17 17:49:41,291][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files.
[2025-07-17 17:49:41,304][MEDS_extract.shard_events.shard_events][INFO] - Read 668862 rows from /content/intermediate_dir/icu/chartevents.parquet.
[2025-07-17 17:49:41,304][MEDS_extract.shard_events.shard_events][INFO] - Splitting intermediate_dir/icu/chartevents.parquet into 1 row-chunks of size 200000000.
[2025-07-17 17:49:41,305][MEDS_extract.shard_events.shard_events][INFO] - Writing file 1/1: intermediate_dir/icu/chartevents.parquet row-chunk [0-668862) to output_dir/shard_events/icu/chartevents/[0-668862).parquet.
[2025-07-17 17:49:41,305][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from intermediate_dir/icu/chartevents.parquet
[2025-07-17 17:49:41,306][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files.
[2025-07-17 17:49:41,306][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:41,307][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/shard_events/icu/chartevents/[0-668862).parquet
[2025-07-17 17:49:42,184][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.878552
[2025-07-17 17:49:42,188][MEDS_extract.shard_events.shard_events][INFO] - Processing intermediate_dir/icu/icustays.parquet to output_dir/shard_events/icu/icustays.
[2025-07-17 17:49:42,188][MEDS_extract.shard_events.shard_events][INFO] - Performing preliminary read of /content/intermediate_dir/icu/icustays.parquet to determine row count.
[2025-07-17 17:49:42,189][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files.
[2025-07-17 17:49:42,193][MEDS_extract.shard_events.shard_events][INFO] - Read 140 rows from /content/intermediate_dir/icu/icustays.parquet.
[2025-07-17 17:49:42,193][MEDS_extract.shard_events.shard_events][INFO] - Splitting intermediate_dir/icu/icustays.parquet into 1 row-chunks of size 200000000.
[2025-07-17 17:49:42,193][MEDS_extract.shard_events.shard_events][INFO] - Writing file 1/1: intermediate_dir/icu/icustays.parquet row-chunk [0-140) to output_dir/shard_events/icu/icustays/[0-140).parquet.
[2025-07-17 17:49:42,194][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from intermediate_dir/icu/icustays.parquet
[2025-07-17 17:49:42,194][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files.
[2025-07-17 17:49:42,195][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:42,195][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/shard_events/icu/icustays/[0-140).parquet
[2025-07-17 17:49:42,207][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.013370
[2025-07-17 17:49:42,209][MEDS_extract.shard_events.shard_events][INFO] - Processing intermediate_dir/hosp/patients.parquet to output_dir/shard_events/hosp/patients.
[2025-07-17 17:49:42,209][MEDS_extract.shard_events.shard_events][INFO] - Performing preliminary read of /content/intermediate_dir/hosp/patients.parquet to determine row count.
[2025-07-17 17:49:42,210][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files.
[2025-07-17 17:49:42,218][MEDS_extract.shard_events.shard_events][INFO] - Read 100 rows from /content/intermediate_dir/hosp/patients.parquet.
[2025-07-17 17:49:42,218][MEDS_extract.shard_events.shard_events][INFO] - Splitting intermediate_dir/hosp/patients.parquet into 1 row-chunks of size 200000000.
[2025-07-17 17:49:42,218][MEDS_extract.shard_events.shard_events][INFO] - Writing file 1/1: intermediate_dir/hosp/patients.parquet row-chunk [0-100) to output_dir/shard_events/hosp/patients/[0-100).parquet.
[2025-07-17 17:49:42,219][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from intermediate_dir/hosp/patients.parquet
[2025-07-17 17:49:42,220][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files.
[2025-07-17 17:49:42,221][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:42,221][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/shard_events/hosp/patients/[0-100).parquet
[2025-07-17 17:49:42,230][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.011276
[2025-07-17 17:49:42,231][MEDS_extract.shard_events.shard_events][INFO] - Sub-sharding completed in 0:00:01.034039

INFO:MEDS_transforms.runner:Command error:

INFO:MEDS_transforms.runner:Running stage: split_and_shard_subjects
INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml split_and_shard_subjects stage=split_and_shard_subjects input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0
INFO:MEDS_transforms.runner:Command output:
[2025-07-17 17:49:45,804][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading event conversion config from MESSY.yaml (needed for subject ID columns)
[2025-07-17 17:49:45,820][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Event conversion config:
subject_id_col: subject_id
hosp/patients:
  gender:
    code: col(gender)
    time: null
  death:
    code: MEDS_DEATH
    time: col(dod)
  birth:
    code: MEDS_BIRTH
    time: col(year_of_birth)
    time_format: '%Y'
hosp/admissions:
  ed_registration:
    code: ED_REGISTRATION
    time: col(edregtime)
    time_format: '%Y-%m-%d %H:%M:%S'
  ed_out:
    code: ED_OUT
    time: col(edouttime)
    time_format: '%Y-%m-%d %H:%M:%S'
  admission:
    code:
    - HOSPITAL_ADMISSION
    - col(admission_type)
    - col(admission_location)
    time: col(admittime)
    time_format: '%Y-%m-%d %H:%M:%S'
    hadm_id: hadm_id
  discharge:
    code:
    - HOSPITAL_DISCHARGE
    - col(discharge_location)
    time: col(dischtime)
    time_format: '%Y-%m-%d %H:%M:%S'
    hadm_id: hadm_id
hosp/procedures_icd:
  procedure_icd:
    code:
    - PROCEDURE
    - ICD
    - coi(icd_version)
    - col(icd_code)
    time: col(chartdate)
    seq_num: seq_num
icu/icustays:
  admission:
    code:
    - ICU_ADMISSION
    - col(first_careunit)
    time: col(intime)
    time_format: '%Y-%m-%d %H:%M:%S'
  discharge:
    code:
    - ICU_DISCHARGE
    - col(last_careunit)
    time: col(outtime)
    time_format: '%Y-%m-%d %H:%M:%S'
icu/chartevents:
  chartevent:
    code:
    - CHARTEVENT
    - col(itemid)
    - col(valueuom)
    time: col(charttime)
    time_format: '%Y-%m-%d %H:%M:%S'
    numeric_value: valuenum

[2025-07-17 17:49:45,821][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading subject IDs from hosp/patients files:
  - /content/output_dir/shard_events/hosp/patients/[0-100).parquet
[2025-07-17 17:49:45,823][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading subject IDs from hosp/admissions files:
  - /content/output_dir/shard_events/hosp/admissions/[0-275).parquet
[2025-07-17 17:49:45,823][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading subject IDs from hosp/procedures_icd files:
  - /content/output_dir/shard_events/hosp/procedures_icd/[0-722).parquet
[2025-07-17 17:49:45,824][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading subject IDs from icu/icustays files:
  - /content/output_dir/shard_events/icu/icustays/[0-140).parquet
[2025-07-17 17:49:45,824][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading subject IDs from icu/chartevents files:
  - /content/output_dir/shard_events/icu/chartevents/[0-668862).parquet
[2025-07-17 17:49:45,824][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Joining all subject IDs from 5 dataframes
[2025-07-17 17:49:45,925][numexpr.utils][INFO] - NumExpr defaulting to 2 threads.
[2025-07-17 17:49:46,127][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Found 100 unique subject IDs of type int64
[2025-07-17 17:49:46,128][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Sharding and splitting subjects
[2025-07-17 17:49:46,129][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Split train/0 has 80 subjects.
[2025-07-17 17:49:46,129][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Split tuning/0 has 10 subjects.
[2025-07-17 17:49:46,129][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Split held_out/0 has 10 subjects.
[2025-07-17 17:49:46,130][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Writing sharded subjects to /content/output_dir/metadata/.shards.json
[2025-07-17 17:49:46,131][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Done writing sharded subjects

INFO:MEDS_transforms.runner:Command error:

INFO:MEDS_transforms.runner:Running stage: convert_to_subject_sharded
INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml convert_to_subject_sharded stage=convert_to_subject_sharded input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0
INFO:MEDS_transforms.runner:Command output:
[2025-07-17 17:49:48,010][MEDS_extract.convert_to_subject_sharded.convert_to_subject_sharded][INFO] - Starting subject sharding.
[2025-07-17 17:49:48,010][MEDS_extract.convert_to_subject_sharded.convert_to_subject_sharded][INFO] - Reading event conversion config from MESSY.yaml
[2025-07-17 17:49:48,026][MEDS_extract.convert_to_subject_sharded.convert_to_subject_sharded][INFO] - Event conversion config:
subject_id_col: subject_id
hosp/patients:
  gender:
    code: col(gender)
    time: null
  death:
    code: MEDS_DEATH
    time: col(dod)
  birth:
    code: MEDS_BIRTH
    time: col(year_of_birth)
    time_format: '%Y'
hosp/admissions:
  ed_registration:
    code: ED_REGISTRATION
    time: col(edregtime)
    time_format: '%Y-%m-%d %H:%M:%S'
  ed_out:
    code: ED_OUT
    time: col(edouttime)
    time_format: '%Y-%m-%d %H:%M:%S'
  admission:
    code:
    - HOSPITAL_ADMISSION
    - col(admission_type)
    - col(admission_location)
    time: col(admittime)
    time_format: '%Y-%m-%d %H:%M:%S'
    hadm_id: hadm_id
  discharge:
    code:
    - HOSPITAL_DISCHARGE
    - col(discharge_location)
    time: col(dischtime)
    time_format: '%Y-%m-%d %H:%M:%S'
    hadm_id: hadm_id
hosp/procedures_icd:
  procedure_icd:
    code:
    - PROCEDURE
    - ICD
    - coi(icd_version)
    - col(icd_code)
    time: col(chartdate)
    seq_num: seq_num
icu/icustays:
  admission:
    code:
    - ICU_ADMISSION
    - col(first_careunit)
    time: col(intime)
    time_format: '%Y-%m-%d %H:%M:%S'
  discharge:
    code:
    - ICU_DISCHARGE
    - col(last_careunit)
    time: col(outtime)
    time_format: '%Y-%m-%d %H:%M:%S'
icu/chartevents:
  chartevent:
    code:
    - CHARTEVENT
    - col(itemid)
    - col(valueuom)
    time: col(charttime)
    time_format: '%Y-%m-%d %H:%M:%S'
    numeric_value: valuenum

[2025-07-17 17:49:48,028][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/patients/[0-100).parquet')]
[2025-07-17 17:49:48,030][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,030][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/tuning/0/hosp/patients.parquet
[2025-07-17 17:49:48,036][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.008897
[2025-07-17 17:49:48,038][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/procedures_icd/[0-722).parquet')]
[2025-07-17 17:49:48,039][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,039][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/tuning/0/hosp/procedures_icd.parquet
[2025-07-17 17:49:48,042][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003652
[2025-07-17 17:49:48,044][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/admissions/[0-275).parquet')]
[2025-07-17 17:49:48,044][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,044][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/tuning/0/hosp/admissions.parquet
[2025-07-17 17:49:48,047][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003944
[2025-07-17 17:49:48,049][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/chartevents/[0-668862).parquet')]
[2025-07-17 17:49:48,050][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,050][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/tuning/0/icu/chartevents.parquet
[2025-07-17 17:49:48,105][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.056211
[2025-07-17 17:49:48,107][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/icustays/[0-140).parquet')]
[2025-07-17 17:49:48,107][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,107][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/tuning/0/icu/icustays.parquet
[2025-07-17 17:49:48,110][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003717
[2025-07-17 17:49:48,112][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/patients/[0-100).parquet')]
[2025-07-17 17:49:48,112][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,112][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/held_out/0/hosp/patients.parquet
[2025-07-17 17:49:48,115][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003670
[2025-07-17 17:49:48,117][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/procedures_icd/[0-722).parquet')]
[2025-07-17 17:49:48,117][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,117][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/held_out/0/hosp/procedures_icd.parquet
[2025-07-17 17:49:48,120][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003151
[2025-07-17 17:49:48,122][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/admissions/[0-275).parquet')]
[2025-07-17 17:49:48,122][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,122][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/held_out/0/hosp/admissions.parquet
[2025-07-17 17:49:48,126][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003903
[2025-07-17 17:49:48,127][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/chartevents/[0-668862).parquet')]
[2025-07-17 17:49:48,127][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,128][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/held_out/0/icu/chartevents.parquet
[2025-07-17 17:49:48,161][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.034368
[2025-07-17 17:49:48,163][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/icustays/[0-140).parquet')]
[2025-07-17 17:49:48,163][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,164][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/held_out/0/icu/icustays.parquet
[2025-07-17 17:49:48,166][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003498
[2025-07-17 17:49:48,168][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/patients/[0-100).parquet')]
[2025-07-17 17:49:48,169][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,169][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/train/0/hosp/patients.parquet
[2025-07-17 17:49:48,171][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003106
[2025-07-17 17:49:48,172][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/procedures_icd/[0-722).parquet')]
[2025-07-17 17:49:48,173][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,173][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/train/0/hosp/procedures_icd.parquet
[2025-07-17 17:49:48,176][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003341
[2025-07-17 17:49:48,178][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/admissions/[0-275).parquet')]
[2025-07-17 17:49:48,178][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,178][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/train/0/hosp/admissions.parquet
[2025-07-17 17:49:48,182][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.004345
[2025-07-17 17:49:48,183][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/chartevents/[0-668862).parquet')]
[2025-07-17 17:49:48,184][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,184][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/train/0/icu/chartevents.parquet
[2025-07-17 17:49:48,379][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.195991
[2025-07-17 17:49:48,381][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/icustays/[0-140).parquet')]
[2025-07-17 17:49:48,381][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:48,381][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/train/0/icu/icustays.parquet
[2025-07-17 17:49:48,384][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003644
[2025-07-17 17:49:48,385][MEDS_extract.convert_to_subject_sharded.convert_to_subject_sharded][INFO] - Created a subject-sharded view.

INFO:MEDS_transforms.runner:Command error:
/usr/local/lib/python3.11/dist-packages/MEDS_extract/convert_to_subject_sharded/convert_to_subject_sharded.py:75: PerformanceWarning: Resolving the schema of a LazyFrame is a potentially expensive operation. Use `LazyFrame.collect_schema()` to get the schema without this warning.
  typed_subjects = pl.Series(subjects, dtype=dfs[0].schema[input_subject_id_column])

INFO:MEDS_transforms.runner:Running stage: convert_to_MEDS_events
INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml convert_to_MEDS_events stage=convert_to_MEDS_events input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0
INFO:MEDS_transforms.runner:Command output:
[2025-07-17 17:49:50,205][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Starting event conversion.
[2025-07-17 17:49:50,205][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Reading event conversion config from MESSY.yaml
[2025-07-17 17:49:50,221][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Event conversion config:
subject_id_col: subject_id
hosp/patients:
  gender:
    code: col(gender)
    time: null
  death:
    code: MEDS_DEATH
    time: col(dod)
  birth:
    code: MEDS_BIRTH
    time: col(year_of_birth)
    time_format: '%Y'
hosp/admissions:
  ed_registration:
    code: ED_REGISTRATION
    time: col(edregtime)
    time_format: '%Y-%m-%d %H:%M:%S'
  ed_out:
    code: ED_OUT
    time: col(edouttime)
    time_format: '%Y-%m-%d %H:%M:%S'
  admission:
    code:
    - HOSPITAL_ADMISSION
    - col(admission_type)
    - col(admission_location)
    time: col(admittime)
    time_format: '%Y-%m-%d %H:%M:%S'
    hadm_id: hadm_id
  discharge:
    code:
    - HOSPITAL_DISCHARGE
    - col(discharge_location)
    time: col(dischtime)
    time_format: '%Y-%m-%d %H:%M:%S'
    hadm_id: hadm_id
hosp/procedures_icd:
  procedure_icd:
    code:
    - PROCEDURE
    - ICD
    - coi(icd_version)
    - col(icd_code)
    time: col(chartdate)
    seq_num: seq_num
icu/icustays:
  admission:
    code:
    - ICU_ADMISSION
    - col(first_careunit)
    time: col(intime)
    time_format: '%Y-%m-%d %H:%M:%S'
  discharge:
    code:
    - ICU_DISCHARGE
    - col(last_careunit)
    time: col(outtime)
    time_format: '%Y-%m-%d %H:%M:%S'
icu/chartevents:
  chartevent:
    code:
    - CHARTEVENT
    - col(itemid)
    - col(valueuom)
    time: col(charttime)
    time_format: '%Y-%m-%d %H:%M:%S'
    numeric_value: valuenum

[2025-07-17 17:49:50,227][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/train/0/hosp/patients.parquet
[2025-07-17 17:49:50,228][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,228][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/patients
[2025-07-17 17:49:50,229][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting gender
[2025-07-17 17:49:50,230][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column gender
[2025-07-17 17:49:50,231][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding null literate for time
[2025-07-17 17:49:50,231][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null codes via col("gender").is_not_null()
[2025-07-17 17:49:50,231][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting death
[2025-07-17 17:49:50,231][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - dod should already be of Datetime type
[2025-07-17 17:49:50,232][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dod").is_not_null()
[2025-07-17 17:49:50,232][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting birth
[2025-07-17 17:49:50,232][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column year_of_birth in possible formats %Y
[2025-07-17 17:49:50,232][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("year_of_birth").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,233][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/train/0/hosp/patients.parquet
[2025-07-17 17:49:50,241][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.013788
[2025-07-17 17:49:50,242][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/train/0/icu/chartevents.parquet
[2025-07-17 17:49:50,243][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,243][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/chartevents
[2025-07-17 17:49:50,244][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting chartevent
[2025-07-17 17:49:50,245][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column valueuom
[2025-07-17 17:49:50,245][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column itemid
[2025-07-17 17:49:50,245][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column charttime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,245][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("charttime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,245][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/train/0/icu/chartevents.parquet
[2025-07-17 17:49:50,695][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.452166
[2025-07-17 17:49:50,696][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/train/0/hosp/procedures_icd.parquet
[2025-07-17 17:49:50,696][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,697][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/procedures_icd
[2025-07-17 17:49:50,697][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting procedure_icd
[2025-07-17 17:49:50,698][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column icd_code
[2025-07-17 17:49:50,698][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - chartdate should already be of Datetime type
[2025-07-17 17:49:50,698][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("chartdate").is_not_null()
[2025-07-17 17:49:50,699][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/train/0/hosp/procedures_icd.parquet
[2025-07-17 17:49:50,702][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.006156
[2025-07-17 17:49:50,704][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/train/0/hosp/admissions.parquet
[2025-07-17 17:49:50,704][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,705][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/admissions
[2025-07-17 17:49:50,706][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_registration
[2025-07-17 17:49:50,706][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edregtime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,707][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edregtime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,707][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_out
[2025-07-17 17:49:50,707][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edouttime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,707][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edouttime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,708][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission
[2025-07-17 17:49:50,708][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_type
[2025-07-17 17:49:50,708][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_location
[2025-07-17 17:49:50,708][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column admittime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,709][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("admittime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,709][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge
[2025-07-17 17:49:50,709][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column discharge_location
[2025-07-17 17:49:50,710][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column dischtime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,710][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dischtime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,710][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/train/0/hosp/admissions.parquet
[2025-07-17 17:49:50,716][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.011979
[2025-07-17 17:49:50,718][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/train/0/icu/icustays.parquet
[2025-07-17 17:49:50,718][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,718][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/icustays
[2025-07-17 17:49:50,719][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission
[2025-07-17 17:49:50,720][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column first_careunit
[2025-07-17 17:49:50,720][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column intime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,720][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("intime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,720][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge
[2025-07-17 17:49:50,721][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column last_careunit
[2025-07-17 17:49:50,721][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column outtime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,721][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("outtime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,721][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/train/0/icu/icustays.parquet
[2025-07-17 17:49:50,725][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.007194
[2025-07-17 17:49:50,726][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/held_out/0/hosp/patients.parquet
[2025-07-17 17:49:50,727][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,727][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/patients
[2025-07-17 17:49:50,727][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting gender
[2025-07-17 17:49:50,728][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column gender
[2025-07-17 17:49:50,728][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding null literate for time
[2025-07-17 17:49:50,728][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null codes via col("gender").is_not_null()
[2025-07-17 17:49:50,728][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting death
[2025-07-17 17:49:50,729][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - dod should already be of Datetime type
[2025-07-17 17:49:50,729][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dod").is_not_null()
[2025-07-17 17:49:50,729][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting birth
[2025-07-17 17:49:50,729][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column year_of_birth in possible formats %Y
[2025-07-17 17:49:50,730][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("year_of_birth").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,730][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/held_out/0/hosp/patients.parquet
[2025-07-17 17:49:50,734][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.007166
[2025-07-17 17:49:50,735][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/held_out/0/icu/chartevents.parquet
[2025-07-17 17:49:50,735][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,735][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/chartevents
[2025-07-17 17:49:50,736][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting chartevent
[2025-07-17 17:49:50,736][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column valueuom
[2025-07-17 17:49:50,737][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column itemid
[2025-07-17 17:49:50,737][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column charttime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,737][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("charttime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,737][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/held_out/0/icu/chartevents.parquet
[2025-07-17 17:49:50,763][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.028126
[2025-07-17 17:49:50,764][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/held_out/0/hosp/procedures_icd.parquet
[2025-07-17 17:49:50,765][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,765][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/procedures_icd
[2025-07-17 17:49:50,765][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting procedure_icd
[2025-07-17 17:49:50,766][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column icd_code
[2025-07-17 17:49:50,766][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - chartdate should already be of Datetime type
[2025-07-17 17:49:50,767][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("chartdate").is_not_null()
[2025-07-17 17:49:50,767][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/held_out/0/hosp/procedures_icd.parquet
[2025-07-17 17:49:50,770][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.005439
[2025-07-17 17:49:50,772][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/held_out/0/hosp/admissions.parquet
[2025-07-17 17:49:50,772][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,772][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/admissions
[2025-07-17 17:49:50,773][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_registration
[2025-07-17 17:49:50,774][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edregtime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,774][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edregtime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,774][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_out
[2025-07-17 17:49:50,775][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edouttime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,775][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edouttime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,775][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission
[2025-07-17 17:49:50,776][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_type
[2025-07-17 17:49:50,776][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_location
[2025-07-17 17:49:50,776][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column admittime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,776][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("admittime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,776][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge
[2025-07-17 17:49:50,777][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column discharge_location
[2025-07-17 17:49:50,777][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column dischtime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,777][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dischtime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,778][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/held_out/0/hosp/admissions.parquet
[2025-07-17 17:49:50,783][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.011343
[2025-07-17 17:49:50,785][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/held_out/0/icu/icustays.parquet
[2025-07-17 17:49:50,785][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,785][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/icustays
[2025-07-17 17:49:50,786][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission
[2025-07-17 17:49:50,787][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column first_careunit
[2025-07-17 17:49:50,787][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column intime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,787][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("intime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,787][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge
[2025-07-17 17:49:50,788][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column last_careunit
[2025-07-17 17:49:50,788][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column outtime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,788][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("outtime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,788][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/held_out/0/icu/icustays.parquet
[2025-07-17 17:49:50,791][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.006842
[2025-07-17 17:49:50,793][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/tuning/0/hosp/patients.parquet
[2025-07-17 17:49:50,793][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,793][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/patients
[2025-07-17 17:49:50,794][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting gender
[2025-07-17 17:49:50,795][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column gender
[2025-07-17 17:49:50,795][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding null literate for time
[2025-07-17 17:49:50,795][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null codes via col("gender").is_not_null()
[2025-07-17 17:49:50,795][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting death
[2025-07-17 17:49:50,796][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - dod should already be of Datetime type
[2025-07-17 17:49:50,796][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dod").is_not_null()
[2025-07-17 17:49:50,796][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting birth
[2025-07-17 17:49:50,796][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column year_of_birth in possible formats %Y
[2025-07-17 17:49:50,796][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("year_of_birth").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,797][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/tuning/0/hosp/patients.parquet
[2025-07-17 17:49:50,800][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.007281
[2025-07-17 17:49:50,802][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/tuning/0/icu/chartevents.parquet
[2025-07-17 17:49:50,802][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,802][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/chartevents
[2025-07-17 17:49:50,803][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting chartevent
[2025-07-17 17:49:50,803][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column valueuom
[2025-07-17 17:49:50,803][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column itemid
[2025-07-17 17:49:50,804][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column charttime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,804][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("charttime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,804][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/tuning/0/icu/chartevents.parquet
[2025-07-17 17:49:50,843][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.040857
[2025-07-17 17:49:50,844][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/tuning/0/hosp/procedures_icd.parquet
[2025-07-17 17:49:50,845][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,845][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/procedures_icd
[2025-07-17 17:49:50,845][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting procedure_icd
[2025-07-17 17:49:50,846][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column icd_code
[2025-07-17 17:49:50,846][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - chartdate should already be of Datetime type
[2025-07-17 17:49:50,847][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("chartdate").is_not_null()
[2025-07-17 17:49:50,847][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/tuning/0/hosp/procedures_icd.parquet
[2025-07-17 17:49:50,851][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.007183
[2025-07-17 17:49:50,853][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/tuning/0/hosp/admissions.parquet
[2025-07-17 17:49:50,853][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,853][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/admissions
[2025-07-17 17:49:50,855][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_registration
[2025-07-17 17:49:50,855][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edregtime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,855][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edregtime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,856][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_out
[2025-07-17 17:49:50,856][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edouttime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,856][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edouttime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,856][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission
[2025-07-17 17:49:50,857][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_type
[2025-07-17 17:49:50,857][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_location
[2025-07-17 17:49:50,857][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column admittime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,857][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("admittime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,858][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge
[2025-07-17 17:49:50,858][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column discharge_location
[2025-07-17 17:49:50,858][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column dischtime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,858][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dischtime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,859][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/tuning/0/hosp/admissions.parquet
[2025-07-17 17:49:50,864][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.010504
[2025-07-17 17:49:50,865][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/tuning/0/icu/icustays.parquet
[2025-07-17 17:49:50,865][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:50,866][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/icustays
[2025-07-17 17:49:50,866][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission
[2025-07-17 17:49:50,867][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column first_careunit
[2025-07-17 17:49:50,867][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column intime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,867][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("intime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,868][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge
[2025-07-17 17:49:50,868][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column last_careunit
[2025-07-17 17:49:50,868][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column outtime in possible formats %Y-%m-%d %H:%M:%S
[2025-07-17 17:49:50,868][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("outtime").str.strptime(["raise"]).coalesce().is_not_null()
[2025-07-17 17:49:50,869][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/tuning/0/icu/icustays.parquet
[2025-07-17 17:49:50,872][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.006402
[2025-07-17 17:49:50,872][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Subsharded into converted events.

INFO:MEDS_transforms.runner:Command error:

INFO:MEDS_transforms.runner:Running stage: merge_to_MEDS_cohort
INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml merge_to_MEDS_cohort stage=merge_to_MEDS_cohort input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0
INFO:MEDS_transforms.runner:Command output:
[2025-07-17 17:49:52,716][MEDS_extract.merge_to_MEDS_cohort.merge_to_MEDS_cohort][INFO] - Mapping computation over a maximum of 3 shards
[2025-07-17 17:49:52,717][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/convert_to_MEDS_events/held_out/0 into /content/output_dir/merge_to_MEDS_cohort/held_out/0.parquet
[2025-07-17 17:49:52,717][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_MEDS_events/held_out/0
[2025-07-17 17:49:52,718][MEDS_extract.merge_to_MEDS_cohort.merge_to_MEDS_cohort][INFO] - Reading 5 files:
  - /content/output_dir/convert_to_MEDS_events/held_out/0/hosp/patients.parquet
  - /content/output_dir/convert_to_MEDS_events/held_out/0/hosp/admissions.parquet
  - /content/output_dir/convert_to_MEDS_events/held_out/0/hosp/procedures_icd.parquet
  - /content/output_dir/convert_to_MEDS_events/held_out/0/icu/icustays.parquet
  - /content/output_dir/convert_to_MEDS_events/held_out/0/icu/chartevents.parquet
[2025-07-17 17:49:52,720][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:52,720][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/merge_to_MEDS_cohort/held_out/0.parquet
[2025-07-17 17:49:52,752][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.034519
[2025-07-17 17:49:52,752][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/convert_to_MEDS_events/tuning/0 into /content/output_dir/merge_to_MEDS_cohort/tuning/0.parquet
[2025-07-17 17:49:52,753][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_MEDS_events/tuning/0
[2025-07-17 17:49:52,753][MEDS_extract.merge_to_MEDS_cohort.merge_to_MEDS_cohort][INFO] - Reading 5 files:
  - /content/output_dir/convert_to_MEDS_events/tuning/0/hosp/patients.parquet
  - /content/output_dir/convert_to_MEDS_events/tuning/0/hosp/admissions.parquet
  - /content/output_dir/convert_to_MEDS_events/tuning/0/hosp/procedures_icd.parquet
  - /content/output_dir/convert_to_MEDS_events/tuning/0/icu/icustays.parquet
  - /content/output_dir/convert_to_MEDS_events/tuning/0/icu/chartevents.parquet
[2025-07-17 17:49:52,754][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:52,754][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/merge_to_MEDS_cohort/tuning/0.parquet
[2025-07-17 17:49:52,799][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.046454
[2025-07-17 17:49:52,800][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/convert_to_MEDS_events/train/0 into /content/output_dir/merge_to_MEDS_cohort/train/0.parquet
[2025-07-17 17:49:52,801][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_MEDS_events/train/0
[2025-07-17 17:49:52,801][MEDS_extract.merge_to_MEDS_cohort.merge_to_MEDS_cohort][INFO] - Reading 5 files:
  - /content/output_dir/convert_to_MEDS_events/train/0/hosp/patients.parquet
  - /content/output_dir/convert_to_MEDS_events/train/0/hosp/admissions.parquet
  - /content/output_dir/convert_to_MEDS_events/train/0/hosp/procedures_icd.parquet
  - /content/output_dir/convert_to_MEDS_events/train/0/icu/icustays.parquet
  - /content/output_dir/convert_to_MEDS_events/train/0/icu/chartevents.parquet
[2025-07-17 17:49:52,802][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:49:52,802][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/merge_to_MEDS_cohort/train/0.parquet
[2025-07-17 17:49:53,295][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.494605
[2025-07-17 17:49:53,295][MEDS_transforms.mapreduce.stage][INFO] - Finished mapping in 0:00:00.581387

INFO:MEDS_transforms.runner:Command error:

INFO:MEDS_transforms.runner:Running stage: extract_code_metadata
INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml extract_code_metadata stage=extract_code_metadata input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0
INFO:MEDS_transforms.runner:Command output:
[2025-07-17 17:49:55,490][MEDS_extract.extract_code_metadata.extract_code_metadata][INFO] - Reading event conversion config from MESSY.yaml
[2025-07-17 17:49:55,515][MEDS_extract.extract_code_metadata.extract_code_metadata][INFO] - Event conversion config:
subject_id_col: subject_id
hosp/patients:
  gender:
    code: col(gender)
    time: null
  death:
    code: MEDS_DEATH
    time: col(dod)
  birth:
    code: MEDS_BIRTH
    time: col(year_of_birth)
    time_format: '%Y'
hosp/admissions:
  ed_registration:
    code: ED_REGISTRATION
    time: col(edregtime)
    time_format: '%Y-%m-%d %H:%M:%S'
  ed_out:
    code: ED_OUT
    time: col(edouttime)
    time_format: '%Y-%m-%d %H:%M:%S'
  admission:
    code:
    - HOSPITAL_ADMISSION
    - col(admission_type)
    - col(admission_location)
    time: col(admittime)
    time_format: '%Y-%m-%d %H:%M:%S'
    hadm_id: hadm_id
  discharge:
    code:
    - HOSPITAL_DISCHARGE
    - col(discharge_location)
    time: col(dischtime)
    time_format: '%Y-%m-%d %H:%M:%S'
    hadm_id: hadm_id
hosp/procedures_icd:
  procedure_icd:
    code:
    - PROCEDURE
    - ICD
    - coi(icd_version)
    - col(icd_code)
    time: col(chartdate)
    seq_num: seq_num
icu/icustays:
  admission:
    code:
    - ICU_ADMISSION
    - col(first_careunit)
    time: col(intime)
    time_format: '%Y-%m-%d %H:%M:%S'
  discharge:
    code:
    - ICU_DISCHARGE
    - col(last_careunit)
    time: col(outtime)
    time_format: '%Y-%m-%d %H:%M:%S'
icu/chartevents:
  chartevent:
    code:
    - CHARTEVENT
    - col(itemid)
    - col(valueuom)
    time: col(charttime)
    time_format: '%Y-%m-%d %H:%M:%S'
    numeric_value: valuenum

[2025-07-17 17:49:55,523][MEDS_extract.extract_code_metadata.extract_code_metadata][INFO] - No _metadata blocks in the event_conversion_config.yaml found. Exiting...

INFO:MEDS_transforms.runner:Command error:

INFO:MEDS_transforms.runner:Running stage: finalize_MEDS_metadata
INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml finalize_MEDS_metadata stage=finalize_MEDS_metadata input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0
INFO:MEDS_transforms.runner:Command output:
[2025-07-17 17:49:58,056][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Validating code metadata
[2025-07-17 17:49:58,056][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - No code metadata found at output_dir/extract_code_metadata/codes.parquet. Making empty metadata file.
[2025-07-17 17:49:58,133][numexpr.utils][INFO] - NumExpr defaulting to 2 threads.
[2025-07-17 17:49:58,336][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Writing finalized metadata df to /content/output_dir/metadata/codes.parquet
[2025-07-17 17:49:58,337][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Creating dataset metadata
[2025-07-17 17:49:58,340][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Writing finalized dataset metadata to /content/output_dir/metadata/dataset.json
[2025-07-17 17:49:58,340][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Creating subject splits from {str(shards_map_fp.resolve())}
[2025-07-17 17:49:58,341][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Split train has 80 subjects
[2025-07-17 17:49:58,341][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Split tuning has 10 subjects
[2025-07-17 17:49:58,341][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Split held_out has 10 subjects
[2025-07-17 17:49:58,341][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Writing finalized subject splits to /content/output_dir/metadata/subject_splits.parquet

INFO:MEDS_transforms.runner:Command error:

INFO:MEDS_transforms.runner:Running stage: finalize_MEDS_data
INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml finalize_MEDS_data stage=finalize_MEDS_data input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0
INFO:MEDS_transforms.runner:Command output:
[2025-07-17 17:50:00,222][MEDS_transforms.mapreduce.shard_iteration][INFO] - Mapping computation over a maximum of 3 shards
[2025-07-17 17:50:00,223][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/merge_to_MEDS_cohort/held_out/0.parquet into /content/output_dir/data/held_out/0.parquet
[2025-07-17 17:50:00,224][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/merge_to_MEDS_cohort/held_out/0.parquet
[2025-07-17 17:50:00,224][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:50:00,235][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/data/held_out/0.parquet
[2025-07-17 17:50:00,242][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.018772
[2025-07-17 17:50:00,243][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/merge_to_MEDS_cohort/tuning/0.parquet into /content/output_dir/data/tuning/0.parquet
[2025-07-17 17:50:00,243][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/merge_to_MEDS_cohort/tuning/0.parquet
[2025-07-17 17:50:00,244][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:50:00,249][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/data/tuning/0.parquet
[2025-07-17 17:50:00,260][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.016874
[2025-07-17 17:50:00,261][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/merge_to_MEDS_cohort/train/0.parquet into /content/output_dir/data/train/0.parquet
[2025-07-17 17:50:00,261][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/merge_to_MEDS_cohort/train/0.parquet
[2025-07-17 17:50:00,262][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset
[2025-07-17 17:50:00,302][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/data/train/0.parquet
[2025-07-17 17:50:00,404][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.143007
[2025-07-17 17:50:00,405][MEDS_transforms.mapreduce.stage][INFO] - Finished mapping in 0:00:00.184304

INFO:MEDS_transforms.runner:Command error:

Note that if you did everything right, the log will still say "Command error:" at the end, with nothing following, which is reporting that there was no error output written for the internal stages of the process.

What do the output files themselves actually look like? Let's see:

1 
2

%%bash
tree output_dir

output_dir
├── convert_to_MEDS_events
│   ├── event_conversion_config.yaml
│   ├── held_out
│   │   └── 0
│   │       ├── hosp
│   │       │   ├── admissions.parquet
│   │       │   ├── patients.parquet
│   │       │   └── procedures_icd.parquet
│   │       └── icu
│   │           ├── chartevents.parquet
│   │           └── icustays.parquet
│   ├── train
│   │   └── 0
│   │       ├── hosp
│   │       │   ├── admissions.parquet
│   │       │   ├── patients.parquet
│   │       │   └── procedures_icd.parquet
│   │       └── icu
│   │           ├── chartevents.parquet
│   │           └── icustays.parquet
│   └── tuning
│       └── 0
│           ├── hosp
│           │   ├── admissions.parquet
│           │   ├── patients.parquet
│           │   └── procedures_icd.parquet
│           └── icu
│               ├── chartevents.parquet
│               └── icustays.parquet
├── convert_to_subject_sharded
│   ├── held_out
│   │   └── 0
│   │       ├── hosp
│   │       │   ├── admissions.parquet
│   │       │   ├── patients.parquet
│   │       │   └── procedures_icd.parquet
│   │       └── icu
│   │           ├── chartevents.parquet
│   │           └── icustays.parquet
│   ├── train
│   │   └── 0
│   │       ├── hosp
│   │       │   ├── admissions.parquet
│   │       │   ├── patients.parquet
│   │       │   └── procedures_icd.parquet
│   │       └── icu
│   │           ├── chartevents.parquet
│   │           └── icustays.parquet
│   └── tuning
│       └── 0
│           ├── hosp
│           │   ├── admissions.parquet
│           │   ├── patients.parquet
│           │   └── procedures_icd.parquet
│           └── icu
│               ├── chartevents.parquet
│               └── icustays.parquet
├── data
│   ├── held_out
│   │   └── 0.parquet
│   ├── train
│   │   └── 0.parquet
│   └── tuning
│       └── 0.parquet
├── extract_code_metadata
│   └── event_conversion_config.yaml
├── finalize_MEDS_metadata
├── merge_to_MEDS_cohort
│   ├── held_out
│   │   └── 0.parquet
│   ├── train
│   │   └── 0.parquet
│   └── tuning
│       └── 0.parquet
├── metadata
│   ├── codes.parquet
│   ├── dataset.json
│   └── subject_splits.parquet
├── shard_events
│   ├── hosp
│   │   ├── admissions
│   │   │   └── [0-275).parquet
│   │   ├── patients
│   │   │   └── [0-100).parquet
│   │   └── procedures_icd
│   │       └── [0-722).parquet
│   └── icu
│       ├── chartevents
│       │   └── [0-668862).parquet
│       └── icustays
│           └── [0-140).parquet
└── split_and_shard_subjects

46 directories, 46 files

There's a lot here -- thankfully, most of them are internal, partial outputs that MEDS-Extract writes so it can resume after failures on larger datasets. These aren't helpful for us, but are helpful when you're working with hundreds of thousands to billions of measurements!

To see just the final files, we can look in the data and metadata sub-folders:

1 
2

%%bash
tree output_dir/data

output_dir/data
├── held_out
│   └── 0.parquet
├── train
│   └── 0.parquet
└── tuning
    └── 0.parquet

3 directories, 3 files

1 
2

%%bash
tree output_dir/metadata

output_dir/metadata
├── codes.parquet
├── dataset.json
└── subject_splits.parquet

0 directories, 3 files

Going Forward

While you've just built a great MEDS dataset over the MIMIC demo dataset in this tutorial, you've only looked at a small set of the included files we showed above. In the rest of the tutorial, we'll use the full MIMIC demo dataset, which we'll download as needed in the other notebooks, rather than the output of this notebook. Note that it also is built using a slightly different file than the one constructed here -- but rest assured, it is very similar to what you put together here. You can see how it is processed by looking at the dedicated MIMIC-IV ETL Package, or specifically the analogus MESSY file used for all the sources in that repository!

Additional Details and Resources

You can also check out MEDS-Extract's documentation and another example on synthetic data via the included links as well!

Even more importantly, what if you don't like MEDS-Extract and don't want to use it? Then don't! The three guiding questions of the extraction process (What is happening?, To whom is it happening?, and When is it happening?) can be turned into an extraction pipeline in whatever way you like -- the MEDS ecosystem is designed to be data-centric, so it doesn't matter how you got to a MEDS dataset, just that you did, and then tools can run from there!