Converting to MEDS
In this tutorial, we'll extract a dataset in MEDS format for downstream use. See below for the jupyter notebook tutorial, rendered here, or check it out online on Google Colab or on our GitHub Repository
Converting a Custom Dataset to MEDS
Part 1: Loading the raw data
In this tutorial, we'll use the publicly available MIMIC-IV Demo v2.2 dataset as our fictional "raw data source". Naturally, MIMIC has been used extensively in the public space, so its structure is actually very well understood and widely used; however, for the sake of this tutorial, let's act as though it isn't and we're seeing it for the first time.
The first thing we need to do is load up the raw data (or, really, generally, a small random chunk of the raw data, so we can iterate quickly, though here we'll just use the entire demo dataset given its size) and take a look at it. To do that, we'll go ahead and download the raw files from PhysioNet and store them in the newly created raw_data
directory (note this will take some time):
%%bash
mkdir -p raw_data
wget \
--quiet \
--no-host-directories \
--recursive \
--no-parent \
--cut-dirs=3 \
--directory-prefix \
raw_data \
https://physionet.org/files/mimic-iv-demo/2.2/
Now that the files have downloaded, what do they actually contain?
%%bash
apt-get -qq install tree > /dev/null
tree raw_data
raw_data ├── demo_subject_id.csv ├── hosp │ ├── admissions.csv.gz │ ├── d_hcpcs.csv.gz │ ├── diagnoses_icd.csv.gz │ ├── d_icd_diagnoses.csv.gz │ ├── d_icd_procedures.csv.gz │ ├── d_labitems.csv.gz │ ├── drgcodes.csv.gz │ ├── emar.csv.gz │ ├── emar_detail.csv.gz │ ├── hcpcsevents.csv.gz │ ├── index.html │ ├── labevents.csv.gz │ ├── microbiologyevents.csv.gz │ ├── omr.csv.gz │ ├── patients.csv.gz │ ├── pharmacy.csv.gz │ ├── poe.csv.gz │ ├── poe_detail.csv.gz │ ├── prescriptions.csv.gz │ ├── procedures_icd.csv.gz │ ├── provider.csv.gz │ ├── services.csv.gz │ └── transfers.csv.gz ├── icu │ ├── caregiver.csv.gz │ ├── chartevents.csv.gz │ ├── datetimeevents.csv.gz │ ├── d_items.csv.gz │ ├── icustays.csv.gz │ ├── index.html │ ├── ingredientevents.csv.gz │ ├── inputevents.csv.gz │ ├── outputevents.csv.gz │ └── procedureevents.csv.gz ├── index.html ├── LICENSE.txt ├── README.txt ├── robots.txt └── SHA256SUMS.txt 2 directories, 39 files
We can see there are a number of data files here, including:
hosp/*.csv.gz
icu/*.csv.gz
as well as a variety of other, likely non-data files. To understand any clinical dataset, you generally should rely on both provided documentation and a local, subject-matter expert who is familiar with both the clinical and operational context of the dataset; however, in practice, we rarely have this. For our purposes, let's take a look at the provided MIMIC-IV documentation to try to understand these various files.
Part 2: MEDS Extraction, conceptually
For now, we'll focus on only a few files, to keep things simple (note that each file below links to its specific data source documentation):
hosp/patients.csv.gz
hosp/admissions.csv.gz
hosp/procedures_icd.csv.gz
icu/icustays.csv.gz
icu/chartevents.csv.gz
To start understanding how we should think about extracting a MEDS view of this data, let's inspect some of the data using pandas:
import pandas as pd
from pathlib import Path
DATA_ROOT = Path("raw_data")
dfs = {}
for fn in [
"hosp/patients.csv.gz",
"hosp/admissions.csv.gz",
"hosp/procedures_icd.csv.gz",
"icu/icustays.csv.gz",
"icu/chartevents.csv.gz",
]:
fp = DATA_ROOT / fn
df = pd.read_csv(fp)
print(f"{fn}:")
display(df.head(2))
dfs[fn.split(".")[0]] = df
hosp/patients.csv.gz:
subject_id | gender | anchor_age | anchor_year | anchor_year_group | dod | |
---|---|---|---|---|---|---|
0 | 10014729 | F | 21 | 2125 | 2011 - 2013 | NaN |
1 | 10003400 | F | 72 | 2134 | 2011 - 2013 | 2137-09-02 |
hosp/admissions.csv.gz:
subject_id | hadm_id | admittime | dischtime | deathtime | admission_type | admit_provider_id | admission_location | discharge_location | insurance | language | marital_status | race | edregtime | edouttime | hospital_expire_flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10004235 | 24181354 | 2196-02-24 14:38:00 | 2196-03-04 14:02:00 | NaN | URGENT | P03YMR | TRANSFER FROM HOSPITAL | SKILLED NURSING FACILITY | Medicaid | ENGLISH | SINGLE | BLACK/CAPE VERDEAN | 2196-02-24 12:15:00 | 2196-02-24 17:07:00 | 0 |
1 | 10009628 | 25926192 | 2153-09-17 17:08:00 | 2153-09-25 13:20:00 | NaN | URGENT | P41R5N | TRANSFER FROM HOSPITAL | HOME HEALTH CARE | Medicaid | ? | MARRIED | HISPANIC/LATINO - PUERTO RICAN | NaN | NaN | 0 |
hosp/procedures_icd.csv.gz:
subject_id | hadm_id | seq_num | chartdate | icd_code | icd_version | |
---|---|---|---|---|---|---|
0 | 10011398 | 27505812 | 3 | 2146-12-15 | 3961 | 9 |
1 | 10011398 | 27505812 | 2 | 2146-12-15 | 3615 | 9 |
icu/icustays.csv.gz:
subject_id | hadm_id | stay_id | first_careunit | last_careunit | intime | outtime | los | |
---|---|---|---|---|---|---|---|---|
0 | 10018328 | 23786647 | 31269608 | Neuro Stepdown | Neuro Stepdown | 2154-04-24 23:03:44 | 2154-05-02 15:55:21 | 7.702512 |
1 | 10020187 | 24104168 | 37509585 | Neuro Surgical Intensive Care Unit (Neuro SICU) | Neuro Stepdown | 2169-01-15 04:56:00 | 2169-01-20 15:47:50 | 5.452662 |
icu/chartevents.csv.gz:
subject_id | hadm_id | stay_id | caregiver_id | charttime | storetime | itemid | value | valuenum | valueuom | warning | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10005817 | 20626031 | 32604416 | 6770.0 | 2132-12-16 00:00:00 | 2132-12-15 23:45:00 | 225054 | On | NaN | NaN | 0.0 |
1 | 10005817 | 20626031 | 32604416 | 6770.0 | 2132-12-16 00:00:00 | 2132-12-15 23:43:00 | 223769 | 100 | 100.0 | % | 0.0 |
We can see there is a lot of data contained in just these files! How can we hope to go about unifying it all into the simple MEDS format in a reasonable time?
To do so, we'll follow the assumptions of the MEDS-Extract library, which organizes the mapping of EHR data elements into the MEDS format via the following questions. For each row of each input source, we ask
- What is happening in this row?
- To whom is it happening?
- When is it happening?
Once we can answer each of these three questions, we're ready to extract a full MEDS dataset over our inputs.
Part 2.1 Mapping the hosp/patients
table
To see these in action, let's work through our files in order, starting with hosp/patients.csv.gz
:
dfs['hosp/patients'].head(2)
[4]:
subject_id | gender | anchor_age | anchor_year | anchor_year_group | dod | |
---|---|---|---|---|---|---|
0 | 10014729 | F | 21 | 2125 | 2011 - 2013 | NaN |
1 | 10003400 | F | 72 | 2134 | 2011 - 2013 | 2137-09-02 |
We can see that this dataframe clearly captures some static data about the patients in the population, any external date-of-death information present about the subject, and meta-data about how this subject's data is transformed when included in MIMIC via the anchor year group. The latter aspect won't feature ino the MEDS representation, so this means we only have the following pieces of information to represent about the patient:
- The information in the
gender
column, which for this dataset we will assign to a static measurement given if is recorded as sudh within the raw dataset. - The information in the
anchor_age
column indicating the patient's date of birth (after some transformation). - The information in the
dod
column, which contains a de-identified date of death for the patient, if applicable.
Ultimately, this tells us that for each row of the hosp/patients
table, we'll want to construct 3 MEDS events:
- A measurement for subject
subject_id
with anull
timestamp with a code indicating the value in thegender
column. - A measurement for subject
subject_id
with a timestamp given by thedod
column (if it is not null) and theMEDS_DEATH
code (as this is a death event) and no values. - A measurement for subject
subject_id
with a timestamp given by the difference between theanchor_year
and theanchor_age
, converted to a date-time, with theMEDS_BIRTH
code and no values.
Let's record the information for these events in a simple, declarative format that we'll encode in YAML. For now, just think of this as an approximate format -- it isn't technically precise just yet. But, as we build up our specification, we'll see how we can turn it into a complete description of the extraction process. In particular, we'll have an outer level of the YAML correspond to the file we're talking about (in this case hosp/patients
) and then we'll have an inner block for each of the 3 events we've identified, describing what columns they'll use for to construct their timestamps and codes (we don't have values for any of these events, but we'll add them in later).
Specification so far:
hosp/patients:
gender:
subject_id: subject_id
code: gender
time: null
death:
subject_id: subject_id
code: MEDS_DEATH
time: dod @ 11:59 p.m.
birth:
subject_id: subject_id
code: MEDS_BIRTH
time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
Note: Note that the patients
table has already revealed two common complications when converting clinical data (to any format, not just MEDS):
- The
dod
column only provides a date level resolution, not a time level resolution. This means that we don't know whether or not the patient died at 12:01 a.m. on that date or at 11:59 p.m. on that date, despite these two times being separated by nearly 24 full hours! This can cause issues with measurement ordering, the validity of temporal prediction tasks (e.g., predicting imminent mortality), etc. Ultimately, some choice needs to be made in how we want to represent this in MEDS. By design, MEDS does not allow you to specify a date-only timestamp, as such a timestamp does not permit a total ordering of measurements across different events. Here, as we know that death is a final event and is often (if not universally) the last event recorded for the patient, it makes sense to place it at the latest possible time within that date (i.e., add an implicit 11:59:59 p.m. onto the end of that timestamp column). - As this dataset records an "age" (via
anchor_age
) rather than an explicit date of birth, we have a similar, but even greater lack of temporal resolution in the date of birth column of the data. Here, we need to choose when within that year we should assign the patient's date of birth; again, there is no "right" answer, but we need to make a choice. For this event, we'll choose January 1st of that year, to keep things simple.
Part 2.2: The hosp/admissions
table:
Next, let's inspect the admissions
table:
dfs['hosp/admissions'].head(2)
[5]:
subject_id | hadm_id | admittime | dischtime | deathtime | admission_type | admit_provider_id | admission_location | discharge_location | insurance | language | marital_status | race | edregtime | edouttime | hospital_expire_flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10004235 | 24181354 | 2196-02-24 14:38:00 | 2196-03-04 14:02:00 | NaN | URGENT | P03YMR | TRANSFER FROM HOSPITAL | SKILLED NURSING FACILITY | Medicaid | ENGLISH | SINGLE | BLACK/CAPE VERDEAN | 2196-02-24 12:15:00 | 2196-02-24 17:07:00 | 0 |
1 | 10009628 | 25926192 | 2153-09-17 17:08:00 | 2153-09-25 13:20:00 | NaN | URGENT | P41R5N | TRANSFER FROM HOSPITAL | HOME HEALTH CARE | Medicaid | ? | MARRIED | HISPANIC/LATINO - PUERTO RICAN | NaN | NaN | 0 |
Here, we have a lot of additional pieces of data -- records of admissions, discharges, possible competing records of deaths, admission types, locations for both admissions and discharges, patient information at time of admission (e.g., insurance, language, marital status, race), and emergency department (ed
*) registration & discharge information. One new piece of complexity here that is worth noting is that many of these events are "interval" style events -- namely, events that present with both a start and an end time (e.g., an admission and discharge, an ED registration and an ED out, etc.). The "MEDS way" to handle such events is to simply include both a separate, appropriately timed start event and an end event -- that way you are representing each interaction separately in its appropriate place in the patient timeline. This comes through naturally when we focus on asking our three questions from above. With this perspective, we can quickly identify a list of measurements these columns represent:
- There is (or may be, if the timestamp is not null) a "hospital admission" of type
admission_type
to locationadmission_location
at the time given byadmittime
for the subject given insubject_id
(Note we are not tracking theadmit_provider_id
as MEDS does not currently formalize the notion of the treating provider). - At the time of the hospital admission, some patient demographics are collected about
subject_id
, including their:
- `insurance`
- `language`
- `marital_status`
- `race`
- There may be a "hospital discharge" to the location
discharge_location
at timedischtime
forsubject_id
. - There may be a "death" event at time
deathtime
forsubject_id
. - There may be an "ED Registration" event at time
edregtime
forsubject_id
. - There may be an "ED Out" event at time
edouttime
forsubject_id
.
Given these event descriptions, we can update our specification as follows:
hosp/patients:
gender:
subject_id: subject_id
code: gender
time: null
death:
subject_id: subject_id
code: MEDS_DEATH
time: dod @ 11:59 p.m.
birth:
subject_id: subject_id
code: MEDS_BIRTH
time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
hosp/admissions:
admission:
subject_id: subject_id
code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
time: admittime
language:
subject_id: subject_id
code: "LANGUAGE//${language}"
time: admittime
marital_status:
subject_id: subject_id
code: "MARITAL_STATUS//${marital_status}"
time: admittime
insurance:
subject_id: subject_id
code: "INSURANCE//${insurance}"
time: admittime
race:
subject_id: subject_id
code: "RACE//${race}"
time: admittime
discharge:
subject_id: subject_id
code: "HOSPITAL_DISCHARGE//${discharge_location}"
time: dischtime
death:
subject_id: subject_id
code: MEDS_DEATH
time: deathtime
ed_reg:
subject_id: subject_id
code: ED_REGISTRATION
time: edregtime
ed_out:
subject_id: subject_id
code: ED_OUT
time: edouttime
Heads up that we're being a bit imprecise with our syntax here, as this is just (for now) a mental aid -- namely, we're using some plain strings to represent column names (e.g., code: gender
and subject_id: subject_id
) and sometimes we're using strings explicitly indicated with double-quotes to indicate compound codes using python's string interpolation syntax (e.g., `code: "HOSPITAL_DISCHARGE//${discharge_location}"). We'll formalize this later, but for now, use context to disambiguate which we mean.
Note that, much like before we've seen some other areas where challenges arise and assumptions need to be made in mapping this table:
- Multifactorial measurements: Here, there are several measurements that come with different parts. We have admissions occurring with types and to locations alongside demographic data being measured like language, marital status, race, and insurance type. How should we map all of these to a set of distinct measurements with what codes? In general, this question comes down to a trade-off between more simultaneous measurements vs. more complex codes -- i.e., you can either produce more measurements for each distinct aspect of the code at the same time-point, or you can add more pieces of information into a single code string, thereby increasing the size of your vocabulary. This data extraction step shows both strategies in action, for good reason:
-
For admission type and location, we include them in the core hospital admission code. This makes sense because every admission has to have a type and a location -- so they are natural "modifiers" to the admission measurement conceptually, as opposed to being distinct measurements. We'd also almost never have a situation where a model would need to know that an admission happened, but not know of what type or to where.
-
On the other hand, for the patient demographic information, these have each been separated into distinct measurements all at the same point in time -- each aspect of the demographic data is thus recorded separately, so if we wish to later filter out rare or unknown recordings for one aspect of the demographic data in isolation of the others, this will be easy to do at a measurement level. Ultimately, however, it may also be reasonable (or even work better in some modeling tasks) to instead have produced a joint code string across all demographic information (e.g.,
LANGUAGE//${language}//INSURANCE//${insurance}//...
). If you want to try that out yourself, let us know if it works better!
The existence of these multifactorial codes also highlights a convention we'll take in this guide, which is to compose "structured code strings" using the double-slash ("//"
) as a separator, as this is unlikely to occur in a raw code string. This is not a formal requirement, so feel free to use a different approach in your data -- but what is important to note is that you likely do not want code strings to collide across different measurement sources. So, if you just used code: race
and code: language
, for example, and UNK
was an option for both race
and language
, your model's wouldn't be able to differentiate between those two options unless you use a unique prefix (like we do here).
- Competing Measurement Sources: There's another death time in this file, in addition to the
dod
recorded in thepatients
table! This is, unfortunately, a common enough problem in EHR data. Luckily, its solution is pretty straightforward -- simply decide which source has precedent (ideally this will be a universal property, not a data-dependent one) and favor that over the other. Here, as thedeathtime
in this dataset has full datetime resolution, it will be preferred over thedod
in the other file. We'll merely denote that with a comment in our specification for now.
At this point, our specification is also getting pretty verbose. Let's pull out the shared aspects across all event blocks into the upper level categories -- for now this just includes the subject_id
specification -- so we can get rid of some wasted space:
subject_id: subject_id
hosp/patients:
gender:
code: gender
time: null
death: # Superceded by the `death` measurement in hosp/admissions
code: MEDS_DEATH
time: dod @ 11:59 p.m.
birth:
code: MEDS_BIRTH
time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
hosp/admissions:
admission:
code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
time: admittime
language:
code: "LANGUAGE//${language}"
time: admittime
marital_status:
code: "MARITAL_STATUS//${marital_status}"
time: admittime
insurance:
code: "INSURANCE//${insurance}"
time: admittime
race:
code: "RACE//${race}"
time: admittime
discharge:
code: "HOSPITAL_DISCHARGE//${discharge_location}"
time: dischtime
death: # Takes precedent over the `death` measurement in hosp/patients
code: MEDS_DEATH
time: deathtime
ed_reg:
code: ED_REGISTRATION
time: edregtime
ed_out:
code: ED_OUT
time: edouttime
While there are other ways we could further condense this (e.g., using a list of objects rather than a dictionary of objects within each data source) that will hurt us more on clarity, so we'll keep that for now.
Part 2.3: The hosp/procedures_icd
table
Let's move onto our next table: procedures_icd
dfs['hosp/procedures_icd'].head(2)
[6]:
subject_id | hadm_id | seq_num | chartdate | icd_code | icd_version | |
---|---|---|---|---|---|---|
0 | 10011398 | 27505812 | 3 | 2146-12-15 | 3961 | 9 |
1 | 10011398 | 27505812 | 2 | 2146-12-15 | 3615 | 9 |
Here, we have a bit of an easier time -- there's clearly only one measurement being recorded here -- the ICD code itself, recorded for subject_id
at the time given by chartdate
. However, much like for the dod
column in the hosp/patients
table, this is only a date, not a full datetime, so we need to decide at what timestamp within the date we should assign this. Here, the situation is not quite so simple; unlike death, which is clearly a "final" event, procedures can happen throughout the day, and we don't know where it would be best to assign the recordings of their ICD codes. Ultimately, as we are more likely to want to predict things that are based on these procedures or heavily indicated by these procedures, it is better to put them later in the day rather than earlier to avoid temporal leakage -- though note that this can still cause leakage in tasks that are attempting to predict these procedure codes themselves! Regardless, we'll assign them the time of 11:59:59 p.m. on the given day here. We'll also want to ensure we capture both the icd_code
and icd_version
in these measurements, as both are necessary to fully define the assigned ICD code.
Before we show our new specification, note that there is one additional complexity here we should take into account, and that is seq_num
. This is actually an important piece of information, as it indicates the relative prioritization of the given codes assigned to the patient (a lower seq_num
indicating a higher priority code). This is a common paradigm for diagnostic codes in U.S. healthcare datasets, so we do want to include it; however, it doesn't feel quite right to include it in the code as it is not a real part of the measurement about the patient. Instead, for this example, we'll use the fact that MEDS datasets are permitted to include any other desired columns beyond the required columns, so we can just track it directly as an external column:
subject_id: subject_id
hosp/patients:
gender:
code: gender
time: null
death: # Superceded by the `death` measurement in hosp/admissions
code: MEDS_DEATH
time: dod @ 11:59 p.m.
birth:
code: MEDS_BIRTH
time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
hosp/admissions:
admission:
code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
time: admittime
language:
code: "LANGUAGE//${language}"
time: admittime
marital_status:
code: "MARITAL_STATUS//${marital_status}"
time: admittime
insurance:
code: "INSURANCE//${insurance}"
time: admittime
race:
code: "RACE//${race}"
time: admittime
discharge:
code: "HOSPITAL_DISCHARGE//${discharge_location}"
time: dischtime
death: # Takes precedent over the `death` measurement in hosp/patients
code: MEDS_DEATH
time: deathtime
ed_reg:
code: ED_REGISTRATION
time: edregtime
ed_out:
code: ED_OUT
time: edouttime
hosp/procedures_icd:
procedure_icd:
code: "PROCEDURE//ICD${icd_version}//${icd_code}"
time: chartdate @ 11:59 p.m.
seq_num: seq_num
Part 2.4: The icu/icustays
table
Now, let's look at icustays
:
dfs['icu/icustays'].head(2)
[7]:
subject_id | hadm_id | stay_id | first_careunit | last_careunit | intime | outtime | los | |
---|---|---|---|---|---|---|---|---|
0 | 10018328 | 23786647 | 31269608 | Neuro Stepdown | Neuro Stepdown | 2154-04-24 23:03:44 | 2154-05-02 15:55:21 | 7.702512 |
1 | 10020187 | 24104168 | 37509585 | Neuro Surgical Intensive Care Unit (Neuro SICU) | Neuro Stepdown | 2169-01-15 04:56:00 | 2169-01-20 15:47:50 | 5.452662 |
This table is much like the hosp/admissions
table -- we have some "interval" style events being recorded here (namely, ICU stays) which we'll separate into endpoints, resulting in:
- An ICU admission event for
subject_id
atintime
to thefirst_careunit
- An ICU discharge event for
subject_id
atouttime
from thelast_careunit
.
Note two things:
- The
los
here is actually a derived property -- it isn't something we want to record in the MEDS data directly (especially not in the ICU admission event because that could risk future leakage). - We're actually being a bit inconsistent here -- really, we should likely try to find another table in the MIMIC source which captures the sequence of careunits the patient is seen within so that we can record transfers to a careunit universally, rather than having an ICU admission to a careunit and an ICU discharge from a careunit -- but for now, this is outside the scope of our tutorial (but if you're interested, the right table to use for this is the
hosp/transfers
table, which is actually the ground truth source for theicu/icustays
table.).
When we add this to our spec, we obtain:
subject_id: subject_id
hosp/patients:
gender:
code: gender
time: null
death: # Superceded by the `death` measurement in hosp/admissions
code: MEDS_DEATH
time: dod @ 11:59 p.m.
birth:
code: MEDS_BIRTH
time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
hosp/admissions:
admission:
code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
time: admittime
language:
code: "LANGUAGE//${language}"
time: admittime
marital_status:
code: "MARITAL_STATUS//${marital_status}"
time: admittime
insurance:
code: "INSURANCE//${insurance}"
time: admittime
race:
code: "RACE//${race}"
time: admittime
discharge:
code: "HOSPITAL_DISCHARGE//${discharge_location}"
time: dischtime
death: # Takes precedent over the `death` measurement in hosp/patients
code: MEDS_DEATH
time: deathtime
ed_reg:
code: ED_REGISTRATION
time: edregtime
ed_out:
code: ED_OUT
time: edouttime
hosp/procedures_icd:
procedure_icd:
code: "PROCEDURE//ICD${icd_version}//${icd_code}"
time: chartdate @ 11:59 p.m.
seq_num: seq_num
icu/icustays:
admission:
code: "ICU_ADMISSION//${first_careunit}"
time: intime
discharge:
code: "ICU_DISCHARGE//${last_careunit}"
time: outtime
Part 2.5: icu/chartevents
:
Finally, let's look at chartevents
:
dfs['icu/chartevents'].head(2)
[8]:
subject_id | hadm_id | stay_id | caregiver_id | charttime | storetime | itemid | value | valuenum | valueuom | warning | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10005817 | 20626031 | 32604416 | 6770.0 | 2132-12-16 00:00:00 | 2132-12-15 23:45:00 | 225054 | On | NaN | NaN | 0.0 |
1 | 10005817 | 20626031 | 32604416 | 6770.0 | 2132-12-16 00:00:00 | 2132-12-15 23:43:00 | 223769 | 100 | 100.0 | % | 0.0 |
This table clearly has rows that capture a variety of recordings of some more nuanced measurements. Some have numerical results, units of measure, etc. We also have another complexity here in that we have some uncertainty in timestamp, with both charttime
and storetime
being included. Ultimately, though, there is still just one kind of measurement being recorded here: Namely, a "chart event" (often a lab test), identified via the "Item ID" itemid
being recorded at either charttime
or storetime
, with a value given by the columns within value
, valuenum
, and valueuom
. Let's see how to add that to our specification (for brevity, we'll just show the new bit first, before we put it all together):
icu/chartevents:
chartevent:
time: charttime
code: "CHARTEVENT//${itemid}//${valueuom}"
numeric_value: valuenum
Note here that we've made a few assumptions:
- We've defaulted to favor
charttime
here -- this is because, according to the data documentation,charttime
is the closest proxy to when the data was actually recorded. However, this could benefit from further investigation and empirical validation! - We are omitting the
warning
column -- this is because we don't know when a warning would actually have been noted by the care-team, as it does not represent an automated process as part of the chart event measurement, but rather is a manual observation by the care team after the data has been recorded
In addition, this format has the following undesired property -- if valueuom
is empty or NaN
, the code string will have a trailing //
(because we've included valueuom
in the template, even though it will only be used for things with a numeric measurement). We can try to remedy this later, though it is not a high-priority issue as it only results in a superficial change.
All told, this gives us a final "specification" for the data extraction (at a conceptual level) as follows:
subject_id: subject_id
hosp/patients:
gender:
code: gender
time: null
death: # Superceded by the `death` measurement in hosp/admissions
code: MEDS_DEATH
time: dod @ 11:59 p.m.
birth:
code: MEDS_BIRTH
time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
hosp/admissions:
admission:
code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
time: admittime
language:
code: "LANGUAGE//${language}"
time: admittime
marital_status:
code: "MARITAL_STATUS//${marital_status}"
time: admittime
insurance:
code: "INSURANCE//${insurance}"
time: admittime
race:
code: "RACE//${race}"
time: admittime
discharge:
code: "HOSPITAL_DISCHARGE//${discharge_location}"
time: dischtime
death: # Takes precedent over the `death` measurement in hosp/patients
code: MEDS_DEATH
time: deathtime
ed_reg:
code: ED_REGISTRATION
time: edregtime
ed_out:
code: ED_OUT
time: edouttime
hosp/procedures_icd:
procedure_icd:
code: "PROCEDURE//ICD${icd_version}//${icd_code}"
time: chartdate @ 11:59 p.m.
seq_num: seq_num
icu/icustays:
admission:
code: "ICU_ADMISSION//${first_careunit}"
time: intime
discharge:
code: "ICU_DISCHARGE//${last_careunit}"
time: outtime
icu/chartevents:
chartevent:
time: charttime
code: "CHARTEVENT//${itemid}//${valueuom}"
numeric_value: valuenum
Then, our question becomes, how can we use this model to actually extract the data?
Part 3: Using MEDS-Extract to Automate Extraction
So far, all we've built up is a conceptual map on how to think about extracting data to MEDS. Hopefully, in doing so, you've come to see how the simplicity of MEDS gives rise to likewise simple extraction pipelines -- rather than requiring hours or days to understand the various input files, you can often map the rows of input tables into a conceptual specification for MEDS extraction in minutes, even when presented with more complex cases that require some assumptions to be made.
However, as it turns out, not only is this conceptual specification useful theoretically, it also is very close to a precise technical specification that the MEDS-Extract package can use to extract your data in the MEDS format for you.
The MEDS-Extract library leverages MEDS-Transforms to run a full ETL pipeline, with the secret sauce in the middle being the "MEDS-Extract Specification Syntax YAML" (MESSY) file -- which tells you how to map your messy input data into the MEDS format in alignment with this conceptual model.
This file is (as the name implies) in the YAML format and looks much like our specification above. It consists of blocks mapping input source table name to named measurements within the rows of that table, each measurement block having some sentinel properties which map to a prescribed extraction syntax that controls how the input data is parsed. It does, unfortunately, have some limitations that will make certain operations in our conceptual specification a bit harder. Let's dig in!
The MESSY File Format
1. The Outer Structure
First, much like our conceptual specification above, the MESSY file will have a block per input source, within which we'll go through and identify all the measurements we want to extract from that source. In this case, that means we'll have a block for each of the tables we've listed above:
hosp/patients:
...
hosp/admissions:
...
hosp/procedures_icd:
...
icu/icustays:
...
icu/chartevents:
...
Also, much like our specification above, we can specify shared properties at the top level -- so we can add back in our subject_id
indicator as well, though in the MESSY format, we need to name it subject_id_col
(for no particularly good reason):
subject_id_col: subject_id
hosp/patients:
...
hosp/admissions:
...
hosp/procedures_icd:
...
icu/icustays:
...
icu/chartevents:
...
2. Measurement blocks
Within each table source, we also need to specify all of the measurements we want to extract. Again, our format will look pretty similar, but a bit different. Our conceptual specification had measurements that looked like each of the following prototypical examples:
gender:
code: gender
time: null
death:
code: MEDS_DEATH
time: dod @ 11:59 p.m.
birth:
code: MEDS_BIRTH
time: anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
admission:
code: "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
time: admittime
death:
code: MEDS_DEATH
time: deathtime
procedure_icd:
code: "PROCEDURE//ICD${icd_version}//${icd_code}"
time: chartdate @ 11:59 p.m.
seq_num: seq_num
chartevent:
time: charttime
code: "CHARTEVENT//${itemid}//${valueuom}"
numeric_value: valuenum
Let's walk through each to see which features we'll need to change:
Specifying Time Format Strings
A key missing piecce here is that we've indicated some strings are "time" columns, but we're not saying how those should be parsed from the (string) input types accessible in our CSV files! While this is not an issue if our inputs were parquets or something else with typed timestamp columns, for CSVs we need to address it. Luckily, this is simple; we can just add a time_format
key to each block with a time format string used to parse the column. Refer to the chrono crate documentation for how these format strings should be specified. In this case, we want the following format string for most use cases: time_format: "%Y-%m-%d %H:%M:%S"
.
What if a column isn't so nicely formatted, and there are multiple format strings in the data? You can also specify a list of format strings to the time_format
key, and they will be used in specified order until one works on a given input for that column; e.g., time_format: ["%Y-%m-%d %H:%M:%S", "%Y"]
.
We'll omit this added detail from our measurement configs for now in the interest of brevity, but see it added in at the end.
Disambiguating column references from string literals
We can see in the gender
and death
measurements that, in our conceptual specification, we sometimes used strings to refer to column names and sometimes as string literals. For the code and time columns only, the MESSY file disambiguates column references with col(...)
and treats all others as string literals. String literals are only allowed for the code
column; the time
column can only accept null
literals. So, we'll need to make some changes to these blocks to account for this (note that as we're making changes iteratively, they won't be fully valid until we're done). In some cases, it isn't clear how to make the change we're describing, so we'll add ???
indicators to those cases.
gender: # We're all done with this format -- this block is complete!
code: col(gender)
time: null
death:
code: MEDS_DEATH
time: ??? # dod @ 11:59 p.m.
birth:
code: MEDS_BIRTH
time: ??? # anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
admission:
code: ??? # "HOSPITAL_ADMISSION//${admission_type}//${admission_location}"
time: col(admittime)
death: # We're all done with this format -- this block is complete!
code: MEDS_DEATH
time: col(deathtime)
procedure_icd:
code: ??? # "PROCEDURE//ICD${icd_version}//${icd_code}"
time: ??? # chartdate @ 11:59 p.m.
seq_num: seq_num # Note that this doesn't need a col(...) specifier
chartevent:
code: ??? # "CHARTEVENT//${itemid}//${valueuom}"
time: col(charttime)
numeric_value: valuenum # Note that this doesn't need a col(...) specifier
Note that we can actually now see that in some cases, resolving this piece has "completed" a full block; gender
and death
are feature complete now, and can be omitted from the later sections for our tutorial pieces.
String interpolation in the code column
Another feature we see a lot of is string interpolation in the code column; e.g., CHARTEVENT//${itemid}//${valueuom}
. How can we handle that?
Unfortunately, as of now, the MEDS-Extract does not allow generit string interpolation; but it does allow you to specify a list of parts which will be concatenated together with the //
separator. This is done just by specifying a list of each of the literals or columns (with the col(...)
syntax to denote the latter) in the YAML file directly. Let's see it in action!
death:
code: MEDS_DEATH
time: ??? # dod @ 11:59 p.m.
birth:
code: MEDS_BIRTH
time: ??? # anchor_year - anchor_age @ Jan 1, 12:00:01 a.m.
admission: # We're all done with this format -- this block is complete!
code:
- HOSPITAL_ADMISSION
- col(admission_type)
- col(admission_location)
time: col(admittime)
procedure_icd:
code:
- PROCEDURE
- ??? #ICD${icd_version}
- col(icd_code)
time: ??? # chartdate @ 11:59 p.m.
seq_num: seq_num
chartevent: # We're all done with this format -- this block is complete!
code:
- CHARTEVENT
- col(itemid)
- col(valueuom)
time: col(charttime)
numeric_value: valuenum
With this change, we've knocked out two blocks, but we see there is a tricky issue with a third -- the procedure_icd
block doesn't support expressing things in the way we want. This is unfortunate, but for now it is unavoidable, so we'll have to change what we want the code string to be, and make the ICD
part be separated from the version with another //
:
procedure_icd:
code:
- PROCEDURE
- ICD
- coi(icd_version)
- col(icd_code)
time: ??? # chartdate @ 11:59 p.m.
seq_num: seq_num
Timestamp Resolution and basic arithmetic
Now, we have a tricky one: in all remaining sources of uncertainty, we have one of two (or more) problems going on -- either (a) we need to resolve a timestamp to a specific time of day (e.g., chartdate @ 11:59 p.m.
) or (b) we need to perform some simple arithmetic (e.g., anchor_year - anchor_age
).
MEDS-Extract does not currently support either of these operations. So, they need to happen in a "pre-MEDS" step, where we have some custom code go through and perform these operations for us on the raw dataframes, before we call MEDS-Extract. There are some other operations that might be required that MEDS-Extract can't handle currently that you should know about (even though we don't need them here), such as:
- Joining multiple tables together to ensure the
subject_id
is present in all cases. - Adjusting "offset" time columns into true datetime columns (this is actually just a case of arithmetic and datetime parsing as well, but it warrants an explicit mention).
- Any data filtering that needs to happen before MEDS extraction occurs (though often data cleaning can happen after the MEDS conversion process as well).
Let's write a simple pre-MEDS step we can run here.
Pre-MEDS
Our Pre-MEDS step will have a few simple goals:
- Subtract the anchor age from the anchor year to get a "year of birth"
- Resolve the timestamps in
hosp/procedures_icd
. - Remove the duplication between the
dod
column inhosp/patients
and thedeathtime
inhosp/admissions
to favor the latter where both are specified.
We'll write this using pandas
for now, but you can use whatever method you want for your data.
from datetime import timedelta
def get_year_of_birth(df: pd.DataFrame) -> pd.DataFrame:
df["year_of_birth"] = (
df["anchor_year"].astype(int) - df["anchor_age"].astype(int)
).astype(str)
return df
def put_procedure_at_EOD(df: pd.DataFrame) -> pd.DataFrame:
df["chartdate"] = (
pd.to_datetime(df["chartdate"], format="%Y-%m-%d") +
timedelta(hours=23, minutes=59, seconds=59)
)
return df
def remove_dod_duplication_and_put_at_EOD(
patients_df: pd.DataFrame,
admissions_df: pd.DataFrame,
) -> pd.DataFrame:
subjects_with_deathtime = (
admissions_df[~admissions_df["deathtime"].isna()]["subject_id"]
)
idx = patients_df["subject_id"].isin(subjects_with_deathtime)
patients_df.loc[idx, "dod"] = None
patients_df["dod"] = (
pd.to_datetime(patients_df["dod"], format="%Y-%m-%d") +
timedelta(hours=23, minutes=59, seconds=59)
)
return patients_df
We'll store the output of our pre-MEDS stage in an "intermediate directory" called intermediate_dir
-- that way we can always re-use our raw data.
INTERMEDIATE_DIR = Path("intermediate_dir")
for name, df in dfs.items():
if name == "hosp/patients":
df = get_year_of_birth(df)
df = remove_dod_duplication_and_put_at_EOD(df, dfs["hosp/admissions"])
elif name == "hosp/procedures_icd":
df = put_procedure_at_EOD(df)
out_fp = INTERMEDIATE_DIR / f"{name}.parquet"
out_fp.parent.mkdir(parents=True, exist_ok=True)
df.to_parquet(out_fp)
%%bash
tree intermediate_dir
intermediate_dir ├── hosp │ ├── admissions.parquet │ ├── patients.parquet │ └── procedures_icd.parquet └── icu ├── chartevents.parquet └── icustays.parquet 2 directories, 5 files
pd.read_parquet(INTERMEDIATE_DIR / "hosp/patients.parquet").head(5)
[12]:
subject_id | gender | anchor_age | anchor_year | anchor_year_group | dod | year_of_birth | |
---|---|---|---|---|---|---|---|
0 | 10014729 | F | 21 | 2125 | 2011 - 2013 | NaT | 2104 |
1 | 10003400 | F | 72 | 2134 | 2011 - 2013 | NaT | 2062 |
2 | 10002428 | F | 80 | 2155 | 2011 - 2013 | NaT | 2075 |
3 | 10032725 | F | 38 | 2143 | 2011 - 2013 | 2143-03-30 23:59:59 | 2105 |
4 | 10027445 | F | 48 | 2142 | 2011 - 2013 | 2146-02-09 23:59:59 | 2094 |
pd.read_parquet(INTERMEDIATE_DIR / "hosp/procedures_icd.parquet").head(5)
[13]:
subject_id | hadm_id | seq_num | chartdate | icd_code | icd_version | |
---|---|---|---|---|---|---|
0 | 10011398 | 27505812 | 3 | 2146-12-15 23:59:59 | 3961 | 9 |
1 | 10011398 | 27505812 | 2 | 2146-12-15 23:59:59 | 3615 | 9 |
2 | 10011398 | 27505812 | 1 | 2146-12-15 23:59:59 | 3614 | 9 |
3 | 10014729 | 23300884 | 4 | 2125-03-23 23:59:59 | 3897 | 9 |
4 | 10014729 | 23300884 | 1 | 2125-03-20 23:59:59 | 3403 | 9 |
The final MESSY File
Now that we've resolved our remaining issues, let's put together our final, complete MESSY file!
YAML_contents = """
subject_id_col: subject_id
hosp/patients:
gender:
code: col(gender)
time: null
death:
code: MEDS_DEATH
time: col(dod)
birth:
code: MEDS_BIRTH
time: col(year_of_birth)
time_format: "%Y"
hosp/admissions:
ed_registration:
code: ED_REGISTRATION
time: col(edregtime)
time_format: "%Y-%m-%d %H:%M:%S"
ed_out:
code: ED_OUT
time: col(edouttime)
time_format: "%Y-%m-%d %H:%M:%S"
admission:
code:
- HOSPITAL_ADMISSION
- col(admission_type)
- col(admission_location)
time: col(admittime)
time_format: "%Y-%m-%d %H:%M:%S"
hadm_id: hadm_id
discharge:
code:
- HOSPITAL_DISCHARGE
- col(discharge_location)
time: col(dischtime)
time_format: "%Y-%m-%d %H:%M:%S"
hadm_id: hadm_id
hosp/procedures_icd:
procedure_icd:
code:
- PROCEDURE
- ICD
- coi(icd_version)
- col(icd_code)
time: col(chartdate)
seq_num: seq_num
icu/icustays:
admission:
code:
- ICU_ADMISSION
- col(first_careunit)
time: col(intime)
time_format: "%Y-%m-%d %H:%M:%S"
discharge:
code:
- ICU_DISCHARGE
- col(last_careunit)
time: col(outtime)
time_format: "%Y-%m-%d %H:%M:%S"
icu/chartevents:
chartevent:
code:
- CHARTEVENT
- col(itemid)
- col(valueuom)
time: col(charttime)
time_format: "%Y-%m-%d %H:%M:%S"
numeric_value: valuenum
"""
YAML_fp = Path("MESSY.yaml")
YAML_fp.write_text(YAML_contents)
print(YAML_fp.read_text())
subject_id_col: subject_id hosp/patients: gender: code: col(gender) time: null death: code: MEDS_DEATH time: col(dod) birth: code: MEDS_BIRTH time: col(year_of_birth) time_format: "%Y" hosp/admissions: ed_registration: code: ED_REGISTRATION time: col(edregtime) time_format: "%Y-%m-%d %H:%M:%S" ed_out: code: ED_OUT time: col(edouttime) time_format: "%Y-%m-%d %H:%M:%S" admission: code: - HOSPITAL_ADMISSION - col(admission_type) - col(admission_location) time: col(admittime) time_format: "%Y-%m-%d %H:%M:%S" hadm_id: hadm_id discharge: code: - HOSPITAL_DISCHARGE - col(discharge_location) time: col(dischtime) time_format: "%Y-%m-%d %H:%M:%S" hadm_id: hadm_id hosp/procedures_icd: procedure_icd: code: - PROCEDURE - ICD - coi(icd_version) - col(icd_code) time: col(chartdate) seq_num: seq_num icu/icustays: admission: code: - ICU_ADMISSION - col(first_careunit) time: col(intime) time_format: "%Y-%m-%d %H:%M:%S" discharge: code: - ICU_DISCHARGE - col(last_careunit) time: col(outtime) time_format: "%Y-%m-%d %H:%M:%S" icu/chartevents: chartevent: code: - CHARTEVENT - col(itemid) - col(valueuom) time: col(charttime) time_format: "%Y-%m-%d %H:%M:%S" numeric_value: valuenum
Using the MESSY File -- how do you run MEDS-Extract?
With the MESSY file specified, running MEDS-Extract is easy. There are two steps. First, install the package:
!pip --quiet install MEDS-extract
Next, use the typical MEDS-Transforms syntax for running a dependent pipeline, and pass in the override variables you want. In our case, the command will look like the below:
%%bash
MEDS_transform-pipeline \
pkg://MEDS_extract.configs._extract.yaml \
--overrides \
input_dir=intermediate_dir \
output_dir=output_dir \
event_conversion_config_fp=MESSY.yaml \
dataset.name=KDD_Tutorial \
dataset.version=1.0
If we've done things right, then we should see the above cell complete with no errors -- if we haven't, we'll need to debug. Thankfully, MEDS-Extract writes out some nice logs to help with this, which we can find in the output directory output_dir
, under output_dir/.logs/pipeline.log
:
!cat output_dir/.logs/pipeline.log
INFO:root:Running MEDS-Transforms Pipeline Runner with the following arguments: INFO:root: pipeline_config_fp: pkg://MEDS_extract.configs._extract.yaml INFO:root: stage_runner_fp: None INFO:root: do_profile: False INFO:root: overrides: ['input_dir=intermediate_dir', 'output_dir=output_dir', 'event_conversion_config_fp=MESSY.yaml', 'dataset.name=KDD_Tutorial', 'dataset.version=1.0'] INFO:MEDS_transforms.runner:No parallelization configuration provided. INFO:MEDS_transforms.runner:Running stage: shard_events INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml shard_events stage=shard_events input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0 INFO:MEDS_transforms.runner:Command output: [2025-07-17 17:49:41,164][MEDS_extract.shard_events.shard_events][INFO] - Running with config: dataset: root_dir: ??? name: KDD_Tutorial version: '1.0' code_modifiers: [] input_dir: intermediate_dir output_dir: output_dir _default_description: 'This is a MEDS pipeline ETL. Please set a more detailed description at the top of your specific pipeline configuration file.' log_dir: ${stage_cfg.output_dir}/.logs do_overwrite: false seed: 1 worker: 0 polling_time: 300 stages: - shard_events - split_and_shard_subjects - convert_to_subject_sharded - convert_to_MEDS_events - merge_to_MEDS_cohort - extract_code_metadata - finalize_MEDS_metadata - finalize_MEDS_data stage: shard_events etl_metadata: pipeline_name: MEDS-Transforms Pipeline dataset_name: ${dataset.name} dataset_version: ${dataset.version} package_name: ${get_package_name:} package_version: ${get_package_version:} code_modifiers: null etl_metadata.pipeline_name: extract description: "This pipeline extracts raw MEDS events in longitudinal, sparse form\ \ from an input dataset meeting select\ncriteria and converts them to the flattened,\ \ MEDS format. It can be run in its entirety, with controllable\nlevels of parallelism,\ \ or in stages. Arguments:\n - `event_conversion_config_fp`: The path to the event\ \ conversion configuration file. This file defines\n the events to extract from\ \ the various rows of the various input files encountered in the global input\n\ \ directory.\n - `input_dir`: The path to the directory containing the raw input\ \ files.\n - `output_dir`: The path to the directory where the output cohort will\ \ be written. It will be written in\n various subfolders of this dir depending\ \ on the stage, as intermediate stages cache their output during\n computation\ \ for efficiency of re-running and distributing." event_conversion_config_fp: MESSY.yaml shards_map_fp: ${output_dir}/metadata/.shards.json cloud_io_storage_options: {} stage_cfg: row_chunksize: 200000000 infer_schema_length: 10000 data_input_dir: ${input_dir}/data metadata_input_dir: ${input_dir}/metadata reducer_output_dir: null train_only: false output_dir: ${output_dir}/shard_events Stage: shard_events Stage config: row_chunksize: 200000000 infer_schema_length: 10000 data_input_dir: ${input_dir}/data metadata_input_dir: ${input_dir}/metadata reducer_output_dir: null train_only: false output_dir: ${output_dir}/shard_events [2025-07-17 17:49:41,167][MEDS_extract.shard_events.shard_events][INFO] - Reading event conversion config from MESSY.yaml to identify needed columns. [2025-07-17 17:49:41,195][MEDS_extract.shard_events.shard_events][INFO] - Starting event sub-sharding. Sub-sharding 5 files: * /content/intermediate_dir/hosp/admissions.parquet * /content/intermediate_dir/hosp/procedures_icd.parquet * /content/intermediate_dir/icu/chartevents.parquet * /content/intermediate_dir/icu/icustays.parquet * /content/intermediate_dir/hosp/patients.parquet [2025-07-17 17:49:41,196][MEDS_extract.shard_events.shard_events][INFO] - Will read raw data from /content/intermediate_dir/$IN_FILE.parquet and write sub-sharded data to output_dir/shard_events/$IN_FILE/$ROW_START-$ROW_END.parquet [2025-07-17 17:49:41,198][MEDS_extract.shard_events.shard_events][INFO] - Processing intermediate_dir/hosp/admissions.parquet to output_dir/shard_events/hosp/admissions. [2025-07-17 17:49:41,206][MEDS_extract.shard_events.shard_events][INFO] - Performing preliminary read of /content/intermediate_dir/hosp/admissions.parquet to determine row count. [2025-07-17 17:49:41,207][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files. [2025-07-17 17:49:41,234][MEDS_extract.shard_events.shard_events][INFO] - Read 275 rows from /content/intermediate_dir/hosp/admissions.parquet. [2025-07-17 17:49:41,234][MEDS_extract.shard_events.shard_events][INFO] - Splitting intermediate_dir/hosp/admissions.parquet into 1 row-chunks of size 200000000. [2025-07-17 17:49:41,235][MEDS_extract.shard_events.shard_events][INFO] - Writing file 1/1: intermediate_dir/hosp/admissions.parquet row-chunk [0-275) to output_dir/shard_events/hosp/admissions/[0-275).parquet. [2025-07-17 17:49:41,236][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from intermediate_dir/hosp/admissions.parquet [2025-07-17 17:49:41,236][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files. [2025-07-17 17:49:41,236][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:41,237][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/shard_events/hosp/admissions/[0-275).parquet [2025-07-17 17:49:41,261][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.024981 [2025-07-17 17:49:41,262][MEDS_extract.shard_events.shard_events][INFO] - Processing intermediate_dir/hosp/procedures_icd.parquet to output_dir/shard_events/hosp/procedures_icd. [2025-07-17 17:49:41,262][MEDS_extract.shard_events.shard_events][INFO] - Performing preliminary read of /content/intermediate_dir/hosp/procedures_icd.parquet to determine row count. [2025-07-17 17:49:41,263][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files. [2025-07-17 17:49:41,273][MEDS_extract.shard_events.shard_events][INFO] - Read 722 rows from /content/intermediate_dir/hosp/procedures_icd.parquet. [2025-07-17 17:49:41,274][MEDS_extract.shard_events.shard_events][INFO] - Splitting intermediate_dir/hosp/procedures_icd.parquet into 1 row-chunks of size 200000000. [2025-07-17 17:49:41,274][MEDS_extract.shard_events.shard_events][INFO] - Writing file 1/1: intermediate_dir/hosp/procedures_icd.parquet row-chunk [0-722) to output_dir/shard_events/hosp/procedures_icd/[0-722).parquet. [2025-07-17 17:49:41,275][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from intermediate_dir/hosp/procedures_icd.parquet [2025-07-17 17:49:41,275][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files. [2025-07-17 17:49:41,276][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:41,276][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/shard_events/hosp/procedures_icd/[0-722).parquet [2025-07-17 17:49:41,284][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.009133 [2025-07-17 17:49:41,290][MEDS_extract.shard_events.shard_events][INFO] - Processing intermediate_dir/icu/chartevents.parquet to output_dir/shard_events/icu/chartevents. [2025-07-17 17:49:41,291][MEDS_extract.shard_events.shard_events][INFO] - Performing preliminary read of /content/intermediate_dir/icu/chartevents.parquet to determine row count. [2025-07-17 17:49:41,291][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files. [2025-07-17 17:49:41,304][MEDS_extract.shard_events.shard_events][INFO] - Read 668862 rows from /content/intermediate_dir/icu/chartevents.parquet. [2025-07-17 17:49:41,304][MEDS_extract.shard_events.shard_events][INFO] - Splitting intermediate_dir/icu/chartevents.parquet into 1 row-chunks of size 200000000. [2025-07-17 17:49:41,305][MEDS_extract.shard_events.shard_events][INFO] - Writing file 1/1: intermediate_dir/icu/chartevents.parquet row-chunk [0-668862) to output_dir/shard_events/icu/chartevents/[0-668862).parquet. [2025-07-17 17:49:41,305][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from intermediate_dir/icu/chartevents.parquet [2025-07-17 17:49:41,306][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files. [2025-07-17 17:49:41,306][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:41,307][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/shard_events/icu/chartevents/[0-668862).parquet [2025-07-17 17:49:42,184][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.878552 [2025-07-17 17:49:42,188][MEDS_extract.shard_events.shard_events][INFO] - Processing intermediate_dir/icu/icustays.parquet to output_dir/shard_events/icu/icustays. [2025-07-17 17:49:42,188][MEDS_extract.shard_events.shard_events][INFO] - Performing preliminary read of /content/intermediate_dir/icu/icustays.parquet to determine row count. [2025-07-17 17:49:42,189][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files. [2025-07-17 17:49:42,193][MEDS_extract.shard_events.shard_events][INFO] - Read 140 rows from /content/intermediate_dir/icu/icustays.parquet. [2025-07-17 17:49:42,193][MEDS_extract.shard_events.shard_events][INFO] - Splitting intermediate_dir/icu/icustays.parquet into 1 row-chunks of size 200000000. [2025-07-17 17:49:42,193][MEDS_extract.shard_events.shard_events][INFO] - Writing file 1/1: intermediate_dir/icu/icustays.parquet row-chunk [0-140) to output_dir/shard_events/icu/icustays/[0-140).parquet. [2025-07-17 17:49:42,194][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from intermediate_dir/icu/icustays.parquet [2025-07-17 17:49:42,194][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files. [2025-07-17 17:49:42,195][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:42,195][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/shard_events/icu/icustays/[0-140).parquet [2025-07-17 17:49:42,207][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.013370 [2025-07-17 17:49:42,209][MEDS_extract.shard_events.shard_events][INFO] - Processing intermediate_dir/hosp/patients.parquet to output_dir/shard_events/hosp/patients. [2025-07-17 17:49:42,209][MEDS_extract.shard_events.shard_events][INFO] - Performing preliminary read of /content/intermediate_dir/hosp/patients.parquet to determine row count. [2025-07-17 17:49:42,210][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files. [2025-07-17 17:49:42,218][MEDS_extract.shard_events.shard_events][INFO] - Read 100 rows from /content/intermediate_dir/hosp/patients.parquet. [2025-07-17 17:49:42,218][MEDS_extract.shard_events.shard_events][INFO] - Splitting intermediate_dir/hosp/patients.parquet into 1 row-chunks of size 200000000. [2025-07-17 17:49:42,218][MEDS_extract.shard_events.shard_events][INFO] - Writing file 1/1: intermediate_dir/hosp/patients.parquet row-chunk [0-100) to output_dir/shard_events/hosp/patients/[0-100).parquet. [2025-07-17 17:49:42,219][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from intermediate_dir/hosp/patients.parquet [2025-07-17 17:49:42,220][MEDS_extract.shard_events.shard_events][INFO] - Ignoring infer_schema_length=10000 for Parquet files. [2025-07-17 17:49:42,221][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:42,221][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/shard_events/hosp/patients/[0-100).parquet [2025-07-17 17:49:42,230][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.011276 [2025-07-17 17:49:42,231][MEDS_extract.shard_events.shard_events][INFO] - Sub-sharding completed in 0:00:01.034039 INFO:MEDS_transforms.runner:Command error: INFO:MEDS_transforms.runner:Running stage: split_and_shard_subjects INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml split_and_shard_subjects stage=split_and_shard_subjects input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0 INFO:MEDS_transforms.runner:Command output: [2025-07-17 17:49:45,804][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading event conversion config from MESSY.yaml (needed for subject ID columns) [2025-07-17 17:49:45,820][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Event conversion config: subject_id_col: subject_id hosp/patients: gender: code: col(gender) time: null death: code: MEDS_DEATH time: col(dod) birth: code: MEDS_BIRTH time: col(year_of_birth) time_format: '%Y' hosp/admissions: ed_registration: code: ED_REGISTRATION time: col(edregtime) time_format: '%Y-%m-%d %H:%M:%S' ed_out: code: ED_OUT time: col(edouttime) time_format: '%Y-%m-%d %H:%M:%S' admission: code: - HOSPITAL_ADMISSION - col(admission_type) - col(admission_location) time: col(admittime) time_format: '%Y-%m-%d %H:%M:%S' hadm_id: hadm_id discharge: code: - HOSPITAL_DISCHARGE - col(discharge_location) time: col(dischtime) time_format: '%Y-%m-%d %H:%M:%S' hadm_id: hadm_id hosp/procedures_icd: procedure_icd: code: - PROCEDURE - ICD - coi(icd_version) - col(icd_code) time: col(chartdate) seq_num: seq_num icu/icustays: admission: code: - ICU_ADMISSION - col(first_careunit) time: col(intime) time_format: '%Y-%m-%d %H:%M:%S' discharge: code: - ICU_DISCHARGE - col(last_careunit) time: col(outtime) time_format: '%Y-%m-%d %H:%M:%S' icu/chartevents: chartevent: code: - CHARTEVENT - col(itemid) - col(valueuom) time: col(charttime) time_format: '%Y-%m-%d %H:%M:%S' numeric_value: valuenum [2025-07-17 17:49:45,821][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading subject IDs from hosp/patients files: - /content/output_dir/shard_events/hosp/patients/[0-100).parquet [2025-07-17 17:49:45,823][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading subject IDs from hosp/admissions files: - /content/output_dir/shard_events/hosp/admissions/[0-275).parquet [2025-07-17 17:49:45,823][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading subject IDs from hosp/procedures_icd files: - /content/output_dir/shard_events/hosp/procedures_icd/[0-722).parquet [2025-07-17 17:49:45,824][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading subject IDs from icu/icustays files: - /content/output_dir/shard_events/icu/icustays/[0-140).parquet [2025-07-17 17:49:45,824][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Reading subject IDs from icu/chartevents files: - /content/output_dir/shard_events/icu/chartevents/[0-668862).parquet [2025-07-17 17:49:45,824][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Joining all subject IDs from 5 dataframes [2025-07-17 17:49:45,925][numexpr.utils][INFO] - NumExpr defaulting to 2 threads. [2025-07-17 17:49:46,127][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Found 100 unique subject IDs of type int64 [2025-07-17 17:49:46,128][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Sharding and splitting subjects [2025-07-17 17:49:46,129][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Split train/0 has 80 subjects. [2025-07-17 17:49:46,129][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Split tuning/0 has 10 subjects. [2025-07-17 17:49:46,129][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Split held_out/0 has 10 subjects. [2025-07-17 17:49:46,130][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Writing sharded subjects to /content/output_dir/metadata/.shards.json [2025-07-17 17:49:46,131][MEDS_extract.split_and_shard_subjects.split_and_shard_subjects][INFO] - Done writing sharded subjects INFO:MEDS_transforms.runner:Command error: INFO:MEDS_transforms.runner:Running stage: convert_to_subject_sharded INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml convert_to_subject_sharded stage=convert_to_subject_sharded input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0 INFO:MEDS_transforms.runner:Command output: [2025-07-17 17:49:48,010][MEDS_extract.convert_to_subject_sharded.convert_to_subject_sharded][INFO] - Starting subject sharding. [2025-07-17 17:49:48,010][MEDS_extract.convert_to_subject_sharded.convert_to_subject_sharded][INFO] - Reading event conversion config from MESSY.yaml [2025-07-17 17:49:48,026][MEDS_extract.convert_to_subject_sharded.convert_to_subject_sharded][INFO] - Event conversion config: subject_id_col: subject_id hosp/patients: gender: code: col(gender) time: null death: code: MEDS_DEATH time: col(dod) birth: code: MEDS_BIRTH time: col(year_of_birth) time_format: '%Y' hosp/admissions: ed_registration: code: ED_REGISTRATION time: col(edregtime) time_format: '%Y-%m-%d %H:%M:%S' ed_out: code: ED_OUT time: col(edouttime) time_format: '%Y-%m-%d %H:%M:%S' admission: code: - HOSPITAL_ADMISSION - col(admission_type) - col(admission_location) time: col(admittime) time_format: '%Y-%m-%d %H:%M:%S' hadm_id: hadm_id discharge: code: - HOSPITAL_DISCHARGE - col(discharge_location) time: col(dischtime) time_format: '%Y-%m-%d %H:%M:%S' hadm_id: hadm_id hosp/procedures_icd: procedure_icd: code: - PROCEDURE - ICD - coi(icd_version) - col(icd_code) time: col(chartdate) seq_num: seq_num icu/icustays: admission: code: - ICU_ADMISSION - col(first_careunit) time: col(intime) time_format: '%Y-%m-%d %H:%M:%S' discharge: code: - ICU_DISCHARGE - col(last_careunit) time: col(outtime) time_format: '%Y-%m-%d %H:%M:%S' icu/chartevents: chartevent: code: - CHARTEVENT - col(itemid) - col(valueuom) time: col(charttime) time_format: '%Y-%m-%d %H:%M:%S' numeric_value: valuenum [2025-07-17 17:49:48,028][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/patients/[0-100).parquet')] [2025-07-17 17:49:48,030][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,030][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/tuning/0/hosp/patients.parquet [2025-07-17 17:49:48,036][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.008897 [2025-07-17 17:49:48,038][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/procedures_icd/[0-722).parquet')] [2025-07-17 17:49:48,039][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,039][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/tuning/0/hosp/procedures_icd.parquet [2025-07-17 17:49:48,042][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003652 [2025-07-17 17:49:48,044][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/admissions/[0-275).parquet')] [2025-07-17 17:49:48,044][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,044][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/tuning/0/hosp/admissions.parquet [2025-07-17 17:49:48,047][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003944 [2025-07-17 17:49:48,049][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/chartevents/[0-668862).parquet')] [2025-07-17 17:49:48,050][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,050][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/tuning/0/icu/chartevents.parquet [2025-07-17 17:49:48,105][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.056211 [2025-07-17 17:49:48,107][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/icustays/[0-140).parquet')] [2025-07-17 17:49:48,107][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,107][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/tuning/0/icu/icustays.parquet [2025-07-17 17:49:48,110][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003717 [2025-07-17 17:49:48,112][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/patients/[0-100).parquet')] [2025-07-17 17:49:48,112][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,112][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/held_out/0/hosp/patients.parquet [2025-07-17 17:49:48,115][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003670 [2025-07-17 17:49:48,117][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/procedures_icd/[0-722).parquet')] [2025-07-17 17:49:48,117][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,117][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/held_out/0/hosp/procedures_icd.parquet [2025-07-17 17:49:48,120][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003151 [2025-07-17 17:49:48,122][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/admissions/[0-275).parquet')] [2025-07-17 17:49:48,122][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,122][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/held_out/0/hosp/admissions.parquet [2025-07-17 17:49:48,126][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003903 [2025-07-17 17:49:48,127][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/chartevents/[0-668862).parquet')] [2025-07-17 17:49:48,127][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,128][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/held_out/0/icu/chartevents.parquet [2025-07-17 17:49:48,161][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.034368 [2025-07-17 17:49:48,163][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/icustays/[0-140).parquet')] [2025-07-17 17:49:48,163][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,164][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/held_out/0/icu/icustays.parquet [2025-07-17 17:49:48,166][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003498 [2025-07-17 17:49:48,168][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/patients/[0-100).parquet')] [2025-07-17 17:49:48,169][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,169][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/train/0/hosp/patients.parquet [2025-07-17 17:49:48,171][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003106 [2025-07-17 17:49:48,172][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/procedures_icd/[0-722).parquet')] [2025-07-17 17:49:48,173][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,173][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/train/0/hosp/procedures_icd.parquet [2025-07-17 17:49:48,176][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003341 [2025-07-17 17:49:48,178][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/hosp/admissions/[0-275).parquet')] [2025-07-17 17:49:48,178][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,178][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/train/0/hosp/admissions.parquet [2025-07-17 17:49:48,182][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.004345 [2025-07-17 17:49:48,183][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/chartevents/[0-668862).parquet')] [2025-07-17 17:49:48,184][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,184][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/train/0/icu/chartevents.parquet [2025-07-17 17:49:48,379][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.195991 [2025-07-17 17:49:48,381][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from [PosixPath('output_dir/shard_events/icu/icustays/[0-140).parquet')] [2025-07-17 17:49:48,381][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:48,381][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_subject_sharded/train/0/icu/icustays.parquet [2025-07-17 17:49:48,384][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.003644 [2025-07-17 17:49:48,385][MEDS_extract.convert_to_subject_sharded.convert_to_subject_sharded][INFO] - Created a subject-sharded view. INFO:MEDS_transforms.runner:Command error: /usr/local/lib/python3.11/dist-packages/MEDS_extract/convert_to_subject_sharded/convert_to_subject_sharded.py:75: PerformanceWarning: Resolving the schema of a LazyFrame is a potentially expensive operation. Use `LazyFrame.collect_schema()` to get the schema without this warning. typed_subjects = pl.Series(subjects, dtype=dfs[0].schema[input_subject_id_column]) INFO:MEDS_transforms.runner:Running stage: convert_to_MEDS_events INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml convert_to_MEDS_events stage=convert_to_MEDS_events input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0 INFO:MEDS_transforms.runner:Command output: [2025-07-17 17:49:50,205][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Starting event conversion. [2025-07-17 17:49:50,205][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Reading event conversion config from MESSY.yaml [2025-07-17 17:49:50,221][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Event conversion config: subject_id_col: subject_id hosp/patients: gender: code: col(gender) time: null death: code: MEDS_DEATH time: col(dod) birth: code: MEDS_BIRTH time: col(year_of_birth) time_format: '%Y' hosp/admissions: ed_registration: code: ED_REGISTRATION time: col(edregtime) time_format: '%Y-%m-%d %H:%M:%S' ed_out: code: ED_OUT time: col(edouttime) time_format: '%Y-%m-%d %H:%M:%S' admission: code: - HOSPITAL_ADMISSION - col(admission_type) - col(admission_location) time: col(admittime) time_format: '%Y-%m-%d %H:%M:%S' hadm_id: hadm_id discharge: code: - HOSPITAL_DISCHARGE - col(discharge_location) time: col(dischtime) time_format: '%Y-%m-%d %H:%M:%S' hadm_id: hadm_id hosp/procedures_icd: procedure_icd: code: - PROCEDURE - ICD - coi(icd_version) - col(icd_code) time: col(chartdate) seq_num: seq_num icu/icustays: admission: code: - ICU_ADMISSION - col(first_careunit) time: col(intime) time_format: '%Y-%m-%d %H:%M:%S' discharge: code: - ICU_DISCHARGE - col(last_careunit) time: col(outtime) time_format: '%Y-%m-%d %H:%M:%S' icu/chartevents: chartevent: code: - CHARTEVENT - col(itemid) - col(valueuom) time: col(charttime) time_format: '%Y-%m-%d %H:%M:%S' numeric_value: valuenum [2025-07-17 17:49:50,227][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/train/0/hosp/patients.parquet [2025-07-17 17:49:50,228][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,228][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/patients [2025-07-17 17:49:50,229][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting gender [2025-07-17 17:49:50,230][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column gender [2025-07-17 17:49:50,231][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding null literate for time [2025-07-17 17:49:50,231][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null codes via col("gender").is_not_null() [2025-07-17 17:49:50,231][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting death [2025-07-17 17:49:50,231][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - dod should already be of Datetime type [2025-07-17 17:49:50,232][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dod").is_not_null() [2025-07-17 17:49:50,232][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting birth [2025-07-17 17:49:50,232][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column year_of_birth in possible formats %Y [2025-07-17 17:49:50,232][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("year_of_birth").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,233][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/train/0/hosp/patients.parquet [2025-07-17 17:49:50,241][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.013788 [2025-07-17 17:49:50,242][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/train/0/icu/chartevents.parquet [2025-07-17 17:49:50,243][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,243][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/chartevents [2025-07-17 17:49:50,244][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting chartevent [2025-07-17 17:49:50,245][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column valueuom [2025-07-17 17:49:50,245][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column itemid [2025-07-17 17:49:50,245][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column charttime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,245][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("charttime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,245][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/train/0/icu/chartevents.parquet [2025-07-17 17:49:50,695][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.452166 [2025-07-17 17:49:50,696][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/train/0/hosp/procedures_icd.parquet [2025-07-17 17:49:50,696][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,697][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/procedures_icd [2025-07-17 17:49:50,697][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting procedure_icd [2025-07-17 17:49:50,698][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column icd_code [2025-07-17 17:49:50,698][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - chartdate should already be of Datetime type [2025-07-17 17:49:50,698][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("chartdate").is_not_null() [2025-07-17 17:49:50,699][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/train/0/hosp/procedures_icd.parquet [2025-07-17 17:49:50,702][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.006156 [2025-07-17 17:49:50,704][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/train/0/hosp/admissions.parquet [2025-07-17 17:49:50,704][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,705][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/admissions [2025-07-17 17:49:50,706][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_registration [2025-07-17 17:49:50,706][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edregtime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,707][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edregtime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,707][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_out [2025-07-17 17:49:50,707][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edouttime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,707][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edouttime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,708][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission [2025-07-17 17:49:50,708][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_type [2025-07-17 17:49:50,708][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_location [2025-07-17 17:49:50,708][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column admittime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,709][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("admittime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,709][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge [2025-07-17 17:49:50,709][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column discharge_location [2025-07-17 17:49:50,710][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column dischtime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,710][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dischtime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,710][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/train/0/hosp/admissions.parquet [2025-07-17 17:49:50,716][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.011979 [2025-07-17 17:49:50,718][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/train/0/icu/icustays.parquet [2025-07-17 17:49:50,718][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,718][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/icustays [2025-07-17 17:49:50,719][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission [2025-07-17 17:49:50,720][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column first_careunit [2025-07-17 17:49:50,720][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column intime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,720][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("intime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,720][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge [2025-07-17 17:49:50,721][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column last_careunit [2025-07-17 17:49:50,721][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column outtime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,721][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("outtime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,721][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/train/0/icu/icustays.parquet [2025-07-17 17:49:50,725][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.007194 [2025-07-17 17:49:50,726][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/held_out/0/hosp/patients.parquet [2025-07-17 17:49:50,727][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,727][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/patients [2025-07-17 17:49:50,727][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting gender [2025-07-17 17:49:50,728][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column gender [2025-07-17 17:49:50,728][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding null literate for time [2025-07-17 17:49:50,728][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null codes via col("gender").is_not_null() [2025-07-17 17:49:50,728][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting death [2025-07-17 17:49:50,729][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - dod should already be of Datetime type [2025-07-17 17:49:50,729][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dod").is_not_null() [2025-07-17 17:49:50,729][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting birth [2025-07-17 17:49:50,729][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column year_of_birth in possible formats %Y [2025-07-17 17:49:50,730][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("year_of_birth").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,730][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/held_out/0/hosp/patients.parquet [2025-07-17 17:49:50,734][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.007166 [2025-07-17 17:49:50,735][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/held_out/0/icu/chartevents.parquet [2025-07-17 17:49:50,735][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,735][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/chartevents [2025-07-17 17:49:50,736][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting chartevent [2025-07-17 17:49:50,736][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column valueuom [2025-07-17 17:49:50,737][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column itemid [2025-07-17 17:49:50,737][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column charttime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,737][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("charttime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,737][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/held_out/0/icu/chartevents.parquet [2025-07-17 17:49:50,763][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.028126 [2025-07-17 17:49:50,764][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/held_out/0/hosp/procedures_icd.parquet [2025-07-17 17:49:50,765][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,765][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/procedures_icd [2025-07-17 17:49:50,765][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting procedure_icd [2025-07-17 17:49:50,766][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column icd_code [2025-07-17 17:49:50,766][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - chartdate should already be of Datetime type [2025-07-17 17:49:50,767][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("chartdate").is_not_null() [2025-07-17 17:49:50,767][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/held_out/0/hosp/procedures_icd.parquet [2025-07-17 17:49:50,770][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.005439 [2025-07-17 17:49:50,772][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/held_out/0/hosp/admissions.parquet [2025-07-17 17:49:50,772][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,772][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/admissions [2025-07-17 17:49:50,773][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_registration [2025-07-17 17:49:50,774][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edregtime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,774][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edregtime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,774][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_out [2025-07-17 17:49:50,775][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edouttime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,775][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edouttime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,775][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission [2025-07-17 17:49:50,776][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_type [2025-07-17 17:49:50,776][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_location [2025-07-17 17:49:50,776][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column admittime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,776][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("admittime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,776][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge [2025-07-17 17:49:50,777][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column discharge_location [2025-07-17 17:49:50,777][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column dischtime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,777][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dischtime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,778][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/held_out/0/hosp/admissions.parquet [2025-07-17 17:49:50,783][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.011343 [2025-07-17 17:49:50,785][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/held_out/0/icu/icustays.parquet [2025-07-17 17:49:50,785][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,785][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/icustays [2025-07-17 17:49:50,786][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission [2025-07-17 17:49:50,787][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column first_careunit [2025-07-17 17:49:50,787][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column intime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,787][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("intime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,787][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge [2025-07-17 17:49:50,788][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column last_careunit [2025-07-17 17:49:50,788][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column outtime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,788][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("outtime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,788][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/held_out/0/icu/icustays.parquet [2025-07-17 17:49:50,791][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.006842 [2025-07-17 17:49:50,793][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/tuning/0/hosp/patients.parquet [2025-07-17 17:49:50,793][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,793][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/patients [2025-07-17 17:49:50,794][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting gender [2025-07-17 17:49:50,795][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column gender [2025-07-17 17:49:50,795][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding null literate for time [2025-07-17 17:49:50,795][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null codes via col("gender").is_not_null() [2025-07-17 17:49:50,795][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting death [2025-07-17 17:49:50,796][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - dod should already be of Datetime type [2025-07-17 17:49:50,796][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dod").is_not_null() [2025-07-17 17:49:50,796][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting birth [2025-07-17 17:49:50,796][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column year_of_birth in possible formats %Y [2025-07-17 17:49:50,796][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("year_of_birth").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,797][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/tuning/0/hosp/patients.parquet [2025-07-17 17:49:50,800][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.007281 [2025-07-17 17:49:50,802][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/tuning/0/icu/chartevents.parquet [2025-07-17 17:49:50,802][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,802][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/chartevents [2025-07-17 17:49:50,803][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting chartevent [2025-07-17 17:49:50,803][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column valueuom [2025-07-17 17:49:50,803][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column itemid [2025-07-17 17:49:50,804][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column charttime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,804][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("charttime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,804][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/tuning/0/icu/chartevents.parquet [2025-07-17 17:49:50,843][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.040857 [2025-07-17 17:49:50,844][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/tuning/0/hosp/procedures_icd.parquet [2025-07-17 17:49:50,845][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,845][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/procedures_icd [2025-07-17 17:49:50,845][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting procedure_icd [2025-07-17 17:49:50,846][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column icd_code [2025-07-17 17:49:50,846][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - chartdate should already be of Datetime type [2025-07-17 17:49:50,847][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("chartdate").is_not_null() [2025-07-17 17:49:50,847][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/tuning/0/hosp/procedures_icd.parquet [2025-07-17 17:49:50,851][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.007183 [2025-07-17 17:49:50,853][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/tuning/0/hosp/admissions.parquet [2025-07-17 17:49:50,853][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,853][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for hosp/admissions [2025-07-17 17:49:50,855][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_registration [2025-07-17 17:49:50,855][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edregtime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,855][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edregtime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,856][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting ed_out [2025-07-17 17:49:50,856][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column edouttime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,856][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("edouttime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,856][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission [2025-07-17 17:49:50,857][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_type [2025-07-17 17:49:50,857][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column admission_location [2025-07-17 17:49:50,857][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column admittime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,857][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("admittime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,858][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge [2025-07-17 17:49:50,858][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column discharge_location [2025-07-17 17:49:50,858][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column dischtime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,858][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("dischtime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,859][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/tuning/0/hosp/admissions.parquet [2025-07-17 17:49:50,864][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.010504 [2025-07-17 17:49:50,865][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_subject_sharded/tuning/0/icu/icustays.parquet [2025-07-17 17:49:50,865][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:50,866][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting events for icu/icustays [2025-07-17 17:49:50,866][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting admission [2025-07-17 17:49:50,867][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column first_careunit [2025-07-17 17:49:50,867][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column intime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,867][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("intime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,868][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Building computational graph for extracting discharge [2025-07-17 17:49:50,868][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Extracting column last_careunit [2025-07-17 17:49:50,868][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Adding time column outtime in possible formats %Y-%m-%d %H:%M:%S [2025-07-17 17:49:50,868][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Filtering out rows with null times via col("outtime").str.strptime(["raise"]).coalesce().is_not_null() [2025-07-17 17:49:50,869][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/convert_to_MEDS_events/tuning/0/icu/icustays.parquet [2025-07-17 17:49:50,872][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.006402 [2025-07-17 17:49:50,872][MEDS_extract.convert_to_MEDS_events.convert_to_MEDS_events][INFO] - Subsharded into converted events. INFO:MEDS_transforms.runner:Command error: INFO:MEDS_transforms.runner:Running stage: merge_to_MEDS_cohort INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml merge_to_MEDS_cohort stage=merge_to_MEDS_cohort input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0 INFO:MEDS_transforms.runner:Command output: [2025-07-17 17:49:52,716][MEDS_extract.merge_to_MEDS_cohort.merge_to_MEDS_cohort][INFO] - Mapping computation over a maximum of 3 shards [2025-07-17 17:49:52,717][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/convert_to_MEDS_events/held_out/0 into /content/output_dir/merge_to_MEDS_cohort/held_out/0.parquet [2025-07-17 17:49:52,717][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_MEDS_events/held_out/0 [2025-07-17 17:49:52,718][MEDS_extract.merge_to_MEDS_cohort.merge_to_MEDS_cohort][INFO] - Reading 5 files: - /content/output_dir/convert_to_MEDS_events/held_out/0/hosp/patients.parquet - /content/output_dir/convert_to_MEDS_events/held_out/0/hosp/admissions.parquet - /content/output_dir/convert_to_MEDS_events/held_out/0/hosp/procedures_icd.parquet - /content/output_dir/convert_to_MEDS_events/held_out/0/icu/icustays.parquet - /content/output_dir/convert_to_MEDS_events/held_out/0/icu/chartevents.parquet [2025-07-17 17:49:52,720][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:52,720][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/merge_to_MEDS_cohort/held_out/0.parquet [2025-07-17 17:49:52,752][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.034519 [2025-07-17 17:49:52,752][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/convert_to_MEDS_events/tuning/0 into /content/output_dir/merge_to_MEDS_cohort/tuning/0.parquet [2025-07-17 17:49:52,753][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_MEDS_events/tuning/0 [2025-07-17 17:49:52,753][MEDS_extract.merge_to_MEDS_cohort.merge_to_MEDS_cohort][INFO] - Reading 5 files: - /content/output_dir/convert_to_MEDS_events/tuning/0/hosp/patients.parquet - /content/output_dir/convert_to_MEDS_events/tuning/0/hosp/admissions.parquet - /content/output_dir/convert_to_MEDS_events/tuning/0/hosp/procedures_icd.parquet - /content/output_dir/convert_to_MEDS_events/tuning/0/icu/icustays.parquet - /content/output_dir/convert_to_MEDS_events/tuning/0/icu/chartevents.parquet [2025-07-17 17:49:52,754][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:52,754][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/merge_to_MEDS_cohort/tuning/0.parquet [2025-07-17 17:49:52,799][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.046454 [2025-07-17 17:49:52,800][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/convert_to_MEDS_events/train/0 into /content/output_dir/merge_to_MEDS_cohort/train/0.parquet [2025-07-17 17:49:52,801][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/convert_to_MEDS_events/train/0 [2025-07-17 17:49:52,801][MEDS_extract.merge_to_MEDS_cohort.merge_to_MEDS_cohort][INFO] - Reading 5 files: - /content/output_dir/convert_to_MEDS_events/train/0/hosp/patients.parquet - /content/output_dir/convert_to_MEDS_events/train/0/hosp/admissions.parquet - /content/output_dir/convert_to_MEDS_events/train/0/hosp/procedures_icd.parquet - /content/output_dir/convert_to_MEDS_events/train/0/icu/icustays.parquet - /content/output_dir/convert_to_MEDS_events/train/0/icu/chartevents.parquet [2025-07-17 17:49:52,802][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:49:52,802][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/merge_to_MEDS_cohort/train/0.parquet [2025-07-17 17:49:53,295][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.494605 [2025-07-17 17:49:53,295][MEDS_transforms.mapreduce.stage][INFO] - Finished mapping in 0:00:00.581387 INFO:MEDS_transforms.runner:Command error: INFO:MEDS_transforms.runner:Running stage: extract_code_metadata INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml extract_code_metadata stage=extract_code_metadata input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0 INFO:MEDS_transforms.runner:Command output: [2025-07-17 17:49:55,490][MEDS_extract.extract_code_metadata.extract_code_metadata][INFO] - Reading event conversion config from MESSY.yaml [2025-07-17 17:49:55,515][MEDS_extract.extract_code_metadata.extract_code_metadata][INFO] - Event conversion config: subject_id_col: subject_id hosp/patients: gender: code: col(gender) time: null death: code: MEDS_DEATH time: col(dod) birth: code: MEDS_BIRTH time: col(year_of_birth) time_format: '%Y' hosp/admissions: ed_registration: code: ED_REGISTRATION time: col(edregtime) time_format: '%Y-%m-%d %H:%M:%S' ed_out: code: ED_OUT time: col(edouttime) time_format: '%Y-%m-%d %H:%M:%S' admission: code: - HOSPITAL_ADMISSION - col(admission_type) - col(admission_location) time: col(admittime) time_format: '%Y-%m-%d %H:%M:%S' hadm_id: hadm_id discharge: code: - HOSPITAL_DISCHARGE - col(discharge_location) time: col(dischtime) time_format: '%Y-%m-%d %H:%M:%S' hadm_id: hadm_id hosp/procedures_icd: procedure_icd: code: - PROCEDURE - ICD - coi(icd_version) - col(icd_code) time: col(chartdate) seq_num: seq_num icu/icustays: admission: code: - ICU_ADMISSION - col(first_careunit) time: col(intime) time_format: '%Y-%m-%d %H:%M:%S' discharge: code: - ICU_DISCHARGE - col(last_careunit) time: col(outtime) time_format: '%Y-%m-%d %H:%M:%S' icu/chartevents: chartevent: code: - CHARTEVENT - col(itemid) - col(valueuom) time: col(charttime) time_format: '%Y-%m-%d %H:%M:%S' numeric_value: valuenum [2025-07-17 17:49:55,523][MEDS_extract.extract_code_metadata.extract_code_metadata][INFO] - No _metadata blocks in the event_conversion_config.yaml found. Exiting... INFO:MEDS_transforms.runner:Command error: INFO:MEDS_transforms.runner:Running stage: finalize_MEDS_metadata INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml finalize_MEDS_metadata stage=finalize_MEDS_metadata input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0 INFO:MEDS_transforms.runner:Command output: [2025-07-17 17:49:58,056][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Validating code metadata [2025-07-17 17:49:58,056][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - No code metadata found at output_dir/extract_code_metadata/codes.parquet. Making empty metadata file. [2025-07-17 17:49:58,133][numexpr.utils][INFO] - NumExpr defaulting to 2 threads. [2025-07-17 17:49:58,336][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Writing finalized metadata df to /content/output_dir/metadata/codes.parquet [2025-07-17 17:49:58,337][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Creating dataset metadata [2025-07-17 17:49:58,340][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Writing finalized dataset metadata to /content/output_dir/metadata/dataset.json [2025-07-17 17:49:58,340][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Creating subject splits from {str(shards_map_fp.resolve())} [2025-07-17 17:49:58,341][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Split train has 80 subjects [2025-07-17 17:49:58,341][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Split tuning has 10 subjects [2025-07-17 17:49:58,341][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Split held_out has 10 subjects [2025-07-17 17:49:58,341][MEDS_extract.finalize_MEDS_metadata.finalize_MEDS_metadata][INFO] - Writing finalized subject splits to /content/output_dir/metadata/subject_splits.parquet INFO:MEDS_transforms.runner:Command error: INFO:MEDS_transforms.runner:Running stage: finalize_MEDS_data INFO:MEDS_transforms.runner:Running command: MEDS_transform-stage pkg://MEDS_extract.configs._extract.yaml finalize_MEDS_data stage=finalize_MEDS_data input_dir=intermediate_dir output_dir=output_dir event_conversion_config_fp=MESSY.yaml dataset.name=KDD_Tutorial dataset.version=1.0 INFO:MEDS_transforms.runner:Command output: [2025-07-17 17:50:00,222][MEDS_transforms.mapreduce.shard_iteration][INFO] - Mapping computation over a maximum of 3 shards [2025-07-17 17:50:00,223][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/merge_to_MEDS_cohort/held_out/0.parquet into /content/output_dir/data/held_out/0.parquet [2025-07-17 17:50:00,224][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/merge_to_MEDS_cohort/held_out/0.parquet [2025-07-17 17:50:00,224][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:50:00,235][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/data/held_out/0.parquet [2025-07-17 17:50:00,242][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.018772 [2025-07-17 17:50:00,243][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/merge_to_MEDS_cohort/tuning/0.parquet into /content/output_dir/data/tuning/0.parquet [2025-07-17 17:50:00,243][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/merge_to_MEDS_cohort/tuning/0.parquet [2025-07-17 17:50:00,244][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:50:00,249][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/data/tuning/0.parquet [2025-07-17 17:50:00,260][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.016874 [2025-07-17 17:50:00,261][MEDS_transforms.mapreduce.mapper][INFO] - Processing /content/output_dir/merge_to_MEDS_cohort/train/0.parquet into /content/output_dir/data/train/0.parquet [2025-07-17 17:50:00,261][MEDS_transforms.mapreduce.rwlock][INFO] - Reading input dataframe from output_dir/merge_to_MEDS_cohort/train/0.parquet [2025-07-17 17:50:00,262][MEDS_transforms.mapreduce.rwlock][INFO] - Read dataset [2025-07-17 17:50:00,302][MEDS_transforms.mapreduce.rwlock][INFO] - Writing final output to output_dir/data/train/0.parquet [2025-07-17 17:50:00,404][MEDS_transforms.mapreduce.rwlock][INFO] - Succeeded in 0:00:00.143007 [2025-07-17 17:50:00,405][MEDS_transforms.mapreduce.stage][INFO] - Finished mapping in 0:00:00.184304 INFO:MEDS_transforms.runner:Command error:
Note that if you did everything right, the log will still say "Command error:"
at the end, with nothing following, which is reporting that there was no error output written for the internal stages of the process.
What do the output files themselves actually look like? Let's see:
%%bash
tree output_dir
output_dir ├── convert_to_MEDS_events │ ├── event_conversion_config.yaml │ ├── held_out │ │ └── 0 │ │ ├── hosp │ │ │ ├── admissions.parquet │ │ │ ├── patients.parquet │ │ │ └── procedures_icd.parquet │ │ └── icu │ │ ├── chartevents.parquet │ │ └── icustays.parquet │ ├── train │ │ └── 0 │ │ ├── hosp │ │ │ ├── admissions.parquet │ │ │ ├── patients.parquet │ │ │ └── procedures_icd.parquet │ │ └── icu │ │ ├── chartevents.parquet │ │ └── icustays.parquet │ └── tuning │ └── 0 │ ├── hosp │ │ ├── admissions.parquet │ │ ├── patients.parquet │ │ └── procedures_icd.parquet │ └── icu │ ├── chartevents.parquet │ └── icustays.parquet ├── convert_to_subject_sharded │ ├── held_out │ │ └── 0 │ │ ├── hosp │ │ │ ├── admissions.parquet │ │ │ ├── patients.parquet │ │ │ └── procedures_icd.parquet │ │ └── icu │ │ ├── chartevents.parquet │ │ └── icustays.parquet │ ├── train │ │ └── 0 │ │ ├── hosp │ │ │ ├── admissions.parquet │ │ │ ├── patients.parquet │ │ │ └── procedures_icd.parquet │ │ └── icu │ │ ├── chartevents.parquet │ │ └── icustays.parquet │ └── tuning │ └── 0 │ ├── hosp │ │ ├── admissions.parquet │ │ ├── patients.parquet │ │ └── procedures_icd.parquet │ └── icu │ ├── chartevents.parquet │ └── icustays.parquet ├── data │ ├── held_out │ │ └── 0.parquet │ ├── train │ │ └── 0.parquet │ └── tuning │ └── 0.parquet ├── extract_code_metadata │ └── event_conversion_config.yaml ├── finalize_MEDS_metadata ├── merge_to_MEDS_cohort │ ├── held_out │ │ └── 0.parquet │ ├── train │ │ └── 0.parquet │ └── tuning │ └── 0.parquet ├── metadata │ ├── codes.parquet │ ├── dataset.json │ └── subject_splits.parquet ├── shard_events │ ├── hosp │ │ ├── admissions │ │ │ └── [0-275).parquet │ │ ├── patients │ │ │ └── [0-100).parquet │ │ └── procedures_icd │ │ └── [0-722).parquet │ └── icu │ ├── chartevents │ │ └── [0-668862).parquet │ └── icustays │ └── [0-140).parquet └── split_and_shard_subjects 46 directories, 46 files
There's a lot here -- thankfully, most of them are internal, partial outputs that MEDS-Extract writes so it can resume after failures on larger datasets. These aren't helpful for us, but are helpful when you're working with hundreds of thousands to billions of measurements!
To see just the final files, we can look in the data
and metadata
sub-folders:
%%bash
tree output_dir/data
output_dir/data ├── held_out │ └── 0.parquet ├── train │ └── 0.parquet └── tuning └── 0.parquet 3 directories, 3 files
%%bash
tree output_dir/metadata
output_dir/metadata ├── codes.parquet ├── dataset.json └── subject_splits.parquet 0 directories, 3 files
Going Forward
While you've just built a great MEDS dataset over the MIMIC demo dataset in this tutorial, you've only looked at a small set of the included files we showed above. In the rest of the tutorial, we'll use the full MIMIC demo dataset, which we'll download as needed in the other notebooks, rather than the output of this notebook. Note that it also is built using a slightly different file than the one constructed here -- but rest assured, it is very similar to what you put together here. You can see how it is processed by looking at the dedicated MIMIC-IV ETL Package, or specifically the analogus MESSY file used for all the sources in that repository!
Additional Details and Resources
You can also check out MEDS-Extract's documentation and another example on synthetic data via the included links as well!
Even more importantly, what if you don't like MEDS-Extract and don't want to use it? Then don't! The three guiding questions of the extraction process (What is happening?, To whom is it happening?, and When is it happening?) can be turned into an extraction pipeline in whatever way you like -- the MEDS ecosystem is designed to be data-centric, so it doesn't matter how you got to a MEDS dataset, just that you did, and then tools can run from there!