Task Objetive

The objective of this task is to develop systems that can automatically identify predefined medical sections in unstructured clinical notes. This task is a combination of both segmentation and classification, where the goal is to accurately segment the notes into different sections and correctly classify them based on predefined categories.


Data Description

For the task a subset of the CodiEsp corpus [1] was selected by the organizers. The CodiEsp is a collection of Spanish unstructured clinical case reports from different medical specialties. This corpus was originally used in a Named Entity Recognition (NER) task (eHealth CLEF 2020), with the aim of identifying procedures and diagnoses labeled with the Spanish version of ICD-10 in a subset of 1,000 documents. An additional collection of 2,751 unannotated documents was also provided as a background set. The present corpus is a randomly-selected subset of the background CodiEsp corpus, consisting of 1038 distinct notes. The following table presents some of its relevant statistics.

Number of Notes Number of Tokens Average Note Length (In Tokens) Average Number of Sections In Note Average Number of Unique Sections In Note
1,038 360,224 347.04±235.52 6.94±3.36 4.38±0.99


Annotated Clinical Sections

Section Criterion
Present Illness (PI) Description of the consultation reason, including treatments, diagnoses and explorations performed prior to admission. Anamnesis is also included in this section when collected in the clinical case.
Derived from/to (D) Referral to any transfer; form-to any department, center or primary care physician who has made the transfer request and its justification if any.
Past Medical History (MH) Description of previous pathologies. A mention of the absence of previous medical history is also considered part of this section.
Family history (FH) Description of family members' pathologies. If its absence is indicated, it will also be noted as belonging to this section.
Exploration (E) Mention of the physical examination, specific studies and their results, laboratory tests. This section includes autopsies and their results as well.
Treatment (T) Treatments or procedures performed on the patient to treat his condition, including "dieting".
Evolution (E) Evolution of the patient's health status. It may include differential diagnoses.


Annotated Clinical Note Example

Present Illness
Derived from/to
Exploration
Treatment
Evolution

Un paciente varón de 25 años miope magno es remitido con el diagnóstico de membrana neovascular subretiniana (MNVSR) en el ojo izquierdo (OI). Había recibido prednisona oral y triamcinolona transeptal. A su llegada a nuestro centro, su mejor agudeza visual corregida (MAVC) era de 0,8 en el ojo derecho (OD) y 0,25 en el OI. La angiofluoresceingrafía (AFG) revelaba puntos hiperfluorescentes en polo posterior de la arteria oftálmica (AO) y en OI una hemorragia macular con MNVSR adyacente. La tomografía de coherencia óptica (TCO) mostraba una NVC yuxtamacular con edema intrarretiniano leve. Se realizó TFD en el OI y a los 6 meses de seguimiento la MAVC mejoró a 0,8 en OI, con pigmentación del borde nasal de la lesión objetivable por retinografía y ausencia de edema en la TCO.

Seis meses más tarde, la MAVC había descendido en el OI a 0,1, la lesión había crecido, había edema subretiniano y se fugaba contraste. Se indicó tratamiento mediante inyección intravítrea de ranibizumab (Lucentis®) en el OI como uso compasivo mediante la habitual dosis de carga consistente en tres inyecciones separadas un mes. Por persistencia de líquido subretiniano, se indicó una cuarta inyección que inactivó la lesión y aumentó la MAVC de 0,25 a 0,7. La MAVC permaneció estable a pesar de la fibrosis subretiniana macular en revisiones posteriores. Un año después comenzó con metamorfopsias en el OD. Su MAVC era 0,9 en OD y la AFG mostró una NVC. La presencia de manchas hiperfluorescentes en AO, junto a la asociación a NVC, condujo al diagnóstico de CPI. Se inició tratamiento con Lucentis® intravítreo en el OD, tres inyecciones, con el resultado de inactividad angiográfica y tomográfica, con una MAVC de 0,8 en AO.


Dataset Division in Splits

The division of the data set into train, development, and test sets has followed several principles. The proportion of notes in each set is 0.75, 0.125, 0.125 respectively, and the allocation of notes is randomly stratified by category and annotator to ensure a similar proportion of categories in all sets and to account for different annotator expertise levels.

Split Total Dataset % Note Number
Train 75% 781
Dev 12.5% 127
Test 12.5% 130


Annotation Process

Initially, guidelines were created for identifying patterns and categorizing each section in unstructured clinical notes. A group of experts went through several rounds of annotating a small set of notes and updating the guidelines accordingly. When the annotation process became more mature, two doctors, trained in clinical report annotation for different tasks, performed a double annotation on the notes. The annotation task was iterative and the evaluation metric was employed to measure the inter tagger agreement, reaching a 75%.


References

[1] A. Miranda-Escalada, A. Gonzalez-Agirre, J. Armengol-Estapé, M. Krallinger, Overview of Automatic Clinical Coding: Annotations, Guidelines, and Solutions for non-English Clinical Cases at CodiEsp Track of CLEF eHealth 2020. Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings, 2020. https://zenodo.org/record/3837305.


Acquisition of the Dataset