Evaluation

Evaluation Scripts

The evaluation scripts have been released and are available in the Starting Kit of the CodaLab competition

Evaluation Metric

Section identification in unstructured clinical notes presents many challenges that consequently hinder the evaluation of this task. The most noteworthy aspects are on the one hand that the end of one section is always linked to the beginning of another, commonly used evaluation approaches would count two sections as wrong if one of these boundaries were not correct, even if the error was a single word. On the other hand that sections are not delimited by paragraphs, lines or phrases, meaning that on sentence may have more than one section, greatly increasing the difficulty of the segmentation task.

As a result we conducted a thorough analysis of existing metrics and designed the B2 evaluation metric as a means of better evaluate the actual performance in the task.

B2 Evaluation Metric Details

Our presented B2 metric is an adaptation of the boundary distance B developed by C. Fournier. It employs a variation of the editing distance with three operations (addition/deletion, substitution, and transposition), and is able to discern segment types. The main advantage is the definition of the transpose operation, in which the boundary between 2 sections can be moved by a limited and configurable number of borders, instead of performing an insert and a delete operation.

Additions/deletions (A or D) for full misses. Addition when the prediction missed a section and adding it the gold is matched. Deletion when the system predicts a non-existing section and deleting it matches the gold standard.
Substitutions (S) when a boundary type is confused with another.
n-wise transpositions (T) for near misses. Cases where a section type is well identified but the predicted boundary is displaced $n$ words.

After an extensive analysis of the initially annotated examples, each operation's weight function was adjusted creating a new measure called B2 that can be defined using the following formulae.

Substitutions undergo a high penalization as they correspond to clear errors:

$w_S(n\_substitutions) = 1.3 \cdot n\_substitutions$
Additions and deletions generally represent discrepancies regarding whether a fragment belongs to a different section or to an existing contiguous one. This error is common given the characteristics of the documents.
When there are exactly two additions it typically indicates the insertion of a section in the middle of another one. Under the standpoint of section identification, it should be considered as a single error, though for the algorithm it is counted as two errors: the insertion of the start of the new section and the continuation of the previous existing one. There are less frequent situations where this may not be the case, for instance if the new section spreads until the previous section's end, not being necessary the extra addition to continue the previous section. Consequently, and considering the limitations of the first version of B, it was decided to apply the next weighting:

$w_A(n\_additions) = \begin{cases} 0 & \text{if } n\_boundaries = 0 \\ 0.75 + \frac{\tanh{(n\_additions - 1.5) - 2}}{4} & \text{otherwise} \end{cases} \label{eq:weight_additions}$
Transpositions are minor divergences in the length of a section. Transpositions range from one or two words to complete sentences, therefore the upper limit of borders a boundary can be moved was set to $n_t=40$. This is a fairly high limit that can cover an entire paragraph, although transpositions with different displacement length do not symbolize the same error. This fact led us to weight each individual transposition based on the number of borders moved.
The maximum value that a weighted transposition can reach is $\sim0.68$ when the boundary is moved the maximum number of borders allowed, and approaches $0$ as the displacement is smaller.

$w_t(n\_borders, n_t) = \begin{cases} 0 & \text{if } n\_borders \leq 2 \\ 0.35 + \tanh(\frac{n\_borders-15}{10}) / 3 & \text{if } 2 < n\_borders \leq n_t \\ \end{cases}$

The next equation is used to calculate B2, i.e., one minus the incorrectness between the 2 annotations. $s_1$ and $s_2$ are lists of the same size, the number of instance borders, containing in each border's position a set with a section boundary if any. $B_M$, $A_e$, $S_e$ and $T_e$ are calculated using $s_1$ and $s_2$, where $B_M$ is the set of matching boundary pairs, $A_e$ and $S_e$ are the sets of additions/deletions and substitutions respectively, and $T_e$ is the list of transpositions. Each transposition is a list containing among others the index of the border where the section boundary was in position 1, and in position 2 the index of the border to which the section boundary has been moved.
B2 metric does consider matching boundaries pairs ($B_M$) as correct and transpositions as semi-correct section divisions. Each transposition's weight's complement is used to adjust how much it contributes to the correctness calculation so that small transpositions count more than bigger ones.

$B2(s_1, s_1, n_t) = 1 - \frac{w_A(|A_e|) + w_T(T_e, n_t) + w_S(|S_e|)}{|A_e| + |T_e| + |S_e| + |B_M| + (|T_e| - w_T(T_e, n_t))}$

While the described formula is used to evaluate a single note, it is not prepared for directly evaluate the whole dataset. In order to evaluate the overall performance on the dataset, a weighted average of the metrics reported for each note is calculated. We use the number of sections present in the ground truth note as a mean of assigning weights, as we found out that other type of complexity indicators, such us word count, were not truly representative of the actual difficulty present in each note.

References

[1] C. Fournier, Evaluating Text Segmentation using Boundary Edit Distance, in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 1702–1712. https://aclanthology.org/P13-1167.