Evaluation scores
End 2 end evaluation
The end-to-end evaluation was performed with the MeasEval dataset (SemEval-2021 Task 8). The scores in the following table are the micro average. MeasEval was annotated to allow approximated entities, which are not supported in grobid-quantities.
Type (Ref) | Matching method | Precision | Recall | F1-score | Support |
---|---|---|---|---|---|
Quantities | strict | 54.82 | 60.50 | 57.52 | 1246 |
Quantities | soft | 66.93 | 73.87 | 70.23 | 1246 |
Quantified substance | strict | 13.10 | 10.58 | 11.71 | 710 |
Quantified substance | soft | 23.38 | 18.89 | 20.89 | 710 |
Note: the ME (Measured Entity) is still experimental in Grobid-quantities.
To reproduce the end-to-end evaluation, you can run the scripts/measeval_e2e_eval.py
script (use the requirements.txt
to install the correct dependencies).
Machine Learning Named Entities Recognition Evaluation
The scores (P: Precision, R: Recall, F1: F1-score) for all the models are performed either as 10-fold cross-validation or using a holdout dataset. The holdout dataset of Grobid-quantities is composed by the following examples:
- Quantities ML: 10 articles
- Units ML: UNISCOR dataset with around 1600 examples
- Values ML: 950 examples
For Deep learning models (BidLSTM_CRF/BidLSTM_CRF_FEATURES, BERT_CRF) models, we provide the average over 5 runs.
The models are organised as follows:
- BidLSTM_CRF is a RNN model based on (Lample et al., 2016) work, with a CRF model as activation function
- BidLSTM_CRF_FEATURES is an extension of BidLSTM_CRF that allow using layout features
- BERT_CRF is a BERT-based model obtained by fine-tuning a SciBERT encoder. Like others, the activation function is composed by a CRF layer.
Results from
The evaluation was performed on the holdout dataset from the grobid-quantities dataset.
Average values are computed as Micro average.
To reproduce it, see evaluation_doc
.
Quantities
Labels | CRF | BERT_CRF | Support | ||||||
---|---|---|---|---|---|---|---|---|---|
Metrics | P | R | F1 | P | R | F1 | St.dev | ||
<unitLeft> |
90.26 | 83.84 | 86.93 | 93.13 | 89.96 | 91.52 | 0.0086 | 464 | |
<unitRight> |
36.36 | 30.77 | 33.33 | 23.67 | 40.00 | 29.70 | 0.0139 | 13 | |
<valueAtomic> |
75.75 | 77.97 | 76.84 | 85.46 | 87.99 | 86.70 | 0.0041 | 581 | |
<valueBase> |
80.77 | 60.00 | 68.85 | 98.75 | 90.29 | 94.33 | 0.0163 | 35 | |
<valueLeast> |
76.24 | 61.11 | 67.84 | 84.58 | 72.22 | 77.91 | 0.0212 | 126 | |
<valueList> |
27.27 | 11.32 | 16.00 | 61.10 | 39.62 | 47.79 | 0.0262 | 53 | |
<valueMost> |
68.35 | 55.67 | 61.36 | 78.93 | 71.75 | 75.16 | 0.0179 | 97 | |
<valueRange> |
91.18 | 88.57 | 89.86 | 100.00 | 91.43 | 95.52 | 0.0000 | 35 | |
---- | |||||||||
All (micro avg) | 79.49 | 73.72 | 76.5 | 86.50 | 83.97 | 85.22 | 0.0031 | 1404 |
Labels | BidLSTM_CRF | BidLSTM_CRF_FEATURES | Support | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Metrics | P | R | F1 | St.dev | P | R | F1 | St.dev | ||
<unitLeft> |
87.58 | 89.96 | 88.75 | 0.0074 | 86.95 | 89.57 | 88.24 | 0.0097 | 464 | |
<unitRight> |
25.01 | 30.77 | 27.50 | 0.0193 | 23.99 | 30.77 | 26.91 | 0.0146 | 13 | |
<valueAtomic> |
79.52 | 85.71 | 82.49 | 0.0044 | 78.33 | 86.57 | 82.24 | 0.0062 | 581 | |
<valueBase> |
83.84 | 97.14 | 89.97 | 0.0185 | 80.99 | 97.14 | 88.32 | 0.0115 | 35 | |
<valueLeast> |
83.79 | 62.38 | 71.45 | 0.0294 | 84.37 | 60.00 | 70.06 | 0.0335 | 126 | |
<valueList> |
80.12 | 13.58 | 23.05 | 0.0326 | 69.29 | 14.34 | 23.37 | 0.0715 | 53 | |
<valueMost> |
75.91 | 70.92 | 73.22 | 0.0311 | 75.54 | 67.01 | 70.99 | 0.0370 | 97 | |
<valueRange> |
92.87 | 94.86 | 93.84 | 0.0783 | 95.58 | 97.14 | 96.35 | 0.0673 | 35 | |
---- | ||||||||||
All (micro avg) | 82.12 | 81.28 | 81.70 | 0.0048 | 81.26 | 81.11 | 81.19 | 0.0090 | 1404 |
Units
Units were evaluated using UNISCOR dataset. For more information check the section UNISCOR.
Labels | CRF | BERT_CRF | Support | ||||||
---|---|---|---|---|---|---|---|---|---|
Metrics | P | R | F1 | P | R | F1 | St.dev | ||
<base> |
80.64 | 82.71 | 81.66 | 73.63 | 76.26 | 74.89 | 0.0231 | 3228 | |
<pow> |
71.94 | 74.34 | 73.12 | 80.20 | 57.35 | 66.75 | 0.0752 | 1773 | |
<prefix> |
92.6 | 86.48 | 89.43 | 77.61 | 88.05 | 82.12 | 0.0338 | 1287 | |
----- | |||||||||
All (micro avg) | 80.39 | 81.12 | 80.76 | 75.55 | 73.34 | 74.41 | 0.0178 | 6288 |
Labels | BidLSTM_CRF | BidLSTM_CRF_FEATURES | Support | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Metrics | P | R | F1 | St.dev | P | R | F1 | St.dev | ||
<base> |
52.17 | 46.16 | 48.93 | 0.0494 | 51.99 | 48.00 | 49.88 | 0.0259 | 3228 | |
<pow> |
94.25 | 56.89 | 70.94 | 0.0125 | 94.20 | 56.92 | 70.96 | 0.0062 | 1773 | |
<prefix> |
81.36 | 82.88 | 82.01 | 0.0119 | 82.11 | 82.94 | 82.43 | 0.0201 | 1287 | |
----- | ||||||||||
All (micro avg) | 68.12 | 56.70 | 61.85 | 0.0282 | 67.76 | 57.67 | 62.29 | 0.0173 | 6288 |
Values
Labels | CRF | BERT_CRF | ||||||
---|---|---|---|---|---|---|---|---|
Metrics | P | R | F1 | P | R | F1 | St.dev | Support |
<alpha> |
96.9 | 99.21 | 98.04 | 99.21 | 99.37 | 99.29 | 0.0017 | 464 |
<base> |
100 | 92.31 | 96 | 100.00 | 100.00 | 100.00 | 0.0000 | 13 |
<number> |
99.14 | 99.63 | 99.38 | 99.43 | 99.46 | 99.44 | 0.0005 | 581 |
<pow> |
100 | 100 | 100 | 100.00 | 100.00 | 100.00 | 0.0000 | 35 |
----- | ||||||||
All (micro avg) | 98.86 | 99.48 | 99.17 | 99.42 | 99.46 | 99.44 | 0.0004 | 1093 |
Labels | BidLSTM_CRF | BidLSTM_CRF_FEATURES | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Metrics | P | R | F1 | St.dev | P | R | F1 | St.dev | Support | |
<alpha> |
97.82 | 99.53 | 98.66 | 0.0035 | 93.13 | 89.96 | 91.52 | 0.0086 | 464 | |
<base> |
97.78 | 67.69 | 79.46 | 0.0937 | 23.67 | 40.00 | 29.70 | 0.0139 | 13 | |
<number> |
98.92 | 99.33 | 99.13 | 0.0008 | 85.46 | 87.99 | 86.70 | 0.0041 | 581 | |
<pow> |
69.11 | 73.85 | 71.29 | 0.1456 | 98.75 | 90.29 | 94.33 | 0.0163 | 35 | |
----- | ||||||||||
All (micro avg) | 98.34 | 98.59 | 98.47 | 0.0023 | 86.50 | 83.97 | 85.22 | 0.0031 | 1093 |
Other published results
The paper \"Automatic Identification and Normalisation of Physical Measurements in Scientific Literature,\" published in September 2019, reported macro averaged evaluation scores.