Skip to content

Evaluation scores

End 2 end evaluation

The end-to-end evaluation was performed with the MeasEval dataset (SemEval-2021 Task 8). The scores in the following table are the micro average. MeasEval was annotated to allow approximated entities, which are not supported in grobid-quantities.

Type (Ref) Matching method Precision Recall F1-score Support
Quantities (QUANT) strict 54.09 54.47 54.28 1137
Quantities (QUANT) soft 67.02 67.49 67.26 1137
Quantified substance (ME) strict 13.82 9.67 11.38 615
Quantified substance (ME) soft 21.63 15.13 17.80 615

Note: the ME (Measured Entity) is still experimental in Grobid-quantities.

To reproduce the end-to-end evaluation, you can run the scripts/measeval_e2e_eval.py script (use the requirements.txt to install the correct dependencies).

Machine Learning Named Entities Recognition Evaluation

The scores (P: Precision, R: Recall, F1: F1-score) for all the models are performed either as 10-fold cross-validation or using a holdout dataset. The holdout dataset of Grobid-quantities is composed by the following examples:

  • Quantities ML: 10 articles
  • Units ML: UNISCOR dataset with around 1600 examples
  • Values ML: 950 examples

For Deep learning models (BidLSTM_CRF/BidLSTM_CRF_FEATURES, BERT_CRF) models, we provide the average over 5 runs.

The models are organised as follows:

  • BidLSTM_CRF is a RNN model based on (Lample et al., 2016) work, with a CRF model as activation function
  • BidLSTM_CRF_FEATURES is an extension of BidLSTM_CRF that allow using layout features
  • BERT_CRF is a BERT-based model obtained by fine-tuning a SciBERT encoder. Like others, the activation function is composed by a CRF layer.

Results from

The evaluation was performed on the holdout dataset from the grobid-quantities dataset. Average values are computed as Micro average. To reproduce it, see evaluation_doc.

Quantities

Labels CRF BERT_CRF Support
Metrics P R F1 P R F1 St.dev
<unitLeft> 90.26 83.84 86.93 93.13 89.96 91.52 0.0086 464
<unitRight> 36.36 30.77 33.33 23.67 40.00 29.70 0.0139 13
<valueAtomic> 75.75 77.97 76.84 85.46 87.99 86.70 0.0041 581
<valueBase> 80.77 60.00 68.85 98.75 90.29 94.33 0.0163 35
<valueLeast> 76.24 61.11 67.84 84.58 72.22 77.91 0.0212 126
<valueList> 27.27 11.32 16.00 61.10 39.62 47.79 0.0262 53
<valueMost> 68.35 55.67 61.36 78.93 71.75 75.16 0.0179 97
<valueRange> 91.18 88.57 89.86 100.00 91.43 95.52 0.0000 35
----
All (micro avg) 79.49 73.72 76.5 86.50 83.97 85.22 0.0031 1404
Labels BidLSTM_CRF BidLSTM_CRF_FEATURES Support
Metrics P R F1 St.dev P R F1 St.dev
<unitLeft> 87.58 89.96 88.75 0.0074 86.95 89.57 88.24 0.0097 464
<unitRight> 25.01 30.77 27.50 0.0193 23.99 30.77 26.91 0.0146 13
<valueAtomic> 79.52 85.71 82.49 0.0044 78.33 86.57 82.24 0.0062 581
<valueBase> 83.84 97.14 89.97 0.0185 80.99 97.14 88.32 0.0115 35
<valueLeast> 83.79 62.38 71.45 0.0294 84.37 60.00 70.06 0.0335 126
<valueList> 80.12 13.58 23.05 0.0326 69.29 14.34 23.37 0.0715 53
<valueMost> 75.91 70.92 73.22 0.0311 75.54 67.01 70.99 0.0370 97
<valueRange> 92.87 94.86 93.84 0.0783 95.58 97.14 96.35 0.0673 35
----
All (micro avg) 82.12 81.28 81.70 0.0048 81.26 81.11 81.19 0.0090 1404

Units

Units were evaluated using UNISCOR dataset. For more information check the section UNISCOR.

Labels CRF BERT_CRF Support
Metrics P R F1 P R F1 St.dev
<base> 80.64 82.71 81.66 73.63 76.26 74.89 0.0231 3228
<pow> 71.94 74.34 73.12 80.20 57.35 66.75 0.0752 1773
<prefix> 92.6 86.48 89.43 77.61 88.05 82.12 0.0338 1287
-----
All (micro avg) 80.39 81.12 80.76 75.55 73.34 74.41 0.0178 6288
Labels BidLSTM_CRF BidLSTM_CRF_FEATURES Support
Metrics P R F1 St.dev P R F1 St.dev
<base> 52.17 46.16 48.93 0.0494 51.99 48.00 49.88 0.0259 3228
<pow> 94.25 56.89 70.94 0.0125 94.20 56.92 70.96 0.0062 1773
<prefix> 81.36 82.88 82.01 0.0119 82.11 82.94 82.43 0.0201 1287
-----
All (micro avg) 68.12 56.70 61.85 0.0282 67.76 57.67 62.29 0.0173 6288

Values

Labels CRF BERT_CRF
Metrics P R F1 P R F1 St.dev Support
<alpha> 96.9 99.21 98.04 99.21 99.37 99.29 0.0017 464
<base> 100 92.31 96 100.00 100.00 100.00 0.0000 13
<number> 99.14 99.63 99.38 99.43 99.46 99.44 0.0005 581
<pow> 100 100 100 100.00 100.00 100.00 0.0000 35
-----
All (micro avg) 98.86 99.48 99.17 99.42 99.46 99.44 0.0004 1093
Labels BidLSTM_CRF BidLSTM_CRF_FEATURES
Metrics P R F1 St.dev P R F1 St.dev Support
<alpha> 97.82 99.53 98.66 0.0035 93.13 89.96 91.52 0.0086 464
<base> 97.78 67.69 79.46 0.0937 23.67 40.00 29.70 0.0139 13
<number> 98.92 99.33 99.13 0.0008 85.46 87.99 86.70 0.0041 581
<pow> 69.11 73.85 71.29 0.1456 98.75 90.29 94.33 0.0163 35
-----
All (micro avg) 98.34 98.59 98.47 0.0023 86.50 83.97 85.22 0.0031 1093

Other published results

ℹ The paper \"Automatic Identification and Normalisation of Physical Measurements in Scientific Literature,\" published in September 2019, reported macro averaged evaluation scores.