Evaluation scores

End 2 end evaluation

The end-to-end evaluation was performed with the MeasEval dataset (SemEval-2021 Task 8). The scores in the following table are the micro average. MeasEval was annotated to allow approximated entities, which are not supported in grobid-quantities.

Type (Ref)	Matching method	Precision	Recall	F1-score	Support
Quantities	strict	54.82	60.50	57.52	1246
Quantities	soft	66.93	73.87	70.23	1246
Quantified substance	strict	13.10	10.58	11.71	710
Quantified substance	soft	23.38	18.89	20.89	710

Note: the ME (Measured Entity) is still experimental in Grobid-quantities.

To reproduce the end-to-end evaluation, you can run the scripts/measeval_e2e_eval.py script (use the requirements.txt to install the correct dependencies).

Machine Learning Named Entities Recognition Evaluation

The scores (P: Precision, R: Recall, F1: F1-score) for all the models are performed either as 10-fold cross-validation or using a holdout dataset. The holdout dataset of Grobid-quantities is composed by the following examples:

Quantities ML: 10 articles
Units ML: UNISCOR dataset with around 1600 examples
Values ML: 950 examples

For Deep learning models (BidLSTM_CRF/BidLSTM_CRF_FEATURES, BERT_CRF) models, we provide the average over 5 runs.

The models are organised as follows:

BidLSTM_CRF is a RNN model based on (Lample et al., 2016) work, with a CRF model as activation function
BidLSTM_CRF_FEATURES is an extension of BidLSTM_CRF that allow using layout features
BERT_CRF is a BERT-based model obtained by fine-tuning a SciBERT encoder. Like others, the activation function is composed by a CRF layer.

Results from

The evaluation was performed on the holdout dataset from the grobid-quantities dataset. Average values are computed as Micro average. To reproduce it, see evaluation_doc.

Quantities

Labels	CRF			BERT_CRF				Support
Metrics	P	R	F1	P	R	F1	St.dev
`<unitLeft>`	90.26	83.84	86.93	93.13	89.96	91.52	0.0086	464
`<unitRight>`	36.36	30.77	33.33	23.67	40.00	29.70	0.0139	13
`<valueAtomic>`	75.75	77.97	76.84	85.46	87.99	86.70	0.0041	581
`<valueBase>`	80.77	60.00	68.85	98.75	90.29	94.33	0.0163	35
`<valueLeast>`	76.24	61.11	67.84	84.58	72.22	77.91	0.0212	126
`<valueList>`	27.27	11.32	16.00	61.10	39.62	47.79	0.0262	53
`<valueMost>`	68.35	55.67	61.36	78.93	71.75	75.16	0.0179	97
`<valueRange>`	91.18	88.57	89.86	100.00	91.43	95.52	0.0000	35
----
All (micro avg)	79.49	73.72	76.5	86.50	83.97	85.22	0.0031	1404

Labels	BidLSTM_CRF				BidLSTM_CRF_FEATURES				Support
Metrics	P	R	F1	St.dev	P	R	F1	St.dev
`<unitLeft>`	87.58	89.96	88.75	0.0074	86.95	89.57	88.24	0.0097	464
`<unitRight>`	25.01	30.77	27.50	0.0193	23.99	30.77	26.91	0.0146	13
`<valueAtomic>`	79.52	85.71	82.49	0.0044	78.33	86.57	82.24	0.0062	581
`<valueBase>`	83.84	97.14	89.97	0.0185	80.99	97.14	88.32	0.0115	35
`<valueLeast>`	83.79	62.38	71.45	0.0294	84.37	60.00	70.06	0.0335	126
`<valueList>`	80.12	13.58	23.05	0.0326	69.29	14.34	23.37	0.0715	53
`<valueMost>`	75.91	70.92	73.22	0.0311	75.54	67.01	70.99	0.0370	97
`<valueRange>`	92.87	94.86	93.84	0.0783	95.58	97.14	96.35	0.0673	35
----
All (micro avg)	82.12	81.28	81.70	0.0048	81.26	81.11	81.19	0.0090	1404

Units

Units were evaluated using UNISCOR dataset. For more information check the section UNISCOR.

Labels	CRF			BERT_CRF				Support
Metrics	P	R	F1	P	R	F1	St.dev
`<base>`	80.64	82.71	81.66	73.63	76.26	74.89	0.0231	3228
`<pow>`	71.94	74.34	73.12	80.20	57.35	66.75	0.0752	1773
`<prefix>`	92.6	86.48	89.43	77.61	88.05	82.12	0.0338	1287
-----
All (micro avg)	80.39	81.12	80.76	75.55	73.34	74.41	0.0178	6288

Labels	BidLSTM_CRF				BidLSTM_CRF_FEATURES				Support
Metrics	P	R	F1	St.dev	P	R	F1	St.dev
`<base>`	52.17	46.16	48.93	0.0494	51.99	48.00	49.88	0.0259	3228
`<pow>`	94.25	56.89	70.94	0.0125	94.20	56.92	70.96	0.0062	1773
`<prefix>`	81.36	82.88	82.01	0.0119	82.11	82.94	82.43	0.0201	1287
-----
All (micro avg)	68.12	56.70	61.85	0.0282	67.76	57.67	62.29	0.0173	6288

Values

Labels	CRF			BERT_CRF
Metrics	P	R	F1	P	R	F1	St.dev	Support
`<alpha>`	96.9	99.21	98.04	99.21	99.37	99.29	0.0017	464
`<base>`	100	92.31	96	100.00	100.00	100.00	0.0000	13
`<number>`	99.14	99.63	99.38	99.43	99.46	99.44	0.0005	581
`<pow>`	100	100	100	100.00	100.00	100.00	0.0000	35
-----
All (micro avg)	98.86	99.48	99.17	99.42	99.46	99.44	0.0004	1093

Labels	BidLSTM_CRF				BidLSTM_CRF_FEATURES
Metrics	P	R	F1	St.dev	P	R	F1	St.dev	Support
`<alpha>`	97.82	99.53	98.66	0.0035	93.13	89.96	91.52	0.0086	464
`<base>`	97.78	67.69	79.46	0.0937	23.67	40.00	29.70	0.0139	13
`<number>`	98.92	99.33	99.13	0.0008	85.46	87.99	86.70	0.0041	581
`<pow>`	69.11	73.85	71.29	0.1456	98.75	90.29	94.33	0.0163	35
-----
All (micro avg)	98.34	98.59	98.47	0.0023	86.50	83.97	85.22	0.0031	1093

Other published results

The paper \"Automatic Identification and Normalisation of Physical Measurements in Scientific Literature,\" published in September 2019, reported macro averaged evaluation scores.