Grobid-quantities is a Java application, based on Grobid (GeneRation Of BIbliographic Data), a machine learning framework for parsing and structuring raw documents such as PDF or plain text. Grobid-quantities is designed for large-scale processing tasks in batch or via a web REST API.
The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task.
The models are trained using the Conditional Random Field (CRF) algorithm and Recurrent neural networks (RNN) using the bidirectional LSTM with CRF as activation layer (BidLSTM_CRF).
- quantities are modelled using three different types:
atomic valuesin case of single measurements (e.g., 10 grams),
from 3 to 5 km) and
100 +- 4) for continuous values, and,
listsof discrete values where the measurement unit is shared.
units are decomposed and restructured. Complementary information like unit system, type of measurement are attached by lookup in an internal lexicon.
- value are parsed, supporting different representations:
- numeric (
- alphabetic (
- power of 10 (
1.5 x 10^-5)
- date/time expressions
- numeric (
The measurements that are identified are normalised toward the International System of Units (SI) using the java library Units of measurement.
Grobid-quantities also contains a module implementing the identification of the “quantified” object/substance related to the measure. This module is still experimental.
The following screenshot illustrate an example of measurement that is extracted, parsed and normalised, the quantified substance, streptomycin is additionally recognised:
If you contribute to grobid-quantities, you agree to share your contribution following these licenses.
The References page contains citations, acknowledgement and references resources related to the project.