Grobid-quantities is a Java application, based on Grobid (GeneRation Of BIbliographic Data), a machine learning framework for parsing and structuring raw documents such as PDF or plain text. Grobid-quantities is designed for large-scale processing tasks in batch or via a web REST API.

The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task.

Grobid-quantities cascade schema

The models are trained using the Conditional Random Field (CRF) algorithm and Recurrent neural networks (RNN) using the bidirectional LSTM with CRF as activation layer (BidLSTM_CRF).

quantities are modelled using three different types:
  1. atomic values in case of single measurements (e.g., 10 grams),
  2. interval (e.g. from 3 to 5 km) and range (100 +- 4 ) for continuous values, and,
  3. lists of discrete values where the measurement unit is shared.

units are decomposed and restructured. Complementary information like unit system, type of measurement are attached by lookup in an internal lexicon.

value are parsed, supporting different representations:
  1. numeric (2, 1000)
  2. alphabetic (tw, thousand),
  3. power of 10 (1.5 x 10^-5)
  4. date/time expressions

The measurements that are identified are normalised toward the International System of Units (SI) using the java library Units of measurement.

Grobid-quantities also contains a module implementing the identification of the “quantified” object/substance related to the measure. This module is still experimental.

The following screenshot illustrate an example of measurement that is extracted, parsed and normalised, the quantified substance, streptomycin is additionally recognised:

Grobid-quantities extraction from text


Contact: Patrice Lopez (<>), Luca Foppiano (<>)


GROBID and grobid-quantities are distributed under Apache 2.0 license. The documentation is distributed under CC-0 license. The annotated data are licenced under CC 4.0 BY.

If you contribute to grobid-quantities, you agree to share your contribution following these licenses.

The References page contains citations, acknowledgement and references resources related to the project.