Introduction

Grobid-quantities is a Java application, based on Grobid (GeneRation Of BIbliographic Data), a machine learning framework for parsing and structuring raw documents such as PDF or plain text. Grobid-quantities is designed for large-scale processing tasks in batch or via a web REST API.

The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm.

quantities are modelled using three different types:
  1. atomic values in case of single measurements (e.g., 10 grams),
  2. interval (e.g. from 3 to 5 km) and range (100 +- 4 ) for continuous values, and,
  3. lists of discrete values:

units are decomposed and restructured. Complementary information like unit system, type of measurement are attached by lookup in an internal lexicon.

value are parsed, supporting different representations:
  1. numeric (2, 1000)
  2. alphabetic (tw, thousand),
  3. power of 10 (1.5 x 10^-5)
  4. date/time expressions

The measurements that are identified are normalised toward the International System of Units (SI) using the java library Units of measurement.

Grobid-quantities also contains a module implementing the identification of the “quantified” object/substance related to the measure. This module is currently experimental.

The following screenshot illustrate an example of measurement that is extracted, parsed and normalised, the quantified substance, streptomycin is additionally recognised:

Grobid-quantities extraction from text

Contacts

Contact: Patrice Lopez (<patrice.lopez@science-miner.com>), Luca Foppiano (<luca@foppiano.org>)

License

GROBID and grobid-quantities are distributed under Apache 2.0 license. The documentation is distributed under CC-0 license. The annotated data are licenced under CC 4.0 BY.

If you contribute to grobid-quantities, you agree to share your contribution following these licenses.

The References page contains citations, acknowledgement and references resources related to the project.