Skip to content

Commit

Permalink
update doc for release
Browse files Browse the repository at this point in the history
  • Loading branch information
kermitt2 committed Apr 24, 2020
1 parent d8eefb1 commit 544f804
Show file tree
Hide file tree
Showing 8 changed files with 63 additions and 56 deletions.
4 changes: 2 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [Unreleased]
## [0.6.0] – 2020-04-24

### Added

Expand All @@ -19,7 +19,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

+ Improve CORS configuration #527 (thank you @lfoppiano)
+ Documentation improvements
+ Update of segmentation and fulltext model
+ Update of segmentation and fulltext model and training data
+ Better handling of affiliation block fragments
+ Improved DOI string recognition
+ More robust n-fold cross validation (case of shared grobid-home)
Expand Down
42 changes: 24 additions & 18 deletions doc/Benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,36 +289,40 @@ However note that for some simpler NER-style tasks or especially for text classi
**Summary**

Architectures:

- [Architecture 1](https://github.com/kermitt2/delft/pull/82#issuecomment-587280497): using normal dropout after the BidLSTM of the feature channel

- [Architecture 2](https://github.com/kermitt2/delft/pull/82#issuecomment-588570868): using normal dropouts between embeddings and BidLSTM of the feature channel

- [Architecture 3](https://github.com/kermitt2/delft/pull/82#issuecomment-589442381): using recurrent dropouts on the BidLSTM of the feature channel
- Ignored features: using the standard BidLSTM-CRF without the use of any layout feature information

- Ignored features: using the standard BidLSTM-CRF without the use of any layout feature information

`Trainable=true` indicate that the features embeddings are trainable.

All metrics has been calculated by running n-fold cross-validation with n = 10.

|Model | [Architecture 1](https://github.com/kermitt2/delft/pull/82#issuecomment-589447087) | [Architecture 1](https://github.com/kermitt2/delft/pull/82#issuecomment-593787846) (Trainable = true) | [Architecture 2](https://github.com/kermitt2/delft/pull/82#issuecomment-589439496) | [Architecture 2](https://github.com/kermitt2/delft/pull/82#issuecomment-593788260) (Trainable = true) | [Architecture 3](https://github.com/kermitt2/delft/pull/82#issuecomment-589523067) | [Architecture 3](https://github.com/kermitt2/delft/pull/82#issuecomment-594249488)(Trainable = true) | [Ignore features](https://github.com/kermitt2/delft/pull/82#issuecomment-586652333) | CRF Wapiti
|Model | CRF Wapiti | [Architecture 1](https://github.com/kermitt2/delft/pull/82#issuecomment-589447087) | [Architecture 1](https://github.com/kermitt2/delft/pull/82#issuecomment-593787846) (Trainable = true) | [Architecture 2](https://github.com/kermitt2/delft/pull/82#issuecomment-589439496) | [Architecture 2](https://github.com/kermitt2/delft/pull/82#issuecomment-593788260) (Trainable = true) | [Architecture 3](https://github.com/kermitt2/delft/pull/82#issuecomment-589523067) | [Architecture 3](https://github.com/kermitt2/delft/pull/82#issuecomment-594249488) (Trainable = true) | [Ignore features](https://github.com/kermitt2/delft/pull/82#issuecomment-586652333) |
|-- | -- | -- | -- | -- | -- | -- | -- | -- |
|Affiliation-address | 0.8709 | 0.8714 | 0.8721 | 0.872 | **0.873** | 0.8677 | 0.8668 | 0.8587 |
|Citation | 0.9516 | **0.9522** | 0.9501 | 0.9503 | 0.9518 | 0.951 | 0.95 | 0.9448 |
|Date | 0.9628 | 0.96 | 0.9606 | 0.9616 | 0.9631 | 0.961 | 0.9663 | **0.9833** |
|Figure | 0.5594 | 0.5397 | 0.5907 | 0.4714 | 0.5515 | 0.6219 | 0.2949 | **0.9839** |
|Header | 0.7107 | 0.7102 | 0.7139 | 0.7156 | 0.7215 | 0.713 | 0.6764 | **0.7425** |
|Software | 0.8112 | **0.8128** | 0.807 | 0.8039 | 0.8038 | 0.8084 | 0.7915 | 0.7764 |
|Superconductors [85 papers] | 0.7774 | 0.772 | 0.7767 | **0.7814** | 0.7766 | 0.7791 | 0.7663 | 0.6528 |
|Quantities | 0.8809 | 0.8752 | **0.883** | 0.8701 | 0.8724 | 0.8727 | 0.8733 | 0.8014 |
|Unit | 0.9838 | 0.9834 | 0.9829 | 0.9826 | 0.9816 | 0.9846 | 0.9801 | **0.9886** |
|Values | 0.979 | **0.9874** | 0.9854 | 0.9852 | 0.9851 | 0.9853 | 0.9827 | 0.8457 |
|Affiliation-address | 0.8587 | 0.8709 | 0.8714 | 0.8721 | 0.872 | **0.873** | 0.8677 | 0.8668 |
|Citation | 0.9448 | 0.9516 | **0.9522** | 0.9501 | 0.9503 | 0.9518 | 0.951 | 0.95 |
|Date | **0.9833** | 0.9628 | 0.96 | 0.9606 | 0.9616 | 0.9631 | 0.961 | 0.9663 |
|Figure | **0.9839** | 0.5594 | 0.5397 | 0.5907 | 0.4714 | 0.5515 | 0.6219 | 0.2949 |
|Header | **0.7425** |0.7107 | 0.7102 | 0.7139 | 0.7156 | 0.7215 | 0.713 | 0.6764 |
|Software | 0.7764 | 0.8112 | **0.8128** | 0.807 | 0.8039 | 0.8038 | 0.8084 | 0.7915 |
|Superconductors [85 papers] | 0.6528 | 0.7774 | 0.772 | 0.7767 | **0.7814** | 0.7766 | 0.7791 | 0.7663 |
|Quantities | 0.8014 | 0.8809 | 0.8752 | **0.883** | 0.8701 | 0.8724 | 0.8727 | 0.8733 |
|Unit | **0.9886** | 0.9838 | 0.9834 | 0.9829 | 0.9826 | 0.9816 | 0.9846 | 0.9801 |
|Values | 0.8457 | 0.979 | **0.9874** | 0.9854 | 0.9852 | 0.9851 | 0.9853 | 0.9827 |
| | | | | | | | | |
|Average | 0.84877 | 0.84643 | 0.85224 | 0.83941 | 0.84804 | 0.85447 | 0.81483 | **0.85781** |
|**Average** | **0.85781** | 0.84877 | 0.84643 | 0.85224 | 0.83941 | 0.84804 | 0.85447 | 0.81483 |


### Runtime

To appreciate the runtime impact of Deep Learning models over CRF Wapiti, we report here some relevant comparisons. The following runtimes were obtained based on a Ubuntu 16.04 server Intel i7-4790 (4 CPU), 4.00 GHz with 16 GB memory. The runtimes for the Deep Learning architectures are based on the same machine with a nvidia GPU GeForce 1080Ti (11 GB). We run here a [software mention recognizer](https://github.com/ourresearch/software-mentions) model with Grobid as reference model, but any Grobid model would exhibit similar relative difference.
To appreciate the runtime impact of Deep Learning models over CRF Wapiti, we report here some relevant comparisons. The following runtimes were obtained based on a Ubuntu 16.04 server Intel i7-4790 (4 CPU), 4.00 GHz with 16 GB memory. The runtimes for the Deep Learning architectures are based on the same machine with a nvidia GPU GeForce 1080Ti (11 GB). We run here a [software mention recognizer](https://github.com/ourresearch/software-mentions) model with Grobid as reference model, but any Grobid model would exhibit similar relative differences.

|CRF ||
|CRF Wapiti ||
|--- | --- |
|threads | tokens/s |
|1 | 23,685 |
Expand Down Expand Up @@ -349,8 +353,10 @@ To appreciate the runtime impact of Deep Learning models over CRF Wapiti, we rep
| 5 | 4,729|
| 6 | 5,060|

Batch size is a parameter constrained by the capacity of the available GPU. An improvement of the performance of the deep learning architecture requires increasing the number of GPU and the amount of memory of these GPU, similarly as improving CRF capacity requires increasing the number of available threads and CPU. Running a Deep Learning architectures on CPU is around 50 times slower than on GPU (although it depends on the amount of RAM available with the CPU, which can allow to increase the batch size significantly).
Additional remarks:

- Batch size is a parameter constrained by the capacity of the available GPU. An improvement of the performance of the deep learning architecture requires increasing the number of GPU and the amount of memory of these GPU, similarly as improving CRF Wapiti capacity requires increasing the number of available threads and CPU. Running a Deep Learning architectures on CPU is around 50 times slower than on GPU (although it depends on the amount of RAM available with the CPU, which can allow to increase the batch size significantly).

Note that the BERT-CRF architecture in DeLFT is a strongly optimized version of the official version of BERT (which does not support sequence labelling as such), with a final CRF activation layer instead of a softmax (a CRF activation layer improves f-score in average by +0.30 for sequence labelling task). Above we run SciBERT, a BERT base model trained on scientific literature. Also note that given their limit of the size of the input sequence (512 tokens), BERT models are challenging to apply to several Grobid tasks which are working at document or paragraph levels.
- The BERT-CRF architecture in DeLFT is a modified and heavily optimized version of the Google Research [reference distribution of BERT](https://github.com/google-research/bert) (which does not support sequence labelling as such), with a final CRF activation layer instead of a softmax (a CRF activation layer improves f-score in average by +0.30 for sequence labelling task). Above we run SciBERT, a BERT base model trained on scientific literature. Also note that given their limit of the size of the input sequence (512 tokens), BERT models are challenging to apply to several Grobid tasks which are working at document or paragraph levels.

Finally an important aspect is that we present here the runtime for a single model. When using a cascade of models as in the Grobid core PDF structuring task, involving 9 different sequence labelling models, the possibility to use efficiently the batch size with the DL architecture is very challenging. In practice, as the batches will be often filled by 1 or a few input sequences, the runtime for a single document will be significantly longer (up to 100 times slower), and adapting the processing of multiple PDF in parallel with DL batches will require an important development effort.
- Finally we present here the runtime for a single model. When using a cascade of models as in the Grobid core PDF structuring task, involving 9 different sequence labelling models, the possibility to use efficiently the batch size with the DL architecture is very challenging. In practice, as the batches will be often filled by 1 or a few input sequences, the runtime for a single document will be significantly longer (up to 100 times slower), and adapting the processing of multiple PDF in parallel with DL batches will require an important development effort.
2 changes: 1 addition & 1 deletion doc/Consolidation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ In GROBID, we call __consolidation__ the usage of an external bibliographical se

Consolidation has two main interests:

* The consolidation service improves very significantly the retrieval of header information (+.12 to .13 in f-score, e.g. from 74.59 f-score in average for all fields with Ratcliff/Obershelp similarity at 0.95, to 88.89 f-score, using biblio-glutton and GROBID version 0.5.6-SNAPSHOT for the PMC 1942 dataset, see the [benchmarking documentation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) and [reports](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc)).
* The consolidation service improves very significantly the retrieval of header information (+.12 to .13 in f-score, e.g. from 74.59 f-score in average for all fields with Ratcliff/Obershelp similarity at 0.95, to 88.89 f-score, using biblio-glutton and GROBID version `0.5.6` for the PMC 1942 dataset, see the [benchmarking documentation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) and [reports](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc)).

* The consolidation service matches the extracted bibliographical references with known publications, and complement the parsed bibliographical references with various metadata, in particular DOI, making possible the creation of a citation graph and to link the extracted references to external services.

Expand Down
30 changes: 15 additions & 15 deletions doc/Grobid-batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ The following command display some help for the batch commands:

Be sure to replace `<current version>` with the current version of GROBID that you have installed and built. For example:
```bash
> java -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -h
> java -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -h
```

The available batch commands are listed bellow. For those commands, at least `-Xmx1G` is used to set the JVM memory to avoid *OutOfMemoryException* given the current size of the Grobid models and the crazyness of some PDF. For complete fulltext processing, which involve all the GROBID models, `-Xmx4G` is recommended (although allocating less memory is usually fine).
Expand All @@ -40,7 +40,7 @@ The needed parameters for that command are:

Example:
```bash
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader
```

WARNING: the expected extension of the PDF files to be processed is .pdf
Expand All @@ -62,7 +62,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf

Example:
```bash
> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText
> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText
```

WARNING: the expected extension of the PDF files to be processed is .pdf
Expand All @@ -76,7 +76,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf

Example:
```bash
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format"
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format"
```

### processAuthorsHeader
Expand All @@ -88,7 +88,7 @@ Example:

Example:
```bash
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors"
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors"
```

### processAuthorsCitation
Expand All @@ -100,7 +100,7 @@ Example:

Example:
```bash
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors"
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors"
```

### processAffiliation
Expand All @@ -112,7 +112,7 @@ Example:

Example:
```bash
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation"
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation"
```

### processRawReference
Expand All @@ -124,7 +124,7 @@ Example:

Example:
```bash
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string"
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string"
```

### processReferences
Expand All @@ -140,7 +140,7 @@ Example:

Example:
```bash
> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences
> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences
```

WARNING: the expected extension of the PDF files to be processed is .pdf
Expand All @@ -156,7 +156,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf

Example:
```bash
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36
```

WARNING: extension of the ST.36 files to be processed must be .xml
Expand All @@ -172,7 +172,7 @@ WARNING: extension of the ST.36 files to be processed must be .xml

Example:
```
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT
```

WARNING: extension of the text files to be processed must be .txt, and expected encoding is UTF-8
Expand All @@ -188,7 +188,7 @@ WARNING: extension of the text files to be processed must be .txt, and expected

Example:
```
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF
> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF
```

WARNING: extension of the text files to be processed must be .pdf
Expand All @@ -204,7 +204,7 @@ WARNING: extension of the text files to be processed must be .pdf

Example:
```bash
> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining
> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining
```

WARNING: the expected extension of the PDF files to be processed is .pdf
Expand All @@ -220,7 +220,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf

Example:
```bash
> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank
> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank
```

WARNING: the expected extension of the PDF files to be processed is .pdf
Expand All @@ -238,7 +238,7 @@ The needed parameters for that command are:

Example:
```
> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation
> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation
```

WARNING: extension of the text files to be processed must be .pdf
12 changes: 6 additions & 6 deletions doc/Grobid-java-library.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,20 +33,20 @@ Here an example of grobid-core dependency:
<dependency>
<groupId>org.grobid</groupId>
<artifactId>grobid-core</artifactId>
<version>0.5.6</version>
<version>0.6.0</version>
</dependency>
```

If you want to work on a SNAPSHOT development version, you need to include in your pom file the path to the Grobid jar file,
for instance as follow (if necessary replace `0.5.6` by the valid `<current version>`):
for instance as follow (if necessary replace `0.6.0` by the valid `<current version>`):

```xml
<dependency>
<groupId>org.grobid</groupId>
<artifactId>grobid-core</artifactId>
<version>0.5.6</version>
<version>0.6.0</version>
<scope>system</scope>
<systemPath>${project.basedir}/lib/grobid-core-0.5.6.jar</systemPath>
<systemPath>${project.basedir}/lib/grobid-core-0.6.0.jar</systemPath>
</dependency>
```

Expand All @@ -64,8 +64,8 @@ Add the following snippet in your gradle.build file:

and add the Grobid dependency as well:
```
compile 'org.grobid:grobid-core:0.5.6'
compile 'org.grobid:grobid-trainer:0.5.6'
compile 'org.grobid:grobid-core:0.6.0'
compile 'org.grobid:grobid-trainer:0.6.0'
```


Expand Down
Loading

0 comments on commit 544f804

Please sign in to comment.