update doc for release

kermitt2 · Apr 24, 2020 · 544f804 · 544f804
1 parent d8eefb1
commit 544f804
Show file tree

Hide file tree

Showing 8 changed files with 63 additions and 56 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,7 +4,7 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
-## [Unreleased]
+## [0.6.0] – 2020-04-24
 
 ### Added
 
@@ -19,7 +19,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
 + Improve CORS configuration #527 (thank you @lfoppiano)
 + Documentation improvements
-+ Update of segmentation and fulltext model
++ Update of segmentation and fulltext model and training data
 + Better handling of affiliation block fragments
 + Improved DOI string recognition
 + More robust n-fold cross validation (case of shared grobid-home)

diff --git a/doc/Benchmarking.md b/doc/Benchmarking.md
@@ -289,36 +289,40 @@ However note that for some simpler NER-style tasks or especially for text classi
 **Summary** 
 
 Architectures: 
+
 - [Architecture 1](https://github.com/kermitt2/delft/pull/82#issuecomment-587280497): using normal dropout after the BidLSTM of the feature channel
+
 - [Architecture 2](https://github.com/kermitt2/delft/pull/82#issuecomment-588570868): using normal dropouts between embeddings and BidLSTM of the feature channel
+
 - [Architecture 3](https://github.com/kermitt2/delft/pull/82#issuecomment-589442381): using recurrent dropouts on the BidLSTM of the feature channel
- - Ignored features: using the standard BidLSTM-CRF without the use of any layout feature information
+
+- Ignored features: using the standard BidLSTM-CRF without the use of any layout feature information
 
 `Trainable=true` indicate that the features embeddings are trainable. 
 
 All metrics has been calculated by running n-fold cross-validation with n = 10.
 
-|Model | [Architecture 1](https://github.com/kermitt2/delft/pull/82#issuecomment-589447087) | [Architecture 1](https://github.com/kermitt2/delft/pull/82#issuecomment-593787846) (Trainable = true) | [Architecture 2](https://github.com/kermitt2/delft/pull/82#issuecomment-589439496) | [Architecture 2](https://github.com/kermitt2/delft/pull/82#issuecomment-593788260) (Trainable = true) | [Architecture 3](https://github.com/kermitt2/delft/pull/82#issuecomment-589523067) | [Architecture 3](https://github.com/kermitt2/delft/pull/82#issuecomment-594249488)(Trainable = true) | [Ignore features](https://github.com/kermitt2/delft/pull/82#issuecomment-586652333) | CRF Wapiti 
+|Model |  CRF Wapiti | [Architecture 1](https://github.com/kermitt2/delft/pull/82#issuecomment-589447087) | [Architecture 1](https://github.com/kermitt2/delft/pull/82#issuecomment-593787846) (Trainable = true) | [Architecture 2](https://github.com/kermitt2/delft/pull/82#issuecomment-589439496) | [Architecture 2](https://github.com/kermitt2/delft/pull/82#issuecomment-593788260) (Trainable = true) | [Architecture 3](https://github.com/kermitt2/delft/pull/82#issuecomment-589523067) | [Architecture 3](https://github.com/kermitt2/delft/pull/82#issuecomment-594249488) (Trainable = true) | [Ignore features](https://github.com/kermitt2/delft/pull/82#issuecomment-586652333) |
 |-- | -- | -- | -- | -- | -- | -- | -- | -- | 
-|Affiliation-address | 0.8709 | 0.8714 | 0.8721 | 0.872 | **0.873** | 0.8677 | 0.8668 | 0.8587 |
-|Citation | 0.9516 | **0.9522** | 0.9501 | 0.9503 | 0.9518 | 0.951 | 0.95 | 0.9448 |
-|Date | 0.9628 | 0.96 | 0.9606 | 0.9616 | 0.9631 | 0.961 | 0.9663 | **0.9833** |
-|Figure | 0.5594 | 0.5397 | 0.5907 | 0.4714 | 0.5515 | 0.6219 | 0.2949 | **0.9839** |
-|Header | 0.7107 | 0.7102 | 0.7139 | 0.7156 | 0.7215 | 0.713 | 0.6764 | **0.7425** |
-|Software | 0.8112 | **0.8128** | 0.807 | 0.8039 | 0.8038 | 0.8084 | 0.7915 | 0.7764 |
-|Superconductors [85 papers] | 0.7774 | 0.772 | 0.7767 | **0.7814** | 0.7766 | 0.7791 | 0.7663 | 0.6528 |
-|Quantities | 0.8809 | 0.8752 | **0.883** | 0.8701 | 0.8724 | 0.8727 | 0.8733 | 0.8014 |
-|Unit | 0.9838 | 0.9834 | 0.9829 | 0.9826 | 0.9816 | 0.9846 | 0.9801 | **0.9886** |
-|Values | 0.979 | **0.9874** | 0.9854 | 0.9852 | 0.9851 | 0.9853 | 0.9827 | 0.8457 |
+|Affiliation-address | 0.8587 | 0.8709 | 0.8714 | 0.8721 | 0.872 | **0.873** | 0.8677 | 0.8668 | 
+|Citation | 0.9448 | 0.9516 | **0.9522** | 0.9501 | 0.9503 | 0.9518 | 0.951 | 0.95 | 
+|Date | **0.9833** | 0.9628 | 0.96 | 0.9606 | 0.9616 | 0.9631 | 0.961 | 0.9663 | 
+|Figure | **0.9839** | 0.5594 | 0.5397 | 0.5907 | 0.4714 | 0.5515 | 0.6219 | 0.2949 | 
+|Header | **0.7425** |0.7107 | 0.7102 | 0.7139 | 0.7156 | 0.7215 | 0.713 | 0.6764 | 
+|Software | 0.7764 | 0.8112 | **0.8128** | 0.807 | 0.8039 | 0.8038 | 0.8084 | 0.7915 | 
+|Superconductors [85 papers] | 0.6528 | 0.7774 | 0.772 | 0.7767 | **0.7814** | 0.7766 | 0.7791 | 0.7663 | 
+|Quantities | 0.8014 | 0.8809 | 0.8752 | **0.883** | 0.8701 | 0.8724 | 0.8727 | 0.8733 | 
+|Unit |  **0.9886** | 0.9838 | 0.9834 | 0.9829 | 0.9826 | 0.9816 | 0.9846 | 0.9801 |
+|Values | 0.8457 | 0.979 | **0.9874** | 0.9854 | 0.9852 | 0.9851 | 0.9853 | 0.9827 | 
 |  |   |   |   |   |   |   |   |  |
-|Average | 0.84877 | 0.84643 | 0.85224 | 0.83941 | 0.84804 | 0.85447 | 0.81483 | **0.85781** |
+|**Average** | **0.85781** | 0.84877 | 0.84643 | 0.85224 | 0.83941 | 0.84804 | 0.85447 | 0.81483 | 
 
 
 ### Runtime
 
-To appreciate the runtime impact of Deep Learning models over CRF Wapiti, we report here some relevant comparisons. The following runtimes were obtained based on a Ubuntu 16.04 server Intel i7-4790 (4 CPU), 4.00 GHz with 16 GB memory. The runtimes for the Deep Learning architectures are based on the same machine with a nvidia GPU GeForce 1080Ti (11 GB). We run here a [software mention recognizer](https://github.com/ourresearch/software-mentions) model with Grobid as reference model, but any Grobid model would exhibit similar relative difference. 
+To appreciate the runtime impact of Deep Learning models over CRF Wapiti, we report here some relevant comparisons. The following runtimes were obtained based on a Ubuntu 16.04 server Intel i7-4790 (4 CPU), 4.00 GHz with 16 GB memory. The runtimes for the Deep Learning architectures are based on the same machine with a nvidia GPU GeForce 1080Ti (11 GB). We run here a [software mention recognizer](https://github.com/ourresearch/software-mentions) model with Grobid as reference model, but any Grobid model would exhibit similar relative differences. 
 
-|CRF ||
+|CRF Wapiti ||
 |--- | --- |
 |threads | tokens/s | 
 |1 | 23,685 | 
@@ -349,8 +353,10 @@ To appreciate the runtime impact of Deep Learning models over CRF Wapiti, we rep
 | 5 | 4,729|
 | 6 | 5,060|
 
-Batch size is a parameter constrained by the capacity of the available GPU. An improvement of the performance of the deep learning architecture requires increasing the number of GPU and the amount of memory of these GPU, similarly as improving CRF capacity requires increasing the number of available threads and CPU. Running a Deep Learning architectures on CPU is around 50 times slower than on GPU (although it depends on the amount of RAM available with the CPU, which can allow to increase the batch size significantly). 
+Additional remarks:
+
+- Batch size is a parameter constrained by the capacity of the available GPU. An improvement of the performance of the deep learning architecture requires increasing the number of GPU and the amount of memory of these GPU, similarly as improving CRF Wapiti capacity requires increasing the number of available threads and CPU. Running a Deep Learning architectures on CPU is around 50 times slower than on GPU (although it depends on the amount of RAM available with the CPU, which can allow to increase the batch size significantly). 
 
-Note that the BERT-CRF architecture in DeLFT is a strongly optimized version of the official version of BERT (which does not support sequence labelling as such), with a final CRF activation layer instead of a softmax (a CRF activation layer improves f-score in average by +0.30 for sequence labelling task). Above we run SciBERT, a BERT base model trained on scientific literature. Also note that given their limit of the size of the input sequence (512 tokens), BERT models are challenging to apply to several Grobid tasks which are working at document or paragraph levels. 
+- The BERT-CRF architecture in DeLFT is a modified and heavily optimized version of the Google Research [reference distribution of BERT](https://github.com/google-research/bert) (which does not support sequence labelling as such), with a final CRF activation layer instead of a softmax (a CRF activation layer improves f-score in average by +0.30 for sequence labelling task). Above we run SciBERT, a BERT base model trained on scientific literature. Also note that given their limit of the size of the input sequence (512 tokens), BERT models are challenging to apply to several Grobid tasks which are working at document or paragraph levels. 
 
-Finally an important aspect is that we present here the runtime for a single model. When using a cascade of models as in the Grobid core PDF structuring task, involving 9 different sequence labelling models, the possibility to use efficiently the batch size with the DL architecture is very challenging. In practice, as the batches will be often filled by 1 or a few input sequences, the runtime for a single document will be significantly longer (up to 100 times slower), and adapting the processing of multiple PDF in parallel with DL batches will require an important development effort. 
+- Finally we present here the runtime for a single model. When using a cascade of models as in the Grobid core PDF structuring task, involving 9 different sequence labelling models, the possibility to use efficiently the batch size with the DL architecture is very challenging. In practice, as the batches will be often filled by 1 or a few input sequences, the runtime for a single document will be significantly longer (up to 100 times slower), and adapting the processing of multiple PDF in parallel with DL batches will require an important development effort. 
diff --git a/doc/Consolidation.md b/doc/Consolidation.md
@@ -4,7 +4,7 @@ In GROBID, we call __consolidation__ the usage of an external bibliographical se
 
 Consolidation has two main interests:
 
-* The consolidation service improves very significantly the retrieval of header information (+.12 to .13 in f-score, e.g. from 74.59 f-score in average for all fields with Ratcliff/Obershelp similarity at 0.95, to 88.89 f-score, using biblio-glutton and GROBID version 0.5.6-SNAPSHOT for the PMC 1942 dataset, see the [benchmarking documentation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) and [reports](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc)). 
+* The consolidation service improves very significantly the retrieval of header information (+.12 to .13 in f-score, e.g. from 74.59 f-score in average for all fields with Ratcliff/Obershelp similarity at 0.95, to 88.89 f-score, using biblio-glutton and GROBID version `0.5.6` for the PMC 1942 dataset, see the [benchmarking documentation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) and [reports](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc)). 
 
 * The consolidation service matches the extracted bibliographical references with known publications, and complement the parsed bibliographical references with various metadata, in particular DOI, making possible the creation of a citation graph and to link the extracted references to external services. 
 

diff --git a/doc/Grobid-batch.md b/doc/Grobid-batch.md
@@ -18,7 +18,7 @@ The following command display some help for the batch commands:
 
 Be sure to replace `<current version>` with the current version of GROBID that you have installed and built. For example:
 ```bash
-> java -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -h
+> java -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -h
 ```
 
 The available batch commands are listed bellow. For those commands, at least `-Xmx1G` is used to set the JVM memory to avoid *OutOfMemoryException* given the current size of the Grobid models and the crazyness of some PDF. For complete fulltext processing, which involve all the GROBID models, `-Xmx4G` is recommended (although allocating less memory is usually fine). 
@@ -40,7 +40,7 @@ The needed parameters for that command are:
 
 Example:
 ```bash
-> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader 
+> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader 
 ```
 
 WARNING: the expected extension of the PDF files to be processed is .pdf
@@ -62,7 +62,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf
 
 Example:
 ```bash
-> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText 
+> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText 
 ```
 
 WARNING: the expected extension of the PDF files to be processed is .pdf
@@ -76,7 +76,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf
 
 Example:
 ```bash
-> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format"
+> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format"
 ```
 
 ### processAuthorsHeader
@@ -88,7 +88,7 @@ Example:
 
 Example:
 ```bash
-> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors"
+> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors"
 ```
 
 ### processAuthorsCitation
@@ -100,7 +100,7 @@ Example:
 
 Example:
 ```bash
-> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors"
+> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors"
 ```
 
 ### processAffiliation
@@ -112,7 +112,7 @@ Example:
 
 Example:
 ```bash
-> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation"
+> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation"
 ```
 
 ### processRawReference
@@ -124,7 +124,7 @@ Example:
 
 Example:
 ```bash
-> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string"
+> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string"
 ```
 
 ### processReferences
@@ -140,7 +140,7 @@ Example:
 
 Example:
 ```bash
-> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences
+> java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences
 ```
 
 WARNING: the expected extension of the PDF files to be processed is .pdf
@@ -156,7 +156,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf
 
 Example:
 ```bash
-> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36
+> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36
 ```
 
 WARNING: extension of the ST.36 files to be processed must be .xml
@@ -172,7 +172,7 @@ WARNING: extension of the ST.36 files to be processed must be .xml
 
 Example:
 ```
-> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT
+> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT
 ```
 
 WARNING: extension of the text files to be processed must be .txt, and expected encoding is UTF-8
@@ -188,7 +188,7 @@ WARNING: extension of the text files to be processed must be .txt, and expected
 
 Example:
 ```
-> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF
+> java -Xmx1G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF
 ```
 
 WARNING: extension of the text files to be processed must be .pdf 
@@ -204,7 +204,7 @@ WARNING: extension of the text files to be processed must be .pdf
 
 Example:
 ```bash
-> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining
+> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining
 ```
 
 WARNING: the expected extension of the PDF files to be processed is .pdf
@@ -220,7 +220,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf
 
 Example:
 ```bash
-> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank
+> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank
 ```
 
 WARNING: the expected extension of the PDF files to be processed is .pdf
@@ -238,7 +238,7 @@ The needed parameters for that command are:
 
 Example:
 ```
->  java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.5.6-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation
+>  java -Xmx2G -jar grobid-core/build/libs/grobid-core-0.6.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation
 ```
 
 WARNING: extension of the text files to be processed must be .pdf 
diff --git a/doc/Grobid-java-library.md b/doc/Grobid-java-library.md
@@ -33,20 +33,20 @@ Here an example of grobid-core dependency:
 	<dependency>
 	    <groupId>org.grobid</groupId>
 	    <artifactId>grobid-core</artifactId>
-	    <version>0.5.6</version>
+	    <version>0.6.0</version>
 	</dependency>
 ```
 
 If you want to work on a SNAPSHOT development version, you need to include in your pom file the path to the Grobid jar file, 
-for instance as follow (if necessary replace `0.5.6` by the valid `<current version>`):
+for instance as follow (if necessary replace `0.6.0` by the valid `<current version>`):
 
 ```xml
 	<dependency>
 	    <groupId>org.grobid</groupId>
 	    <artifactId>grobid-core</artifactId>
-	    <version>0.5.6</version>
+	    <version>0.6.0</version>
 	    <scope>system</scope>
-	    <systemPath>${project.basedir}/lib/grobid-core-0.5.6.jar</systemPath>
+	    <systemPath>${project.basedir}/lib/grobid-core-0.6.0.jar</systemPath>
 	</dependency>
 ```
 
@@ -64,8 +64,8 @@ Add the following snippet in your gradle.build file:
 
 and add the Grobid dependency as well: 
 ```
-    compile 'org.grobid:grobid-core:0.5.6'
-    compile 'org.grobid:grobid-trainer:0.5.6'
+    compile 'org.grobid:grobid-core:0.6.0'
+    compile 'org.grobid:grobid-trainer:0.6.0'
 ```