last doc updates before release

Former-commit-id: 5411f72
kermitt2 · Aug 5, 2017 · 5ff63c6 · 5ff63c6
1 parent 276c7fa
commit 5ff63c6
Show file tree

Hide file tree

Showing 6 changed files with 176 additions and 18 deletions.
diff --git a/.gitignore b/.gitignore
@@ -13,6 +13,8 @@ Thumbs.db
 tei-alt
 raw-alt
 
+grobid-core/dependency-reduced-pom.xml
+
 grobid-core/src/test/resources/org/grobid/core/annotations/resTeiStAXParser/out.tei.xml
 
 grobid-home/models/affiliation-address/model.crf.old

diff --git a/Readme.md b/Readme.md
@@ -36,8 +36,9 @@ GROBID can be considered as production ready. Deployments in production includes
 The key aspects of GROBID are the following ones:
 
 + Written in Java, with JNI call to native CRF libraries. 
-+ High performance - on a 2011 low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes (or from 3 PDF per second with the RESTful API), parsing of 3000 references in 18 seconds. [INIST](http://www.inist.fr/lang=en) recently scaled GROBID REST service for processing 1 million PDF in 1 day on a Xeon 10 CPU E5-2660 and 10 GB memory (3GB used in average) with 9 threads - so around 11,5 PDF per second.
-+ Lazy loading of models and resources. Depending on the selectd process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structure around 4GB.  
++ Speed - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes (or from 3 PDF per second with the RESTful API), parsing of 3000 references in 18 seconds. 
++ Speed and Scalability: [INIST](http://www.inist.fr/lang=en) recently scaled GROBID REST service for extracting bibliographical references of 1 million PDF in 1 day on a Xeon 10 CPU E5-2660 and 10 GB memory (3GB used in average) with 9 threads - so around 11.5 PDF per second. The complete processing of 395,000 PDF (IOP) with full text structuring was performed in 12h46mn with 16 threads, 0.11s per PDF (~1,72s per PDF with single thread).
++ Lazy loading of models and resources. Depending on the selected process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structure around 4GB.  
 + Robust and fast PDF processing based on Xpdf and dedicated post-processing.
 + Modular and reusable machine learning models. The extractions are based on Linear Chain Conditional Random Fields which is currently the state of the art in bibliographical information extraction and labeling. The specialized CRF models are cascaded to build a complete document structure.  
 + Full encoding in [__TEI__](http://www.tei-c.org/Guidelines/P5), both for the training corpus and the parsed results.
@@ -61,7 +62,15 @@ _Warning_: Some quota and query limitation apply to the demo server! If you are
 
 ## Latest version
 
-The latest stable release of GROBID is version ```0.4.1```. As compared to previous version ```0.4.0```, this version brings:
+The latest stable release of GROBID is version ```0.4.2```. As compared to previous version ```0.4.1```, this version brings:
+
++ f-score improvement for the PubMed Central sample: fulltext +10-14%, header +0.5%, citations +0.5%
++ More robust PDF parsing
++ Identification of equations (with PDF coordinates)
++ End-to-end evaluation with Pub2TEI conversions
++ many fixes and refactoring
+
+New in previous release ```0.4.1```:
 
 + Support for Windows thanks to the contributions of Christopher Boumenot!
 + Support to Docker.

diff --git a/doc/Introduction.md b/doc/Introduction.md
@@ -32,8 +32,9 @@ GROBID can be considered as production ready. Deployments in production includes
 The key aspects of GROBID are the following ones:
 
 + Written in Java, with JNI call to native CRF libraries. 
-+ High performance - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes (or from 3 PDF per second with the RESTful API), parsing of 3000 references in 18 seconds. [INIST](http://www.inist.fr/lang=en) recently scaled GROBID REST service for processing 1 million PDF in 1 day on a Xeon 10 CPU E5-2660 and 10 GB memory (3GB used in average) with 9 threads - so around 11,5 PDF per second.
-+ Lazy loading of models and resources. Depending on the selectd process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structure around 4GB. 
++ Speed - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes (or from 3 PDF per second with the RESTful API), parsing of 3000 references in 18 seconds. 
++ Speed and Scalability: [INIST](http://www.inist.fr/lang=en) recently scaled GROBID REST service for extracting bibliographical references of 1 million PDF in 1 day on a Xeon 10 CPU E5-2660 and 10 GB memory (3GB used in average) with 9 threads - so around 11.5 PDF per second. The complete processing of 395,000 PDF (IOP) with full text structuring was performed in 12h46mn with 16 threads, 0.11s per PDF (~1,72s per PDF with single thread).
++ Lazy loading of models and resources. Depending on the selected process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structure around 4GB. 
 + Robust and fast PDF processing based on Xpdf and dedicated post-processing.
 + Modular and reusable machine learning models. The extractions are based on Linear Chain Conditional Random Fields which is currently the state of the art in bibliographical information extraction and labeling. The specialized CRF models are cascaded to build a complete document structure.  
 + Full encoding in [__TEI__](http://www.tei-c.org/Guidelines/P5), both for the training corpus and the parsed results.
@@ -44,8 +45,7 @@ The key aspects of GROBID are the following ones:
 
 The GROBID extraction and parsing algorithms uses the [Wapiti CRF library](http://wapiti.limsi.fr). [CRF++ library](http://crfpp.googlecode.com/svn/trunk/doc/index.html) is not supported since GROBID version 0.4. The C++ libraries are transparently integrated as JNI with dynamic call based on the current OS. 
 
-GROBID should run properly "out of the box" on MacOS X, Linux (32 and 64 bits). GROBID does currently not run on Windows environments because the required and up-to-date CRF native binaries are not yet compiled for this platform (contributors to work on Windows support are very welcome!).
-
+GROBID should run properly "out of the box" on MacOS X, Linux (32 and 64 bits) and Windows. 
 
 ## Credits
 

diff --git a/grobid-core/src/main/java/org/grobid/core/document/BasicStructureBuilder.java b/grobid-core/src/main/java/org/grobid/core/document/BasicStructureBuilder.java
@@ -38,23 +38,21 @@ public class BasicStructureBuilder {
     static public Pattern introductionStrict =
             Pattern.compile("^\\b*(1\\.\\sPROBLEMS?|1\\.(\\n)?\\sIntroduction?|1\\.(\\n)?\\sContent?|1\\.\\sINTRODUCTION|I\\.(\\s)+Introduction|1\\.\\sProblems?|I\\.\\sEinleitung?|1\\.\\sEinleitung?|1\\sEinleitung?|1\\sIntroduction?)",
                     Pattern.CASE_INSENSITIVE);
-
     static public Pattern abstract_ = Pattern.compile("^\\b*\\.?(abstract?|résumé?|summary?|zusammenfassung?)",
             Pattern.CASE_INSENSITIVE);
-    static public Pattern keywords = Pattern.compile("^\\b*\\.?(keyword?|key\\s*word?|mots\\s*clefs?)",
-            Pattern.CASE_INSENSITIVE);
-
-    static public Pattern references =
+    /*static public Pattern keywords = Pattern.compile("^\\b*\\.?(keyword?|key\\s*word?|mots\\s*clefs?)",
+            Pattern.CASE_INSENSITIVE);*/
+    /*static public Pattern references =
             Pattern.compile("^\\b*(References?|REFERENCES?|Bibliography|BIBLIOGRAPHY|" +
                     "References?\\s+and\\s+Notes?|References?\\s+Cited|REFERENCE?\\s+CITED|REFERENCES?\\s+AND\\s+NOTES?|Références|Literatur|" +
-                    "LITERATURA|Literatur|Referências|BIBLIOGRAFIA|Literaturverzeichnis|Referencias|LITERATURE CITED|References and Notes)", Pattern.CASE_INSENSITIVE);
-    static public Pattern header = Pattern.compile("^((\\d\\d?)|([A-Z](I|V|X)*))(\\.(\\d)*)*\\s(\\D+)");
+                    "LITERATURA|Literatur|Referências|BIBLIOGRAFIA|Literaturverzeichnis|Referencias|LITERATURE CITED|References and Notes)", Pattern.CASE_INSENSITIVE);*/
+    /*static public Pattern header = Pattern.compile("^((\\d\\d?)|([A-Z](I|V|X)*))(\\.(\\d)*)*\\s(\\D+)");*/
     //    static public Pattern header2 = Pattern.compile("^\\d\\s\\D+");
-    static public Pattern figure = Pattern.compile("(figure\\s|fig\\.|sch?ma)", Pattern.CASE_INSENSITIVE);
+    /*static public Pattern figure = Pattern.compile("(figure\\s|fig\\.|sch?ma)", Pattern.CASE_INSENSITIVE);
     static public Pattern table = Pattern.compile("^(T|t)able\\s|tab|tableau", Pattern.CASE_INSENSITIVE);
     static public Pattern equation = Pattern.compile("^(E|e)quation\\s");
     private static Pattern acknowledgement = Pattern.compile("(acknowledge?ments?|acknowledge?ment?)",
-            Pattern.CASE_INSENSITIVE);
+            Pattern.CASE_INSENSITIVE);*/
     static public Pattern headerNumbering1 = Pattern.compile("^(\\d+)\\.?\\s");
     static public Pattern headerNumbering2 = Pattern.compile("^((\\d+)\\.)+(\\d+)\\s");
     static public Pattern headerNumbering3 = Pattern.compile("^((\\d+)\\.)+\\s");

diff --git a/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java b/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java
@@ -367,15 +367,15 @@ static public Pair<String, LayoutTokenization> getBodyTextFeatured(Document doc,
 	                    //nn++;
 	                    continue;
 	                }
-	                text = text.replace(" ", "");
+	                text = text.replaceAll("\\s+", "");
 	                if (text.length() == 0) {
 	                    n++;
 	                    mm++;
 	                    nn++;
 	                    continue;
 	                }
 
-	                if (text.equals("\n") || text.equals("\r")) {
+	                if (text.equals("\n") || text.equals("\r") || text.equals("\t") ) {
 	                    newline = true;
 	                    previousNewline = true;
 	                    n++;

diff --git a/grobid-core/src/main/java/org/grobid/core/utilities/UnicodeUtil.java b/grobid-core/src/main/java/org/grobid/core/utilities/UnicodeUtil.java
@@ -0,0 +1,149 @@
+package org.grobid.core.utilities;
+
+/**
+ * Class for holding static methods for processing related to unicode.
+ *
+ * @author Patrice Lopez
+ */
+public class UnicodeUtil {
+
+	// As java \s doesn’t support the Unicode white space property (\s matches
+	// [ \t\n\x0B\f\r]), here are the 26 code points of the "official" stable 
+	// p{White_Space} unicode property
+	private static String whitespace_chars = "\\u0009" // CHARACTER TABULATION \t
+			                        	+ "\\u000A"  // LINE FEED (LF) \n -> new line 
+			                        	+ "\\u000B"  // LINE TABULATION \v -> new line 
+			                        	+ "\\u000C"  // FORM FEED (FF) -> break page 
+			                        	+ "\\u000D"  // CARRIAGE RETURN (CR) \r
+			                        	+ "\\u0020"  // SPACE
+				                        + "\\u0085"  // NEXT LINE (NEL)  -> new line
+				                        + "\\u00A0"  // NO-BREAK SPACE
+				                        + "\\u1680"  // OGHAM SPACE MARK
+				                        + "\\u180E"  // MONGOLIAN VOWEL SEPARATOR
+				                        + "\\u2000"  // EN QUAD 
+				                        + "\\u2001"  // EM QUAD 
+				                        + "\\u2002"  // EN SPACE
+				                        + "\\u2003"  // EM SPACE
+				                        + "\\u2004"  // THREE-PER-EM SPACE
+				                        + "\\u2005"  // FOUR-PER-EM SPACE
+				                        + "\\u2006"  // SIX-PER-EM SPACE
+				                        + "\\u2007"  // FIGURE SPACE
+				                        + "\\u2008"  // PUNCTUATION SPACE
+				                        + "\\u2009"  // THIN SPACE
+				                        + "\\u200A"  // HAIR SPACE
+				                        + "\\u2028"  // LINE SEPARATOR
+				                        + "\\u2029"  // PARAGRAPH SEPARATOR
+				                        + "\\u202F"  // NARROW NO-BREAK SPACE
+				                        + "\\u205F"  // MEDIUM MATHEMATICAL SPACE
+				                        + "\\u3000"; // IDEOGRAPHIC SPACE
+
+	// a more restrictive selection of horizontal white space characters than the 
+	// Unicode p{White_Space} property (which includes new line and vertical spaces)		                     
+	private static String my_whitespace_chars = "\\u0009" // CHARACTER TABULATION \t
+			                        	+ "\\u0020"  // SPACE
+				                        + "\\u00A0"  // NO-BREAK SPACE
+				                        + "\\u1680"  // OGHAM SPACE MARK
+				                        + "\\u180E"  // MONGOLIAN VOWEL SEPARATOR
+				                        + "\\u2000"  // EN QUAD 
+				                        + "\\u2001"  // EM QUAD 
+				                        + "\\u2002"  // EN SPACE
+				                        + "\\u2003"  // EM SPACE
+				                        + "\\u2004"  // THREE-PER-EM SPACE
+				                        + "\\u2005"  // FOUR-PER-EM SPACE
+				                        + "\\u2006"  // SIX-PER-EM SPACE
+				                        + "\\u2007"  // FIGURE SPACE
+				                        + "\\u2008"  // PUNCTUATION SPACE
+				                        + "\\u2009"  // THIN SPACE
+				                        + "\\u200A"  // HAIR SPACE
+				                        + "\\u2028"  // LINE SEPARATOR
+				                        + "\\u2029"  // PARAGRAPH SEPARATOR
+				                        + "\\u202F"  // NARROW NO-BREAK SPACE
+				                        + "\\u205F"  // MEDIUM MATHEMATICAL SPACE
+				                        + "\\u3000"; // IDEOGRAPHIC SPACE			                    
+
+    // all the horizontal low lines
+    private static String horizontal_low_lines_chars = "\\u005F" // low Line
+			    								  + "\\u203F" 	 // undertie
+			    								  + "\\u2040" 	 // character tie
+			    								  + "\\u2054"  	 // inverted undertie
+			    								  + "\\uFE4D" 	 // dashed low line
+			    								  + "\\uFE4E" 	 // centreline low line
+			    								  + "\\uFE4F" 	 // wavy low line
+			    								  + "\\uFF3F" 	 // fullwidth low line 
+			    								  + "\\uFE33" 	 // Presentation Form For Vertical Low Line
+			    								  + "\\uFE34";   // Presentation Form For Vertical Wavy Low Line
+
+    // all the vertical lines
+    private static String vertical_lines_chars = "\\u007C" 	// vertical line
+			    							+ "\\u01C0" 	// Latin Letter Dental
+			    							+ "\\u05C0" 	// Hebrew Punctuation Paseq
+			    							+ "\\u2223" 	// Divides
+			    							+ "\\u2758"; 	// Light Vertical Bar		
+
+    // all new lines 
+    private static String new_line_chars =  "\\u000C"  // form feed \f - normally a page break
+    									 + "\\u000A"  // line feed \n
+    									 + "\\u000D"  // carriage return \r
+    									 + "\\u000B"  // line tabulation \v - concretely it's a new line
+    									 + "\\u0085"; // next line (NEL)
+
+    // all bullets
+    private static String bullet_chars = "\\u2022"  // bullet
+ 									    + "\\u2023"  // triangular bullet 
+    									+ "\\u25E6"  // white bullet
+										+ "\\u2043"  // hyphen bullet
+										+ "\\u204C"  // black leftwards bullet
+										+ "\\u204D"  // black rightwards bullet
+										+ "\\u2219"  // bullet operator (use in math stuff)
+										+ "\\u25D8"  // inverse bullet
+										+ "\\u29BE"  // circled white bullet
+										+ "\\u29BF"  // circled bullet 
+										+ "\\u23FA"  // black circle for record
+										+ "\\u25CF"  // black circle
+										+ "\\u26AB"  // medium black circle
+										+ "\\u2B24"; // black large circle
+
+	private UnicodeUtil() {}
+
+	/**
+     * Normalise the space, EOL and punctuation unicode characters.
+     * 
+     * In particular all the characters which are treated as space in 
+     * C++ (http://en.cppreference.com/w/cpp/string/byte/isspace)
+     * will be replace by the punctuation space character U+2008
+     * so that the token can be used to generate a robust feature vector
+     * legible as Wapiti input.
+     *
+     * @param token to be normalised
+     * @return normalised string, legible for Wapiti feature generation
+     */
+    public static String normaliseToken(String token) {
+        if (token == null)
+            return null;
+
+        // see https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html
+        // for Unicode character properties supported by Java
+
+        // normalise all horizontal space separator characters 
+        token = token.replaceAll("["+my_whitespace_chars+"]", " ");   
+
+        // normalise all EOL - special handling of "\r\n" as one single newline
+        token = token.replace("\r\n", "\n").replaceAll("["+new_line_chars+"\\p{Zl}\\p{Zp}]", "\n");
+
+        // normalize dash via the unicode dash punctuation property
+        // note: we don't add the "hyphen bullet" character \\u2043 because it's actually a bullet
+        token = token.replaceAll("\\p{Pd}", "-");
+
+        // normalize horizontal low lines
+		token = token.replaceAll('['+horizontal_low_lines_chars+']', "_");
+
+        // normalize vertical lines
+		token = token.replaceAll('['+vertical_lines_chars+']', "|");
+
+		// bullet normalisation
+		token = token.replaceAll('['+bullet_chars+']', "|");		
+
+        return token;
+    }
+
+}