Skip to content

Commit

Permalink
last doc updates before release
Browse files Browse the repository at this point in the history
Former-commit-id: 5411f72
  • Loading branch information
kermitt2 committed Aug 5, 2017
1 parent 276c7fa commit 5ff63c6
Show file tree
Hide file tree
Showing 6 changed files with 176 additions and 18 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ Thumbs.db
tei-alt
raw-alt

grobid-core/dependency-reduced-pom.xml

grobid-core/src/test/resources/org/grobid/core/annotations/resTeiStAXParser/out.tei.xml

grobid-home/models/affiliation-address/model.crf.old
Expand Down
15 changes: 12 additions & 3 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,9 @@ GROBID can be considered as production ready. Deployments in production includes
The key aspects of GROBID are the following ones:

+ Written in Java, with JNI call to native CRF libraries.
+ High performance - on a 2011 low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes (or from 3 PDF per second with the RESTful API), parsing of 3000 references in 18 seconds. [INIST](http://www.inist.fr/lang=en) recently scaled GROBID REST service for processing 1 million PDF in 1 day on a Xeon 10 CPU E5-2660 and 10 GB memory (3GB used in average) with 9 threads - so around 11,5 PDF per second.
+ Lazy loading of models and resources. Depending on the selectd process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structure around 4GB.
+ Speed - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes (or from 3 PDF per second with the RESTful API), parsing of 3000 references in 18 seconds.
+ Speed and Scalability: [INIST](http://www.inist.fr/lang=en) recently scaled GROBID REST service for extracting bibliographical references of 1 million PDF in 1 day on a Xeon 10 CPU E5-2660 and 10 GB memory (3GB used in average) with 9 threads - so around 11.5 PDF per second. The complete processing of 395,000 PDF (IOP) with full text structuring was performed in 12h46mn with 16 threads, 0.11s per PDF (~1,72s per PDF with single thread).
+ Lazy loading of models and resources. Depending on the selected process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structure around 4GB.
+ Robust and fast PDF processing based on Xpdf and dedicated post-processing.
+ Modular and reusable machine learning models. The extractions are based on Linear Chain Conditional Random Fields which is currently the state of the art in bibliographical information extraction and labeling. The specialized CRF models are cascaded to build a complete document structure.
+ Full encoding in [__TEI__](http://www.tei-c.org/Guidelines/P5), both for the training corpus and the parsed results.
Expand All @@ -61,7 +62,15 @@ _Warning_: Some quota and query limitation apply to the demo server! If you are

## Latest version

The latest stable release of GROBID is version ```0.4.1```. As compared to previous version ```0.4.0```, this version brings:
The latest stable release of GROBID is version ```0.4.2```. As compared to previous version ```0.4.1```, this version brings:

+ f-score improvement for the PubMed Central sample: fulltext +10-14%, header +0.5%, citations +0.5%
+ More robust PDF parsing
+ Identification of equations (with PDF coordinates)
+ End-to-end evaluation with Pub2TEI conversions
+ many fixes and refactoring

New in previous release ```0.4.1```:

+ Support for Windows thanks to the contributions of Christopher Boumenot!
+ Support to Docker.
Expand Down
8 changes: 4 additions & 4 deletions doc/Introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,9 @@ GROBID can be considered as production ready. Deployments in production includes
The key aspects of GROBID are the following ones:

+ Written in Java, with JNI call to native CRF libraries.
+ High performance - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes (or from 3 PDF per second with the RESTful API), parsing of 3000 references in 18 seconds. [INIST](http://www.inist.fr/lang=en) recently scaled GROBID REST service for processing 1 million PDF in 1 day on a Xeon 10 CPU E5-2660 and 10 GB memory (3GB used in average) with 9 threads - so around 11,5 PDF per second.
+ Lazy loading of models and resources. Depending on the selectd process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structure around 4GB.
+ Speed - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes (or from 3 PDF per second with the RESTful API), parsing of 3000 references in 18 seconds.
+ Speed and Scalability: [INIST](http://www.inist.fr/lang=en) recently scaled GROBID REST service for extracting bibliographical references of 1 million PDF in 1 day on a Xeon 10 CPU E5-2660 and 10 GB memory (3GB used in average) with 9 threads - so around 11.5 PDF per second. The complete processing of 395,000 PDF (IOP) with full text structuring was performed in 12h46mn with 16 threads, 0.11s per PDF (~1,72s per PDF with single thread).
+ Lazy loading of models and resources. Depending on the selected process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structure around 4GB.
+ Robust and fast PDF processing based on Xpdf and dedicated post-processing.
+ Modular and reusable machine learning models. The extractions are based on Linear Chain Conditional Random Fields which is currently the state of the art in bibliographical information extraction and labeling. The specialized CRF models are cascaded to build a complete document structure.
+ Full encoding in [__TEI__](http://www.tei-c.org/Guidelines/P5), both for the training corpus and the parsed results.
Expand All @@ -44,8 +45,7 @@ The key aspects of GROBID are the following ones:

The GROBID extraction and parsing algorithms uses the [Wapiti CRF library](http://wapiti.limsi.fr). [CRF++ library](http://crfpp.googlecode.com/svn/trunk/doc/index.html) is not supported since GROBID version 0.4. The C++ libraries are transparently integrated as JNI with dynamic call based on the current OS.

GROBID should run properly "out of the box" on MacOS X, Linux (32 and 64 bits). GROBID does currently not run on Windows environments because the required and up-to-date CRF native binaries are not yet compiled for this platform (contributors to work on Windows support are very welcome!).

GROBID should run properly "out of the box" on MacOS X, Linux (32 and 64 bits) and Windows.

## Credits

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,23 +38,21 @@ public class BasicStructureBuilder {
static public Pattern introductionStrict =
Pattern.compile("^\\b*(1\\.\\sPROBLEMS?|1\\.(\\n)?\\sIntroduction?|1\\.(\\n)?\\sContent?|1\\.\\sINTRODUCTION|I\\.(\\s)+Introduction|1\\.\\sProblems?|I\\.\\sEinleitung?|1\\.\\sEinleitung?|1\\sEinleitung?|1\\sIntroduction?)",
Pattern.CASE_INSENSITIVE);

static public Pattern abstract_ = Pattern.compile("^\\b*\\.?(abstract?|résumé?|summary?|zusammenfassung?)",
Pattern.CASE_INSENSITIVE);
static public Pattern keywords = Pattern.compile("^\\b*\\.?(keyword?|key\\s*word?|mots\\s*clefs?)",
Pattern.CASE_INSENSITIVE);

static public Pattern references =
/*static public Pattern keywords = Pattern.compile("^\\b*\\.?(keyword?|key\\s*word?|mots\\s*clefs?)",
Pattern.CASE_INSENSITIVE);*/
/*static public Pattern references =
Pattern.compile("^\\b*(References?|REFERENCES?|Bibliography|BIBLIOGRAPHY|" +
"References?\\s+and\\s+Notes?|References?\\s+Cited|REFERENCE?\\s+CITED|REFERENCES?\\s+AND\\s+NOTES?|Références|Literatur|" +
"LITERATURA|Literatur|Referências|BIBLIOGRAFIA|Literaturverzeichnis|Referencias|LITERATURE CITED|References and Notes)", Pattern.CASE_INSENSITIVE);
static public Pattern header = Pattern.compile("^((\\d\\d?)|([A-Z](I|V|X)*))(\\.(\\d)*)*\\s(\\D+)");
"LITERATURA|Literatur|Referências|BIBLIOGRAFIA|Literaturverzeichnis|Referencias|LITERATURE CITED|References and Notes)", Pattern.CASE_INSENSITIVE);*/
/*static public Pattern header = Pattern.compile("^((\\d\\d?)|([A-Z](I|V|X)*))(\\.(\\d)*)*\\s(\\D+)");*/
// static public Pattern header2 = Pattern.compile("^\\d\\s\\D+");
static public Pattern figure = Pattern.compile("(figure\\s|fig\\.|sch?ma)", Pattern.CASE_INSENSITIVE);
/*static public Pattern figure = Pattern.compile("(figure\\s|fig\\.|sch?ma)", Pattern.CASE_INSENSITIVE);
static public Pattern table = Pattern.compile("^(T|t)able\\s|tab|tableau", Pattern.CASE_INSENSITIVE);
static public Pattern equation = Pattern.compile("^(E|e)quation\\s");
private static Pattern acknowledgement = Pattern.compile("(acknowledge?ments?|acknowledge?ment?)",
Pattern.CASE_INSENSITIVE);
Pattern.CASE_INSENSITIVE);*/
static public Pattern headerNumbering1 = Pattern.compile("^(\\d+)\\.?\\s");
static public Pattern headerNumbering2 = Pattern.compile("^((\\d+)\\.)+(\\d+)\\s");
static public Pattern headerNumbering3 = Pattern.compile("^((\\d+)\\.)+\\s");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -367,15 +367,15 @@ static public Pair<String, LayoutTokenization> getBodyTextFeatured(Document doc,
//nn++;
continue;
}
text = text.replace(" ", "");
text = text.replaceAll("\\s+", "");
if (text.length() == 0) {
n++;
mm++;
nn++;
continue;
}

if (text.equals("\n") || text.equals("\r")) {
if (text.equals("\n") || text.equals("\r") || text.equals("\t") ) {
newline = true;
previousNewline = true;
n++;
Expand Down
149 changes: 149 additions & 0 deletions grobid-core/src/main/java/org/grobid/core/utilities/UnicodeUtil.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
package org.grobid.core.utilities;

/**
* Class for holding static methods for processing related to unicode.
*
* @author Patrice Lopez
*/
public class UnicodeUtil {

// As java \s doesn’t support the Unicode white space property (\s matches
// [ \t\n\x0B\f\r]), here are the 26 code points of the "official" stable
// p{White_Space} unicode property
private static String whitespace_chars = "\\u0009" // CHARACTER TABULATION \t
+ "\\u000A" // LINE FEED (LF) \n -> new line
+ "\\u000B" // LINE TABULATION \v -> new line
+ "\\u000C" // FORM FEED (FF) -> break page
+ "\\u000D" // CARRIAGE RETURN (CR) \r
+ "\\u0020" // SPACE
+ "\\u0085" // NEXT LINE (NEL) -> new line
+ "\\u00A0" // NO-BREAK SPACE
+ "\\u1680" // OGHAM SPACE MARK
+ "\\u180E" // MONGOLIAN VOWEL SEPARATOR
+ "\\u2000" // EN QUAD
+ "\\u2001" // EM QUAD
+ "\\u2002" // EN SPACE
+ "\\u2003" // EM SPACE
+ "\\u2004" // THREE-PER-EM SPACE
+ "\\u2005" // FOUR-PER-EM SPACE
+ "\\u2006" // SIX-PER-EM SPACE
+ "\\u2007" // FIGURE SPACE
+ "\\u2008" // PUNCTUATION SPACE
+ "\\u2009" // THIN SPACE
+ "\\u200A" // HAIR SPACE
+ "\\u2028" // LINE SEPARATOR
+ "\\u2029" // PARAGRAPH SEPARATOR
+ "\\u202F" // NARROW NO-BREAK SPACE
+ "\\u205F" // MEDIUM MATHEMATICAL SPACE
+ "\\u3000"; // IDEOGRAPHIC SPACE

// a more restrictive selection of horizontal white space characters than the
// Unicode p{White_Space} property (which includes new line and vertical spaces)
private static String my_whitespace_chars = "\\u0009" // CHARACTER TABULATION \t
+ "\\u0020" // SPACE
+ "\\u00A0" // NO-BREAK SPACE
+ "\\u1680" // OGHAM SPACE MARK
+ "\\u180E" // MONGOLIAN VOWEL SEPARATOR
+ "\\u2000" // EN QUAD
+ "\\u2001" // EM QUAD
+ "\\u2002" // EN SPACE
+ "\\u2003" // EM SPACE
+ "\\u2004" // THREE-PER-EM SPACE
+ "\\u2005" // FOUR-PER-EM SPACE
+ "\\u2006" // SIX-PER-EM SPACE
+ "\\u2007" // FIGURE SPACE
+ "\\u2008" // PUNCTUATION SPACE
+ "\\u2009" // THIN SPACE
+ "\\u200A" // HAIR SPACE
+ "\\u2028" // LINE SEPARATOR
+ "\\u2029" // PARAGRAPH SEPARATOR
+ "\\u202F" // NARROW NO-BREAK SPACE
+ "\\u205F" // MEDIUM MATHEMATICAL SPACE
+ "\\u3000"; // IDEOGRAPHIC SPACE

// all the horizontal low lines
private static String horizontal_low_lines_chars = "\\u005F" // low Line
+ "\\u203F" // undertie
+ "\\u2040" // character tie
+ "\\u2054" // inverted undertie
+ "\\uFE4D" // dashed low line
+ "\\uFE4E" // centreline low line
+ "\\uFE4F" // wavy low line
+ "\\uFF3F" // fullwidth low line
+ "\\uFE33" // Presentation Form For Vertical Low Line
+ "\\uFE34"; // Presentation Form For Vertical Wavy Low Line

// all the vertical lines
private static String vertical_lines_chars = "\\u007C" // vertical line
+ "\\u01C0" // Latin Letter Dental
+ "\\u05C0" // Hebrew Punctuation Paseq
+ "\\u2223" // Divides
+ "\\u2758"; // Light Vertical Bar

// all new lines
private static String new_line_chars = "\\u000C" // form feed \f - normally a page break
+ "\\u000A" // line feed \n
+ "\\u000D" // carriage return \r
+ "\\u000B" // line tabulation \v - concretely it's a new line
+ "\\u0085"; // next line (NEL)

// all bullets
private static String bullet_chars = "\\u2022" // bullet
+ "\\u2023" // triangular bullet
+ "\\u25E6" // white bullet
+ "\\u2043" // hyphen bullet
+ "\\u204C" // black leftwards bullet
+ "\\u204D" // black rightwards bullet
+ "\\u2219" // bullet operator (use in math stuff)
+ "\\u25D8" // inverse bullet
+ "\\u29BE" // circled white bullet
+ "\\u29BF" // circled bullet
+ "\\u23FA" // black circle for record
+ "\\u25CF" // black circle
+ "\\u26AB" // medium black circle
+ "\\u2B24"; // black large circle

private UnicodeUtil() {}

/**
* Normalise the space, EOL and punctuation unicode characters.
*
* In particular all the characters which are treated as space in
* C++ (http://en.cppreference.com/w/cpp/string/byte/isspace)
* will be replace by the punctuation space character U+2008
* so that the token can be used to generate a robust feature vector
* legible as Wapiti input.
*
* @param token to be normalised
* @return normalised string, legible for Wapiti feature generation
*/
public static String normaliseToken(String token) {
if (token == null)
return null;

// see https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html
// for Unicode character properties supported by Java

// normalise all horizontal space separator characters
token = token.replaceAll("["+my_whitespace_chars+"]", " ");

// normalise all EOL - special handling of "\r\n" as one single newline
token = token.replace("\r\n", "\n").replaceAll("["+new_line_chars+"\\p{Zl}\\p{Zp}]", "\n");

// normalize dash via the unicode dash punctuation property
// note: we don't add the "hyphen bullet" character \\u2043 because it's actually a bullet
token = token.replaceAll("\\p{Pd}", "-");

// normalize horizontal low lines
token = token.replaceAll('['+horizontal_low_lines_chars+']', "_");

// normalize vertical lines
token = token.replaceAll('['+vertical_lines_chars+']', "|");

// bullet normalisation
token = token.replaceAll('['+bullet_chars+']', "|");

return token;
}

}

0 comments on commit 5ff63c6

Please sign in to comment.