<repositories>
   <repository>
      <id>repository.groupdocs.com</id>
      <name>GroupDocs Repository</name>
      <url>https://releases.groupdocs.com/java/repo/</url>
   </repository>
</repositories>

<dependencies>
   <dependency>
      <groupId>com.groupdocs</groupId>
      <artifactId>groupdocs-parser</artifactId>
      <version>23.11</version>
   </dependency>
</dependencies>
copied!  
repositories {
    maven {
        url 'https://releases.groupdocs.com/java/repo/'
    }
}

compile(group: 'com.groupdocs', name: 'groupdocs-parser', version: '23.11')
copied!  
<ivysettings>
    <settings defaultResolver="chain"/>
    <resolvers>
        <chain name="chain">
            <ibiblio name="GroupDocs Repository" m2compatible="true" root="https://releases.groupdocs.com/java/repo/"/>
        </chain>
    </resolvers>
</ivysettings>

<dependency org="com.groupdocs" name="groupdocs-parser" rev="23.11">
   <artifact name="groupdocs-parser" ext="jar"/>
</dependency>
copied!  
resolvers += Resolver.url("GroupDocs Repository", url("https://releases.groupdocs.com/java/repo/"))

libraryDependencies += "com.groupdocs" % "groupdocs-parser" % "23.11"
copied!  

Text Extraction & Parsing Java High Code API

main-banner

Product Page | Docs | Demos | API Reference | Examples | Blog | Free Support | Temporary License

GroupDocs.Parser for Java is on-premise API that enable your Java applications to parse and extract data from various type of file formats. It allows you to extract hyperlinks, tables, barcodes, text, images, as well as data extraction from ZIP archives, email Archives, PDF portfolios, & databases. GroupDocs.Parser for Java can be used to define user-defined templates containing fixed, regex, & linked field positions for accurate data extraction.

Text Extraction & Parsing Java On-Premise API Features

  • Document parsing via user-defined template
    • Create a user-defined template with data field & table definitions.
    • Parse documents via user-defined templates and extract data, such as, invoices, tables, etc.
  • Supports extraction of various text elements, such as:
    • Plain text extraction
    • Formatted text extraction as simple text, HTML or Markdown (MD)
    • Structured text extraction in the XML form
    • Text Area extraction as per specific coordinates, text style
    • Extract text around (in context of) a specific word
  • Supports various extraction modes, such as:
    • Accurate Text Extraction Mode: The default text extraction mode with the best possible text quality.
    • Raw Text Extraction Mode: The extraction mode with better performance but the text quality is not as accurate as the aforementioned mode.
  • Extract the text of the whole document or extract only the desired document page.
  • Ability to search documents using specific keywords or via regular expression.
  • Supports metadata extraction & image extraction from Microsoft Word®, Excel®, PowerPoint®, PDF® & other document types.
  • Extract table of contents (TOC) from Microsoft Office® Word® & EPUB eBook formats.
  • Ability to extract data from containers (Archives), such as, ZIP, PDF portfolios, OST containers, etc.
  • Ability to iterate through the form fields and extract PDF Form data.
  • Extract data from databases (e.g. Sqlite) via JDBC.
  • Extract information from Microsoft OneNote® notebooks.
  • Extract all hyper-links from whole document or from specific page or from a specific page area only.

Supported Document Parser File Formats

Microsoft Word®: DOC/DOT/DOCX/DOCM/DOTX/DOTM/RTF/TXT
OpenOffice Writer®: ODT/OTT/ Microsoft Excel®: XLS/XLT/XLSX/XLSM/XLSB/XLTX/XLTM/XLA/XLAM
OpenOffice Calc®: ODS/OTS/CSV
Apple® iWork: NUMBERS
Microsoft PowerPoint®: PPT/PPS/POT/PPTX/PPTM/POTX/POTM/PPSX/PPSM
OpenOffice Impress®: ODP/OTP
Microsoft Outlook®: PST/OST/EML/MSG
Apple® Mail Message: EMLX
Microsoft OneNote®: ONE
Fixed Layout: PDF
Postscript: PS
Markup: XHTML/MHTML/MD/XML
eBook: CHM/EPUB/FB2
Archive: ZIP/RAR/TAR/GZ/BZ2
Image: BMP/GIF/JPG/JPEG/JPE/JP2/PNG/TIF/TIFF/DJVU/J2K/WEBP
Vector: SVG/SVGZ
Adobe Photoshop®: PSD
Medical Imaging: DICOM
Metadata: EMF/WMF
Database: JDBC

For details and limitations please visit, Supported Document Formats.

System Requirements

  • Microsoft Windows: Windows Desktop & Server (x86, x64), Microsoft Azure
  • macOS: Mac OS X
  • Linux: Ubuntu, OpenSUSE, CentOS, and others
  • Java Versions: J2SE 7.0 (1.7), J2SE 8.0 (1.8) or above (for example Java 10)

GroupDocs.Parser for Java does not require any external software or third party tool to be installed. Just follow one of the ways as described in Installation and Configuration.

Get Started

GroupDocs hosts all Java APIs at the GroupDocs Repository. You can easily use GroupDocs.Parser for Java API directly in your Maven projects with simple configurations. For the detailed instructions please visit Installation from GroupDocs Repository using Maven documentation page.

Sample Java code for text extraction from a specific PDF page

// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
    // Get the document info
    IDocumentInfo documentInfo = parser.getDocumentInfo();
    // Iterate over pages
    for (int p = 0; p < documentInfo.getPageCount(); p++) {
        // Print a page number
        System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
        // Extract a text into the reader
        try (TextReader reader = parser.getText(p)) {
            // Print a text from the document page
            System.out.println(reader.readToEnd());
        }
    }
}

Product Page | Docs | Demos | API Reference | Examples | Blog | Free Support | Temporary License

VersionRelease Date
23.11November 24, 2023
23.10October 21, 2023
23.9September 17, 2023
23.2March 1, 2023
22.11November 30, 2022
22.6June 8, 2022
22.3March 17, 2022
20.5January 25, 2022
20.12January 25, 2022
18.9January 25, 2022
18.11January 25, 2022
21.2February 27, 2021
20.8August 19, 2020
20.6June 30, 2020
20.3April 1, 2020
20.1February 4, 2020
19.11December 3, 2019
19.5May 29, 2019
18.12December 11, 2018
18.10October 10, 2018
18.7July 3, 2018