GroupDocs.Parser for Java 24.6

<repositories>
   <repository>
      <id>repository.groupdocs.com</id>
      <name>GroupDocs Repository</name>
      <url>https://releases.groupdocs.com/java/repo/</url>
   </repository>
</repositories>

<dependencies>
   <dependency>
      <groupId>com.groupdocs</groupId>
      <artifactId>groupdocs-parser</artifactId>
      <version>24.6</version>
   </dependency>
</dependencies>

copied!

repositories {
    maven {
        url 'https://releases.groupdocs.com/java/repo/'
    }
}

compile(group: 'com.groupdocs', name: 'groupdocs-parser', version: '24.6')

copied!

<ivysettings>
    <settings defaultResolver="chain"/>
    <resolvers>
        <chain name="chain">
            <ibiblio name="GroupDocs Repository" m2compatible="true" root="https://releases.groupdocs.com/java/repo/"/>
        </chain>
    </resolvers>
</ivysettings>

<dependency org="com.groupdocs" name="groupdocs-parser" rev="24.6">
   <artifact name="groupdocs-parser" ext="jar"/>
</dependency>

copied!

resolvers += Resolver.url("GroupDocs Repository", url("https://releases.groupdocs.com/java/repo/"))

libraryDependencies += "com.groupdocs" % "groupdocs-parser" % "24.6"

copied!

Text Extraction & Parsing Java High Code API

GroupDocs.Parser for Java is on-premise API that enable your Java applications to parse and extract data from various type of file formats. It allows you to extract hyperlinks, tables, barcodes, text, images, as well as data extraction from ZIP archives, email Archives, PDF portfolios, & databases. GroupDocs.Parser for Java can be used to define user-defined templates containing fixed, regex, & linked field positions for accurate data extraction.

Text Extraction & Parsing Java On-Premise API Features

Document parsing via user-defined template
- Create a user-defined template with data field & table definitions.
- Parse documents via user-defined templates and extract data, such as, invoices, tables, etc.
Supports extraction of various text elements, such as:
- Plain text extraction
- Formatted text extraction as simple text, HTML or Markdown (MD)
- Structured text extraction in the XML form
- Text Area extraction as per specific coordinates, text style
- Extract text around (in context of) a specific word
Supports various extraction modes, such as:
- Accurate Text Extraction Mode: The default text extraction mode with the best possible text quality.
- Raw Text Extraction Mode: The extraction mode with better performance but the text quality is not as accurate as the aforementioned mode.
Extract the text of the whole document or extract only the desired document page.
Ability to search documents using specific keywords or via regular expression.
Supports metadata extraction & image extraction from Microsoft Word®, Excel®, PowerPoint®, PDF® & other document types.
Extract table of contents (TOC) from Microsoft Office® Word® & EPUB eBook formats.
Ability to extract data from containers (Archives), such as, ZIP, PDF portfolios, OST containers, etc.
Ability to iterate through the form fields and extract PDF Form data.
Extract data from databases (e.g. Sqlite) via JDBC.
Extract information from Microsoft OneNote® notebooks.
Extract all hyper-links from whole document or from specific page or from a specific page area only.

Supported Document Parser File Formats

Microsoft Word®: DOC/DOT/DOCX/DOCM/DOTX/DOTM/RTF/TXT
OpenOffice Writer®: ODT/OTT/ Microsoft Excel®: XLS/XLT/XLSX/XLSM/XLSB/XLTX/XLTM/XLA/XLAM
OpenOffice Calc®: ODS/OTS/CSV
Apple® iWork: NUMBERS
Microsoft PowerPoint®: PPT/PPS/POT/PPTX/PPTM/POTX/POTM/PPSX/PPSM
OpenOffice Impress®: ODP/OTP
Microsoft Outlook®: PST/OST/EML/MSG
Apple® Mail Message: EMLX
Microsoft OneNote®: ONE
Fixed Layout: PDF
Postscript: PS
Markup: XHTML/MHTML/MD/XML
eBook: CHM/EPUB/FB2
Archive: ZIP/RAR/TAR/GZ/BZ2
Image: BMP/GIF/JPG/JPEG/JPE/JP2/PNG/TIF/TIFF/DJVU/J2K/WEBP
Vector: SVG/SVGZ
Adobe Photoshop®: PSD
Medical Imaging: DICOM
Metadata: EMF/WMF
Database: JDBC

For details and limitations please visit, Supported Document Formats.

System Requirements

Microsoft Windows: Windows Desktop & Server (x86, x64), Microsoft Azure
macOS: Mac OS X
Linux: Ubuntu, OpenSUSE, CentOS, and others
Java Versions: J2SE 7.0 (1.7), J2SE 8.0 (1.8) or above (for example Java 10)

GroupDocs.Parser for Java does not require any external software or third party tool to be installed. Just follow one of the ways as described in Installation and Configuration.

Get Started

GroupDocs hosts all Java APIs at the GroupDocs Repository. You can easily use GroupDocs.Parser for Java API directly in your Maven projects with simple configurations. For the detailed instructions please visit Installation from GroupDocs Repository using Maven documentation page.

Sample Java code for text extraction from a specific PDF page

// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
    // Get the document info
    IDocumentInfo documentInfo = parser.getDocumentInfo();
    // Iterate over pages
    for (int p = 0; p < documentInfo.getPageCount(); p++) {
        // Print a page number
        System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
        // Extract a text into the reader
        try (TextReader reader = parser.getText(p)) {
            // Print a text from the document page
            System.out.println(reader.readToEnd());
        }
    }
}

Version	Release Date
24.6	June 27, 2024
24.3	March 29, 2024
23.11	November 24, 2023
23.10	October 21, 2023
23.9	September 17, 2023
23.2	March 1, 2023
22.11	November 30, 2022
22.6	June 8, 2022
22.3	March 17, 2022
20.5	January 25, 2022
20.12	January 25, 2022
18.9	January 25, 2022
18.11	January 25, 2022
21.2	February 27, 2021
20.8	August 19, 2020
20.6	June 30, 2020
20.3	April 1, 2020
20.1	February 4, 2020
19.11	December 3, 2019
19.5	May 29, 2019
18.12	December 11, 2018
18.10	October 10, 2018
18.7	July 3, 2018