<repositories>
<repository>
<id>repository.groupdocs.com</id>
<name>GroupDocs Repository</name>
<url>https://releases.groupdocs.com/java/repo/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>com.groupdocs</groupId>
<artifactId>groupdocs-parser</artifactId>
<version>24.6</version>
</dependency>
</dependencies>
repositories {
maven {
url 'https://releases.groupdocs.com/java/repo/'
}
}
compile(group: 'com.groupdocs', name: 'groupdocs-parser', version: '24.6')
<ivysettings>
<settings defaultResolver="chain"/>
<resolvers>
<chain name="chain">
<ibiblio name="GroupDocs Repository" m2compatible="true" root="https://releases.groupdocs.com/java/repo/"/>
</chain>
</resolvers>
</ivysettings>
<dependency org="com.groupdocs" name="groupdocs-parser" rev="24.6">
<artifact name="groupdocs-parser" ext="jar"/>
</dependency>
resolvers += Resolver.url("GroupDocs Repository", url("https://releases.groupdocs.com/java/repo/"))
libraryDependencies += "com.groupdocs" % "groupdocs-parser" % "24.6"
Text Extraction & Parsing Java High Code API
Product Page | Docs | Demos | API Reference | Examples | Blog | Free Support | Temporary License
GroupDocs.Parser for Java is on-premise API that enable your Java applications to parse and extract data from various type of file formats. It allows you to extract hyperlinks, tables, barcodes, text, images, as well as data extraction from ZIP archives, email Archives, PDF portfolios, & databases. GroupDocs.Parser for Java can be used to define user-defined templates containing fixed, regex, & linked field positions for accurate data extraction.
Text Extraction & Parsing Java On-Premise API Features
- Document parsing via user-defined template
- Create a user-defined template with data field & table definitions.
- Parse documents via user-defined templates and extract data, such as, invoices, tables, etc.
- Supports extraction of various text elements, such as:
- Plain text extraction
- Formatted text extraction as simple text, HTML or Markdown (MD)
- Structured text extraction in the XML form
- Text Area extraction as per specific coordinates, text style
- Extract text around (in context of) a specific word
- Supports various extraction modes, such as:
- Accurate Text Extraction Mode: The default text extraction mode with the best possible text quality.
- Raw Text Extraction Mode: The extraction mode with better performance but the text quality is not as accurate as the aforementioned mode.
- Extract the text of the whole document or extract only the desired document page.
- Ability to search documents using specific keywords or via regular expression.
- Supports metadata extraction & image extraction from Microsoft Word®, Excel®, PowerPoint®, PDF® & other document types.
- Extract table of contents (TOC) from Microsoft Office® Word® & EPUB eBook formats.
- Ability to extract data from containers (Archives), such as, ZIP, PDF portfolios, OST containers, etc.
- Ability to iterate through the form fields and extract PDF Form data.
- Extract data from databases (e.g. Sqlite) via JDBC.
- Extract information from Microsoft OneNote® notebooks.
- Extract all hyper-links from whole document or from specific page or from a specific page area only.
Supported Document Parser File Formats
Microsoft Word®: DOC/DOT/DOCX/DOCM/DOTX/DOTM/RTF/TXT
OpenOffice Writer®: ODT/OTT/
Microsoft Excel®: XLS/XLT/XLSX/XLSM/XLSB/XLTX/XLTM/XLA/XLAM
OpenOffice Calc®: ODS/OTS/CSV
Apple® iWork: NUMBERS
Microsoft PowerPoint®: PPT/PPS/POT/PPTX/PPTM/POTX/POTM/PPSX/PPSM
OpenOffice Impress®: ODP/OTP
Microsoft Outlook®: PST/OST/EML/MSG
Apple® Mail Message: EMLX
Microsoft OneNote®: ONE
Fixed Layout: PDF
Postscript: PS
Markup: XHTML/MHTML/MD/XML
eBook: CHM/EPUB/FB2
Archive: ZIP/RAR/TAR/GZ/BZ2
Image: BMP/GIF/JPG/JPEG/JPE/JP2/PNG/TIF/TIFF/DJVU/J2K/WEBP
Vector: SVG/SVGZ
Adobe Photoshop®: PSD
Medical Imaging: DICOM
Metadata: EMF/WMF
Database: JDBC
For details and limitations please visit, Supported Document Formats.
System Requirements
- Microsoft Windows: Windows Desktop & Server (x86, x64), Microsoft Azure
- macOS: Mac OS X
- Linux: Ubuntu, OpenSUSE, CentOS, and others
- Java Versions:
J2SE 7.0 (1.7)
,J2SE 8.0 (1.8)
or above (for example Java 10)
GroupDocs.Parser for Java does not require any external software or third party tool to be installed. Just follow one of the ways as described in Installation and Configuration.
Get Started
GroupDocs hosts all Java APIs at the GroupDocs Repository. You can easily use GroupDocs.Parser for Java API directly in your Maven projects with simple configurations. For the detailed instructions please visit Installation from GroupDocs Repository using Maven documentation page.
Sample Java code for text extraction from a specific PDF page
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Iterate over pages
for (int p = 0; p < documentInfo.getPageCount(); p++) {
// Print a page number
System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
// Extract a text into the reader
try (TextReader reader = parser.getText(p)) {
// Print a text from the document page
System.out.println(reader.readToEnd());
}
}
}
Product Page | Docs | Demos | API Reference | Examples | Blog | Free Support | Temporary License
Version | Release Date |
---|---|
24.6 | June 27, 2024 |
24.3 | March 29, 2024 |
23.11 | November 24, 2023 |
23.10 | October 21, 2023 |
23.9 | September 17, 2023 |
23.2 | March 1, 2023 |
22.11 | November 30, 2022 |
22.6 | June 8, 2022 |
22.3 | March 17, 2022 |
20.5 | January 25, 2022 |
20.12 | January 25, 2022 |
18.9 | January 25, 2022 |
18.11 | January 25, 2022 |
21.2 | February 27, 2021 |
20.8 | August 19, 2020 |
20.6 | June 30, 2020 |
20.3 | April 1, 2020 |
20.1 | February 4, 2020 |
19.11 | December 3, 2019 |
19.5 | May 29, 2019 |
18.12 | December 11, 2018 |
18.10 | October 10, 2018 |
18.7 | July 3, 2018 |
GroupDocs.Total GroupDocs.Parser API on premise DOC DOT DOCX DOCM DOTX DOTM RTF TXT ODT OTT XLS XLT XLSX XLSM XLSB XLTX XLTM XLA XLAM ODS OTS CSV NUMBERS PPT PPS POT PPTX PPTM POTX POTM PPSX PPSM ODP OTP PST OST EML MSG EMLX ONE PDF PS XHTML MHTML MD XML CHM EPUB FB2 ZIP RAR TAR GZ BZ2 BMP GIF JPG JPEG JPE JP2 PNG TIF TIFF DJVU J2K WEBP SVG SVGZ PSD DICOM EMF WMF JDBC windows macOS Linux J2SE azure sqlite JDBC parsing extract extraction extractor raw text search regex keywords hyperlink document automation