GroupDocs.Parser for .NET 24.6 (MSI) delivers a groundbreaking OCR (Optical Character Recognition) feature for extracting text from images and PDFs. With this release, .NET developers can effortlessly convert image-based content into searchable text format.
New Functionality: Image and PDF Text Extraction
The .NET parser API now supports extracting text from images and PDFs that lack plain text content. This innovative feature utilizes OCR technology to accurately convert image-based information into editable text. Here is how you can extract text from a PDF document in C#:
// Create an instance of Parser class
using (Parser parser = new Parser("scanned.pdf"))
{
// Create an instance of TextOptions to use OCR
TextOptions options = new TextOptions(false, true);
// Extract a text using OCR
using(TextReader reader = parser.GetText(options))
{
// Print a text or 'not supported' message
Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
}
}
Source*
This code sample illustrates extracting text from images:
// Create an instance of Parser class
using (Parser parser = new Parser("scanned.jpg"))
{
// Extract a text using OCR
using(TextReader reader = parser.GetText())
{
// Print a text or 'not supported' message
Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
}
}
Source*
To effectively use this feature, make sure your development environment is based on .NET Core 3.1 or later. At present, OCR supports the English language only.
API Changes
The OcrConnectorBase
class was updated with IsTextAreasSupported
, IsTextPageSupported
, and IsTextSupported
properties in this release of the .NET API.
You can view the list of all new features, enhancements, and bug fixes introduced in this release by visiting GroupDocs.Parser for .NET 24.6 Release Notes.