Latest Release (December 2025)
We’re happy to announce the first release of GroupDocs.Parser for Python via .NET – a powerful document parsing and data extraction library that enables Python developers to extract text, images, attachments, barcodes, and structured content from a wide range of document formats.
What’s New
This initial release brings the full power of GroupDocs.Parser to Python developers through the .NET-powered parsing engine, providing a unified API for advanced document parsing and data extraction.
Key Features
- Rich text extraction & search – Extract plain or formatted text from PDFs, Office documents, emails, e‑books, archives and more, with page‑level access and advanced search options (case‑sensitive, whole‑word, regex)
- Structured content & templates – Parse document structure (headings, paragraphs, tables, text areas) and use templates to pull out strongly‑typed fields from invoices, receipts and other business documents
- Images, attachments & barcodes – Extract embedded images, file attachments and barcodes from supported document and image formats
- OCR for scanned documents – Use OCR to read text from scanned PDFs and raster images, optionally combining it with spell‑checking for better recognition quality
- Wide format & platform support – Work with dozens of document, image and archive formats on Windows, Linux and macOS
Supported Document Formats
GroupDocs.Parser for Python via .NET supports a comprehensive range of document families:
- Word processing – DOC, DOCX, RTF, TXT, ODT and others
- PDF & markup – PDF, HTML/MHTML, Markdown, XML
- Spreadsheets – XLS, XLSX, ODS, CSV and related formats
- Presentations – PPT, PPTX, ODP and similar formats
- Email & notes – PST, OST, EML, MSG, ONE
- eBooks & web content – EPUB, MOBI, AZW3, CHM, FB2
- Images – JPEG, PNG, TIFF, GIF, BMP, SVG and more
- Archives & containers – ZIP, RAR, 7Z, TAR, GZ, BZ2
Quick Start
Download the package for your platform from the GroupDocs Releases website and install it using pip:
pip install groupdocs_parser_net-25.12-*.whl
Extract text from a document:
from groupdocs.parser import Parser
# Create a Parser instance for your document
with Parser("sample.pdf") as parser:
# Extract text from the document
text = parser.GetText()
# Print all extracted text to the console
print(text)