GroupDocs.Parser for .NET 17.02 Release Notes

Major Features

There are the following features in this release:

  • Support for extracting a text from EPUB documents
  • Ability to search with a regular expression
  • Ability to search the whole word
  • Ability to extract a highlight to line’s start/end or with the limited words count

All Changes

KeySummaryIssue Type
TEXTNET-525Implement the ability to extract a text from EPUB filesNew feature
TEXTNET-340Implement the ability to search a text with a regular expressionNew feature
TEXTNET-492Implement the ability to search the whole wordNew feature
TEXTNET-494Implement the ability to extract a highlight to line’s start (end)New feature
TEXTNET-495Implement the ability to extract a highlight with the limited words countNew feature
TEXTNET-528Implement the ability to use all highlight extraction modes with search functionalityEnhancement

Public API and Backward Incompatible Changes

Implement the ability to use all highlight extraction modes with search functionality

This enhancement allows to use all highlight extraction modes with search functionality.

Public API Changes
Added (int leftLength, int rightLength, HighlightMode highlightMode) constructor to SearchHighlightOptions class.
Added HighlightMode property to SearchHighlightOptions class.
Added WordSeparators property (and constructor to initialize it) to SearchOptions class.
Added WordSeparators property (and constructor to initialize it) to RegexSearchOptions class.
Added CreateLeftHighlightOptions and CreateRightHighlightOptions methods to SearchOptions class.
Added CreateLeftHighlightOptions and CreateRightHighlightOptions methods to RegexSearchOptions class.
Added void Search(SearchOptions options, ISearchHandler handler, IList keywords) to ISearchable interface.

Following example shows searching a text with highlights limited by line’s start/end.

C#

using (WordsTextExtractor extractor = new WordsTextExtractor(@"document.docx"))
{
  ListSearchHandler handler = new ListSearchHandler();
  SearchHighlightOptions highlightOptions = SearchHighlightOptions.CreateLineOptions(100, 100);
  extractor.Search(new SearchOptions(highlightOptions), handler, null, new string[] { "test text", "keyword" });

  if (handler.List.Count == 0)
  {
    Console.WriteLine("Not found");
  }
  else
  {
    for (int i = 0; i < handler.List.Count; i++)
    {
      Console.Write(handler.List[i].LeftText);
      Console.Write("_");
      Console.Write(handler.List[i].FoundText);
      Console.Write("_");
      Console.Write(handler.List[i].RightText);
      Console.WriteLine("---");
    }
  }
}

Implement the ability to search a text with a regular expression

This feature allows to search a text in documents with regular expressions.

Public API changes
Added IRegexSearchable interface.
Added RegexSearchOptions class.

Enumerate all files in the archive:

C#

using (WordsTextExtractor extractor = new WordsTextExtractor(@"document.docx"))
{
  ListSearchHandler handler = new ListSearchHandler();
  extractor.SearchWithRegex("19[0-9]{2}", handler, new RegexSearchOptions(SearchHighlightOptions.CreateFixedLengthOptions(10)));

  if (handler.List.Count == 0)
  {
    Console.WriteLine("Not found");
  }
  else
  {
    for (int i = 0; i < handler.List.Count; i++)
    {
      Console.Write(handler.List[i].LeftText);
      Console.Write("_");
      Console.Write(handler.List[i].FoundText);
      Console.Write("_");
      Console.Write(handler.List[i].RightText);
      Console.WriteLine("---");
    }
  }
}

Implement the ability to search the whole word

This feature allows to search the whole word in documents.

Public API changes
Added two constructors with isWholeWord and wordsSeparators arguments to SearchOptions class.
Added IsWholeWord and WordSeparators properties to SearchOptions class.

C#

using (WordsTextExtractor extractor = new WordsTextExtractor(@"document.docx"))
{
  SearchOptions searchOptions = new SearchOptions(SearchHighlightOptions.CreateFixedLengthOptions(15), true, true);
  ListSearchHandler handler = new ListSearchHandler();
  extractor.Search(searchOptions, handler, null, new string[] { "test", "keyword" });

  if (handler.List.Count == 0)
  {
    Console.WriteLine("Not found");
  }
  else
  {
    for (int i = 0; i < handler.List.Count; i++)
    {
      Console.Write(handler.List[i].LeftText);
      Console.Write("_");
      Console.Write(handler.List[i].FoundText);
      Console.Write("_");
      Console.Write(handler.List[i].RightText);
      Console.WriteLine("---");
    }
  }
}

Implement the ability to extract a highlight to line’s start (end)

This feature allows to limit highlight by the start or end of line. Highlight is a part of the text. Usually it is used to explain a context of the found text.

Public API changes
Added CreateLineOptions methods to HighlightOptions class.
Added CreateLineOptions methods to SearchHighlightOptions class.
Added Line value to HighlightMode enum.

C#

using (WordsTextExtractor extractor = new WordsTextExtractor(@"document.docx")) {
  IList<string> highlights = extractor.ExtractHighlights(
    HighlightOptions.CreateLineOptions(HighlightDirection.Left, 15),
    HighlightOptions.CreateLineOptions(HighlightDirection.Right, 20));

  for (int i = 0; i < highlights.Count; i++) {
    Console.WriteLine(highlights[i]);
  }
}

Implement the ability to extract a highlight with the limited words count

This feature allows to limit highlight by the words count. Highlight is a part of the text. Usually it is used to explain a context of the found text.
Public API changes
Added CreateWordsCount methods to HighlightOptions class.
Added Mode property to HighlightOptions class.
Added HighlightMode enum.
Added WordSeparators class.
Added WordSeparators property to HighlightOptions class.
Added CreateFixedLengthOptions methods to HighlightOptions class.
Added CreateWordsCountOptions methods to HighlightOptions class.
Added CreateWordsCountOptions methods to SearchHighlightOptions class.
Constructors of SearchHighlightOptions class are marked as Obsolete (use CreateXXX static methods instead).
CreateFixedLength method in HighlightOptions is marked as Obsolete (use CreateFixedLengthOptions method instead).

Following example shows highlight extraction with five words from the position

C#

using (WordsTextExtractor extractor = new WordsTextExtractor(@"document.docx")) {
  IList<string> highlights = extractor.ExtractHighlights(
    HighlightOptions.CreateWordsCountOptions(HighlightDirection.Left, 15, 5),
    HighlightOptions.CreateWordsCountOptions(HighlightDirection.Right, 20, 5));

  for (int i = 0; i < highlights.Count; i++) {
    Console.WriteLine(highlights[i]);
  }
}

Implement the ability to extract a text from EPUB files

This feature allows to extract a text from EPUB documents.
Public API changes
Added EpubTextExtractor class.
Added EpubPackage class.

Following example extracts a line of characters from a document:

C#

using (var extractor = new EpubTextExtractor(stream)) {
  string line = extractor.ExtractLine();
  while (line != null) {
    Console.WriteLine(line);
    line = extractor.ExtractLine();
  }
}

Following example extracts all characters from a document:

C#

using (var extractor = new EpubTextExtractor(stream)) {
  Console.WriteLine(extractor.ExtractAll());
}

GetTextReader method is another way to extract a text from content documents. This method returns TextReader class:

C#

using (TextReader reader = package.GetTextReader(0)) {
  string line = reader.ReadLine();
  while (line != null) {
    Console.WriteLine(line);
    line = reader.ReadLine();
  }
}