GroupDocs.Search for .NET 25.2 Release Notes

Full List of Issues Covering all Changes in this Release

KeySummaryCategory
SEARCHNET-3445Implement the ability of custom splitting of text into wordsFeature

Public API and Backward Incompatible Changes

Implement the ability of custom splitting of text into words

This functionality allows you to implement text segmentation (for example, hieroglyphic, such as Chinese, Japanese, Korean) using external libraries. For a more detailed description, see the documentation article about Custom text segmenter.

Public API changes

Interface IWordSplitter has been added to GroupDocs.Search.Common namespace.
Method System.Collections.Generic.IEnumerable<System.String> Split(System.String) has been added to GroupDocs.Search.Common.IWordSplitter interface.

Property GroupDocs.Search.Common.IWordSplitter WordSplitter has been added to GroupDocs.Search.Events.FileIndexingEventArgs class.

Use cases
// Implementing custom word splitter
public class JiebaWordSplitter : IWordSplitter
{
    private readonly JiebaSegmenter segmenter;

    public JiebaWordSplitter()
    {
        segmenter = new JiebaSegmenter();
    }

    public IEnumerable<string> Split(string text)
    {
        IEnumerable<string> segments = segmenter.Cut(text, cutAll: false);
        return segments;
    }
}

...

string indexFolder = @"c:\MyIndex\";
string documentsFolder = @"c:\MyDocuments\";

// Creating an index in the specified folder
Index index = new Index(indexFolder);

// Using Jieba segmenter to break text into words
JiebaWordSplitter jiebaWordSplitter = new JiebaWordSplitter();
index.Events.FileIndexing += (s, e) =>
{
    if (e.DocumentFullPath.EndsWith("Chinese.txt"))
    {
        // We know that the text in this document is in Chinese
        e.WordSplitter = jiebaWordSplitter;
    }
};

// Indexing documents
index.Add(documentsFolder);

// Searching in the index
string query = "考虑"; // Consider
SearchResult result = index.Search(query);