public final class WordsTextExtractor extends TextExtractor implements ISearchable, IHighlightExtractor, IRegexSearchable, IStructuredExtractor, IDocumentContentExtractor
Provides the text extractor for text documents.
Supported formats:
.DOC | Microsoft Word Text document |
.DOT | Microsoft Word Text template |
.DOCX | Microsoft Office Open XML Text document |
.DOCM | Microsoft Word 2007 Master document |
.RTF | Rich Text Format text file |
.ODT | OpenDocument text |
.TXT | Plain text |
.HTML (.XHTML, .HTM) | Hypertext Markup Language document |
.MHTML (.MHT) | Web Archive Single File |
Extracting a text from a text document:
// Create a text extractor for text documents
WordsTextExtractor extractor = new WordsTextExtractor(stream);
// Extract a text
System.out.println(extractor.extractAll());
Constructor and Description |
---|
WordsTextExtractor(InputStream stream)
Initializes a new instance of the
WordsTextExtractor class. |
WordsTextExtractor(InputStream stream,
LoadOptions loadOptions)
Initializes a new instance of the
WordsTextExtractor class. |
WordsTextExtractor(String fileName)
Initializes a new instance of the
WordsTextExtractor class. |
WordsTextExtractor(String fileName,
LoadOptions loadOptions)
Initializes a new instance of the
WordsTextExtractor class. |
Modifier and Type | Method and Description |
---|---|
protected void |
dispose(boolean disposing)
Releases the unmanaged resources used by the extractor.
|
List<String> |
extractHighlights(HighlightOptions... highlightOptions)
Extracts highlights.
|
void |
extractStructured(StructuredHandler handler)
Extracts a structured text.
|
protected String |
extractText()
Extracts all characters from the current position to the end of the text extractor
and returns them as one string.
|
DocumentContent |
getDocumentContent()
Gets an access to the document's content.
|
protected String |
prepareLine()
Returns a line of the text.
|
void |
reset()
Resets the current document.
|
void |
search(SearchOptions options,
ISearchHandler handler,
ISearchEngine searchEngine,
List<String> keywords)
Searches the keywords.
|
void |
search(SearchOptions options,
ISearchHandler handler,
List<String> keywords)
Searches the keywords.
|
void |
searchWithRegex(String expression,
ISearchHandler handler,
RegexSearchOptions searchOptions)
Searches the expression.
|
checkDisposed, close, dispose, extractAll, extractLine, extractTextLine, getEncoding, getMediaType, getPassword, isDisposed, setEncoding, setMediaType
public WordsTextExtractor(String fileName)
Initializes a new instance of the WordsTextExtractor
class.
fileName
- The path to the file.InvalidPasswordException
- Incorrect passwords.UnsupportedDocumentFormatException
- File format isn't supported.GroupDocsParserException
- File is corrupted.public WordsTextExtractor(String fileName, LoadOptions loadOptions)
Initializes a new instance of the WordsTextExtractor
class.
fileName
- The path to the file.loadOptions
- The options of loading the file.InvalidPasswordException
- Incorrect passwords.UnsupportedDocumentFormatException
- File format isn't supported.GroupDocsParserException
- File is corrupted.public WordsTextExtractor(InputStream stream)
Initializes a new instance of the WordsTextExtractor
class.
stream
- The stream of the document.InvalidPasswordException
- Incorrect passwords.UnsupportedDocumentFormatException
- File format isn't supported.GroupDocsParserException
- File is corrupted.public WordsTextExtractor(InputStream stream, LoadOptions loadOptions)
Initializes a new instance of the WordsTextExtractor
class.
stream
- The stream of the document.loadOptions
- The options of loading the file.InvalidPasswordException
- Incorrect passwords.UnsupportedDocumentFormatException
- File format isn't supported.GroupDocsParserException
- File is corrupted.public DocumentContent getDocumentContent()
Gets an access to the document's content.
getDocumentContent
in interface IDocumentContentExtractor
DocumentContent
class.public void extractStructured(StructuredHandler handler)
Extracts a structured text.
extractStructured
in interface IStructuredExtractor
handler
- Structured text extraction handler.public void search(SearchOptions options, ISearchHandler handler, List<String> keywords)
Searches the keywords.
search
in interface ISearchable
options
- Options for searching.handler
- An instance of the search handler.keywords
- A collection of words to search.public void search(SearchOptions options, ISearchHandler handler, ISearchEngine searchEngine, List<String> keywords)
Searches the keywords.
search
in interface ISearchable
options
- Options for searching.handler
- An instance of the search handler.searchEngine
- An instance of the search engine.keywords
- A collection of words to search.public void searchWithRegex(String expression, ISearchHandler handler, RegexSearchOptions searchOptions)
Searches the expression.
searchWithRegex
in interface IRegexSearchable
expression
- A regular expression.handler
- An instance of the search handler.searchOptions
- Options for searching.public List<String> extractHighlights(HighlightOptions... highlightOptions)
Extracts highlights.
extractHighlights
in interface IHighlightExtractor
highlightOptions
- A collection of HighlightOptions.public void reset()
Resets the current document.
ExtractLine
method will return the first line of the document.
reset
in class TextExtractor
protected void dispose(boolean disposing)
Releases the unmanaged resources used by the extractor.
dispose
in class TextExtractor
disposing
- A boolean true if invoked from Dispose; otherwise, false.protected String extractText()
Extracts all characters from the current position to the end of the text extractor and returns them as one string.
extractText
in class TextExtractor
protected String prepareLine()
Returns a line of the text.
prepareLine
in class TextExtractor
Copyright © 2018. All rights reserved.