public final class ChmTextExtractor extends TextExtractor implements IHighlightExtractor, IPageTextExtractor, ISearchable, IRegexSearchable, IStructuredExtractor
Provides the text extractor for CHM documents.
Extracts a line of characters from a document:
// Create a text extractor for CHM documents
ChmTextExtractor extractor = new ChmTextExtractor(stream);
// Extract a line of the text
String line = extractor.extractLine();
// If the line is null, then the end of the file is reached
while (line != null) {
// Print a line to the console
System.out.println(line);
// Extract another line
line = extractor.extractLine();
}
Extracts all characters from a document:
// Create a text extractor for CHM documents
ChmTextExtractor extractor = new ChmTextExtractor(stream);
// Extract a text
System.out.println(extractor.extractAll());
Constructor and Description |
---|
ChmTextExtractor(InputStream stream)
Initializes a new instance of the
ChmTextExtractor class. |
ChmTextExtractor(String fileName)
Initializes a new instance of the
ChmTextExtractor class. |
Modifier and Type | Method and Description |
---|---|
protected void |
dispose(boolean disposing)
Releases the unmanaged resources used by the extractor.
|
List<String> |
extractHighlights(HighlightOptions... highlightOptions)
Extracts highlights.
|
String |
extractPage(int pageIndex)
Extracts all characters from the page with pageIndex and returns the data as a string.
|
void |
extractStructured(StructuredHandler handler)
Extracts a structured text.
|
protected String |
extractText()
Extracts all characters from the current position to the end of the text extractor
and returns them as one string.
|
int |
getPageCount()
Gets a total count of the pages.
|
List<TableOfContentsItem> |
getTableOfContents()
Gets a collection of table of contents items.
|
protected String |
prepareLine()
Returns a line of the text.
|
void |
reset()
Resets the current document.
|
void |
search(SearchOptions options,
ISearchHandler handler,
ISearchEngine searchEngine,
List<String> keywords)
Searches the keywords.
|
void |
search(SearchOptions options,
ISearchHandler handler,
List<String> keywords)
Searches the keywords.
|
void |
searchWithRegex(String expression,
ISearchHandler handler,
RegexSearchOptions searchOptions)
Searches the expression.
|
checkDisposed, close, dispose, extractAll, extractLine, extractTextLine, getEncoding, getMediaType, getPassword, isDisposed, setEncoding, setMediaType
public ChmTextExtractor(String fileName)
Initializes a new instance of the ChmTextExtractor
class.
fileName
- The path to the file.public ChmTextExtractor(InputStream stream)
Initializes a new instance of the ChmTextExtractor
class.
stream
- The stream of the document.public int getPageCount()
Gets a total count of the pages.
getPageCount
in interface IPageTextExtractor
public List<TableOfContentsItem> getTableOfContents()
Gets a collection of table of contents items.
public void search(SearchOptions options, ISearchHandler handler, List<String> keywords)
Searches the keywords.
search
in interface ISearchable
options
- Options for searching.handler
- An instance of the search handler.keywords
- A collection of words to search.public void search(SearchOptions options, ISearchHandler handler, ISearchEngine searchEngine, List<String> keywords)
Searches the keywords.
search
in interface ISearchable
options
- Options for searching.handler
- An instance of the search handler.searchEngine
- An instance of the search engine.keywords
- A collection of words to search.public void searchWithRegex(String expression, ISearchHandler handler, RegexSearchOptions searchOptions)
Searches the expression.
searchWithRegex
in interface IRegexSearchable
expression
- A regular expression.handler
- An instance of the search handler.searchOptions
- Options for searching.public List<String> extractHighlights(HighlightOptions... highlightOptions)
Extracts highlights.
extractHighlights
in interface IHighlightExtractor
highlightOptions
- A collection of HighlightOptions.public void extractStructured(StructuredHandler handler)
Extracts a structured text.
extractStructured
in interface IStructuredExtractor
handler
- Structured text extraction handler.public void reset()
Resets the current document.
ExtractLine
method will return the first line of the document.
reset
in class TextExtractor
public String extractPage(int pageIndex)
Extracts all characters from the page with pageIndex and returns the data as a string.
extractPage
in interface IPageTextExtractor
pageIndex
- The index of the page.protected String prepareLine()
Returns a line of the text.
prepareLine
in class TextExtractor
protected String extractText()
Extracts all characters from the current position to the end of the text extractor and returns them as one string.
extractText
in class TextExtractor
protected void dispose(boolean disposing)
Releases the unmanaged resources used by the extractor.
dispose
in class TextExtractor
disposing
- A boolean true if invoked from Dispose; otherwise, false.Copyright © 2019. All rights reserved.