public class StructuredHandler extends Object
Represents a handler for extracting a structured text from the document.
Extracting headers from a document:
class Headers {
private class Handler extends StructuredHandler {
// Handle List event to prevent processing of lists
protected void onStartList(ListProperties properties) {
properties.setSkipElement(true); // ignore lists
}
// Handle Table event to prevent processing of tables
protected void onStartTable(TableProperties properties) {
properties.setSkipElement(true); // ignore tables
}
// Handle ElementText event to process a text
protected void onText(TextProperties properties, String value) {
sb.append(value);
}
// Handle Paragraph event to process a paragraph
protected void onStartParagraph(ParagraphProperties properties) {
int h1 = (int) ParagraphStyle.Heading1;
int h6 = (int) ParagraphStyle.Heading6;
int style = properties.getStyle();
if (h1 <= style && style <= h6) {
if (sb.length() > 0) {
sb.append("\r\n");
}
// make an indention for the header (h1 - no indention)
sb.append(new String(new char[style - h1]).replace('\0', ' '));
} else {
// skip paragraph if it's not a header or a title
properties.setSkipElement(properties.getStyle() != ParagraphStyle.Title);
}
}
}
private StringBuilder sb = new StringBuilder();
public void extract(java.io.InputStream stream) {
IStructuredExtractor extractor = new WordsTextExtractor(stream);
Handler handler = new Handler();
// Extract a text with its structure
extractor.extractStructured(handler);
System.out.println(sb.toString());
}
}
Extracting hyperlinks from a document:
class Hyperlinks {
private class Handler extends StructuredHandler {
// Handle Hyperlink event to process a starting of a hyperlink
protected void onStartHyperlink(HyperlinkProperties properties) {
sb = new StringBuilder();
currentLink = properties.getLink();
}
// Handle ElementClose event to process a closing of a hyperlink
protected void onEndElement() {
if (get_Item(0).getClass() == HyperlinkProperties.class) // closing of hyperlink
{
if (sb != null) {
hyperlinks.add(String.format("%s (%s)", sb.toString(), currentLink));
}
sb = null;
currentLink = null;
}
}
// Handle ElementText event to process a text
protected void onText(TextProperties properties, String value) {
if (sb != null) // if hyperlink is open
{
sb.append(value);
}
}
}
java.util.List<String> hyperlinks = new java.util.ArrayList<String>();
StringBuilder sb = null;
String currentLink = null;
public void extract(java.io.InputStream stream) {
IStructuredExtractor extractor = new WordsTextExtractor(stream);
StructuredHandler handler = new StructuredHandler();
// Extract a text with its structure
extractor.extractStructured(handler);
for(String hl : hyperlinks)
{
System.out.println(hl);
}
}
}
Constructor and Description |
---|
StructuredHandler()
Initializes a new instance of the
StructuredHandler class. |
Modifier and Type | Method and Description |
---|---|
void |
endElement()
Processes the closing of the element.
|
StructuredElementProperties |
get_Item(int index)
Gets a element.
|
int |
getDepth()
Gets a depth of the current element.
|
protected void |
onEndElement()
Starts to process the closing of the element.
|
protected void |
onLineBreak(LineBreakProperties properties)
Starts to process the line break element.
|
protected void |
onStartDocument(DocumentProperties properties)
Starts to process the document.
|
protected void |
onStartElement(StructuredElementProperties properties)
Starts to process the element.
|
protected void |
onStartGroup(GroupProperties properties)
Starts to process the group element.
|
protected void |
onStartHyperlink(HyperlinkProperties properties)
Starts to process the hyperlink element.
|
protected void |
onStartingElement(StructuredElementProperties properties)
Prepares to process the element.
|
protected void |
onStartList(ListProperties properties)
Starts to process the list element.
|
protected void |
onStartListItem(ListItemProperties properties)
Starts to process the list item element.
|
protected void |
onStartPage(PageProperties properties)
Starts to process the page.
|
protected void |
onStartParagraph(ParagraphProperties properties)
Starts to process the paragraph element.
|
protected void |
onStartSection(SectionProperties properties)
Starts to process the section element.
|
protected void |
onStartSlide(SlideProperties properties)
Starts to process the slide.
|
protected void |
onStartTable(TableProperties properties)
Starts to process the table element.
|
protected void |
onStartTableCell(TableCellProperties properties)
Starts to process the table cell element.
|
protected void |
onStartTableRow(TableRowProperties properties)
Starts to process the table row element.
|
protected void |
onText(TextProperties properties,
String value)
Starts to process the element's text.
|
void |
startElement(StructuredElementProperties properties)
Processes the element.
|
void |
text(TextProperties properties,
String value)
Processes the element's text.
|
public StructuredHandler()
Initializes a new instance of the StructuredHandler
class.
public int getDepth()
Gets a depth of the current element.
public StructuredElementProperties get_Item(int index)
Gets a element.
index
- Depth of the element.
StructuredElementProperties
.public void startElement(StructuredElementProperties properties)
Processes the element.
properties
- Properties of the element.public void endElement()
Processes the closing of the element.
public void text(TextProperties properties, String value)
Processes the element's text.
properties
- Properties of the element's text.value
- A text of the element.protected void onStartingElement(StructuredElementProperties properties)
Prepares to process the element.
properties
- Properties of the element.protected void onStartElement(StructuredElementProperties properties)
Starts to process the element.
properties
- Properties of the element.protected void onStartDocument(DocumentProperties properties)
Starts to process the document.
properties
- Properties of the document.protected void onStartPage(PageProperties properties)
Starts to process the page.
properties
- Properties of the page.protected void onStartSlide(SlideProperties properties)
Starts to process the slide.
properties
- Properties of the slide.protected void onStartParagraph(ParagraphProperties properties)
Starts to process the paragraph element.
properties
- Properties of the paragraph element.protected void onStartHyperlink(HyperlinkProperties properties)
Starts to process the hyperlink element.
properties
- Properties of the hyperlink element.protected void onStartList(ListProperties properties)
Starts to process the list element.
properties
- Properties of the list element.protected void onStartListItem(ListItemProperties properties)
Starts to process the list item element.
properties
- Properties of the list item element.protected void onStartTable(TableProperties properties)
Starts to process the table element.
properties
- Properties of the table element.protected void onStartTableRow(TableRowProperties properties)
Starts to process the table row element.
properties
- Properties of the table row element.protected void onStartTableCell(TableCellProperties properties)
Starts to process the table cell element.
properties
- Properties of the table cell element.protected void onLineBreak(LineBreakProperties properties)
Starts to process the line break element.
properties
- Properties of the line break element.protected void onStartGroup(GroupProperties properties)
Starts to process the group element.
properties
- Properties of the group element.protected void onStartSection(SectionProperties properties)
Starts to process the section element.
properties
- Properties of the section element.protected void onText(TextProperties properties, String value)
Starts to process the element's text.
properties
- Properties of the element's text.value
- A text of the element.protected void onEndElement()
Starts to process the closing of the element.
Copyright © 2019. All rights reserved.