Document Indexers
QuestAgent comes with document indexers for the most popular document formats (at least on the Internet). Indexers extract content properties from a document being indexed. In QuestManager's
Content Field Mappings panel to can specify how do you want to handle these properties (to
index or/and
store).
HTML file handler
By default, HTML handler recognizes the following document properties:
| Property |
Description |
| body |
Document body content. |
| title |
Document title. Content appearing in TITLE tag. |
| description |
The content of Description Meta tag or content appearing in the first 200 characters of file. |
| keywords |
The content of Keywords Meta tag. |
| img.alt |
Image alternate text. |
| area.alt |
Area alternate text. |
| h1 ...h6 |
Header tags. |
Meta elements can be defined in document header (between
<HEAD> ...
</HEAD> tags) or anywhere in document body.
In addition, HTML Handler supports (non-standard) inline Meta tagging that can appear anywhere in document body
<META name="propertyName">some content</META>
allowing intensive content tagging. In addition, inline Meta tags can be nested:
this is a document body...
<META name="propOuter">
in outer tag...
<META name="propInner">in inner tag...</META>
in outer tag...
</META>
this is a document body...
For the example above, Meta content extracted is:
propOuter = in outer tag... in inner tag... in outer tag...
propInner = in inner tag...
Text file handler
Text file reader provides two document properties: body and description.
PDF file handler
PDF handler recognizes the following document properties:
| Property |
Description |
| body |
Document body content. |
| title |
Document title. |
| subject |
Document subject. |
| keywords |
Keywords describing document content. |
| author |
Document author. |
MS Word handler (cross platform)
Fast cross-platform MS Word document reader. The complete list of supported document properties can be found in content field mappings panel.
MS Word handler
Fast cross-platform MS Excel document reader. The complete list of supported document properties can be found in content field mappings panel.
MS Word handler (MS Office bridge)
MS Excel handler (MS Office bridge)
MS PowerPoint handler (MS Office bridge)
The group of MS Office bridge handlers works only on Windows platform with MS Office installed. It uses MS Office components to extract document content.
These handlers are accurate but very slow. Use it only if cross-platform extractor does not work for you (There's no cross-platform PowerPoint handler currently) or you suspect that there's a problem with text extraction.
The complete list of supported document properties can be found in content field mappings panel.
Open Office handler
Open Office handler reads document created by OpenOffice.org Writer, Calc and Impress.
The complete list of supported document properties can be found in content field mappings panel.
XML handler (XPath)
Content of any XML tag or its attribute can be indexed or/and stored. XML handler accepts XPath syntax for document property names. (Please visit
http://www.w3c.org for details on XPath.)
- If property name matches tag attribute then its value will be handled.
- If property matches a tag, XML handler will collect all text appearing between opening and closing tag marks.
Note that property names (XPath expressions) are evaluated against document root element.
By default, XML Handler accepts the following properties:
The following example shows specified data fields along with recognized content.
XML sample file:
<?xml version="1.0"?>
<!DOCTYPE personnel SYSTEM "personal.dtd">
<personnel>
<person id="J.MILLER" >
<name><family>MILLER</family> <given>John</given></name>
<email>john@somedomain.com</email>
<link subordinates="S.SMITH T.PHILLIPS"/>
</person>
...
</personnel>
Extracted content:
XML handler (Simple)
Simple XML handler collections all note text that handles as body field.
If you need controlled text extraction from XML use XPath based XML handler.