You are here: Home > QuestAgent8 > PublishersGuide > DocumentIndexers

Document Indexers

QuestAgent comes with document indexers for the most popular document formats (at least on the Internet). Indexers extract content properties from a document being indexed. In QuestManager's Content Field Mappings panel to can specify how do you want to handle these properties (to index or/and store).

HTML file handler

By default, HTML handler recognizes the following document properties:

Property Description
body Document body content.
title Document title. Content appearing in TITLE tag.
description The content of Description Meta tag or content appearing in the first 200 characters of file.
keywords The content of Keywords Meta tag.
img.alt Image alternate text.
area.alt Area alternate text.
h1 ...h6 Header tags.

Meta elements can be defined in document header (between <HEAD> ... </HEAD> tags) or anywhere in document body.

In addition, HTML Handler supports (non-standard) inline Meta tagging that can appear anywhere in document body

 <META name="propertyName">some content</META>

allowing intensive content tagging. In addition, inline Meta tags can be nested:

 this is a document body...
 <META name="propOuter">
 in outer tag...
 <META name="propInner">in inner tag...</META> 
 in outer tag...
 </META> 
 this is a document body...

For the example above, Meta content extracted is:

 propOuter = in outer tag... in inner tag... in outer tag...
 propInner = in inner tag...

Text file handler

Text file reader provides two document properties: body and description.

PDF file handler

PDF handler recognizes the following document properties:

Property Description
body Document body content.
title Document title.
subject Document subject.
keywords Keywords describing document content.
author Document author.

MS Word handler (cross platform)

Fast cross-platform MS Word document reader. The complete list of supported document properties can be found in content field mappings panel.

MS Word handler

Fast cross-platform MS Excel document reader. The complete list of supported document properties can be found in content field mappings panel.

MS Word handler (MS Office bridge)

MS Excel handler (MS Office bridge)

MS PowerPoint handler (MS Office bridge)

The group of MS Office bridge handlers works only on Windows platform with MS Office installed. It uses MS Office components to extract document content.

These handlers are accurate but very slow. Use it only if cross-platform extractor does not work for you (There's no cross-platform PowerPoint handler currently) or you suspect that there's a problem with text extraction.

The complete list of supported document properties can be found in content field mappings panel.

Open Office handler

Open Office handler reads document created by OpenOffice.org Writer, Calc and Impress.

The complete list of supported document properties can be found in content field mappings panel.

XML handler (XPath)

Content of any XML tag or its attribute can be indexed or/and stored. XML handler accepts XPath syntax for document property names. (Please visit http://www.w3c.org for details on XPath.)

  • If property name matches tag attribute then its value will be handled.
  • If property matches a tag, XML handler will collect all text appearing between opening and closing tag marks.

Note that property names (XPath expressions) are evaluated against document root element.

By default, XML Handler accepts the following properties:

Property (XPath expression) Extracted content
/ Collect and handle all text bellow root element.
//@* Collect and handle all attributes.

The following example shows specified data fields along with recognized content.

XML sample file:

 <?xml version="1.0"?>
 <!DOCTYPE personnel SYSTEM "personal.dtd">
 <personnel>
   <person id="J.MILLER" >
     <name><family>MILLER</family> <given>John</given></name>
     <email>john@somedomain.com</email>
     <link subordinates="S.SMITH T.PHILLIPS"/>
   </person>
   ...
 </personnel>

Extracted content:

Property (XPath expression) Extracted content
/personnel/person/name MILLER John
/personnel/person/link/@subordinates S.SMITH T.PHILLIPS
//email john@somedomain.com
//person/@id J.MILLER

XML handler (Simple)

Simple XML handler collections all note text that handles as body field.

If you need controlled text extraction from XML use XPath based XML handler.