You are here: Home > QuestAgent8 > PublishersGuide > QuestManager > IndexingSettings

Index Profile Settings

Collection Info

Collection Info panel allows you to set the following index profile properties:

Description: Description or title of document collection you're indexing.

Primary Language: Set primary language for your document collection. Primary language determines term filtering and lexical analysis rules that will be used for indexing and searching. See term filtering for details.

Default Encoding: Default file encoding for text files. All HTML files without Content-Type Meta tag will be read using default encoding specified here. File encoding specification can be added to an HTML document using Content-type Meta tag. For example:

 <html>
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=EUC-JP">
 <title>This is a title</title>
 </head> 
 ... 

QaIndexProfilePropertiesInfo.png

Collection Root Directory

Root Directory Specification panel defines directory you want to index. Indexing engine will start indexing from this directory and collect all allowed files and subdirectories.

To exclude certain files or directories use file filtering rules.

QaIndexProfilePropertiesRootDirectory.png

File Filtering Rules

URL Filtering Rules tell indexer which files to index (allowed files) and which files to ignore (disallowed files).

QaIndexProfilePropertiesFileFiltering.png

Filtering Rules Syntax

QuestAgent indexing engine parses file filtering rules using a simple syntax:

  • Keyword allow or disallow activates action for all patters that follow. (Tip: + can be used instead of allow, - can be used instead of disallow.)
  • All other tokens are used as patterns.

Rule matching algorithm:

  • Document URI (that is being tested against filtering rules) is created as file path relative to collection root directory. Slash-separated (Unix style) path is used. (For example, if collection root is c:\col and file path is c:\col\foo\bar.htm then URI is foo/bar.htm.)
  • Document URI is tested against pattern list in order specified in file filtering rules.
  • All comparisons are case insensitive.
  • If file path matches a pattern then:
    • applies rule (allow or disallow) associated with that pattern, and
    • no more rules are processed.
  • Default rule applies (allow).

Note that default filtering rule is allow. It means document will be indexed if your disallow filters do not match the file path. In order to allow only files that match your filter just add disallow * to the end of your filter set.

Example 1

Set of filtering rules Explanation
disallow */test/* ignore all paths containing 'test',
allow *.htm *.html allow index files that match *.htm or *.html,
disallow *_abc.pdf ignore files that match *_abc.pdf pattern,
allow *.pdf index all other PDF documents,
disallow * ignore all other files.

Example2

Note the following:

*module.htm allows both "module.htm" and "newmodule.htm"
*/module.htm allows just file "module.htm"

Index Fields and Indexing Rules

If handled, text and meta-data extracted from a document are always associated with some index field. This panel tells indexing engine how to handle the content.

QaIndexProfilePropertiesIndexFields.png

Predefined Fields

List of predefined index fields define the structure of index database and how do you want to search or use data:

  • Indexing option index tokenized means that content will be tokenized and passed through text analysis filters. (If you want to search for data in field content, you'll probably want to use this option.)
  • Indexing option index as is means that content should be indexed without any analysis. This is useful if index field does not contain plain text but some more specific data (e.g. product identifier extracted from HTML Meta tag).
  • Use do not index option if you're not going to search data in that field.
  • Store field value only if you're going to show field content in search results list.

Note: Field names are case sensitive (for both indexing and searching).

Default Handling Rule

Default handling rule defines what to do with extracted content when an appropriate predefined field does not exist:

  • Mapped only – Ignores all content fields that are not mapped to an existing predefined field. (Default option).
  • Known only - Lookup for predefined index field named as content field and use its rules. This is in case document handler extracts some data but there's no mapping for found field.
  • Index unknown – Means that content should be indexed tokenized using name of content field.
  • Index and store unknown – As above but also stored.

File Types

File Types panel allows you to enable or disable indexing of desired file type. Also, since for some document types there's more than one handler, you can select the profile you want to use.

Note that you can control these settings on a global level too. Settings in profile (if changed) override global settings.

QaIndexProfilePropertiesFileTypes.png

Content Field Mappings

Content field mappings define relationships between document content and properties extracted by content handlers and index fields as part of index database. Using data mappings you can control how document data is visible in index (index field).

Each document type has its own set of data mappings. The following parameters define data mapping:

  • Property is name of a document property extracted by content handler or it can be rule for extraction (as in case of XPath XML handler).
  • Boost defines importance of data extracted from the document property. Boost is of significance when content of two or more content properties is mapped to the same index field. In that case, boosting settles importance of items found in different data fields.
  • Index field is the field in index database you're mapping the content to.
  • Description is optional and is provided for your convenience as a comment about the mapping.

Note: One content field can be mapped to more than one index field.

QaIndexProfilePropertiesContentFieldMappings.png

Text Analysis Options

Text analysis options define how text extracted from documents is tokenized and filtered before it's indexed. This options apply to those index fields where indexing rule is set to index tokenized.

QaIndexProfilePropertiesTextAnalysis.png

Available options

Case insensitive If you want case sensitive search you have to disable this and “use stemming” option.
Ignore accents Enable to ignore accents and umlauts.
Acronim filter Treat acronim terms e.g. “A.B.C.” equal to “ABC”.
Term length filter Reject terms that are not in specified length range.
Ignore stop words Enable if you want to exclude stop words from index. You can also edit (global) stop words list for used language.
Stemming Stemming is an algorithm mapping inflected words to their basic forms - stems. For example words -> word, mapping -> map, etc.
CJK support Regardless of collection language, enable this option your index contains Chinese/Japanese/Korean text.
Split term on delimiters Tokenizes input text using specified delimiter characters.

These analysis filters are applied in order they appear in configuration panel. Note: You need to recreate index whenever you change analysis settings.

Testing Text Analysis

In order to test analysis options currently set in panel, click “Test text analysis” button and test panel will open.

QaIndexProfilePropertiesTestTextAnalysis.png

Type or copy and paste sample text, enter your query and click test button.

Document Passwords

Using Document Passwords panel you can specify password rules for protected documents. At this time passwords are supported only for PDF documents.

Password rule consists of a file name pattern and password.

Warning: Save your index profile in a safe place! Passwords are stored as plain text and could be read by anyone accessing your profile file.

QaIndexProfilePropertiesDocumentPasswords.png

To define password rule click "Add" button, specify file name pattern and the appropriate password. Notice that passwords are visible as plain text. Similar to file filtering rules, file name patterns allow wildcards. Valid patterns include: "secret.pdf", "*.pdf", etc. Rule matching if performed sequentially. Password provided with the first pattern that matches file name would be used to read document content. You can use buttons "Up" and "Down" to change the order of password rule processing.

Index Export Options

With allow use on web sites option you can allow or forbid use of search applet on web sites.

Expert option: Segment size parameter defines maximal size of index file segment. QuestAgent search applet reads just needed segments of index database in order to make search process and data retrieval faster. Default value is 64KB.

QaIndexProfilePropertiesIndexExportOptions.png

Note: QuestAgent enforces your license features as follows:

  • If you have Unlimited Distribution License you do not need to specify server name and that will allow use of exported index from any on web server. That way your clients can install documentation on their intranet web server and use search from there.
  • If you have other license then use on just one web server is allowed and its name must be set before index is deployed/exported.