How FTS indexing works for attachments
Indexing for attachments is selected on a per-field basis, not by attachment pool. Therefore, it is possible to index only some of the attachment fields in a pool. The advantage to indexing only some of the attachment fields is that you can designate (and appropriately name) "buckets" for attachments likely to have value when indexed, as opposed to those that will not. The system attempts to index any attachment that is placed into an attachment field indexed for FTS. By guiding users to place attachments in the appropriate "buckets," the system can avoid unnecessary processing.
For HTML and XML file attachments, only the content (not the metadata) is indexed. That is, the elements and their attributes are not indexed.
The following formats are supported for FTS of attachment files:
- Hypertext markup language (HTML) format
- XML and derived formats
- Microsoft Office document formats (Word 97 and later--see the note that follows)
- OpenDocument format (OpenOffice 1.0 and later--see the note that follows)
- Portable document format (PDF) (versions 1.0 through 9.0)
- Electronic publication format (digital books)
- Rich text format (RTF)
- Compression and packaging formats (.zip, .tar, .bzip2, .ar, .cpio )
- Text formats (Most Unicode and ISO 8859 documents in plain text)
- Audio formats (extracts Lyrics [if present] and any metadata from MP3, MIDI, and other simple audio formats)
- Image formats (extracts metadata from image formats supported by the java platform)
- Video formats (supports only Flash video format using a simple parsing algorithm)
- Java class files and archives (extracts class names and method signatures from Java class files and the .jar files containing them)
- The mbox format (extracts email messages from the mbox format used by many email archives and UNIX mailboxes)