Full-Text Searching with IFilters

Introduction

The nineties were all about information creation and sharing. Today’s challenge is about finding the information you need when you need it. We all feel the ongoing pain that we never can find that piece of information that helps us to do the tasks at hand. The result being that we either spend a lot of time searching for information or, if we can’t find it, we spend a lot of time achieving the task at hand with trial and error until we figured out how to do it. Microsoft products such as Indexing Server, Exchange Server, SharePoint Server, SQL Server, and Windows Desktop Search provide powerful full text search capabilities. All of these products share a common building block for their full text searching—IFilters.

All Microsoft full text search engines have in common that they index the actual content and then allow you to perform searches against these indexes. The indexing process finds the file type associated with the content and then invokes the associated IFilter. The COM object that implements the IFilter encapsulates the understanding about the content structure and performs the actual indexing of the content. If a third-party ISV has some proprietary content tjat should be searchable by these Microsoft products, the ISV needs to create an appropriate IFilter COM object. As soon as this IFilter gets registered, it can be utilized by all Microsoft full text search engines. This simplifies tremendously the process for ISVs to make their content searchable with all thedifferent Microsoft products.

How IFilters Get Associated with the Different File/Content Types

Any content searched has a “file extension” associated. Indexing Server, SharePoint and Windows Desktop Search are used to index and search files on the file system. Exchange Server, SharePoint, and SQL Server can have files embedded that again have a file extension. All other fields in SQL Server are naturally assumed to be in text format and therefore assume the “.txt” extension. Messages in Exchange Server assume also the “.txt” extension. The Registry. therefore, is the natural place to associate IFilters with each file extension. The indexing process first determines the file extension of the content. Then, it performs the following steps:

  • Step 1: Determine whether there is a PersistentHandler associated with the file extension. This can be found in the Registry under HKEY_LOCAL_MACHINESoftwareClassesFileExtension; for example, HKLMSoftwareClasses.htm. The default value of the sub key called PersistentHandler gives you the GUID of the PersistenHandler. If present, skip to Step Four; otherwise, continue with Step Two.
  • Step 2: Determine the CLSID associated with the file extension. Take the default value that is associated with the extension; for example, “htmlfile” for the key HKLMSoftwareClasses.htm. Next, search for that entry—for example, “hmtlfile”—under HKLM SoftwareClasses. The default value of the sub key CLSID contains the CLSID associated with that file extension.
  • Step 3: Next, search for that CLSID under HKLMSoftwareClassesCLSID. The default value of the sub key called PersistentHandler gives you the GUID of the PersistenHandler.
  • Step 4: Search for that GUID under HKLMSoftwareClassesCLSID. Under it, you find a PersistentAddinsRegistered sub key that has always a {89BCB740-6119-101A-BCB7-00DD010655AF} sub key (this is the GUID of the IFilter interface). The default value of this key has the IFilter PersistenHandler GUID.
  • Step 5: Search for this GUID once more under HKLMSoftwareClassesCLSID. Under its key, you find the InProcServer32 sub key and its default value contains the name of the DLL that provides the IFilter interface to use for this extension. For example, for the .htm and .html extension, this is the nlhtml.dll DLL.

The following article provides a more detailed description with examples how the IFilter DLL is found. For more information about the PersistentHandler, refer to this article.

How to Create Your Own IFilter Component

ISVs that register their own file extensions with proprietary content structures need to provide their own IFilter components so these file types can be searched by Microsoft products. The Platform SDK describes in detail the IFilter interface. The Platform SDK also contains three sample IFilter implementations.

The Three IFilter Components Used in this Article

The “Channel9 Wiki” lists the IFilter components that are present out of the box. Please note that a number of software packages install additional IFilter components. It also provides links to a number of additional IFilter components available. The Windows Desktop Search has its own site for additional IFilter components available—http://addins.msn.com. The rest of this article will explain how the full text search in Indexing Server, SQL Server, Windows Desktop Search, and SharePoint works. It will also document any additional settings you need to make for new IFilter components to work. The three IFilter components used are:

  • CHM file extension: The CHM extension is used by compiled Windows help files. Out of the box, CHM files have no PersistentHandler associated so they are not searchable. The installer places and registers one DLL: CHMIFilter.dll.
  • ZIP file extension: Also, ZIP files are not searchable out of the box because they again have no PersistenHandler associated. This installer places also and registers one DLL: ztvArchFil.dll. The ZIP IFilter made available by Citeknet worked fine with Indexing Server and Windows Desktop Search, but I could not get it to work with SQL Server. It also itself places and registers one DLL: ZIPIFilter.dll.
  • XML file extension: The XML file extension has by default a PersistentHandler associated that works fine with Indexing Server and Windows Desktop Search. But, the default IFilter did not work with SQL Server. This XML IFilter component works with all three. First, you need to extract the file, then copy the XMLFilter.dll to the windowssystem32 folder, and then register it.

You can also download a Filter Explorer from Citeknet. This explorer walks the Registry and will list all the IFilter components available. It also can show all the file extensions that have no IFilter associated, meaning they are not searchable. This can be very useful in understanding what content is searchable or not. It also simulates the slightly different behavior of the different Microsoft products as to how each will read the Registry entries to find available IFilter components.

Full Text Search with Indexing Server

The following article describes how you can use Indexing Server to index and search files on the file system. Indexing Server can perform an auto registration of filters if they are added to the DLLsToRegister registry value (under HKLMSystemCurrentControlSetControlContentIndex). When the Indexing service starts up, it calls DllRegisterServer for each DLL listed. In Windows 2003 and XP, this is a multi string value so you can edit it through the Windows Registry editor. In Windows 2000, this is a binary value. Some filters add themselves to this Registry value during registration; for example, the ZIP IFilter.

A newly registered IFilter takes effect only after the Indexing service has been restarted or an individual Indexing catalog itself has been restarted (then it only takes effect for that Indexing catalog). Unregistering an IFilter also takes effect only until the Indexing service or an individual Indexing catalog gets restarted. To remove already indexed content, you need to start a full rescan.

More by Author

Must Read