On a site powered by Sitecore 6.2, I need to give the user the ability to selectively exclude items from search results.
To accomplish this, I have added a checkbox field entitled “Include in Search Results”, and I created a custom database crawler to check that field’s value:
~\App_Config\Include\Search Indexes\Website.config:
<search>
<configuration type="Sitecore.Search.SearchConfiguration, Sitecore.Kernel" singleInstance="true">
<indexes hint="list:AddIndex">
<index id="website" singleInstance="true" type="Sitecore.Search.Index, Sitecore.Kernel">
...
<locations hint="list:AddCrawler">
<master type="MyProject.Lib.Search.Indexing.CustomCrawler, MyProject">
...
</master>
<!-- Similar entry for web database. -->
</locations>
</index>
</indexes>
</configuration>
</search>
~\Lib\Search\Indexing\CustomCrawler.cs:
using Lucene.Net.Documents;
using Sitecore.Search.Crawlers;
using Sitecore.Data.Items;
namespace MyProject.Lib.Search.Indexing
{
public class CustomCrawler : DatabaseCrawler
{
/// <summary>
/// Determines if the item should be included in the index.
/// </summary>
/// <param name="item"></param>
/// <returns></returns>
protected override bool IsMatch(Item item)
{
if (item["include in search results"] != "1")
{
return false;
}
return base.IsMatch(item);
}
}
}
What’s interesting is, if I rebuild the index using the Index Viewer application, everything behaves as normal. Items whose “Include in Search Results” checkbox is not checked will not be included in the search index.
However, when I use the search index rebuilder in the Sitecore Control Panel application or when the IndexingManager auto-updates the search index, all items are included, regardless of the state of their “Include in Search Results” checkbox.
I’ve also set numerous breakpoints in my custom crawler class, and the application never hits any of them when I rebuild the search index using the built-in indexer. When I use Index Viewer, it does hit all the breakpoints I’ve set.
How do I get Sitecore’s built-in indexing processes to respect my “Include in Search Results” checkbox?
I spoke with Alex Shyba yesterday, and we were able to figure out what was going on. There were a couple of problems with my configuration that was preventing everything from working correctly:
As Seth noted, there are two distinct search APIs in Sitecore. My configuration file was using both of them. To use the newer API, only the
sitecore/search/configurationsection needs to be set up (In addition to what I posted in my OP, I was also adding indexes insitecore/indexesandsitecore/databases/database/indexes, which is not correct).Instead of overriding
IsMatch(), I should have been overridingAddItem(). Because of the way Lucene works, you can’t update a document in place; instead, you have to first delete it and then add the updated version.When
Sitecore.Search.Crawlers.DatabaseCrawler.UpdateItem()runs, it checksIsMatch()to see if it should delete and re-add the item. IfIsMatch()returns false, the item won’t be removed from the index even if it shouldn’t be there in the first place.By overriding
AddItem(), I was able to instruct the crawler whether the item should be added to the index after its existing documents had already been removed. Here is what the updated class looks like:~\Lib\Search\Indexing\CustomCrawler.cs:
Alex also pointed out that some of my scalability settings were incorrect. Specifically:
The
InstanceNamesetting was empty, which can cause problems on ephemeral (cloud) instances where the machine name might change between executions. We changed this setting on each instance to have a constant and distinct value (e.g.,CMSandCD).The
Indexing.ServerSpecificPropertiessetting needs to betrueso that each instance maintains its own record of when it last updated its search index.The
EnableEventQueuessetting needs to betrueto prevent race conditions between the search indexing and cache flush processes.When in development, the
Indexing.UpdateIntervalshould be set to a relatively small value (e.g.,00:00:15). This is not great for production environments, but it cuts down on the amount of waiting you have to do when troubleshooting search indexing problems.Make sure the history engine is turned on for each web database, including remote publishing targets:
To manually rebuild the search indexes on CD instances, since there is no access to the Sitecore backend, I also installed RebuildDatabaseCrawlers.aspx (from this article).