Welcome back to getting to know Sitecore search.  In this post, we’ll focus on the advanced web crawler.  We’ll modify the source, document extractors, taggers and attributes to customize our indexed documents.

Manage Sources

To manage sources, visit Administration/Sources

Administration screen

From this screen you can see a list of your sources.  You can see the source name, connector type, last run statue, number of items index, publish status, last modified date, last modified by.

Manage sources

Add a Source

When you add a source you’ll give it a name, description and select a connector type.  Once you set your connector type, you will not be able to change it later.

API Crawler – Read site data from an api

Only allows JS and Request triggers
Only allows JS and JSONPath document extractors

Push API – Send new site urls via api

You cannot define triggers, document extractors, scan frequency, etc

Appflow Crawler – You can’t actually create this type of connector. You get an error.
Feed Crawler – You can’t actually create this type of connector. You get an error.
Web Crawler – A simple web crawler

Only allows request, sitemap, or sitemap index triggers
Only allows xpath document extractors

Web Crawler (Advanced) – The default and most configurable web crawler

Just like pages and widgets, new sources are in draft state until they are published.  Sources in a draft state cannot be seen or modified by other users.  You can create multiple sources with the same connector type.  For the rest of this post, we’ll use the Web Crawler (Advanced)

Web Crawler Settings

The web crawler settings allows you to set the allowed domains, excluded url patterns, and if the crawler renders javascript.  If the crawler encounters any url that is not in the allowed domains list, it will not be added to the catalog.  You can exclude patterns via regular expression or glob expression.  Render javascript is disabled by default.  Enabling javascript rendering will cause the crawler to run slower.  Be sure to view the source of your pages.  Does the source match what is on the screen?  If not, it might be necessary for the crawler to find all of the content on your site.

Triggers

A trigger is the starting point for the crawler.  Each source can have multiple triggers.  You can have multiple triggers of the same type.

JS – Use a javascript function to return an array of urls to crawl
Request – A single url to crawl
Sitemap – Link to a single sitemap file
Sitemap Index – Link to a sitemap index file
RSS – Link to an rss feed

An example of a javascript trigger function.  The javascript trigger basically lets you create multiple request triggers manually.

function extract(){
return[
{
“url”: “https://www.perficient.com/success-stories/large-midwest-healthcare-payor”
},
{
“url”: “https://www.perficient.com/success-stories/large-rural-health-system”
}
];
}

Document Extractors

A document extractor creates an index document for a url.  Each source can have multiple document extractors.  You can have multiple extractors of the same type.

Css – Use css selectors to get attributes from the html
Xpath – Use xpath syntax to get attributes from the html
JS – Use custom javascript to get attributes from the html

An example of a javascript document extractor function.  The name of the key is the attribute to populate.  The value is a list of jquery like expressions to pull content from the document.  You can use an or (two vertical bars) to represent alternate selectors.

function extract(request, response) {
$ = response.body;
return [{
‘description’: $(‘meta[name=”description”]’).attr(‘content’) || $(‘meta[property=”og:description”]’).attr(‘content’) || $(‘p’).text(),
‘name’: $(‘meta[name=”searchtitle”]’).attr(‘content’) || $(‘title’).text(),
‘type’: $(‘meta[property=”og:type”]’).attr(‘content’) || ‘website_content’,
‘url’: $(‘meta[property=”og:url”]’).attr(‘content’)
}]
}

We could recreate this same extractor in either xpath or css.

Xpath selectors for document extractor

I don’t use xpath very often.  I found this handy reference guide to be very helpful https://devhints.io/xpath. You can test your xpath expression in your browser’s dev tools very easily.

On the elements tab, you can press ctrl+f to use the search bar at the bottom of the window
On the console tab, you can use the built in $x(“xpath expression”) function

Each document extractor defines a list of urls to operate on and can tag the document differently.  For example you might tag your news articles at /news/* differently than blog articles at /blog/*.  Remember to use a fixed field string for your content type.  We’ll use this field later as a way to facet our search results.   The order of document extractors matters.  A document won’t be indexed multiple times on the same source.  If you have a document extractor that matches all urls above a specific extractor, the specific extractor will not execute.  I use https://regex101.com  to test my regular expressions before I publish my source.

Document extractor with urls to match field set with a regular expression

Scan Frequency

You can schedule your source to automatically scan your site for updates.  Choose your frequency and set your start/end dates.  You can choose the following options for frequency.

Hourly
Every 6 hours
Every 12 hours
Daily
Weekly

Publishing a Source

Once you’ve updated your source, you must publish to rescan/reindex your site.  When you click the publish button, it checks your triggers, document extractors, request extractors, locale extractors for errors.  It would be nice if you got feedback sooner in the process (like when you save each item rather then having to publish).

Publish source with invalid xpath in the document extractor

If everything is correct, your source will be queued for a scan. You can monitor the main Administration/Sources page for status updates.

Sources listing page with the status of the last scan

Once your scan is complete, check the catalog for new pages and updates to your index documents. Note that it does take a few minutes for the catalog to update correctly.

Content Catalog

I found the content catalog a little lacking to debug my document extractors.  You can only search specifically by name or id.  I would like to be able to search by url as well.  You can filter by content type which is helpful.  But it would be nice to be able to filter by source as well.  Being able to sort by the columns would make it easier to find duplicate content items.

Unwanted Results

We won’t worry so much about unwanted documents, we can blacklist them in our widget settings.  You can tweak the sitemap.xml file on your site to exclude unwanted urls.  You can also exclude them by pattern on the web crawler settings.

Exclude urls by regular expression on web crawler settings

Attributes

Only a user with the tech admin role has the ability to add new attributes.  Click Administration/Domain Settings/Attributes to view the list of available attributes, modify attribute properties, and add new attributes.

Attributes details list

Attributes can be any of the following data types

Integer
Double
Array of Integers
Float
Array of Float
Boolean
String
Array of Strings
Object
Array of Objects
Timestamp
Date Time

Attributes are grouped together by a field called entity.  I have not found a way to add additional entities.  I would have expected to create a set of attributes for web content, pdfs, word documents, and other types of content that could logically have different sets of attributes.  Remember to publish your domain settings so your attributes are available to your document extractors.

Delete a Source

You can only delete a source that is in draft state.  Once it has been published your only option is to archive the source.  Archived sources are hidden from the UI by default.  But you can click the “Show Archived” toggle to view them again.  You can restore an archived source which returns it to draft state.  Archiving or deleting a source removes all of the associated documents from the catalog.

Add a Source or Modify a Source?

A source can cover multiple urls and can have multiple document extractors.  So when does it make sense to add a new source vs adding a document extractor on an existing source?  I think it would make sense to have a new source for separate websites or subdomains of your site.  That way you can’t have the same url accidentally indexed multiple times.  For different sections of your site, it makes sense to create a document extractor and use the urls to match field to limit which pages use the extractor.

Up Next

In the next post, we’ll build a simple UI and connect to the api to get our first real results!

Follow me on linkedin to get notifications as new articles are posted.