In a previous post, I shared my journey with Sitecore Data Exchange Framework.  I was excited to use it to create a reusable importer process.  In the end, I decided it was not the right tool for me.  I ended up using Powershell.  In the process of researching site migration tools, I discovered “Merlin”.  Merlin uses yaml configuration files to crawl a site and create structured representations (JSON) of pages.  It would be easy to create a powershell script to import this data.  It would be a super-fast way to create the basic structure of your site: items in the content tree, correct hierarchy of pages, page titles and metadata, and basic page content.  I’ll use the blogs.perficient.com site to show some examples.

Setup

Merlin is a php based tool.  Setup was super easy!  I used Chocolatey to install php and composer.

choco install php
choco install composer

I downloaded the tool from github.  Under the assets section, click “merlin-framework”.  This file a php archive file (phar) and will run from the command line or powershell with the php.exe command.

Crawl

Merlin includes a crawler that can be used to create a list of urls for the generate feature.  The crawler can be configured (https://salsadigitalauorg.github.io/merlin-framework/docs/crawler) to cache results, include or exclude specific urls by regex pattern, follow redirects and ignore robots file.

To run the crawl command, navigate to the directory where you downloaded the tool.  Use the php executable to run the tool.  The -c option specifies your crawler config.  The -o option specifies where to put the output files.

PS > C:toolsphp81php.exe merlin-framework crawl -c .prftcrawler_blogs_merlin.yml -o .outputprft

You can use multiple crawl configs to crawl different sections of your site.  The entity_type option sets the name of the output file “blog_site_structure” creates url list file called “crawled-urls-blogs_site_structure_default.yml”.  This  makes it easier to break the site into pages that have similar DOM layouts (ie: blogs, news, products, etc) for the generate command.

In this example, I limited my results to 50 as well as limiting the crawl to specific directories.  You can use the crawler_include and include options to limit what urls will end up in the output file.


domain: https://blogs.perficient.com
entity_type: blogs_site_structure

options:
cache_enabled: true
delay: 500
maximum_total: 50
urls:
– /2023/06/
– /2023/05/
– /2023/04/
– /2023/03/
– /2023/03/
– /2023/02/
– /2023/01/
– /2022/12/
– /2022/11/
– /2022/10/
– /2022/09/
– /2022/08/
– /2022/07/
– /2022/06/
crawler_include:
– “~/202[23]/d{2}/d{2}/.*~”
include:
– “~/202[23]/d{2}/d{2}/.*~”

Below is an excerpt from my output file.  The include attribute prevented https://blogs.perficient.com/2023/06/ and similar pages from being added to the output.


urls:
– https://blogs.perficient.com/2023/06/11/install-docker-on-an-amazon-ec2-instance-using-the-yum-package-manager/
– https://blogs.perficient.com/2023/06/10/introduction-to-terraform-day-1/
– https://blogs.perficient.com/2023/06/09/unlocking-digital-accessibility-exploring-the-power-of-cognitive-assistive-technologies-2-2/
– https://blogs.perficient.com/2023/06/09/unlocking-digital-accessibility-exploring-the-power-of-cognitive-assistive-technologies-2/
– https://blogs.perficient.com/2023/06/09/oci-gen-2-refresh-token-setup-troubleshooting/
– https://blogs.perficient.com/2023/06/09/what-if-college-was-just-a-pit-stop-an-interview-with-luca-ranzani/
– https://blogs.perficient.com/2023/06/09/the-ev-leadership-of-luca-ranzini-proves-automotive-has-a-bright-future/
– https://blogs.perficient.com/2023/06/09/unleashing-creativity-through-constraints/

Generate

In order generate your structured data, you need a list of urls to process and a list of field mappings.   The generator can be configured https://salsadigitalauorg.github.io/merlin-framework/docs/getting-started to cache results, read from the cache to make future processing faster,  use css selectors or xpath selectors to map fields to the DOM, and perform post processing on the data.

To run the generate command, navigate to the directory where you downloaded the tool.  Use the php executable to run the tool.  The -c option specifies your generate config.  The -o option specifies where to put the output files.

PS> C:toolsphp81php.exe merlin-framework generate -c .prftblogs_merlin.yml -o .outputprft

Merlin defines several special data types to make mapping easy.

alias – The url of the page
link – Reads the href attribute and text of an anchor tag
long_text – Reads multiline text and rich text fields
meta – Reads the content attribute of a meta tag
media – Generates a separate file for a media items grouped by the specified type
static_value – Outputs a string to the output as entered
taxonomy_term – Generates a separate file for taxonomy terms grouped by the specified type
text – Reads a single line text field

The text and long_text have extra processors that can be applied to the result.

nl2br – Changes new lines to br tags
remove_empty_tags – Removes any empty tags (ie <p></p>)
replace – Regular expression based string replacement (NOTE: this uses php’s preg_replace function which functions differently than the regular expression engine in .net)
strip_tags – Removes tags not in the allowed_tags list
whitespace – Removes extra whitespace characters

Compare the following config to the source of this page and see if you can match up the fields to the html source.


domain: https://blogs.perficient.com

urls:
– /2023/05/30/perficient-included-in-two-commerce-focused-idc-market-glances/
– /2023/05/16/getting-to-know-sitecore-search-part-4/

urls_file:
– “..outputprfteffective-urls-blogs_site_structure_default.yml” #relative to the location of this file

fetch_options:
delay: 500
ignore_ssl_errors: true

entity_type: blogs
mappings:
– field: url
type: alias

– field: sitecore_root_path
type: static_value
options:
value: “/sitecore/content/tenant/site/home/blogs”

– field: sitecore_template_id
type: static_value
options:
value: “{B69EDBAD-2FDF-4120-B13C-CDDF4B127B9F}”

– field: meta_title
type: text
selector: //title

– field: meta_keywords
type: meta
options:
value: keywords
attr: name

– field: meta_description
type: meta
options:
value: description
attr: name

– field: meta_og_title
type: meta
options:
value: og:title
attr: property

– field: meta_og_description
type: meta
options:
value: og:description
attr: property

– field: meta_og_sitename
type: meta
options:
value: og:site_name
attr: property

– field: meta_og_sitename
type: meta
options:
value: og:site_name
attr: property

– field: meta_og_type
type: meta
options:
value: og:type
attr: property

– field: meta_og_image
type: meta
options:
value: og:image
attr: property

– field: meta_published_time
type: meta
options:
value: article:published_time
attr: property

– field: featured_image
type: media
selector: div.story-two-header-content-img img
options:
file: src
alt: alt
type: featured_images

– field: title
selector: h1:first-of-type
type: text
processors:
– processor: nl2br

– field: primary_category
selector: p.eyebrow-header-eyebrow
type: text

– field: author
selector: h4.byline span.author a
type: text

– field: date
selector: h4.byline span.date
type: text

– field: content
selector: div.entry
type: long_text
processors:
– processor: nl2br
– processor: remove_empty_tags
– processor: whitespace

– field: content_images
type: media
selector: div.entry img
options:
file: src
alt: alt
type: content_images

– field: author_page
type: link
selector: div.author-avatar-and-name-avatar a
options:
link: href

– field: author_image
type: media
selector: div.author-avatar-and-name-avatar img
options:
file: src
alt: alt
type: author_images

– field: author_bio
selector: div.author-avatar-and-name-description p:first-of-type
type: text
processors:
– processor: replace
pattern: “More from this Author”

– field: categories
selector: //div[@class=”widget”]//ul/li #Taxonomy_term only works with xpath selector
type: taxonomy_term
vocab: category
children:
– field: uuid
type: uuid
selector: a
– field: name
type: text
selector: a

– field: tags
selector: div.tags-author-info a
type: text

The output file is in JSON format with your field names as the keys and the content of your selectors as the value.  I included two fields that will help make it easier to import this data into Sitecore.

sitecore_root_path – A static_value that contains the path to use as the parent when creating the new page in Sitecore
sitecore_template_id – A static_value that contains the id of the template to use when creating the new page in Sitecore

[
{
“url”: “/2023/05/16/getting-to-know-sitecore-search-part-4/”,
“sitecore_root_path”: “/sitecore/content/tenant/site/home/blogs”,
“sitecore_template_id”: “{B69EDBAD-2FDF-4120-B13C-CDDF4B127B9F}”,
“meta_title”: “Getting to know Sitecore Search – Part 4 / Blogs / Perficient”,
“meta_description”: “Dig into the weeds of managing your search sources. Learn about triggers, extractors, attributes, excluding urls, and scan frequency.”,
“meta_og_title”: “Getting to know Sitecore Search – Part 4 / Blogs / Perficient”,
“meta_og_description”: “Dig into the weeds of managing your search sources. Learn about triggers, extractors, attributes, excluding urls, and scan frequency.”,
“meta_og_sitename”: “Perficient Blogs”,
“meta_og_type”: “article”,
“meta_og_image”: “https://blogs.perficient.com/files/forest-simon-TX0ufDSCV4-unsplash-scaled.jpg”,
“meta_published_time”: “2023-05-16T13:30:04+00:00”,
“featured_image”: [
“55789139-9512-33f6-bf85-cd1897eb36fa”
],
“title”: “Getting to know Sitecore Search – Part 4”,
“primary_category”: “Sitecore”,
“author”: “Eric Sanner”,
“date”: “May 16th, 2023”,
“content”: {
“format”: “rich_text”,
“value”: “<div id=”bsf_rt_marker”><p>Welcome back to getting to know Sitecore search</p>shorted for brevity<p>In the next post, we’ll build a simple UI and connect to the api to get our first real results!</p></div>”
},
“content_images”: [
“b1b335ec-95a7-3609-89d0-65805abd3c68”,
“2eeacd44-f2d4-337c-86eb-ee1249bcf10b”,
“974e0ccb-ce39-3fc2-bfd7-299dbfd1f9b1”,
“379645ef-b0d8-3fa9-8088-49974bc4312e”
],
“author_page”: [
{
“link”: “https://blogs.perficient.com/author/esanner/”,
“text”: “”
}
],
“author_image”: [
“43d4f615-da07-3a01-ad1f-5f53d4b11835”
],
“author_bio”: “”,
“categories”: [
“205dd6d7-887c-3501-b20a-3a2137437a47”,
“00ba851e-b649-30d3-902a-3a32d230110f”,
“8492492a-25c4-3c45-95a1-59c3d6b59620”
],
“tags”: [
“Sitecore”,
“Sitecore.Search”
]
},
{
“url”: “/2023/05/23/the-dialogue-element-modals-made-simple/”,
“sitecore_root_path”: “/sitecore/content/tenant/site/home/blogs”,
“sitecore_template_id”: “{B69EDBAD-2FDF-4120-B13C-CDDF4B127B9F}”,
“meta_title”: “The Dialogue Element: Modals Made Simple / Blogs / Perficient”,
“meta_description”: “The new dialogue element makes modals simple. Learn to create user-friendly modals with ease using the new HTML dialogue element.”,
“meta_og_title”: “The Dialogue Element: Modals Made Simple / Blogs / Perficient”,
“meta_og_description”: “The new dialogue element makes modals simple. Learn to create user-friendly modals with ease using the new HTML dialogue element.”,
“meta_og_sitename”: “Perficient Blogs”,
“meta_og_type”: “article”,
“meta_og_image”: “https://blogs.perficient.com/files/Group-of-People-Holding-Speech-bubbles-scaled.jpg”,
“meta_published_time”: “2023-05-23T13:17:20+00:00”,
“featured_image”: [
“f8a33d23-893e-3625-8665-76bf657c8e72”
],
“title”: “The Dialogue Element: Modals Made Simple”,
“primary_category”: “Accessibility”,
“author”: “Drew Taylor”,
“date”: “May 23rd, 2023”,
“content”: {
“format”: “rich_text”,
“value”: “<div id=”bsf_rt_marker”><p>Any front-end developer has likely experienced the pain of covering an exhaustive list of accessibility and UI edge cases while implementing modals. Well guess what? Not any longer.</p>shortend for brevity<p>The dialog is just another HTML element. Style it with CSS just as any other HTML element.</p></div>”
},
“content_images”: [
“add188a4-2d87-32ae-9a63-77bbfbaef6fb”
],
“author_page”: [
{
“link”: “https://blogs.perficient.com/author/drewtaylor/”,
“text”: “”
}
],
“author_image”: [
“3cbd7fd0-687e-3856-b212-38848e781de0”
],
“author_bio”: “Drew is a technical consultant at Perficient. He enjoys writing code and books, talking AI, and advocating accessibility.”,
“categories”: [
“f47d9f1b-dc4f-31fd-b896-3fa00fe4d304”
],
“tags”: [
“accessibility”,
“AI”,
“modal”,
“UX”
]
}
]

The taxonomy_term field type creates a guid value for the category and outputs the mapping in a separate file.  These items could be imported into Sitecore first so the content can be mapped to the correct category.

{
“data”: [
{
“uuid”: “205dd6d7-887c-3501-b20a-3a2137437a47”,
“name”: “Technical”
},
{
“uuid”: “00ba851e-b649-30d3-902a-3a32d230110f”,
“name”: “Development”
},
{
“uuid”: “8492492a-25c4-3c45-95a1-59c3d6b59620”,
“name”: “Sitecore”
},
{
“uuid”: “f47d9f1b-dc4f-31fd-b896-3fa00fe4d304”,
“name”: “Accessibility”
}
]
}

Conclusion

Merlin is a really neat tool!  I’m excited to use it next time I’m on a site migration project.  I believe it has the opportunity to save tons of time migrating content.  How happy would content authors be to not have to go through the process to create a new page, set the name, set the display name, set the title and meta data manually?  Creating the configuration files can take some effort to tweak as you add more fields and adjust to get exactly the right output.  The caching feature is helpful to avoid overloading the source website with requests.  Once you have an idea of how the field types work, it becomes easier to create new configurations.