crawl(tree, [baseHref])

Crawls the indicated live site and returns the complete tree of reachable resources.

(If you already have a specific set of known resources you want to fetch from a site, you can extract specific resources from a site.)

Crawl an existing site #

You can use crawl to crawl an existing website and copy the resulting crawled tree for local inspection.

In this case, the tree parameter is typically a SiteTree. A convenient way to wrap an existing site is with the httpstree protocol (or httptree for non-secure HTTP sites) in a URL.

For example, you can copy the original Space Jam website to a local folder called spacejam via:

$ ori "copy crawl(httpstree://www.spacejam.com/1996/), files:spacejam"

On a machine that doesn’t have Origami installed, you can invoke ori via npm’s npx command:

$ npx ori "copy crawl(httpstree://www.spacejam.com/1996/), files:spacejam"

Crawling is a network-intensive operation, so a command to crawl a site like the (surprisingly large!) site above can take a long time to complete – on the order of minutes.

Starting points #

A crawl begins by looking for any of:

/
/index.html
robots.txt
sitemap.xml

From these starting points, the crawler will follow links to additional resources.

Broken links #

If the crawl operation finds links to internal references that do not exist, it will return those in a crawl-errors.json entry at the top level of the returned tree.

If you just want to check a site for broken links, see the related dev:audit builtin.

Supported reference types #

The crawler analyzes the following types of files:

HTML files
CSS files
JavaScript modules
Image maps
Sitemap files
robots.txt

In HTML, the crawler finds references to other pages and resources by examining:

href attribute in elements: <a>, <area>, <image>, <filter>, <link>, <mpath>, <pattern>, <use>. (The crawler currently cannot find paths in SVG elements with mixed-case names: <feImage>, <linearGradient>, <radialGradient>, or <textPath>.)
src attribute in elements: <audio>, <embed>, <frame>, <iframe>, <img>, <input>, <script>, <source>, <track>, <video>
srcset attribute in <img> and <source> elements
poster attribute in <video> elements
data attribute in <object> elements
background deprecated attribute in elements: <body>, <table>, <td>, <th>
longdesc deprecated attribute in <img> elements
content attribute in <meta> elements with a property attribute ending in :image (like og:image)
CSS in <style> elements or style attribute
JavaScript in <script> elements with a type attribute of "module"

In JavaScript, the crawler finds references in:

import statements (not dynamic import calls)
export statements

In CSS, the crawler finds references in:

@font-face declarations
@import declarations
@namespace declarations
url() functions
image(), image-set(), and cross-fade() functions