Crawls the indicated live site and returns the complete tree of reachable resources.
(If you already have a specific set of known resources you want to fetch from a site, you can extract specific resources from a site.)
Crawl an existing site #
You can use crawl to crawl an existing website and copy the resulting crawled tree for local inspection.
In this case, the tree parameter is typically a SiteTree. A convenient way to wrap an existing site is with the httpstree protocol (or httptree for non-secure HTTP sites) in a URL.
For example, you can copy the original Space Jam website to a local folder called spacejam via:
$ ori "copy crawl(httpstree://www.spacejam.com/1996/), files:spacejam"
On a machine that doesn’t have Origami installed, you can invoke ori via npm’s npx command:
$ npx ori "copy crawl(httpstree://www.spacejam.com/1996/), files:spacejam"
Crawling is a network-intensive operation, so a command to crawl a site like the (surprisingly large!) site above can take a long time to complete – on the order of minutes.
Starting points #
A crawl begins by looking for any of:
//index.htmlrobots.txtsitemap.xml
From these starting points, the crawler will follow links to additional resources.
Broken links #
If the crawl operation finds links to internal references that do not exist, it will return those in a crawl-errors.json entry at the top level of the returned tree.
If you just want to check a site for broken links, see the related dev:audit builtin.
Supported reference types #
The crawler analyzes the following types of files:
- HTML files
- CSS files
- JavaScript modules
- Image maps
- Sitemap files
robots.txt
In HTML, the crawler finds references to other pages and resources by examining:
hrefattribute in elements:<a>,<area>,<image>,<filter>,<link>,<mpath>,<pattern>,<use>. (The crawler currently cannot find paths in SVG elements with mixed-case names:<feImage>,<linearGradient>,<radialGradient>, or<textPath>.)srcattribute in elements:<audio>,<embed>,<frame>,<iframe>,<img>,<input>,<script>,<source>,<track>,<video>srcsetattribute in<img>and<source>elementsposterattribute in<video>elementsdataattribute in<object>elementsbackgrounddeprecated attribute in elements:<body>,<table>,<td>,<th>longdescdeprecated attribute in<img>elementscontentattribute in<meta>elements with apropertyattribute ending in:image(likeog:image)- CSS in
<style>elements orstyleattribute - JavaScript in
<script>elements with atypeattribute of"module"
In JavaScript, the crawler finds references in:
importstatements (not dynamicimportcalls)exportstatements
In CSS, the crawler finds references in:
@font-facedeclarations@importdeclarations@namespacedeclarationsurl()functionsimage(),image-set(), andcross-fade()functions