Crawls the indicated live site and returns the complete tree of reachable resources.
(If you already have a specific set of known resources you want to fetch from a site, you can extract specific resources from a site.)
Crawl an existing site #
You can use crawl
to crawl an existing website and copy the resulting crawled tree for local inspection.
In this case, the tree
parameter is typically a SiteTree. A convenient way to wrap an existing site is with the httpstree
protocol (or httptree
for non-secure HTTP sites) in a URL.
For example, you can copy the original Space Jam website to a local folder called spacejam
via:
$ ori "copy crawl(httpstree://www.spacejam.com/1996/), files:spacejam"
On a machine that doesn’t have Origami installed, you can invoke ori
via npm’s npx
command:
$ npx ori "copy crawl(httpstree://www.spacejam.com/1996/), files:spacejam"
Crawling is a network-intensive operation, so a command to crawl a site like the (surprisingly large!) site above can take a long time to complete – on the order of minutes.
Starting points #
A crawl begins by looking for any of:
/
/index.html
robots.txt
sitemap.xml
From these starting points, the crawler will follow links to additional resources.
Broken links #
If the crawl operation finds links to internal references that do not exist, it will return those in a crawl-errors.json
entry at the top level of the returned tree.
If you just want to check a site for broken links, see the related dev:audit
builtin.
Supported reference types #
The crawler analyzes the following types of files:
- HTML files
- CSS files
- JavaScript modules
- Image maps
- Sitemap files
robots.txt
In HTML, the crawler finds references to other pages and resources by examining:
href
attribute in elements:<a>
,<area>
,<image>
,<filter>
,<link>
,<mpath>
,<pattern>
,<use>
. (The crawler currently cannot find paths in SVG elements with mixed-case names:<feImage>
,<linearGradient>
,<radialGradient>
, or<textPath>
.)src
attribute in elements:<audio>
,<embed>
,<frame>
,<iframe>
,<img>
,<input>
,<script>
,<source>
,<track>
,<video>
srcset
attribute in<img>
and<source>
elementsposter
attribute in<video>
elementsdata
attribute in<object>
elementsbackground
deprecated attribute in elements:<body>
,<table>
,<td>
,<th>
longdesc
deprecated attribute in<img>
elementscontent
attribute in<meta>
elements with aproperty
attribute ending in:image
(likeog:image
)- CSS in
<style>
elements orstyle
attribute - JavaScript in
<script>
elements with atype
attribute of"module"
In JavaScript, the crawler finds references in:
import
statements (not dynamicimport
calls)export
statements
In CSS, the crawler finds references in:
@font-face
declarations@import
declarations@namespace
declarationsurl()
functionsimage()
,image-set()
, andcross-fade()
functions