Node API
With ccht's Node.js API, you have more control over crawling and reporting.
In fact, ccht
CLI is just a wrapper for the Node.js API.
TL;DR
Basic Usage
Run checkAndReport
function with a crawler and JSON Reporter.
This will cover the most of usecases.
import { checkAndReport, NodeHttpCrawler, JsonReporter } from "ccht";
const results = await checkAndReport(
new NodeHttpCrawler({
concurrency: 1,
timeout: 5000,
useragent: "ccht/node",
}),
new JsonReporter({}),
"https://example.com",
{
// [Required]
// You need set includeUrls manually
includeUrls: ["https://example.com"],
// [Required]
excludeUrls: [
// You can use RegExp and Function for match patterns!
// (same for includeUrls)
/\.htm$/,
(url) => url.includes("auth"),
],
// [Required]
expectedStatus: /^[123]\d\d$/,
reportTypes: ["error", "unexpected_status"],
// [Required]
reportSeverities: ["danger", "warning", "info", "debug"],
// [Required]
exitErrorSeverities: ["danger"],
}
);
// Since reporter always emits strings, you need to deserialize the result.
console.log(JSON.parse(results));
For more about the checker options, see CheckerOptions
in types.ts
.
Using your own crawler
Crawler is a simple JavaScript object, implements visit
method and destroy
method.
The visit
async method takes an URL, checks for it somehow, and returns URLs to be crawled later.
To crawl effeciently, you should set and check the second parameter, which is a Map of VisitedLink
object. A caller checks duplication too, but you need check in the visit
method especially if your crawler support concurrency.
The destroy
async method will be called when there is no more resources to crawl.
Free resources in the method (or don't, but your crawler must have the method).
See more at interface Crawler
in types.ts
.
import { checkAndReport } from "ccht";
import fetch from "node-fetch";
// Simple object with required methods
const objectCrawler = {
async visit(url, visited) {
/*...*/
},
async destroy() {
/*...*/
},
};
await checkAndReport(objectCrawler /*...*/);
// A function that returns a crawle
const functionCrawler = (...params) => {
return {
async visit(url, visited) {
/*...*/
},
async destroy() {
/*...*/
},
};
};
await checkAndReport(functionCrawler(foo) /*...*/);
// Class style
class ClassCrawler {
async visit(url, visited) {
/*...*/
}
async destroy() {
/*...*/
}
}
await checkAndReport(new ClassCrawler() /*...*/);
// If you're using TypeScript, you should use `Crawler` interface type.
class ClassCrawler implements Crawler {
/*...*/
}
const objectCrawler: Crawler = {
/*...*/
};
const functionCrawler = (): Crawler => {
/*...*/
};