HTCRAWL Htcrawl is nodejs module for the recursive crawling of single page applications (SPA) using javascript.

It uses headless chrome to load and analyze web applications and it's build on top of Puppetteer from wich it inherits all the functionalities.



With htcrawl you can roll your own DOM-XSS scanner with less than 60 lines of javascript!! (see below)



Some examples of what (else) you can do with htcrawl: Advanced scraping of single page applications (SPA) Intercept and log all requests made by a webpage Build tools to detect security vulnerabilities Automate testing of UI, javascript ecc You may also try htcap, a vulnerability scanner built on top of htcrawl.

Basic usage Very basic example for code to print all ajax requests. const htcrawl = require('htcrawl'); // Get instance of Crawler class const crawler = await htcrawl.launch("https://htcrawl.org"); // Print out the url of ajax calls crawler.on("xhr", e => { console.log("XHR to " + e.params.request.url); }); // Start crawling! crawler.start();

Demo Video Short video of the crawl engine in action.

API Reference The API manual can be found here

Download $ npm -i htcrawl $ # or $ git clone https://github.com/fcavallarin/htcrawl.git && cd htcrawl && npm i && cd .. or visit the github page here

Crawl flow The diagram below shows the recursive crawling process.

Examples 1. DOM XSS Scanner Simple DOM XSS scanner const targetUrl="https://htcap.org/scanme/domxss.php"; const options = {headlessChrome:1}; var pmap = {}; const payloads = [ ";window.___xssSink({0});", "<img src='a' onerror=window.___xssSink({0})>" ]; function getNewPayload(payload, element){ const k = "" + Math.floor(Math.random()*4000000000); const p = payload.replace("{0}", k); pmap[k] = {payload:payload, element:element}; return p; } async function crawlAndFuzz(payload){ var hashSet = false; // instantiate htcrawl const crawler = await htcrawl.launch(targetUrl, options); // set a sink on page scope crawler.page().exposeFunction("___xssSink", key => { const msg = `DOM XSS found:

payload: ${pmap[key].payload}

element: ${pmap[key].element}` console.log(msg); }); // fill all inputs with a payload crawler.on("fillinput", async function(e, crawler){ const p = getNewPayload(payload, e.params.element); try{ await crawler.page().$eval(e.params.element, (i, p) => i.value = p, p); }catch(e){} // return false to prevent element to be automatically filled with a random value return false; }); // change page hash before the triggering of the first event crawler.on("triggerevent", async function(e, crawler){ if(!hashSet){ const p = getNewPayload(payload, "hash"); await crawler.page().evaluate(p => document.location.hash = p, p); hashSet = true; } }); try{ await crawler.start(); } catch(e){ console.log(`Error ${e}`); } crawler.browser().close(); } (async () => { for(let payload of payloads){ /* Remove 'await' for parallel scan of all payloads */ await crawlAndFuzz(payload); } })(); 2. Advanced content scraper Crawls a single page application looking for emails. const targetUrl="https://htcap.org/scanme/ng/"; const options = {headlessChrome:1}; function printEmails(string){ const emails = string.match(/([a-z0-9._-]+@[a-z0-9._-]+\.[a-z]+)/gi); if(!emails) return; for(let e of emails) console.log(e); } htcrawl.launch(targetUrl, options).then(async crawler => { crawler.on("domcontentloaded", async function(e, crawler){ const selector = "body"; const html = await crawler.page().$eval(selector, body => body.innerText); printEmails(html); }); crawler.on("newdom", async function(e, crawler){ const selector = e.params.rootNode; const html = await crawler.page().$eval(selector, node => node.innerText); printEmails(html); }); crawler.start().then(crawler => { crawler.browser().close(); }).catch(err => { console.log(`Error: ${err}`); }); });