GitHub - Katana A Next-Generation Crawling and Spidering Framework
GitHub - Katana: A Next-Generation Crawling and Spidering Framework
đ Katana is a next-generation crawling and spidering framework designed for fast and fully configurable web crawling, supporting both standard and headless modes. Here are some of its key features:
- JavaScript Parsing/Crawling: Efficiently parse and crawl JavaScript content.
- Customizable Automatic Form Filling: Streamline form submissions with ease.
- Scope Control: Utilize preconfigured fields or Regex for precise crawling.
- Extensive Configuration Options: Set maximum crawl depth, enable endpoint parsing, and more.
Katana is particularly focused on execution in automation pipelines, allowing users to specify target URLs or lists to crawl using the -u or -list flags. It also supports excluding hosts with filters like âcdnâ, âprivate-ipsâ, and regex using the -e or -exclude flag.
Advanced Features
- Maximum Depth Control: Set the maximum depth to crawl with
-d, defaulting to 3. - JavaScript File Crawling: Use
-jcfor endpoint parsing in JavaScript files. - Automatic Form Filling: Enable with
-aff(experimental) and extract forms with-fx. - Custom Headers and Proxies: Define custom headers or cookies using
-Hand specify an HTTP/SOCKS5 proxy with-proxy.
Visit Strategy
Choose between âdepth-firstâ or âbreadth-firstâ crawling with the -s option. Other advanced options include ignoring crawling of the same path with different query parameters using -iqp and enabling experimental TLS randomization with -tlsi.
For more details, visit the complete article: Read full article