HawkScan Spider Configuration

The spider in HawkScan is a critical component for effectively navigating and scanning traditional and dynamic web applications. This page provides detailed guidance on configuring the Base Spider.

Overview

HawkScan’s spider is designed to discover your application’s routes by analyzing HTML content and following URLs. It is particularly suited for scanning server-side rendered and MVC-shaped web applications. The configuration is managed in the stackhawk.yml file under the hawk.spider section.

stackhawk.yml

hawk:
  spider:
    base: true # basic spider utility that looks at html source files and follows urls it finds. Enabled by default.
    ajax: false # more complex spider operation that follows dynamic links and buttons on application.
    maxDurationMinutes: 5 # maximum allowed time in minutes for any enabled spiders to crawl your web application.
    seedPaths: [] # list of paths to directly add to the site tree.
    har: {} # optional route discovery via har file (web recording)
    custom: {} # bring your own developer tools and use generated web traffic to discover your application.

These mechanisms are best suited for discovering running web applications that serve Content-Type: text/html;, including server-side rendered and MVC-shaped web applications. While HawkScan will try to deterministically and consistently scan a running website, the results of the Scan Discovery phase can be more variable for larger web applications with more links and changing content.

For more consistent and protocol constrained REST API scanning, you should specify a configuration such as OpenAPI specification instead of relying on Scan Discovery mechanisms. HawkScan also supports scanning GraphQL, gRPC and SOAP APIs.

base Spider

The base spider is the basic web crawler for discovering your application’s routes. This spider is appropriate for most traditional web applications. This spider will reach new pages in the web application by finding URLs in the Content-Type: text/html; responses and breadth-first-searching those paths until it has reached all feasible pages.

Toggle it’s operation with true or false.

NOTE: This feature is enabled by default.

ajax Spider

Given that modern Single Page Applications (SPAs) are typically powered by one or more APIs, it is generally advisable not to depend solely on the Ajax Spider for scanning. Instead, we recommend directly identifying and testing against these underlying APIs for a more effective and comprehensive security assessment.

The ajax spider is a more complex web crawler that is designed to discover and find new pages in more dynamic websites or Single Page Applications. This spider leverages Selenium to follow an unscripted process for clicking any buttons and links it encounters.

Toggle its operation with true or false. You can additionally configure which browser to use with spider.ajaxBrowser setting. Options include:

  • FIREFOX_HEADLESS (default)
  • FIREFOX
  • CHROME_HEADLESS
  • CHROME

NOTE: To use the spider.ajax option with the CLI you must have Firefox or Chrome installed and set spider.ajaxBrowser appropriately. This spider is not available in the arm64 HawkScan Docker image. For the Windows hawk CLI, you may need to install the geckodriver (releases) for Firefox, or the chromedriver (releases) for Chrome.

maxDurationMinutes

Multiple spiders can be enabled for a scan; however, the full navigation of your web application may take a long time if the app is sufficiently large. This setting limits the amount of time all configured spiders may take when operating. By default this is 5 minutes. Larger web applications may need more time to scan in pre-production, whereas a shorter feedback time is better when scanning in development.

seedPaths

Explicitly adds routes to the site tree. HawkScan visits the host URL and any routes added here directly during the scan. These paths will be used as additional starting points for crawling your application. This parameter is useful for defining routes that are not readily crawlable from the root of your application host. For example, a hidden page like /admin.

NOTE: This configuration is NOT a replacement for an API definition and provides no benefit to pure REST API’s.

har Discovery

HawkScan offers the capability to use HTTP Archive (HAR) files for discovering routes in your web application. HAR files, which record the network traffic between a web browser and server, can be an effective alternative for mapping out the structure of your application, especially in cases where traditional spider methods may not be sufficient.

To utilize HAR files in HawkScan, you need to specify the location of your HAR file in the stackhawk.yml configuration file. Here’s an example of how to configure HawkScan to use a set of HAR files in a local subdirectory:

stackhawk.yml

hawk:
  spider:
    base: false
    har:
      dir:
        path: ${HOME}/test/resources/har
        # optional (hostname in the har file if not the same as the configured app.host)
        replaceHost: https://mytesthost1.example.com

HawkScan offers two methods for generating HAR files to discover routes in your web application: manually through a browser session and automatically by capturing traffic from automated testing tools.

Manual HAR File Generation

For manual HAR file generation, use the hawk perch daemon mode with Chrome. This approach is suitable when you want to manually navigate your application to ensure specific user flows or functionalities are captured.

Start the `hawk perch` daemon with the `--with-chrome` option to initiate manual navigation of your web application. This action opens a Chrome browser instance that routes all web traffic through HawkScan, ensuring every interaction is captured.
hawk perch start --with-chrome
For HawkScan installed as windows executable from the msi, the commmands are a little bit different. You'll start perch in one powershell window and then start the browser in a second window.
hawk perch start
Second window or tab.
hawk perch browser

Visit various parts of your web application to cover all functionality. HawkScan will record the paths you visit.

After completing your navigation, stop the recording to generate the HAR file.

$ hawk perch stop --har-file=myrecording.har

This method captures the web application’s user flows, including authentication details, facilitating thorough testing and overcoming complex authentication mechanisms.

Automated HAR File Generation

For automated HAR file generation, integrate HawkScan with your automated testing tools. This method efficiently captures a wide range of requests and responses generated by automated scripts, including those from popular testing frameworks like Selenium, Cypress, or through Postman Collections.

First, use the --with-proxy-info command to start HawkScan in daemon mode, which returns the current IP and port. This setup allows automated testing tools to route traffic through HawkScan.

$ hawk perch start --with-proxy-info
127.0.0.1:20000

Next, configure your automated testing tools to use the proxy details provided by HawkScan and run your automated tests. All traffic generated by these tools will be intercepted by HawkScan. Typically this can be accomplished by setting the HTTP_PROXY environment variable to http://localhost:20000.

Once your automated tests are complete, stop the HawkScan daemon to finalize the HAR file.

$ hawk perch stop --har-file=automatedRecording.har

Both manual and automated HAR file generation methods provide flexibility in capturing web application traffic for route discovery, offering precise control over the scan coverage in HawkScan. This approach ensures effective and comprehensive security testing for various application complexities and testing scenarios.

  • Consult our HawkScan command line options using hawk perch start --help or hawk perch stop --help.
  • For more information on working with HAR files click here.

custom Scan Discovery

Software Developers that are skillful and successful with HawkScan tend to use other great application testing tools. These tools may generate web traffic and support proxying that traffic into other software. These capabilities can be reused with HawkScan to check the tested application web traffic for software vulnerabilities.

Toggle its operation by specifying a custom.command to be run.

See the Custom Scan Discovery page for more details.