Overview: Web Application Crawling in Astra

Last updated: May 29, 2026

Summary

Astra’s Dynamic Application Security Testing (DAST) engine uses real browsers, such as Chromium and Firefox, to crawl and explore web applications. Unlike traditional crawlers that only parse HTML, Astra renders pages as a real user would, which is essential for mapping the attack surface of modern Single-Page Applications (SPAs) and JavaScript-heavy sites.

Crawling is the automated process of exploring your application to discover all reachable pages, API endpoints, and parameters before vulnerability testing begins. By using real browser automation, the crawler can execute JavaScript, interact with the live DOM, and handle client-side routing (React, Vue, Angular) that traditional tools often miss.

How the Crawler Works

Astra's crawler uses real browsers (Chromium and Firefox) instead of traditional crawlers to ensure complete coverage of your application. It executes all JavaScript on the page, interacts with the live DOM, simulates real user actions like clicks, scrolls, form submissions, and opening modals, and retains full session context including cookies, localStorage, and sessionStorage across the crawl.

For Single Page Applications (SPAs), the crawler handles client-side routing where URLs may not change predictably, watches for DOM mutations and dynamically added elements, listens for network activity triggered by interactions, and tracks navigation events even without full-page reloads. For traditional multi-page applications, it handles JavaScript-based redirects, non-standard link mechanisms using onclick handlers, content behind sliders, accordions and dropdowns, CSRF tokens, dynamically generated hidden fields, custom UI frameworks and component libraries, and JavaScript-based login mechanisms.

As the crawler interacts with your frontend through clicking, navigating, scrolling, and form submissions, it automatically captures API calls fired in the background — including XHR, fetch, and WebSocket requests — along with the full request and response lifecycle including headers, parameters, and payloads. It handles client-side logic that constructs or modifies API calls dynamically, and discovers variations of API usage such as optional parameters or different payload formats based on user interactions. Since many modern applications rely on asynchronous APIs for nearly all data exchange, browser-based crawling provides significantly better backend visibility than traditional scanners that parse static HTML or rely on pre-defined endpoint lists — ensuring APIs are not just detected but also evaluated for security issues as part of the scan.

For authenticated scans, the real browser session stays active throughout the crawl, allowing access to restricted and user-specific areas, and supporting complex login workflows including multi-step logins and third-party SSO. Authentication can be configured in the scan settings to ensure proper session handling.

The result is greater coverage of application features regardless of how they are loaded, improved detection of hidden or dynamically generated endpoints and parameters, effective exploration of modern frontend architectures including SPAs and component-based UIs, and accurate simulation of real user behavior — ensuring the scanner tests what actual users and attackers would see.

The Step-by-Step Crawling Flow

High-level architecture of Astra’s browser-based crawling process for web applications, showing how pages, resources, and endpoints are discovered through real user-like interactions.

Scan Initiation: Configured parameters, such as the target URL and crawl scope, are loaded.
Real Browser Launch: A headless Chromium browser is launched to simulate a real user environment.
Authentication: If required, a recorded login flow is replayed to ensure the crawl starts from an authenticated session.
Initial Navigation: The crawler navigates to the primary URI and waits for the page to fully render.
Page Exploration: All links are extracted, resources (scripts/stylesheets) are recorded, and API calls (XHR, fetch, WebSocket) are captured.
User Interaction Simulation: The crawler mimics a real user by clicking buttons, scrolling, and automatically filling and submitting forms to expose hidden content.
Recursive Crawling: Discovered pages are processed identically until the defined scope is exhausted.
Sitemap Construction: All discovered assets are added to an evolving Sitemap, which acts as the inventory for the testing phase.
Session Monitoring: The crawler continuously monitors session validity and will re-authenticate if the session expires.

Key Functions & Special Handling

API Discovery: Because it uses a real browser, Astra automatically triggers and captures API requests that occur during normal app operation, providing better visibility into backend services.
URL Fragments (#): Astra crawls these fragments for API activity, though they are not listed as separate entries in the sitemap to avoid redundancy.
Dynamic Parameters: URLs differing only by unique IDs (e.g., /user?id=1 and /user?id=2) are grouped into a single route pattern to prevent duplicate testing.
Crawl Depth: To ensure performance and avoid infinite loops, the crawler limits how deeply it follows nested links.

Best Practices

Update Inventory Regularly: Use the Automated Crawling (Web) scan type periodically to update your endpoint inventory without running a full security test.
Separate Crawl and Scan: For large apps, schedule an Automated Crawl (e.g., at 2:00 AM) followed by a Delta Scan (e.g., at 4:00 AM) to ensure the scan targets only the most recent changes.
Maintain Session Integrity: Ensure your login recording is accurate, as an incorrect configuration will limit the scanner to seeing only public pages.

Troubleshooting

Missing Endpoints: This may occur if the app data is state-dependent (e.g., a "Track Delivery" page only appearing if an order is active) or if firewalls/WAFs are blocking the crawler.
Scan Cancellations: Astra may cancel a crawl if it detects persistently high response times or repeated server errors to prevent incomplete or misleading results.