10 KiB
Crawl & Map API Reference
Table of Contents
- Crawl vs Map
- Key Parameters
- Instructions and Chunks
- Path and Domain Filtering
- Use Cases
- Map then Extract Pattern
- Performance Optimization
- Common Pitfalls
- Response Fields
- Summary
Crawl vs Map
| Feature | Crawl | Map |
|---|---|---|
| Returns | Full content | URLs only |
| Speed | Slower | Faster |
| Best for | RAG, deep analysis, documentation | Site structure discovery, URL collection |
Use Crawl when:
- Full content extraction needed
- Building RAG systems
- Processing paginated/nested content
- Integration with knowledge bases
Use Map when:
- Quick site structure discovery
- URL collection without content
- Planning before crawling
- Sitemap generation
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | Required | Root URL to begin |
max_depth |
integer | 1 | Levels deep to crawl (1-5). Start with 1-2 |
max_breadth |
integer | 20 | Links per page. 50-100 for focused crawls |
limit |
integer | 50 | Total pages cap |
instructions |
string | null | Natural language guidance (2 credits/10 pages) |
chunks_per_source |
integer | 3 | Chunks per page (1-5). Only with instructions |
extract_depth |
enum | "basic" |
"basic" (1 credit/5 URLs) or "advanced" (2 credits/5 URLs) |
format |
enum | "markdown" |
"markdown" or "text" |
select_paths |
array | null | Regex patterns to include |
exclude_paths |
array | null | Regex patterns to exclude |
select_domains |
array | null | Regex for domains to include |
exclude_domains |
array | null | Regex for domains to exclude |
allow_external |
boolean | true (crawl) / false (map) | Include external domain links |
include_images |
boolean | false | Include images (crawl only) |
include_favicon |
boolean | false | Include favicon URL (crawl only) |
include_usage |
boolean | false | Include credit usage info |
timeout |
float | 150 | Max wait (10-150 seconds) |
Instructions and Chunks
Use instructions and chunks_per_source for semantic focus and token optimization:
response = client.crawl(
url="https://docs.example.com",
max_depth=2,
instructions="Find all documentation about authentication and security",
chunks_per_source=3 # Only top 3 relevant chunks per page
)
Key benefits:
instructionsguides crawler semantically, focusing on relevant contentchunks_per_sourcereturns only relevant snippets (max 500 chars each)- Prevents context window explosion in agentic use cases
- Chunks appear in
raw_contentas:<chunk 1> [...] <chunk 2> [...] <chunk 3>
Note: chunks_per_source only works when instructions is provided.
Path and Domain Filtering
Path patterns (regex)
# Target specific sections
response = client.crawl(
url="https://example.com",
select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
exclude_paths=["/blog/.*", "/changelog/.*", "/private/.*"]
)
# Paginated content
response = client.crawl(
url="https://example.com/blog",
max_depth=2,
select_paths=["/blog/.*", "/blog/page/.*"],
exclude_paths=["/blog/tag/.*"]
)
Domain control (regex)
# Stay within subdomain
response = client.crawl(
url="https://docs.example.com",
select_domains=["^docs.example.com$"],
max_depth=2
)
# Exclude specific domains
response = client.crawl(
url="https://example.com",
exclude_domains=["^ads.example.com$", "^tracking.example.com$"]
)
Use Cases
1. Deep/Unlinked Content
Deeply nested pages, paginated archives, internal search-only content.
response = client.crawl(
url="https://example.com",
max_depth=3,
max_breadth=50,
limit=200,
select_paths=["/blog/.*", "/changelog/.*"],
exclude_paths=["/private/.*", "/admin/.*"]
)
2. Documentation/Structured Content
Documentation, changelogs, FAQs with nonstandard markup.
response = client.crawl(
url="https://docs.example.com",
max_depth=2,
extract_depth="advanced",
select_paths=["/docs/.*"]
)
3. Multi-modal/Cross-referencing
Combining information from multiple sections.
response = client.crawl(
url="https://example.com",
max_depth=2,
instructions="Find all documentation pages that link to API reference docs",
extract_depth="advanced"
)
4. Rapidly Changing Content
API docs, product announcements, news sections.
response = client.crawl(
url="https://api.example.com",
max_depth=1,
max_breadth=100
)
5. RAG/Knowledge Base Integration
response = client.crawl(
url="https://docs.example.com",
max_depth=2,
extract_depth="advanced",
include_images=True,
instructions="Extract all technical documentation and code examples"
)
6. Compliance/Auditing
Comprehensive content analysis for legal checks.
response = client.crawl(
url="https://example.com",
max_depth=3,
max_breadth=100,
limit=1000,
extract_depth="advanced",
instructions="Find all mentions of GDPR and data protection policies"
)
7. Known URL Patterns
Sitemap-based crawling, section-specific extraction.
response = client.crawl(
url="https://example.com",
max_depth=1,
select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
exclude_paths=["/private/.*", "/admin/.*"]
)
Map then Extract Pattern
Consider using Map before Crawl/Extract to plan your strategy:
- Use Map to get site structure
- Analyze paths and patterns
- Configure Crawl or Extract with discovered paths
- Execute focused extraction
# Step 1: Map to discover structure
map_result = client.map(
url="https://docs.example.com",
max_depth=2,
instructions="Find all API docs and guides"
)
# Step 2: Filter discovered URLs
api_docs = [url for url in map_result["results"] if "/api/" in url]
guides = [url for url in map_result["results"] if "/guides/" in url]
print(f"Found {len(api_docs)} API docs, {len(guides)} guides")
# Step 3: Extract from filtered URLs
target_urls = api_docs + guides
response = client.extract(
urls=target_urls[:20], # Max 20 per extract call
extract_depth="advanced",
query="API endpoints and usage examples",
chunks_per_source=3
)
Benefits:
- Discover site structure before committing to full crawl
- Identify relevant path patterns
- Avoid unnecessary extraction
- More control over what gets extracted
Performance Optimization
Depth vs Performance
Each depth level increases crawl time exponentially:
| Depth | Typical Pages | Time |
|---|---|---|
| 1 | 10-50 | Seconds |
| 2 | 50-500 | Minutes |
| 3 | 500-5000 | Many minutes |
Best practices:
- Start with
max_depth=1and increase only if needed - Use
max_breadthto control horizontal expansion - Set appropriate
limitto prevent excessive crawling - Process results incrementally rather than waiting for full crawl
Rate Limiting
- Respect site's robots.txt
- Monitor API usage and limits
- Use appropriate error handling for rate limits
- Consider delays between large crawl operations
Conservative vs Comprehensive
# Conservative (start here)
response = client.crawl(
url="https://example.com",
max_depth=1,
max_breadth=20,
limit=20
)
# Comprehensive (use carefully)
response = client.crawl(
url="https://example.com",
max_depth=3,
max_breadth=100,
limit=500
)
Common Pitfalls
| Problem | Impact | Solution |
|---|---|---|
Excessive depth (max_depth=4+) |
Exponential time, unnecessary pages | Start with 1-2, increase if needed |
| Unfocused crawling | Wasted resources, irrelevant content, context explosion | Use instructions to focus semantically |
| Missing limits | Runaway crawls, unexpected costs | Always set reasonable limit value |
Ignoring failed_results |
Incomplete data, missed content | Monitor and adjust parameters |
| Full content without chunks | Context window explosion | Use instructions + chunks_per_source |
Response Fields
Crawl Response
| Field | Description |
|---|---|
base_url |
The URL you started the crawl from |
results |
List of crawled pages |
results[].url |
Page URL |
results[].raw_content |
Extracted content (or chunks if instructions provided) |
results[].images |
Image URLs extracted from the page |
results[].favicon |
Favicon URL (if include_favicon=True) |
response_time |
Time in seconds |
request_id |
Unique identifier for support reference |
Map Response
| Field | Description |
|---|---|
base_url |
The URL you started the mapping from |
results |
List of discovered URLs |
response_time |
Time in seconds |
request_id |
Unique identifier for support reference |
Summary
- Use instructions and chunks_per_source for focused, relevant results in agentic use cases
- Start conservative (
max_depth=1,max_breadth=20) and scale up as needed - Use path patterns to focus crawling on relevant content
- Choose appropriate extract_depth based on content complexity
- Always set a limit to prevent runaway crawls and unexpected costs
- Monitor failed_results and adjust patterns accordingly
- Use Map first to understand site structure before committing to full crawl
- Implement error handling for rate limits and failures
- Respect robots.txt and site policies
Crawling is powerful but resource-intensive. Focus your crawls, start small, monitor results, and scale gradually based on actual needs.
For more details, see the full API reference