giteadmin/.dotfiles

Fork 0

Files

Jonathan Agmon c09d9151ca Add skills

2026-03-22 23:21:49 +02:00

10 KiB

Raw Blame History

Crawl & Map API Reference

Crawl vs Map
Key Parameters
Instructions and Chunks
Path and Domain Filtering
Use Cases
Map then Extract Pattern
Performance Optimization
Common Pitfalls
Response Fields
Summary

Crawl vs Map

Feature	Crawl	Map
Returns	Full content	URLs only
Speed	Slower	Faster
Best for	RAG, deep analysis, documentation	Site structure discovery, URL collection

Use Crawl when:

Full content extraction needed
Building RAG systems
Processing paginated/nested content
Integration with knowledge bases

Use Map when:

Quick site structure discovery
URL collection without content
Planning before crawling
Sitemap generation

Key Parameters

Parameter	Type	Default	Description
`url`	string	Required	Root URL to begin
`max_depth`	integer	1	Levels deep to crawl (1-5). Start with 1-2
`max_breadth`	integer	20	Links per page. 50-100 for focused crawls
`limit`	integer	50	Total pages cap
`instructions`	string	null	Natural language guidance (2 credits/10 pages)
`chunks_per_source`	integer	3	Chunks per page (1-5). Only with `instructions`
`extract_depth`	enum	`"basic"`	`"basic"` (1 credit/5 URLs) or `"advanced"` (2 credits/5 URLs)
`format`	enum	`"markdown"`	`"markdown"` or `"text"`
`select_paths`	array	null	Regex patterns to include
`exclude_paths`	array	null	Regex patterns to exclude
`select_domains`	array	null	Regex for domains to include
`exclude_domains`	array	null	Regex for domains to exclude
`allow_external`	boolean	true (crawl) / false (map)	Include external domain links
`include_images`	boolean	false	Include images (crawl only)
`include_favicon`	boolean	false	Include favicon URL (crawl only)
`include_usage`	boolean	false	Include credit usage info
`timeout`	float	150	Max wait (10-150 seconds)

Instructions and Chunks

Use instructions and chunks_per_source for semantic focus and token optimization:

response = client.crawl(
    url="https://docs.example.com",
    max_depth=2,
    instructions="Find all documentation about authentication and security",
    chunks_per_source=3  # Only top 3 relevant chunks per page
)

Key benefits:

instructions guides crawler semantically, focusing on relevant content
chunks_per_source returns only relevant snippets (max 500 chars each)
Prevents context window explosion in agentic use cases
Chunks appear in raw_content as: <chunk 1> [...] <chunk 2> [...] <chunk 3>

Note: chunks_per_source only works when instructions is provided.

Path and Domain Filtering

Path patterns (regex)

# Target specific sections
response = client.crawl(
    url="https://example.com",
    select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
    exclude_paths=["/blog/.*", "/changelog/.*", "/private/.*"]
)

# Paginated content
response = client.crawl(
    url="https://example.com/blog",
    max_depth=2,
    select_paths=["/blog/.*", "/blog/page/.*"],
    exclude_paths=["/blog/tag/.*"]
)

Domain control (regex)

# Stay within subdomain
response = client.crawl(
    url="https://docs.example.com",
    select_domains=["^docs.example.com$"],
    max_depth=2
)

# Exclude specific domains
response = client.crawl(
    url="https://example.com",
    exclude_domains=["^ads.example.com$", "^tracking.example.com$"]
)

Use Cases

1. Deep/Unlinked Content

Deeply nested pages, paginated archives, internal search-only content.

response = client.crawl(
    url="https://example.com",
    max_depth=3,
    max_breadth=50,
    limit=200,
    select_paths=["/blog/.*", "/changelog/.*"],
    exclude_paths=["/private/.*", "/admin/.*"]
)

2. Documentation/Structured Content

Documentation, changelogs, FAQs with nonstandard markup.

response = client.crawl(
    url="https://docs.example.com",
    max_depth=2,
    extract_depth="advanced",
    select_paths=["/docs/.*"]
)

3. Multi-modal/Cross-referencing

Combining information from multiple sections.

response = client.crawl(
    url="https://example.com",
    max_depth=2,
    instructions="Find all documentation pages that link to API reference docs",
    extract_depth="advanced"
)

4. Rapidly Changing Content

API docs, product announcements, news sections.

response = client.crawl(
    url="https://api.example.com",
    max_depth=1,
    max_breadth=100
)

5. RAG/Knowledge Base Integration

response = client.crawl(
    url="https://docs.example.com",
    max_depth=2,
    extract_depth="advanced",
    include_images=True,
    instructions="Extract all technical documentation and code examples"
)

6. Compliance/Auditing

Comprehensive content analysis for legal checks.

response = client.crawl(
    url="https://example.com",
    max_depth=3,
    max_breadth=100,
    limit=1000,
    extract_depth="advanced",
    instructions="Find all mentions of GDPR and data protection policies"
)

7. Known URL Patterns

Sitemap-based crawling, section-specific extraction.

response = client.crawl(
    url="https://example.com",
    max_depth=1,
    select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
    exclude_paths=["/private/.*", "/admin/.*"]
)

Map then Extract Pattern

Consider using Map before Crawl/Extract to plan your strategy:

Use Map to get site structure
Analyze paths and patterns
Configure Crawl or Extract with discovered paths
Execute focused extraction

# Step 1: Map to discover structure
map_result = client.map(
    url="https://docs.example.com",
    max_depth=2,
    instructions="Find all API docs and guides"
)

# Step 2: Filter discovered URLs
api_docs = [url for url in map_result["results"] if "/api/" in url]
guides = [url for url in map_result["results"] if "/guides/" in url]
print(f"Found {len(api_docs)} API docs, {len(guides)} guides")

# Step 3: Extract from filtered URLs
target_urls = api_docs + guides
response = client.extract(
    urls=target_urls[:20],  # Max 20 per extract call
    extract_depth="advanced",
    query="API endpoints and usage examples",
    chunks_per_source=3
)

Benefits:

Discover site structure before committing to full crawl
Identify relevant path patterns
Avoid unnecessary extraction
More control over what gets extracted

Performance Optimization

Depth vs Performance

Each depth level increases crawl time exponentially:

Depth	Typical Pages	Time
1	10-50	Seconds
2	50-500	Minutes
3	500-5000	Many minutes

Best practices:

Start with max_depth=1 and increase only if needed
Use max_breadth to control horizontal expansion
Set appropriate limit to prevent excessive crawling
Process results incrementally rather than waiting for full crawl

Rate Limiting

Respect site's robots.txt
Monitor API usage and limits
Use appropriate error handling for rate limits
Consider delays between large crawl operations

Conservative vs Comprehensive

# Conservative (start here)
response = client.crawl(
    url="https://example.com",
    max_depth=1,
    max_breadth=20,
    limit=20
)

# Comprehensive (use carefully)
response = client.crawl(
    url="https://example.com",
    max_depth=3,
    max_breadth=100,
    limit=500
)

Common Pitfalls

Problem	Impact	Solution
Excessive depth (`max_depth=4+`)	Exponential time, unnecessary pages	Start with 1-2, increase if needed
Unfocused crawling	Wasted resources, irrelevant content, context explosion	Use `instructions` to focus semantically
Missing limits	Runaway crawls, unexpected costs	Always set reasonable `limit` value
Ignoring `failed_results`	Incomplete data, missed content	Monitor and adjust parameters
Full content without chunks	Context window explosion	Use `instructions` + `chunks_per_source`

Response Fields

Crawl Response

Field	Description
`base_url`	The URL you started the crawl from
`results`	List of crawled pages
`results[].url`	Page URL
`results[].raw_content`	Extracted content (or chunks if instructions provided)
`results[].images`	Image URLs extracted from the page
`results[].favicon`	Favicon URL (if `include_favicon=True`)
`response_time`	Time in seconds
`request_id`	Unique identifier for support reference

Map Response

Field	Description
`base_url`	The URL you started the mapping from
`results`	List of discovered URLs
`response_time`	Time in seconds
`request_id`	Unique identifier for support reference

Summary

Use instructions and chunks_per_source for focused, relevant results in agentic use cases
Start conservative (max_depth=1, max_breadth=20) and scale up as needed
Use path patterns to focus crawling on relevant content
Choose appropriate extract_depth based on content complexity
Always set a limit to prevent runaway crawls and unexpected costs
Monitor failed_results and adjust patterns accordingly
Use Map first to understand site structure before committing to full crawl
Implement error handling for rate limits and failures
Respect robots.txt and site policies

Crawling is powerful but resource-intensive. Focus your crawls, start small, monitor results, and scale gradually based on actual needs.

For more details, see the full API reference

10 KiB Raw Blame History