Files
.dotfiles/.config/opencode/skills/tavily-best-practices/references/crawl.md
2026-03-22 23:21:49 +02:00

10 KiB

Crawl & Map API Reference

Table of Contents


Crawl vs Map

Feature Crawl Map
Returns Full content URLs only
Speed Slower Faster
Best for RAG, deep analysis, documentation Site structure discovery, URL collection

Use Crawl when:

  • Full content extraction needed
  • Building RAG systems
  • Processing paginated/nested content
  • Integration with knowledge bases

Use Map when:

  • Quick site structure discovery
  • URL collection without content
  • Planning before crawling
  • Sitemap generation

Key Parameters

Parameter Type Default Description
url string Required Root URL to begin
max_depth integer 1 Levels deep to crawl (1-5). Start with 1-2
max_breadth integer 20 Links per page. 50-100 for focused crawls
limit integer 50 Total pages cap
instructions string null Natural language guidance (2 credits/10 pages)
chunks_per_source integer 3 Chunks per page (1-5). Only with instructions
extract_depth enum "basic" "basic" (1 credit/5 URLs) or "advanced" (2 credits/5 URLs)
format enum "markdown" "markdown" or "text"
select_paths array null Regex patterns to include
exclude_paths array null Regex patterns to exclude
select_domains array null Regex for domains to include
exclude_domains array null Regex for domains to exclude
allow_external boolean true (crawl) / false (map) Include external domain links
include_images boolean false Include images (crawl only)
include_favicon boolean false Include favicon URL (crawl only)
include_usage boolean false Include credit usage info
timeout float 150 Max wait (10-150 seconds)

Instructions and Chunks

Use instructions and chunks_per_source for semantic focus and token optimization:

response = client.crawl(
    url="https://docs.example.com",
    max_depth=2,
    instructions="Find all documentation about authentication and security",
    chunks_per_source=3  # Only top 3 relevant chunks per page
)

Key benefits:

  • instructions guides crawler semantically, focusing on relevant content
  • chunks_per_source returns only relevant snippets (max 500 chars each)
  • Prevents context window explosion in agentic use cases
  • Chunks appear in raw_content as: <chunk 1> [...] <chunk 2> [...] <chunk 3>

Note: chunks_per_source only works when instructions is provided.


Path and Domain Filtering

Path patterns (regex)

# Target specific sections
response = client.crawl(
    url="https://example.com",
    select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
    exclude_paths=["/blog/.*", "/changelog/.*", "/private/.*"]
)

# Paginated content
response = client.crawl(
    url="https://example.com/blog",
    max_depth=2,
    select_paths=["/blog/.*", "/blog/page/.*"],
    exclude_paths=["/blog/tag/.*"]
)

Domain control (regex)

# Stay within subdomain
response = client.crawl(
    url="https://docs.example.com",
    select_domains=["^docs.example.com$"],
    max_depth=2
)

# Exclude specific domains
response = client.crawl(
    url="https://example.com",
    exclude_domains=["^ads.example.com$", "^tracking.example.com$"]
)

Use Cases

1. Deep/Unlinked Content

Deeply nested pages, paginated archives, internal search-only content.

response = client.crawl(
    url="https://example.com",
    max_depth=3,
    max_breadth=50,
    limit=200,
    select_paths=["/blog/.*", "/changelog/.*"],
    exclude_paths=["/private/.*", "/admin/.*"]
)

2. Documentation/Structured Content

Documentation, changelogs, FAQs with nonstandard markup.

response = client.crawl(
    url="https://docs.example.com",
    max_depth=2,
    extract_depth="advanced",
    select_paths=["/docs/.*"]
)

3. Multi-modal/Cross-referencing

Combining information from multiple sections.

response = client.crawl(
    url="https://example.com",
    max_depth=2,
    instructions="Find all documentation pages that link to API reference docs",
    extract_depth="advanced"
)

4. Rapidly Changing Content

API docs, product announcements, news sections.

response = client.crawl(
    url="https://api.example.com",
    max_depth=1,
    max_breadth=100
)

5. RAG/Knowledge Base Integration

response = client.crawl(
    url="https://docs.example.com",
    max_depth=2,
    extract_depth="advanced",
    include_images=True,
    instructions="Extract all technical documentation and code examples"
)

6. Compliance/Auditing

Comprehensive content analysis for legal checks.

response = client.crawl(
    url="https://example.com",
    max_depth=3,
    max_breadth=100,
    limit=1000,
    extract_depth="advanced",
    instructions="Find all mentions of GDPR and data protection policies"
)

7. Known URL Patterns

Sitemap-based crawling, section-specific extraction.

response = client.crawl(
    url="https://example.com",
    max_depth=1,
    select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
    exclude_paths=["/private/.*", "/admin/.*"]
)

Map then Extract Pattern

Consider using Map before Crawl/Extract to plan your strategy:

  1. Use Map to get site structure
  2. Analyze paths and patterns
  3. Configure Crawl or Extract with discovered paths
  4. Execute focused extraction
# Step 1: Map to discover structure
map_result = client.map(
    url="https://docs.example.com",
    max_depth=2,
    instructions="Find all API docs and guides"
)

# Step 2: Filter discovered URLs
api_docs = [url for url in map_result["results"] if "/api/" in url]
guides = [url for url in map_result["results"] if "/guides/" in url]
print(f"Found {len(api_docs)} API docs, {len(guides)} guides")

# Step 3: Extract from filtered URLs
target_urls = api_docs + guides
response = client.extract(
    urls=target_urls[:20],  # Max 20 per extract call
    extract_depth="advanced",
    query="API endpoints and usage examples",
    chunks_per_source=3
)

Benefits:

  • Discover site structure before committing to full crawl
  • Identify relevant path patterns
  • Avoid unnecessary extraction
  • More control over what gets extracted

Performance Optimization

Depth vs Performance

Each depth level increases crawl time exponentially:

Depth Typical Pages Time
1 10-50 Seconds
2 50-500 Minutes
3 500-5000 Many minutes

Best practices:

  • Start with max_depth=1 and increase only if needed
  • Use max_breadth to control horizontal expansion
  • Set appropriate limit to prevent excessive crawling
  • Process results incrementally rather than waiting for full crawl

Rate Limiting

  • Respect site's robots.txt
  • Monitor API usage and limits
  • Use appropriate error handling for rate limits
  • Consider delays between large crawl operations

Conservative vs Comprehensive

# Conservative (start here)
response = client.crawl(
    url="https://example.com",
    max_depth=1,
    max_breadth=20,
    limit=20
)

# Comprehensive (use carefully)
response = client.crawl(
    url="https://example.com",
    max_depth=3,
    max_breadth=100,
    limit=500
)

Common Pitfalls

Problem Impact Solution
Excessive depth (max_depth=4+) Exponential time, unnecessary pages Start with 1-2, increase if needed
Unfocused crawling Wasted resources, irrelevant content, context explosion Use instructions to focus semantically
Missing limits Runaway crawls, unexpected costs Always set reasonable limit value
Ignoring failed_results Incomplete data, missed content Monitor and adjust parameters
Full content without chunks Context window explosion Use instructions + chunks_per_source

Response Fields

Crawl Response

Field Description
base_url The URL you started the crawl from
results List of crawled pages
results[].url Page URL
results[].raw_content Extracted content (or chunks if instructions provided)
results[].images Image URLs extracted from the page
results[].favicon Favicon URL (if include_favicon=True)
response_time Time in seconds
request_id Unique identifier for support reference

Map Response

Field Description
base_url The URL you started the mapping from
results List of discovered URLs
response_time Time in seconds
request_id Unique identifier for support reference

Summary

  1. Use instructions and chunks_per_source for focused, relevant results in agentic use cases
  2. Start conservative (max_depth=1, max_breadth=20) and scale up as needed
  3. Use path patterns to focus crawling on relevant content
  4. Choose appropriate extract_depth based on content complexity
  5. Always set a limit to prevent runaway crawls and unexpected costs
  6. Monitor failed_results and adjust patterns accordingly
  7. Use Map first to understand site structure before committing to full crawl
  8. Implement error handling for rate limits and failures
  9. Respect robots.txt and site policies

Crawling is powerful but resource-intensive. Focus your crawls, start small, monitor results, and scale gradually based on actual needs.

For more details, see the full API reference