.dotfiles/.config/opencode/skills/tavily-best-practices/references/crawl.md

# Crawl & Map API Reference

## Table of Contents

- [Crawl vs Map](#crawl-vs-map)
- [Key Parameters](#key-parameters)
- [Instructions and Chunks](#instructions-and-chunks)
- [Path and Domain Filtering](#path-and-domain-filtering)
- [Use Cases](#use-cases)
- [Map then Extract Pattern](#map-then-extract-pattern)
- [Performance Optimization](#performance-optimization)
- [Common Pitfalls](#common-pitfalls)
- [Response Fields](#response-fields)
- [Summary](#summary)

---

## Crawl vs Map

| Feature | Crawl | Map |
|---------|-------|-----|
| **Returns** | Full content | URLs only |
| **Speed** | Slower | Faster |
| **Best for** | RAG, deep analysis, documentation | Site structure discovery, URL collection |

**Use Crawl when:**
- Full content extraction needed
- Building RAG systems
- Processing paginated/nested content
- Integration with knowledge bases

**Use Map when:**
- Quick site structure discovery
- URL collection without content
- Planning before crawling
- Sitemap generation

---

## Key Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | Required | Root URL to begin |
| `max_depth` | integer | 1 | Levels deep to crawl (1-5). **Start with 1-2** |
| `max_breadth` | integer | 20 | Links per page. 50-100 for focused crawls |
| `limit` | integer | 50 | Total pages cap |
| `instructions` | string | null | Natural language guidance (2 credits/10 pages) |
| `chunks_per_source` | integer | 3 | Chunks per page (1-5). Only with `instructions` |
| `extract_depth` | enum | `"basic"` | `"basic"` (1 credit/5 URLs) or `"advanced"` (2 credits/5 URLs) |
| `format` | enum | `"markdown"` | `"markdown"` or `"text"` |
| `select_paths` | array | null | Regex patterns to include |
| `exclude_paths` | array | null | Regex patterns to exclude |
| `select_domains` | array | null | Regex for domains to include |
| `exclude_domains` | array | null | Regex for domains to exclude |
| `allow_external` | boolean | true (crawl) / false (map) | Include external domain links |
| `include_images` | boolean | false | Include images (crawl only) |
| `include_favicon` | boolean | false | Include favicon URL (crawl only) |
| `include_usage` | boolean | false | Include credit usage info |
| `timeout` | float | 150 | Max wait (10-150 seconds) |

---

## Instructions and Chunks

Use `instructions` and `chunks_per_source` for semantic focus and token optimization:

```python
response = client.crawl(
    url="https://docs.example.com",
    max_depth=2,
    instructions="Find all documentation about authentication and security",
    chunks_per_source=3  # Only top 3 relevant chunks per page
)
```

**Key benefits:**
- `instructions` guides crawler semantically, focusing on relevant content
- `chunks_per_source` returns only relevant snippets (max 500 chars each)
- Prevents context window explosion in agentic use cases
- Chunks appear in `raw_content` as: `<chunk 1> [...] <chunk 2> [...] <chunk 3>`

**Note:** `chunks_per_source` only works when `instructions` is provided.

---

## Path and Domain Filtering

### Path patterns (regex)

```python
# Target specific sections
response = client.crawl(
    url="https://example.com",
    select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
    exclude_paths=["/blog/.*", "/changelog/.*", "/private/.*"]
)

# Paginated content
response = client.crawl(
    url="https://example.com/blog",
    max_depth=2,
    select_paths=["/blog/.*", "/blog/page/.*"],
    exclude_paths=["/blog/tag/.*"]
)
```

### Domain control (regex)

```python
# Stay within subdomain
response = client.crawl(
    url="https://docs.example.com",
    select_domains=["^docs.example.com$"],
    max_depth=2
)

# Exclude specific domains
response = client.crawl(
    url="https://example.com",
    exclude_domains=["^ads.example.com$", "^tracking.example.com$"]
)
```

---

## Use Cases

### 1. Deep/Unlinked Content
Deeply nested pages, paginated archives, internal search-only content.

```python
response = client.crawl(
    url="https://example.com",
    max_depth=3,
    max_breadth=50,
    limit=200,
    select_paths=["/blog/.*", "/changelog/.*"],
    exclude_paths=["/private/.*", "/admin/.*"]
)
```

### 2. Documentation/Structured Content
Documentation, changelogs, FAQs with nonstandard markup.

```python
response = client.crawl(
    url="https://docs.example.com",
    max_depth=2,
    extract_depth="advanced",
    select_paths=["/docs/.*"]
)
```

### 3. Multi-modal/Cross-referencing
Combining information from multiple sections.

```python
response = client.crawl(
    url="https://example.com",
    max_depth=2,
    instructions="Find all documentation pages that link to API reference docs",
    extract_depth="advanced"
)
```

### 4. Rapidly Changing Content
API docs, product announcements, news sections.

```python
response = client.crawl(
    url="https://api.example.com",
    max_depth=1,
    max_breadth=100
)
```

### 5. RAG/Knowledge Base Integration

```python
response = client.crawl(
    url="https://docs.example.com",
    max_depth=2,
    extract_depth="advanced",
    include_images=True,
    instructions="Extract all technical documentation and code examples"
)
```

### 6. Compliance/Auditing
Comprehensive content analysis for legal checks.

```python
response = client.crawl(
    url="https://example.com",
    max_depth=3,
    max_breadth=100,
    limit=1000,
    extract_depth="advanced",
    instructions="Find all mentions of GDPR and data protection policies"
)
```

### 7. Known URL Patterns
Sitemap-based crawling, section-specific extraction.

```python
response = client.crawl(
    url="https://example.com",
    max_depth=1,
    select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
    exclude_paths=["/private/.*", "/admin/.*"]
)
```

---

## Map then Extract Pattern

Consider using Map before Crawl/Extract to plan your strategy:

1. **Use Map** to get site structure
2. **Analyze** paths and patterns
3. **Configure** Crawl or Extract with discovered paths
4. **Execute** focused extraction

```python
# Step 1: Map to discover structure
map_result = client.map(
    url="https://docs.example.com",
    max_depth=2,
    instructions="Find all API docs and guides"
)

# Step 2: Filter discovered URLs
api_docs = [url for url in map_result["results"] if "/api/" in url]
guides = [url for url in map_result["results"] if "/guides/" in url]
print(f"Found {len(api_docs)} API docs, {len(guides)} guides")

# Step 3: Extract from filtered URLs
target_urls = api_docs + guides
response = client.extract(
    urls=target_urls[:20],  # Max 20 per extract call
    extract_depth="advanced",
    query="API endpoints and usage examples",
    chunks_per_source=3
)
```

**Benefits:**
- Discover site structure before committing to full crawl
- Identify relevant path patterns
- Avoid unnecessary extraction
- More control over what gets extracted

---

## Performance Optimization

### Depth vs Performance

Each depth level increases crawl time exponentially:

| Depth | Typical Pages | Time |
|-------|---------------|------|
| 1 | 10-50 | Seconds |
| 2 | 50-500 | Minutes |
| 3 | 500-5000 | Many minutes |

**Best practices:**
- Start with `max_depth=1` and increase only if needed
- Use `max_breadth` to control horizontal expansion
- Set appropriate `limit` to prevent excessive crawling
- Process results incrementally rather than waiting for full crawl

### Rate Limiting

- Respect site's robots.txt
- Monitor API usage and limits
- Use appropriate error handling for rate limits
- Consider delays between large crawl operations

### Conservative vs Comprehensive

```python
# Conservative (start here)
response = client.crawl(
    url="https://example.com",
    max_depth=1,
    max_breadth=20,
    limit=20
)

# Comprehensive (use carefully)
response = client.crawl(
    url="https://example.com",
    max_depth=3,
    max_breadth=100,
    limit=500
)
```

---

## Common Pitfalls

| Problem | Impact | Solution |
|---------|--------|----------|
| Excessive depth (`max_depth=4+`) | Exponential time, unnecessary pages | Start with 1-2, increase if needed |
| Unfocused crawling | Wasted resources, irrelevant content, context explosion | Use `instructions` to focus semantically |
| Missing limits | Runaway crawls, unexpected costs | Always set reasonable `limit` value |
| Ignoring `failed_results` | Incomplete data, missed content | Monitor and adjust parameters |
| Full content without chunks | Context window explosion | Use `instructions` + `chunks_per_source` |

---

## Response Fields

### Crawl Response

| Field | Description |
|-------|-------------|
| `base_url` | The URL you started the crawl from |
| `results` | List of crawled pages |
| `results[].url` | Page URL |
| `results[].raw_content` | Extracted content (or chunks if instructions provided) |
| `results[].images` | Image URLs extracted from the page |
| `results[].favicon` | Favicon URL (if `include_favicon=True`) |
| `response_time` | Time in seconds |
| `request_id` | Unique identifier for support reference |

### Map Response

| Field | Description |
|-------|-------------|
| `base_url` | The URL you started the mapping from |
| `results` | List of discovered URLs |
| `response_time` | Time in seconds |
| `request_id` | Unique identifier for support reference |

---

## Summary

1. **Use instructions and chunks_per_source** for focused, relevant results in agentic use cases
2. **Start conservative** (`max_depth=1`, `max_breadth=20`) and scale up as needed
3. **Use path patterns** to focus crawling on relevant content
4. **Choose appropriate extract_depth** based on content complexity
5. **Always set a limit** to prevent runaway crawls and unexpected costs
6. **Monitor failed_results** and adjust patterns accordingly
7. **Use Map first** to understand site structure before committing to full crawl
8. **Implement error handling** for rate limits and failures
9. **Respect robots.txt** and site policies

> Crawling is powerful but resource-intensive. Focus your crawls, start small, monitor results, and scale gradually based on actual needs.

For more details, see the [full API reference](https://docs.tavily.com/documentation/api-reference/endpoint/crawl)