358 lines
10 KiB
Markdown
358 lines
10 KiB
Markdown
# Crawl & Map API Reference
|
|
|
|
## Table of Contents
|
|
|
|
- [Crawl vs Map](#crawl-vs-map)
|
|
- [Key Parameters](#key-parameters)
|
|
- [Instructions and Chunks](#instructions-and-chunks)
|
|
- [Path and Domain Filtering](#path-and-domain-filtering)
|
|
- [Use Cases](#use-cases)
|
|
- [Map then Extract Pattern](#map-then-extract-pattern)
|
|
- [Performance Optimization](#performance-optimization)
|
|
- [Common Pitfalls](#common-pitfalls)
|
|
- [Response Fields](#response-fields)
|
|
- [Summary](#summary)
|
|
|
|
---
|
|
|
|
## Crawl vs Map
|
|
|
|
| Feature | Crawl | Map |
|
|
|---------|-------|-----|
|
|
| **Returns** | Full content | URLs only |
|
|
| **Speed** | Slower | Faster |
|
|
| **Best for** | RAG, deep analysis, documentation | Site structure discovery, URL collection |
|
|
|
|
**Use Crawl when:**
|
|
- Full content extraction needed
|
|
- Building RAG systems
|
|
- Processing paginated/nested content
|
|
- Integration with knowledge bases
|
|
|
|
**Use Map when:**
|
|
- Quick site structure discovery
|
|
- URL collection without content
|
|
- Planning before crawling
|
|
- Sitemap generation
|
|
|
|
---
|
|
|
|
## Key Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `url` | string | Required | Root URL to begin |
|
|
| `max_depth` | integer | 1 | Levels deep to crawl (1-5). **Start with 1-2** |
|
|
| `max_breadth` | integer | 20 | Links per page. 50-100 for focused crawls |
|
|
| `limit` | integer | 50 | Total pages cap |
|
|
| `instructions` | string | null | Natural language guidance (2 credits/10 pages) |
|
|
| `chunks_per_source` | integer | 3 | Chunks per page (1-5). Only with `instructions` |
|
|
| `extract_depth` | enum | `"basic"` | `"basic"` (1 credit/5 URLs) or `"advanced"` (2 credits/5 URLs) |
|
|
| `format` | enum | `"markdown"` | `"markdown"` or `"text"` |
|
|
| `select_paths` | array | null | Regex patterns to include |
|
|
| `exclude_paths` | array | null | Regex patterns to exclude |
|
|
| `select_domains` | array | null | Regex for domains to include |
|
|
| `exclude_domains` | array | null | Regex for domains to exclude |
|
|
| `allow_external` | boolean | true (crawl) / false (map) | Include external domain links |
|
|
| `include_images` | boolean | false | Include images (crawl only) |
|
|
| `include_favicon` | boolean | false | Include favicon URL (crawl only) |
|
|
| `include_usage` | boolean | false | Include credit usage info |
|
|
| `timeout` | float | 150 | Max wait (10-150 seconds) |
|
|
|
|
---
|
|
|
|
## Instructions and Chunks
|
|
|
|
Use `instructions` and `chunks_per_source` for semantic focus and token optimization:
|
|
|
|
```python
|
|
response = client.crawl(
|
|
url="https://docs.example.com",
|
|
max_depth=2,
|
|
instructions="Find all documentation about authentication and security",
|
|
chunks_per_source=3 # Only top 3 relevant chunks per page
|
|
)
|
|
```
|
|
|
|
**Key benefits:**
|
|
- `instructions` guides crawler semantically, focusing on relevant content
|
|
- `chunks_per_source` returns only relevant snippets (max 500 chars each)
|
|
- Prevents context window explosion in agentic use cases
|
|
- Chunks appear in `raw_content` as: `<chunk 1> [...] <chunk 2> [...] <chunk 3>`
|
|
|
|
**Note:** `chunks_per_source` only works when `instructions` is provided.
|
|
|
|
---
|
|
|
|
## Path and Domain Filtering
|
|
|
|
### Path patterns (regex)
|
|
|
|
```python
|
|
# Target specific sections
|
|
response = client.crawl(
|
|
url="https://example.com",
|
|
select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
|
|
exclude_paths=["/blog/.*", "/changelog/.*", "/private/.*"]
|
|
)
|
|
|
|
# Paginated content
|
|
response = client.crawl(
|
|
url="https://example.com/blog",
|
|
max_depth=2,
|
|
select_paths=["/blog/.*", "/blog/page/.*"],
|
|
exclude_paths=["/blog/tag/.*"]
|
|
)
|
|
```
|
|
|
|
### Domain control (regex)
|
|
|
|
```python
|
|
# Stay within subdomain
|
|
response = client.crawl(
|
|
url="https://docs.example.com",
|
|
select_domains=["^docs.example.com$"],
|
|
max_depth=2
|
|
)
|
|
|
|
# Exclude specific domains
|
|
response = client.crawl(
|
|
url="https://example.com",
|
|
exclude_domains=["^ads.example.com$", "^tracking.example.com$"]
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Use Cases
|
|
|
|
### 1. Deep/Unlinked Content
|
|
Deeply nested pages, paginated archives, internal search-only content.
|
|
|
|
```python
|
|
response = client.crawl(
|
|
url="https://example.com",
|
|
max_depth=3,
|
|
max_breadth=50,
|
|
limit=200,
|
|
select_paths=["/blog/.*", "/changelog/.*"],
|
|
exclude_paths=["/private/.*", "/admin/.*"]
|
|
)
|
|
```
|
|
|
|
### 2. Documentation/Structured Content
|
|
Documentation, changelogs, FAQs with nonstandard markup.
|
|
|
|
```python
|
|
response = client.crawl(
|
|
url="https://docs.example.com",
|
|
max_depth=2,
|
|
extract_depth="advanced",
|
|
select_paths=["/docs/.*"]
|
|
)
|
|
```
|
|
|
|
### 3. Multi-modal/Cross-referencing
|
|
Combining information from multiple sections.
|
|
|
|
```python
|
|
response = client.crawl(
|
|
url="https://example.com",
|
|
max_depth=2,
|
|
instructions="Find all documentation pages that link to API reference docs",
|
|
extract_depth="advanced"
|
|
)
|
|
```
|
|
|
|
### 4. Rapidly Changing Content
|
|
API docs, product announcements, news sections.
|
|
|
|
```python
|
|
response = client.crawl(
|
|
url="https://api.example.com",
|
|
max_depth=1,
|
|
max_breadth=100
|
|
)
|
|
```
|
|
|
|
### 5. RAG/Knowledge Base Integration
|
|
|
|
```python
|
|
response = client.crawl(
|
|
url="https://docs.example.com",
|
|
max_depth=2,
|
|
extract_depth="advanced",
|
|
include_images=True,
|
|
instructions="Extract all technical documentation and code examples"
|
|
)
|
|
```
|
|
|
|
### 6. Compliance/Auditing
|
|
Comprehensive content analysis for legal checks.
|
|
|
|
```python
|
|
response = client.crawl(
|
|
url="https://example.com",
|
|
max_depth=3,
|
|
max_breadth=100,
|
|
limit=1000,
|
|
extract_depth="advanced",
|
|
instructions="Find all mentions of GDPR and data protection policies"
|
|
)
|
|
```
|
|
|
|
### 7. Known URL Patterns
|
|
Sitemap-based crawling, section-specific extraction.
|
|
|
|
```python
|
|
response = client.crawl(
|
|
url="https://example.com",
|
|
max_depth=1,
|
|
select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
|
|
exclude_paths=["/private/.*", "/admin/.*"]
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Map then Extract Pattern
|
|
|
|
Consider using Map before Crawl/Extract to plan your strategy:
|
|
|
|
1. **Use Map** to get site structure
|
|
2. **Analyze** paths and patterns
|
|
3. **Configure** Crawl or Extract with discovered paths
|
|
4. **Execute** focused extraction
|
|
|
|
```python
|
|
# Step 1: Map to discover structure
|
|
map_result = client.map(
|
|
url="https://docs.example.com",
|
|
max_depth=2,
|
|
instructions="Find all API docs and guides"
|
|
)
|
|
|
|
# Step 2: Filter discovered URLs
|
|
api_docs = [url for url in map_result["results"] if "/api/" in url]
|
|
guides = [url for url in map_result["results"] if "/guides/" in url]
|
|
print(f"Found {len(api_docs)} API docs, {len(guides)} guides")
|
|
|
|
# Step 3: Extract from filtered URLs
|
|
target_urls = api_docs + guides
|
|
response = client.extract(
|
|
urls=target_urls[:20], # Max 20 per extract call
|
|
extract_depth="advanced",
|
|
query="API endpoints and usage examples",
|
|
chunks_per_source=3
|
|
)
|
|
```
|
|
|
|
**Benefits:**
|
|
- Discover site structure before committing to full crawl
|
|
- Identify relevant path patterns
|
|
- Avoid unnecessary extraction
|
|
- More control over what gets extracted
|
|
|
|
---
|
|
|
|
## Performance Optimization
|
|
|
|
### Depth vs Performance
|
|
|
|
Each depth level increases crawl time exponentially:
|
|
|
|
| Depth | Typical Pages | Time |
|
|
|-------|---------------|------|
|
|
| 1 | 10-50 | Seconds |
|
|
| 2 | 50-500 | Minutes |
|
|
| 3 | 500-5000 | Many minutes |
|
|
|
|
**Best practices:**
|
|
- Start with `max_depth=1` and increase only if needed
|
|
- Use `max_breadth` to control horizontal expansion
|
|
- Set appropriate `limit` to prevent excessive crawling
|
|
- Process results incrementally rather than waiting for full crawl
|
|
|
|
### Rate Limiting
|
|
|
|
- Respect site's robots.txt
|
|
- Monitor API usage and limits
|
|
- Use appropriate error handling for rate limits
|
|
- Consider delays between large crawl operations
|
|
|
|
### Conservative vs Comprehensive
|
|
|
|
```python
|
|
# Conservative (start here)
|
|
response = client.crawl(
|
|
url="https://example.com",
|
|
max_depth=1,
|
|
max_breadth=20,
|
|
limit=20
|
|
)
|
|
|
|
# Comprehensive (use carefully)
|
|
response = client.crawl(
|
|
url="https://example.com",
|
|
max_depth=3,
|
|
max_breadth=100,
|
|
limit=500
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Common Pitfalls
|
|
|
|
| Problem | Impact | Solution |
|
|
|---------|--------|----------|
|
|
| Excessive depth (`max_depth=4+`) | Exponential time, unnecessary pages | Start with 1-2, increase if needed |
|
|
| Unfocused crawling | Wasted resources, irrelevant content, context explosion | Use `instructions` to focus semantically |
|
|
| Missing limits | Runaway crawls, unexpected costs | Always set reasonable `limit` value |
|
|
| Ignoring `failed_results` | Incomplete data, missed content | Monitor and adjust parameters |
|
|
| Full content without chunks | Context window explosion | Use `instructions` + `chunks_per_source` |
|
|
|
|
---
|
|
|
|
## Response Fields
|
|
|
|
### Crawl Response
|
|
|
|
| Field | Description |
|
|
|-------|-------------|
|
|
| `base_url` | The URL you started the crawl from |
|
|
| `results` | List of crawled pages |
|
|
| `results[].url` | Page URL |
|
|
| `results[].raw_content` | Extracted content (or chunks if instructions provided) |
|
|
| `results[].images` | Image URLs extracted from the page |
|
|
| `results[].favicon` | Favicon URL (if `include_favicon=True`) |
|
|
| `response_time` | Time in seconds |
|
|
| `request_id` | Unique identifier for support reference |
|
|
|
|
### Map Response
|
|
|
|
| Field | Description |
|
|
|-------|-------------|
|
|
| `base_url` | The URL you started the mapping from |
|
|
| `results` | List of discovered URLs |
|
|
| `response_time` | Time in seconds |
|
|
| `request_id` | Unique identifier for support reference |
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
1. **Use instructions and chunks_per_source** for focused, relevant results in agentic use cases
|
|
2. **Start conservative** (`max_depth=1`, `max_breadth=20`) and scale up as needed
|
|
3. **Use path patterns** to focus crawling on relevant content
|
|
4. **Choose appropriate extract_depth** based on content complexity
|
|
5. **Always set a limit** to prevent runaway crawls and unexpected costs
|
|
6. **Monitor failed_results** and adjust patterns accordingly
|
|
7. **Use Map first** to understand site structure before committing to full crawl
|
|
8. **Implement error handling** for rate limits and failures
|
|
9. **Respect robots.txt** and site policies
|
|
|
|
> Crawling is powerful but resource-intensive. Focus your crawls, start small, monitor results, and scale gradually based on actual needs.
|
|
|
|
For more details, see the [full API reference](https://docs.tavily.com/documentation/api-reference/endpoint/crawl)
|