Add skills
This commit is contained in:
@@ -0,0 +1,357 @@
|
||||
# Crawl & Map API Reference
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Crawl vs Map](#crawl-vs-map)
|
||||
- [Key Parameters](#key-parameters)
|
||||
- [Instructions and Chunks](#instructions-and-chunks)
|
||||
- [Path and Domain Filtering](#path-and-domain-filtering)
|
||||
- [Use Cases](#use-cases)
|
||||
- [Map then Extract Pattern](#map-then-extract-pattern)
|
||||
- [Performance Optimization](#performance-optimization)
|
||||
- [Common Pitfalls](#common-pitfalls)
|
||||
- [Response Fields](#response-fields)
|
||||
- [Summary](#summary)
|
||||
|
||||
---
|
||||
|
||||
## Crawl vs Map
|
||||
|
||||
| Feature | Crawl | Map |
|
||||
|---------|-------|-----|
|
||||
| **Returns** | Full content | URLs only |
|
||||
| **Speed** | Slower | Faster |
|
||||
| **Best for** | RAG, deep analysis, documentation | Site structure discovery, URL collection |
|
||||
|
||||
**Use Crawl when:**
|
||||
- Full content extraction needed
|
||||
- Building RAG systems
|
||||
- Processing paginated/nested content
|
||||
- Integration with knowledge bases
|
||||
|
||||
**Use Map when:**
|
||||
- Quick site structure discovery
|
||||
- URL collection without content
|
||||
- Planning before crawling
|
||||
- Sitemap generation
|
||||
|
||||
---
|
||||
|
||||
## Key Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `url` | string | Required | Root URL to begin |
|
||||
| `max_depth` | integer | 1 | Levels deep to crawl (1-5). **Start with 1-2** |
|
||||
| `max_breadth` | integer | 20 | Links per page. 50-100 for focused crawls |
|
||||
| `limit` | integer | 50 | Total pages cap |
|
||||
| `instructions` | string | null | Natural language guidance (2 credits/10 pages) |
|
||||
| `chunks_per_source` | integer | 3 | Chunks per page (1-5). Only with `instructions` |
|
||||
| `extract_depth` | enum | `"basic"` | `"basic"` (1 credit/5 URLs) or `"advanced"` (2 credits/5 URLs) |
|
||||
| `format` | enum | `"markdown"` | `"markdown"` or `"text"` |
|
||||
| `select_paths` | array | null | Regex patterns to include |
|
||||
| `exclude_paths` | array | null | Regex patterns to exclude |
|
||||
| `select_domains` | array | null | Regex for domains to include |
|
||||
| `exclude_domains` | array | null | Regex for domains to exclude |
|
||||
| `allow_external` | boolean | true (crawl) / false (map) | Include external domain links |
|
||||
| `include_images` | boolean | false | Include images (crawl only) |
|
||||
| `include_favicon` | boolean | false | Include favicon URL (crawl only) |
|
||||
| `include_usage` | boolean | false | Include credit usage info |
|
||||
| `timeout` | float | 150 | Max wait (10-150 seconds) |
|
||||
|
||||
---
|
||||
|
||||
## Instructions and Chunks
|
||||
|
||||
Use `instructions` and `chunks_per_source` for semantic focus and token optimization:
|
||||
|
||||
```python
|
||||
response = client.crawl(
|
||||
url="https://docs.example.com",
|
||||
max_depth=2,
|
||||
instructions="Find all documentation about authentication and security",
|
||||
chunks_per_source=3 # Only top 3 relevant chunks per page
|
||||
)
|
||||
```
|
||||
|
||||
**Key benefits:**
|
||||
- `instructions` guides crawler semantically, focusing on relevant content
|
||||
- `chunks_per_source` returns only relevant snippets (max 500 chars each)
|
||||
- Prevents context window explosion in agentic use cases
|
||||
- Chunks appear in `raw_content` as: `<chunk 1> [...] <chunk 2> [...] <chunk 3>`
|
||||
|
||||
**Note:** `chunks_per_source` only works when `instructions` is provided.
|
||||
|
||||
---
|
||||
|
||||
## Path and Domain Filtering
|
||||
|
||||
### Path patterns (regex)
|
||||
|
||||
```python
|
||||
# Target specific sections
|
||||
response = client.crawl(
|
||||
url="https://example.com",
|
||||
select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
|
||||
exclude_paths=["/blog/.*", "/changelog/.*", "/private/.*"]
|
||||
)
|
||||
|
||||
# Paginated content
|
||||
response = client.crawl(
|
||||
url="https://example.com/blog",
|
||||
max_depth=2,
|
||||
select_paths=["/blog/.*", "/blog/page/.*"],
|
||||
exclude_paths=["/blog/tag/.*"]
|
||||
)
|
||||
```
|
||||
|
||||
### Domain control (regex)
|
||||
|
||||
```python
|
||||
# Stay within subdomain
|
||||
response = client.crawl(
|
||||
url="https://docs.example.com",
|
||||
select_domains=["^docs.example.com$"],
|
||||
max_depth=2
|
||||
)
|
||||
|
||||
# Exclude specific domains
|
||||
response = client.crawl(
|
||||
url="https://example.com",
|
||||
exclude_domains=["^ads.example.com$", "^tracking.example.com$"]
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Use Cases
|
||||
|
||||
### 1. Deep/Unlinked Content
|
||||
Deeply nested pages, paginated archives, internal search-only content.
|
||||
|
||||
```python
|
||||
response = client.crawl(
|
||||
url="https://example.com",
|
||||
max_depth=3,
|
||||
max_breadth=50,
|
||||
limit=200,
|
||||
select_paths=["/blog/.*", "/changelog/.*"],
|
||||
exclude_paths=["/private/.*", "/admin/.*"]
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Documentation/Structured Content
|
||||
Documentation, changelogs, FAQs with nonstandard markup.
|
||||
|
||||
```python
|
||||
response = client.crawl(
|
||||
url="https://docs.example.com",
|
||||
max_depth=2,
|
||||
extract_depth="advanced",
|
||||
select_paths=["/docs/.*"]
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Multi-modal/Cross-referencing
|
||||
Combining information from multiple sections.
|
||||
|
||||
```python
|
||||
response = client.crawl(
|
||||
url="https://example.com",
|
||||
max_depth=2,
|
||||
instructions="Find all documentation pages that link to API reference docs",
|
||||
extract_depth="advanced"
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Rapidly Changing Content
|
||||
API docs, product announcements, news sections.
|
||||
|
||||
```python
|
||||
response = client.crawl(
|
||||
url="https://api.example.com",
|
||||
max_depth=1,
|
||||
max_breadth=100
|
||||
)
|
||||
```
|
||||
|
||||
### 5. RAG/Knowledge Base Integration
|
||||
|
||||
```python
|
||||
response = client.crawl(
|
||||
url="https://docs.example.com",
|
||||
max_depth=2,
|
||||
extract_depth="advanced",
|
||||
include_images=True,
|
||||
instructions="Extract all technical documentation and code examples"
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Compliance/Auditing
|
||||
Comprehensive content analysis for legal checks.
|
||||
|
||||
```python
|
||||
response = client.crawl(
|
||||
url="https://example.com",
|
||||
max_depth=3,
|
||||
max_breadth=100,
|
||||
limit=1000,
|
||||
extract_depth="advanced",
|
||||
instructions="Find all mentions of GDPR and data protection policies"
|
||||
)
|
||||
```
|
||||
|
||||
### 7. Known URL Patterns
|
||||
Sitemap-based crawling, section-specific extraction.
|
||||
|
||||
```python
|
||||
response = client.crawl(
|
||||
url="https://example.com",
|
||||
max_depth=1,
|
||||
select_paths=["/docs/.*", "/api/.*", "/guides/.*"],
|
||||
exclude_paths=["/private/.*", "/admin/.*"]
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Map then Extract Pattern
|
||||
|
||||
Consider using Map before Crawl/Extract to plan your strategy:
|
||||
|
||||
1. **Use Map** to get site structure
|
||||
2. **Analyze** paths and patterns
|
||||
3. **Configure** Crawl or Extract with discovered paths
|
||||
4. **Execute** focused extraction
|
||||
|
||||
```python
|
||||
# Step 1: Map to discover structure
|
||||
map_result = client.map(
|
||||
url="https://docs.example.com",
|
||||
max_depth=2,
|
||||
instructions="Find all API docs and guides"
|
||||
)
|
||||
|
||||
# Step 2: Filter discovered URLs
|
||||
api_docs = [url for url in map_result["results"] if "/api/" in url]
|
||||
guides = [url for url in map_result["results"] if "/guides/" in url]
|
||||
print(f"Found {len(api_docs)} API docs, {len(guides)} guides")
|
||||
|
||||
# Step 3: Extract from filtered URLs
|
||||
target_urls = api_docs + guides
|
||||
response = client.extract(
|
||||
urls=target_urls[:20], # Max 20 per extract call
|
||||
extract_depth="advanced",
|
||||
query="API endpoints and usage examples",
|
||||
chunks_per_source=3
|
||||
)
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Discover site structure before committing to full crawl
|
||||
- Identify relevant path patterns
|
||||
- Avoid unnecessary extraction
|
||||
- More control over what gets extracted
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Depth vs Performance
|
||||
|
||||
Each depth level increases crawl time exponentially:
|
||||
|
||||
| Depth | Typical Pages | Time |
|
||||
|-------|---------------|------|
|
||||
| 1 | 10-50 | Seconds |
|
||||
| 2 | 50-500 | Minutes |
|
||||
| 3 | 500-5000 | Many minutes |
|
||||
|
||||
**Best practices:**
|
||||
- Start with `max_depth=1` and increase only if needed
|
||||
- Use `max_breadth` to control horizontal expansion
|
||||
- Set appropriate `limit` to prevent excessive crawling
|
||||
- Process results incrementally rather than waiting for full crawl
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
- Respect site's robots.txt
|
||||
- Monitor API usage and limits
|
||||
- Use appropriate error handling for rate limits
|
||||
- Consider delays between large crawl operations
|
||||
|
||||
### Conservative vs Comprehensive
|
||||
|
||||
```python
|
||||
# Conservative (start here)
|
||||
response = client.crawl(
|
||||
url="https://example.com",
|
||||
max_depth=1,
|
||||
max_breadth=20,
|
||||
limit=20
|
||||
)
|
||||
|
||||
# Comprehensive (use carefully)
|
||||
response = client.crawl(
|
||||
url="https://example.com",
|
||||
max_depth=3,
|
||||
max_breadth=100,
|
||||
limit=500
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
| Problem | Impact | Solution |
|
||||
|---------|--------|----------|
|
||||
| Excessive depth (`max_depth=4+`) | Exponential time, unnecessary pages | Start with 1-2, increase if needed |
|
||||
| Unfocused crawling | Wasted resources, irrelevant content, context explosion | Use `instructions` to focus semantically |
|
||||
| Missing limits | Runaway crawls, unexpected costs | Always set reasonable `limit` value |
|
||||
| Ignoring `failed_results` | Incomplete data, missed content | Monitor and adjust parameters |
|
||||
| Full content without chunks | Context window explosion | Use `instructions` + `chunks_per_source` |
|
||||
|
||||
---
|
||||
|
||||
## Response Fields
|
||||
|
||||
### Crawl Response
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `base_url` | The URL you started the crawl from |
|
||||
| `results` | List of crawled pages |
|
||||
| `results[].url` | Page URL |
|
||||
| `results[].raw_content` | Extracted content (or chunks if instructions provided) |
|
||||
| `results[].images` | Image URLs extracted from the page |
|
||||
| `results[].favicon` | Favicon URL (if `include_favicon=True`) |
|
||||
| `response_time` | Time in seconds |
|
||||
| `request_id` | Unique identifier for support reference |
|
||||
|
||||
### Map Response
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `base_url` | The URL you started the mapping from |
|
||||
| `results` | List of discovered URLs |
|
||||
| `response_time` | Time in seconds |
|
||||
| `request_id` | Unique identifier for support reference |
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
1. **Use instructions and chunks_per_source** for focused, relevant results in agentic use cases
|
||||
2. **Start conservative** (`max_depth=1`, `max_breadth=20`) and scale up as needed
|
||||
3. **Use path patterns** to focus crawling on relevant content
|
||||
4. **Choose appropriate extract_depth** based on content complexity
|
||||
5. **Always set a limit** to prevent runaway crawls and unexpected costs
|
||||
6. **Monitor failed_results** and adjust patterns accordingly
|
||||
7. **Use Map first** to understand site structure before committing to full crawl
|
||||
8. **Implement error handling** for rate limits and failures
|
||||
9. **Respect robots.txt** and site policies
|
||||
|
||||
> Crawling is powerful but resource-intensive. Focus your crawls, start small, monitor results, and scale gradually based on actual needs.
|
||||
|
||||
For more details, see the [full API reference](https://docs.tavily.com/documentation/api-reference/endpoint/crawl)
|
||||
Reference in New Issue
Block a user