giteadmin/.dotfiles

Fork 0

Files

Jonathan Agmon c09d9151ca Add skills

2026-03-22 23:21:49 +02:00

7.2 KiB

Raw Permalink Blame History

Extract API Reference

Extraction Approaches
Key Parameters
Query and Chunks
Extract Depth
Advanced Filtering Strategies
Response Fields
Summary

Extraction Approaches

Search with include_raw_content

Get search results and content in one call:

response = client.search(
    query="AI healthcare applications",
    include_raw_content=True,
    max_results=5
)

When to use:

Quick prototyping
Simple queries where search results are likely relevant
Single API call convenience

Direct Extract API (Recommended)

Two-step pattern for more control:

# Step 1: Search
search_results = client.search(
    query="Python async best practices",
    max_results=10
)

# Step 2: Filter by relevance score
relevant_urls = [
    r["url"] for r in search_results["results"]
    if r["score"] > 0.5
]

# Step 3: Extract with targeting
extracted = client.extract(
    urls=relevant_urls[:20],
    query="async patterns and concurrency",  # Reranks chunks
    chunks_per_source=3  # Prevents context explosion
)

for item in extracted["results"]:
    print(f"URL: {item['url']}")
    print(f"Content: {item['raw_content'][:500]}...")

When to use:

You want control over which URLs to extract
You need to filter/curate URLs before extraction
You want targeted extraction with query and chunks_per_source

Key Parameters

Parameter	Type	Default	Description
`urls`	string/array	Required	Single URL or list (max 20)
`extract_depth`	enum	`"basic"`	`"basic"` or `"advanced"` (for complex/JS pages)
`query`	string	null	Reranks chunks by relevance to this query
`chunks_per_source`	integer	3	Chunks per source (1-5, max 500 chars each). Only with `query`
`format`	enum	`"markdown"`	Output: `"markdown"` or `"text"`
`include_images`	boolean	false	Include image URLs
`include_favicon`	boolean	false	Include favicon URL
`include_usage`	boolean	false	Include credit consumption data in response
`timeout`	float	varies	Max wait time (1.0-60.0 seconds)

Query and Chunks

Use query and chunks_per_source to get only relevant content and prevent context window explosion:

extracted = client.extract(
    urls=[
        "https://example.com/ml-healthcare",
        "https://example.com/ai-diagnostics",
        "https://example.com/medical-ai"
    ],
    query="AI diagnostic tools accuracy",
    chunks_per_source=2  # 2 most relevant chunks per URL
)

When to use query:

To extract only relevant portions of long documents
When you need focused content instead of full page extraction
For targeted information retrieval from specific URLs

Key benefits of chunks_per_source:

Returns only relevant snippets (max 500 chars each) instead of full page
Chunks appear in raw_content as: <chunk 1> [...] <chunk 2> [...] <chunk 3>
Prevents context window from exploding in agentic use cases

Note: chunks_per_source only works when query is provided.

Extract Depth

Depth	When to use
`basic` (default)	Simple text extraction, faster
`advanced`	Dynamic/JS-rendered pages, tables, structured data, embedded media

# For complex pages
extracted = client.extract(
    urls=["https://example.com/complex-page"],
    extract_depth="advanced"
)

Fallback strategy: If basic fails, retry with advanced:

result = client.extract(urls=[url], extract_depth="basic")
if url in [f["url"] for f in result.get("failed_results", [])]:
    result = client.extract(urls=[url], extract_depth="advanced")

Timeout tuning: If latency isn't critical, set timeout=60.0 for better success on slow pages.

Advanced Filtering Strategies

Beyond query-based filtering, consider these approaches before extraction:

Strategy	When to use
Score-based	Filter search results by relevance score
Domain-based	Filter by trusted domains
Re-ranking	Use dedicated re-ranking models for precision
LLM-based	Let an LLM assess relevance before extraction
Clustering	Group similar documents, extract from clusters

Optimal Workflow

Search to discover relevant URLs
Filter by relevance score, domain, or content snippet
Re-rank if needed using specialized models
Extract from top-ranked sources with query and chunks_per_source
Validate extracted content quality
Process for your AI application

Example: Complete Pipeline

import asyncio
from tavily import AsyncTavilyClient

client = AsyncTavilyClient()

async def content_pipeline(topic):
    # 1. Search with sub-queries for breadth
    queries = [
        f"{topic} overview",
        f"{topic} best practices",
        f"{topic} recent developments"
    ]
    responses = await asyncio.gather(
        *(client.search(q, search_depth="advanced", max_results=10) for q in queries)
    )

    # 2. Filter and aggregate by score
    urls = []
    for response in responses:
        urls.extend([
            r['url'] for r in response['results']
            if r['score'] > 0.5
        ])

    # 3. Deduplicate
    urls = list(set(urls))[:20]

    # 4. Extract with error handling
    extracted = await asyncio.gather(
        *(client.extract(urls=[url], query=topic, extract_depth="advanced")
          for url in urls),
        return_exceptions=True
    )

    # 5. Filter successful extractions
    return [e for e in extracted if not isinstance(e, Exception)]

asyncio.run(content_pipeline("machine learning in healthcare"))

Response Fields

Top-level response:

Field	Description
`results`	Array of successfully extracted content
`failed_results`	Array of URLs that failed extraction
`response_time`	Time in seconds
`request_id`	Unique identifier for support reference
`usage`	Credit usage info (if `include_usage=True`)

Each result object:

Field	Description
`url`	The URL extracted from
`raw_content`	Full content, or top-ranked chunks joined by `[...]` when `query` provided
`images`	Array of image URLs (if `include_images=true`)
`favicon`	Favicon URL (if `include_favicon=true`)

Each failed_results object:

Field	Description
`url`	The URL that failed
`error`	Error message

Summary

Use query and chunks_per_source for targeted, focused extraction
Choose Extract API when you need control over which URLs to extract from
Filter URLs before extraction using scores, re-ranking, or domain trust
Choose appropriate extract_depth based on content complexity
Process URLs concurrently with async operations for better performance
Implement error handling to manage failed extractions gracefully
Validate extracted content before downstream processing

For more details, see the full API reference

7.2 KiB Raw Permalink Blame History