20 KiB
20 KiB
Test Case Templates
Copy-paste templates for creating test cases based on skill type.
Table of Contents
- File Transform Skills
- Code Generation Skills
- Workflow Skills
- Tool Integration Skills
- Documentation Skills
File Transform Skills
For skills that convert, parse, or reformat files.
Example: CSV to JSON Converter
{
"skill_name": "csv-to-json",
"description": "Converts CSV files to JSON format with data type handling",
"evals": [
{
"id": 1,
"name": "common-case",
"type": "common",
"prompt": "Convert the CSV file at ./data/sales.csv to JSON and save it as ./output/sales.json. The CSV has headers: date, product, quantity, price. Make sure dates are in ISO 8601 format.",
"expected_output": "A JSON file at ./output/sales.json containing an array of objects, each representing a row from the CSV with properly formatted dates.",
"assertions": [
"JSON file exists at ./output/sales.json",
"File contains valid JSON array",
"All dates are in ISO 8601 format (YYYY-MM-DD)",
"Numeric fields (quantity, price) are numbers, not strings"
]
},
{
"id": 2,
"name": "edge-case-large-file",
"type": "edge",
"prompt": "Convert a CSV file with 50,000 rows at ./data/export.csv to JSON. The file contains some rows with missing values in the 'email' column and special characters (emojis) in the 'notes' column.",
"expected_output": "JSON file is created successfully, handling missing values as null or empty strings, and preserving special characters correctly.",
"assertions": [
"Large file is processed without memory errors",
"Missing email values are handled (null or empty string)",
"Special characters and emojis are preserved in output",
"All 50,000 rows are converted"
]
},
{
"id": 3,
"name": "varied-phrasing-casual",
"type": "variation",
"prompt": "Hey, I've got this spreadsheet data/export.csv that I need to turn into JSON. Can you do that for me? Also, the dates are in MM/DD/YYYY format right now - can you make them proper ISO format?",
"expected_output": "Same as common case: JSON file with ISO 8601 dates",
"assertions": [
"Skill triggers with casual language ('turn into', 'proper format')",
"Implicit date formatting requirement is handled",
"Output matches common case results"
]
}
]
}
Template: File Transform Skill
{
"skill_name": "YOUR_SKILL_NAME",
"description": "[Describe what file transformations this skill does]",
"evals": [
{
"id": 1,
"name": "common-case",
"type": "common",
"prompt": "[Realistic request with specific file paths and formats]",
"expected_output": "[What the transformed file should contain]",
"assertions": [
"Output file exists at expected location",
"File is valid [format: JSON, XML, etc.]",
"Data is correctly transformed",
"Specific requirements met (encoding, formatting, etc.)"
]
},
{
"id": 2,
"name": "edge-case",
"type": "edge",
"prompt": "[Large files, special characters, missing data, malformed input]",
"expected_output": "[Graceful handling or error with helpful message]",
"assertions": [
"Edge case is handled appropriately",
"No data loss or corruption",
"Error messages are helpful if transformation fails"
]
},
{
"id": 3,
"name": "varied-phrasing",
"type": "variation",
"prompt": "[Same request with casual language, different wording]",
"expected_output": "[Same as common case]",
"assertions": [
"Skill triggers with varied phrasing",
"Output matches common case",
"Implicit requirements are understood"
]
}
]
}
Code Generation Skills
For skills that generate code, scripts, or templates.
Example: Python Script Generator
{
"skill_name": "python-script-gen",
"description": "Generates Python scripts for file operations and data processing",
"evals": [
{
"id": 1,
"name": "common-case",
"type": "common",
"prompt": "Create a Python script that recursively finds all .log files in /var/log, compresses them with gzip if they're older than 30 days, and moves them to /archive. Handle permission errors gracefully and log all actions to cleanup.log.",
"expected_output": "A Python script that implements the log cleanup functionality with proper error handling, logging, and follows Python best practices.",
"assertions": [
"Script is syntactically valid Python",
"Implements recursive file search",
"Compresses files older than 30 days",
"Handles permission errors gracefully",
"Logs actions to cleanup.log"
]
},
{
"id": 2,
"name": "edge-case-empty-directory",
"type": "edge",
"prompt": "Create a Python script that processes all CSV files in ./data and generates a summary report. What should it do if the directory is empty or doesn't exist?",
"expected_output": "Script handles empty/non-existent directories gracefully with informative error messages and doesn't crash.",
"assertions": [
"Script checks if directory exists before processing",
"Handles empty directory case gracefully",
"Provides informative error message",
"Returns appropriate exit code"
]
},
{
"id": 3,
"name": "varied-phrasing-brief",
"type": "variation",
"prompt": "Write me a python script to backup my photos. It should copy everything from ~/Pictures to ~/Backups/photos with today's date in the folder name. Skip duplicates if possible.",
"expected_output": "Python backup script with timestamped folder and duplicate detection",
"assertions": [
"Skill works with brief, casual description",
"Generates complete script without requiring clarification",
"Handles date formatting for folder name",
"Implements duplicate detection"
]
}
]
}
Template: Code Generation Skill
{
"skill_name": "YOUR_SKILL_NAME",
"description": "[Describe what code this skill generates]",
"evals": [
{
"id": 1,
"name": "common-case",
"type": "common",
"prompt": "[Detailed request with specific requirements and constraints]",
"expected_output": "[Description of the generated code]",
"assertions": [
"Code is syntactically valid",
"Implements all requested features",
"Follows language best practices",
"Includes error handling where appropriate",
"Is well-structured and readable"
]
},
{
"id": 2,
"name": "edge-case",
"type": "edge",
"prompt": "[Ambiguous requirements, missing data, error conditions]",
"expected_output": "[Code handles edge cases or asks for clarification]",
"assertions": [
"Edge cases are handled appropriately",
"Error conditions are managed",
"Code doesn't crash on unexpected input"
]
},
{
"id": 3,
"name": "varied-phrasing",
"type": "variation",
"prompt": "[Same request with minimal details or casual language]",
"expected_output": "[Same quality as common case]",
"assertions": [
"Skill fills in reasonable defaults",
"Generates complete solution",
"Output quality matches detailed request"
]
}
]
}
Workflow Skills
For skills that guide multi-step processes.
Example: Release Workflow
{
"skill_name": "release-workflow",
"description": "Guides the complete software release process",
"evals": [
{
"id": 1,
"name": "common-case",
"type": "common",
"prompt": "I need to create a new release for my Node.js project. The repo is at ~/projects/myapp. We're currently at version 1.2.3 and this is a minor feature release (1.3.0). I need to update the version, create a changelog entry, commit, tag, and push to GitHub.",
"expected_output": "Step-by-step guide covering: version bump in package.json, CHANGELOG.md update, commit creation, annotated tag, and push commands. Should provide copy-pasteable commands.",
"assertions": [
"All release steps are covered",
"Commands are copy-pasteable",
"Version numbers are consistent",
"Validation steps are included",
"Provides rollback guidance"
]
},
{
"id": 2,
"name": "edge-case-dirty-worktree",
"type": "edge",
"prompt": "Help me release version 2.0.0 of my project. By the way, I have some uncommitted changes in my working directory that I'm not sure about.",
"expected_output": "Workflow detects dirty worktree, suggests stashing or committing changes before proceeding with release. Provides commands to handle the situation.",
"assertions": [
"Detects uncommitted changes",
"Warns about dirty worktree",
"Provides options: stash, commit, or abort",
"Doesn't proceed without addressing the issue"
]
},
{
"id": 3,
"name": "varied-phrasing-urgent",
"type": "variation",
"prompt": "Need to push out v2.1.0 ASAP. Hotfix for critical bug. What's the fastest way to get this released?",
"expected_output": "Accelerated workflow prioritizing speed while maintaining essential safety checks",
"assertions": [
"Recognizes urgency from language ('ASAP', 'fastest')",
"Still includes critical safety checks",
"Prioritizes speed without skipping validation",
"Provides streamlined command sequence"
]
}
]
}
Template: Workflow Skill
{
"skill_name": "YOUR_SKILL_NAME",
"description": "[Describe what workflow this skill guides]",
"evals": [
{
"id": 1,
"name": "common-case",
"type": "common",
"prompt": "[Standard workflow request with clear requirements]",
"expected_output": "[Complete step-by-step guide with all necessary steps]",
"assertions": [
"All workflow steps are included",
"Steps are in logical order",
"Validation points are provided",
"Commands are copy-pasteable where applicable",
"Clear success criteria defined"
]
},
{
"id": 2,
"name": "edge-case",
"type": "edge",
"prompt": "[Workflow with complications, errors, or unusual state]",
"expected_output": "[Workflow detects issues and provides guidance]",
"assertions": [
"Detects unusual states or errors",
"Provides recovery options",
"Doesn't proceed blindly",
"Offers rollback or alternative paths"
]
},
{
"id": 3,
"name": "varied-phrasing",
"type": "variation",
"prompt": "[Workflow request with urgency, casual language, or minimal details]",
"expected_output": "[Same workflow adapted to context]",
"assertions": [
"Adapts to urgency level",
"Works with minimal context",
"Still provides complete guidance"
]
}
]
}
Tool Integration Skills
For skills that wrap command-line tools.
Example: Docker Helper
{
"skill_name": "docker-helper",
"description": "Streamlines Docker container and image management",
"evals": [
{
"id": 1,
"name": "common-case",
"type": "common",
"prompt": "I need to deploy my Node.js app using Docker. The Dockerfile is in ~/projects/myapp. Build an image tagged as myapp:v1.0, then run a container named 'myapp-prod' that maps port 3000 to the host. Make sure it restarts automatically if it crashes.",
"expected_output": "Docker commands to build the image and run the container with specified configuration, including restart policy.",
"assertions": [
"Provides correct build command with tag",
"Provides correct run command with port mapping",
"Includes restart policy (--restart unless-stopped or always)",
"Sets container name correctly",
"Commands are copy-pasteable"
]
},
{
"id": 2,
"name": "edge-case-port-conflict",
"type": "edge",
"prompt": "Run my Docker container on port 3000, but I think something might already be using that port on my machine. How do I check and handle this?",
"expected_output": "Commands to check port usage, offer solutions (kill process, use different port, or map to different host port), and proceed accordingly.",
"assertions": [
"Detects potential port conflict",
"Provides command to check port usage",
"Offers multiple solutions",
"Explains trade-offs of each option"
]
},
{
"id": 3,
"name": "varied-phrasing-cleanup",
"type": "variation",
"prompt": "Docker is taking up too much space. Clean up old stuff for me?",
"expected_output": "Commands to clean up stopped containers, unused images, and build cache",
"assertions": [
"Understands implicit request from context",
"Provides safe cleanup commands",
"Warns about data loss where applicable",
"Shows space savings after cleanup"
]
}
]
}
Template: Tool Integration Skill
{
"skill_name": "YOUR_SKILL_NAME",
"description": "[Describe what tool this skill wraps]",
"evals": [
{
"id": 1,
"name": "common-case",
"type": "common",
"prompt": "[Standard tool usage with specific options and requirements]",
"expected_output": "[Correct command(s) with proper flags]",
"assertions": [
"Command syntax is correct",
"All required flags are included",
"Best practices are followed",
"Commands are copy-pasteable",
"Explains what each part does"
]
},
{
"id": 2,
"name": "edge-case",
"type": "edge",
"prompt": "[Tool usage with errors, conflicts, or unusual requirements]",
"expected_output": "[Troubleshooting steps and solutions]",
"assertions": [
"Detects potential issues",
"Provides diagnostic commands",
"Offers multiple solutions",
"Explains risks of each approach"
]
},
{
"id": 3,
"name": "varied-phrasing",
"type": "variation",
"prompt": "[Casual request with vague requirements]",
"expected_output": "[Tool commands with reasonable defaults]",
"assertions": [
"Fills in reasonable defaults",
"Provides complete solution",
"Explains assumptions made"
]
}
]
}
Documentation Skills
For skills that help create or review documentation.
Example: README Generator
{
"skill_name": "readme-generator",
"description": "Creates comprehensive README files for projects",
"evals": [
{
"id": 1,
"name": "common-case",
"type": "common",
"prompt": "Create a README for my Python project. It's a CLI tool called 'file-organizer' that sorts files by type and date. The repo is at ~/projects/file-organizer. It supports Python 3.8+, uses click for CLI, and has features for: organizing by extension, organizing by date, dry-run mode, and config file support.",
"expected_output": "Complete README with: title, description, installation, usage examples, features list, configuration, and contributing sections. Formatted in Markdown.",
"assertions": [
"README is well-structured with clear headings",
"All requested sections are included",
"Installation instructions are clear",
"Usage examples show actual commands",
"Features are listed comprehensively"
]
},
{
"id": 2,
"name": "edge-case-minimal-info",
"type": "edge",
"prompt": "Write a README for my project. It's called 'utils' and it does some stuff with files.",
"expected_output": "README template with placeholder sections and prompts for missing information. Asks clarifying questions or provides generic placeholders.",
"assertions": [
"Creates template structure despite minimal info",
"Uses placeholders for missing details",
"Suggests what information to add",
"Doesn't invent features or functionality"
]
},
{
"id": 3,
"name": "varied-phrasing-casual",
"type": "variation",
"prompt": "Hey can you write a readme for my new js library? It's on npm as 'async-queue'. It helps manage async tasks with a queue so you don't overwhelm APIs. Pretty simple but useful.",
"expected_output": "README with npm installation, basic usage example, and API overview",
"assertions": [
"Understands from package name and casual description",
"Includes npm install instructions",
"Provides JavaScript usage examples",
"Explains the problem it solves"
]
}
]
}
Template: Documentation Skill
{
"skill_name": "YOUR_SKILL_NAME",
"description": "[Describe what documentation this skill helps with]",
"evals": [
{
"id": 1,
"name": "common-case",
"type": "common",
"prompt": "[Detailed request with project information and requirements]",
"expected_output": "[Complete, well-structured documentation]",
"assertions": [
"Documentation is well-organized",
"All required sections are included",
"Examples are clear and relevant",
"Formatting is consistent",
"Language is clear and concise"
]
},
{
"id": 2,
"name": "edge-case",
"type": "edge",
"prompt": "[Vague request with minimal information]",
"expected_output": "[Template with placeholders or clarifying questions]",
"assertions": [
"Creates structure despite limited info",
"Uses appropriate placeholders",
"Identifies missing information",
"Doesn't invent false details"
]
},
{
"id": 3,
"name": "varied-phrasing",
"type": "variation",
"prompt": "[Casual request with implied requirements]",
"expected_output": "[Documentation meeting implicit needs]",
"assertions": [
"Understands implicit requirements",
"Provides complete documentation",
"Matches tone of request"
]
}
]
}
Tips for Customizing Templates
Making Tests Realistic
Bad:
"Convert CSV to JSON"
Good:
"Convert the CSV file at ./data/sales.csv to JSON and save it as ./output/sales.json. The CSV has headers: date, product, quantity, price. Make sure dates are in ISO 8601 format."
Assertions Should Be Checkable
Vague:
"assertions": [
"Output looks good"
]
Specific:
"assertions": [
"JSON file exists at ./output/sales.json",
"All dates are in ISO 8601 format",
"Numeric fields are numbers, not strings"
]
Edge Cases to Consider
- Large data - Files with 10,000+ rows
- Special characters - Emojis, Unicode, unusual symbols
- Missing data - Empty cells, null values
- Wrong format - CSV with inconsistent columns
- Permissions - Read/write errors
- Conflicts - Port conflicts, file already exists
- Empty input - Zero rows, blank files
- Unexpected state - Dirty git worktree, missing dependencies
Variation Ideas
- Casual language - "Hey, can you...", "I need to..."
- Abbreviated - Minimal words, assumes context
- Urgent - "ASAP", "quickly", "fastest way"
- Uncertain - "I think", "maybe", "probably"
- Brief - Single sentence, minimal details
- Verbose - Extra context, backstory
Validation Checklist
Before using your evals.json:
- Contains exactly 3 test cases (common, edge, variation)
- Each test has unique id (1, 2, 3)
- Each test has descriptive name
- Prompts are realistic with specific details
- Expected outputs are clear
- Assertions are objectively checkable
- JSON is valid (run through jq or json linter)
- File is saved as evals/evals.json in skill directory
Run this to validate:
jq . evals/evals.json > /dev/null && echo "Valid JSON" || echo "Invalid JSON"