Files
2026-03-22 23:21:49 +02:00

20 KiB

Test Case Templates

Copy-paste templates for creating test cases based on skill type.

Table of Contents

  1. File Transform Skills
  2. Code Generation Skills
  3. Workflow Skills
  4. Tool Integration Skills
  5. Documentation Skills

File Transform Skills

For skills that convert, parse, or reformat files.

Example: CSV to JSON Converter

{
  "skill_name": "csv-to-json",
  "description": "Converts CSV files to JSON format with data type handling",
  "evals": [
    {
      "id": 1,
      "name": "common-case",
      "type": "common",
      "prompt": "Convert the CSV file at ./data/sales.csv to JSON and save it as ./output/sales.json. The CSV has headers: date, product, quantity, price. Make sure dates are in ISO 8601 format.",
      "expected_output": "A JSON file at ./output/sales.json containing an array of objects, each representing a row from the CSV with properly formatted dates.",
      "assertions": [
        "JSON file exists at ./output/sales.json",
        "File contains valid JSON array",
        "All dates are in ISO 8601 format (YYYY-MM-DD)",
        "Numeric fields (quantity, price) are numbers, not strings"
      ]
    },
    {
      "id": 2,
      "name": "edge-case-large-file",
      "type": "edge",
      "prompt": "Convert a CSV file with 50,000 rows at ./data/export.csv to JSON. The file contains some rows with missing values in the 'email' column and special characters (emojis) in the 'notes' column.",
      "expected_output": "JSON file is created successfully, handling missing values as null or empty strings, and preserving special characters correctly.",
      "assertions": [
        "Large file is processed without memory errors",
        "Missing email values are handled (null or empty string)",
        "Special characters and emojis are preserved in output",
        "All 50,000 rows are converted"
      ]
    },
    {
      "id": 3,
      "name": "varied-phrasing-casual",
      "type": "variation",
      "prompt": "Hey, I've got this spreadsheet data/export.csv that I need to turn into JSON. Can you do that for me? Also, the dates are in MM/DD/YYYY format right now - can you make them proper ISO format?",
      "expected_output": "Same as common case: JSON file with ISO 8601 dates",
      "assertions": [
        "Skill triggers with casual language ('turn into', 'proper format')",
        "Implicit date formatting requirement is handled",
        "Output matches common case results"
      ]
    }
  ]
}

Template: File Transform Skill

{
  "skill_name": "YOUR_SKILL_NAME",
  "description": "[Describe what file transformations this skill does]",
  "evals": [
    {
      "id": 1,
      "name": "common-case",
      "type": "common",
      "prompt": "[Realistic request with specific file paths and formats]",
      "expected_output": "[What the transformed file should contain]",
      "assertions": [
        "Output file exists at expected location",
        "File is valid [format: JSON, XML, etc.]",
        "Data is correctly transformed",
        "Specific requirements met (encoding, formatting, etc.)"
      ]
    },
    {
      "id": 2,
      "name": "edge-case",
      "type": "edge",
      "prompt": "[Large files, special characters, missing data, malformed input]",
      "expected_output": "[Graceful handling or error with helpful message]",
      "assertions": [
        "Edge case is handled appropriately",
        "No data loss or corruption",
        "Error messages are helpful if transformation fails"
      ]
    },
    {
      "id": 3,
      "name": "varied-phrasing",
      "type": "variation",
      "prompt": "[Same request with casual language, different wording]",
      "expected_output": "[Same as common case]",
      "assertions": [
        "Skill triggers with varied phrasing",
        "Output matches common case",
        "Implicit requirements are understood"
      ]
    }
  ]
}

Code Generation Skills

For skills that generate code, scripts, or templates.

Example: Python Script Generator

{
  "skill_name": "python-script-gen",
  "description": "Generates Python scripts for file operations and data processing",
  "evals": [
    {
      "id": 1,
      "name": "common-case",
      "type": "common",
      "prompt": "Create a Python script that recursively finds all .log files in /var/log, compresses them with gzip if they're older than 30 days, and moves them to /archive. Handle permission errors gracefully and log all actions to cleanup.log.",
      "expected_output": "A Python script that implements the log cleanup functionality with proper error handling, logging, and follows Python best practices.",
      "assertions": [
        "Script is syntactically valid Python",
        "Implements recursive file search",
        "Compresses files older than 30 days",
        "Handles permission errors gracefully",
        "Logs actions to cleanup.log"
      ]
    },
    {
      "id": 2,
      "name": "edge-case-empty-directory",
      "type": "edge",
      "prompt": "Create a Python script that processes all CSV files in ./data and generates a summary report. What should it do if the directory is empty or doesn't exist?",
      "expected_output": "Script handles empty/non-existent directories gracefully with informative error messages and doesn't crash.",
      "assertions": [
        "Script checks if directory exists before processing",
        "Handles empty directory case gracefully",
        "Provides informative error message",
        "Returns appropriate exit code"
      ]
    },
    {
      "id": 3,
      "name": "varied-phrasing-brief",
      "type": "variation",
      "prompt": "Write me a python script to backup my photos. It should copy everything from ~/Pictures to ~/Backups/photos with today's date in the folder name. Skip duplicates if possible.",
      "expected_output": "Python backup script with timestamped folder and duplicate detection",
      "assertions": [
        "Skill works with brief, casual description",
        "Generates complete script without requiring clarification",
        "Handles date formatting for folder name",
        "Implements duplicate detection"
      ]
    }
  ]
}

Template: Code Generation Skill

{
  "skill_name": "YOUR_SKILL_NAME",
  "description": "[Describe what code this skill generates]",
  "evals": [
    {
      "id": 1,
      "name": "common-case",
      "type": "common",
      "prompt": "[Detailed request with specific requirements and constraints]",
      "expected_output": "[Description of the generated code]",
      "assertions": [
        "Code is syntactically valid",
        "Implements all requested features",
        "Follows language best practices",
        "Includes error handling where appropriate",
        "Is well-structured and readable"
      ]
    },
    {
      "id": 2,
      "name": "edge-case",
      "type": "edge",
      "prompt": "[Ambiguous requirements, missing data, error conditions]",
      "expected_output": "[Code handles edge cases or asks for clarification]",
      "assertions": [
        "Edge cases are handled appropriately",
        "Error conditions are managed",
        "Code doesn't crash on unexpected input"
      ]
    },
    {
      "id": 3,
      "name": "varied-phrasing",
      "type": "variation",
      "prompt": "[Same request with minimal details or casual language]",
      "expected_output": "[Same quality as common case]",
      "assertions": [
        "Skill fills in reasonable defaults",
        "Generates complete solution",
        "Output quality matches detailed request"
      ]
    }
  ]
}

Workflow Skills

For skills that guide multi-step processes.

Example: Release Workflow

{
  "skill_name": "release-workflow",
  "description": "Guides the complete software release process",
  "evals": [
    {
      "id": 1,
      "name": "common-case",
      "type": "common",
      "prompt": "I need to create a new release for my Node.js project. The repo is at ~/projects/myapp. We're currently at version 1.2.3 and this is a minor feature release (1.3.0). I need to update the version, create a changelog entry, commit, tag, and push to GitHub.",
      "expected_output": "Step-by-step guide covering: version bump in package.json, CHANGELOG.md update, commit creation, annotated tag, and push commands. Should provide copy-pasteable commands.",
      "assertions": [
        "All release steps are covered",
        "Commands are copy-pasteable",
        "Version numbers are consistent",
        "Validation steps are included",
        "Provides rollback guidance"
      ]
    },
    {
      "id": 2,
      "name": "edge-case-dirty-worktree",
      "type": "edge",
      "prompt": "Help me release version 2.0.0 of my project. By the way, I have some uncommitted changes in my working directory that I'm not sure about.",
      "expected_output": "Workflow detects dirty worktree, suggests stashing or committing changes before proceeding with release. Provides commands to handle the situation.",
      "assertions": [
        "Detects uncommitted changes",
        "Warns about dirty worktree",
        "Provides options: stash, commit, or abort",
        "Doesn't proceed without addressing the issue"
      ]
    },
    {
      "id": 3,
      "name": "varied-phrasing-urgent",
      "type": "variation",
      "prompt": "Need to push out v2.1.0 ASAP. Hotfix for critical bug. What's the fastest way to get this released?",
      "expected_output": "Accelerated workflow prioritizing speed while maintaining essential safety checks",
      "assertions": [
        "Recognizes urgency from language ('ASAP', 'fastest')",
        "Still includes critical safety checks",
        "Prioritizes speed without skipping validation",
        "Provides streamlined command sequence"
      ]
    }
  ]
}

Template: Workflow Skill

{
  "skill_name": "YOUR_SKILL_NAME",
  "description": "[Describe what workflow this skill guides]",
  "evals": [
    {
      "id": 1,
      "name": "common-case",
      "type": "common",
      "prompt": "[Standard workflow request with clear requirements]",
      "expected_output": "[Complete step-by-step guide with all necessary steps]",
      "assertions": [
        "All workflow steps are included",
        "Steps are in logical order",
        "Validation points are provided",
        "Commands are copy-pasteable where applicable",
        "Clear success criteria defined"
      ]
    },
    {
      "id": 2,
      "name": "edge-case",
      "type": "edge",
      "prompt": "[Workflow with complications, errors, or unusual state]",
      "expected_output": "[Workflow detects issues and provides guidance]",
      "assertions": [
        "Detects unusual states or errors",
        "Provides recovery options",
        "Doesn't proceed blindly",
        "Offers rollback or alternative paths"
      ]
    },
    {
      "id": 3,
      "name": "varied-phrasing",
      "type": "variation",
      "prompt": "[Workflow request with urgency, casual language, or minimal details]",
      "expected_output": "[Same workflow adapted to context]",
      "assertions": [
        "Adapts to urgency level",
        "Works with minimal context",
        "Still provides complete guidance"
      ]
    }
  ]
}

Tool Integration Skills

For skills that wrap command-line tools.

Example: Docker Helper

{
  "skill_name": "docker-helper",
  "description": "Streamlines Docker container and image management",
  "evals": [
    {
      "id": 1,
      "name": "common-case",
      "type": "common",
      "prompt": "I need to deploy my Node.js app using Docker. The Dockerfile is in ~/projects/myapp. Build an image tagged as myapp:v1.0, then run a container named 'myapp-prod' that maps port 3000 to the host. Make sure it restarts automatically if it crashes.",
      "expected_output": "Docker commands to build the image and run the container with specified configuration, including restart policy.",
      "assertions": [
        "Provides correct build command with tag",
        "Provides correct run command with port mapping",
        "Includes restart policy (--restart unless-stopped or always)",
        "Sets container name correctly",
        "Commands are copy-pasteable"
      ]
    },
    {
      "id": 2,
      "name": "edge-case-port-conflict",
      "type": "edge",
      "prompt": "Run my Docker container on port 3000, but I think something might already be using that port on my machine. How do I check and handle this?",
      "expected_output": "Commands to check port usage, offer solutions (kill process, use different port, or map to different host port), and proceed accordingly.",
      "assertions": [
        "Detects potential port conflict",
        "Provides command to check port usage",
        "Offers multiple solutions",
        "Explains trade-offs of each option"
      ]
    },
    {
      "id": 3,
      "name": "varied-phrasing-cleanup",
      "type": "variation",
      "prompt": "Docker is taking up too much space. Clean up old stuff for me?",
      "expected_output": "Commands to clean up stopped containers, unused images, and build cache",
      "assertions": [
        "Understands implicit request from context",
        "Provides safe cleanup commands",
        "Warns about data loss where applicable",
        "Shows space savings after cleanup"
      ]
    }
  ]
}

Template: Tool Integration Skill

{
  "skill_name": "YOUR_SKILL_NAME",
  "description": "[Describe what tool this skill wraps]",
  "evals": [
    {
      "id": 1,
      "name": "common-case",
      "type": "common",
      "prompt": "[Standard tool usage with specific options and requirements]",
      "expected_output": "[Correct command(s) with proper flags]",
      "assertions": [
        "Command syntax is correct",
        "All required flags are included",
        "Best practices are followed",
        "Commands are copy-pasteable",
        "Explains what each part does"
      ]
    },
    {
      "id": 2,
      "name": "edge-case",
      "type": "edge",
      "prompt": "[Tool usage with errors, conflicts, or unusual requirements]",
      "expected_output": "[Troubleshooting steps and solutions]",
      "assertions": [
        "Detects potential issues",
        "Provides diagnostic commands",
        "Offers multiple solutions",
        "Explains risks of each approach"
      ]
    },
    {
      "id": 3,
      "name": "varied-phrasing",
      "type": "variation",
      "prompt": "[Casual request with vague requirements]",
      "expected_output": "[Tool commands with reasonable defaults]",
      "assertions": [
        "Fills in reasonable defaults",
        "Provides complete solution",
        "Explains assumptions made"
      ]
    }
  ]
}

Documentation Skills

For skills that help create or review documentation.

Example: README Generator

{
  "skill_name": "readme-generator",
  "description": "Creates comprehensive README files for projects",
  "evals": [
    {
      "id": 1,
      "name": "common-case",
      "type": "common",
      "prompt": "Create a README for my Python project. It's a CLI tool called 'file-organizer' that sorts files by type and date. The repo is at ~/projects/file-organizer. It supports Python 3.8+, uses click for CLI, and has features for: organizing by extension, organizing by date, dry-run mode, and config file support.",
      "expected_output": "Complete README with: title, description, installation, usage examples, features list, configuration, and contributing sections. Formatted in Markdown.",
      "assertions": [
        "README is well-structured with clear headings",
        "All requested sections are included",
        "Installation instructions are clear",
        "Usage examples show actual commands",
        "Features are listed comprehensively"
      ]
    },
    {
      "id": 2,
      "name": "edge-case-minimal-info",
      "type": "edge",
      "prompt": "Write a README for my project. It's called 'utils' and it does some stuff with files.",
      "expected_output": "README template with placeholder sections and prompts for missing information. Asks clarifying questions or provides generic placeholders.",
      "assertions": [
        "Creates template structure despite minimal info",
        "Uses placeholders for missing details",
        "Suggests what information to add",
        "Doesn't invent features or functionality"
      ]
    },
    {
      "id": 3,
      "name": "varied-phrasing-casual",
      "type": "variation",
      "prompt": "Hey can you write a readme for my new js library? It's on npm as 'async-queue'. It helps manage async tasks with a queue so you don't overwhelm APIs. Pretty simple but useful.",
      "expected_output": "README with npm installation, basic usage example, and API overview",
      "assertions": [
        "Understands from package name and casual description",
        "Includes npm install instructions",
        "Provides JavaScript usage examples",
        "Explains the problem it solves"
      ]
    }
  ]
}

Template: Documentation Skill

{
  "skill_name": "YOUR_SKILL_NAME",
  "description": "[Describe what documentation this skill helps with]",
  "evals": [
    {
      "id": 1,
      "name": "common-case",
      "type": "common",
      "prompt": "[Detailed request with project information and requirements]",
      "expected_output": "[Complete, well-structured documentation]",
      "assertions": [
        "Documentation is well-organized",
        "All required sections are included",
        "Examples are clear and relevant",
        "Formatting is consistent",
        "Language is clear and concise"
      ]
    },
    {
      "id": 2,
      "name": "edge-case",
      "type": "edge",
      "prompt": "[Vague request with minimal information]",
      "expected_output": "[Template with placeholders or clarifying questions]",
      "assertions": [
        "Creates structure despite limited info",
        "Uses appropriate placeholders",
        "Identifies missing information",
        "Doesn't invent false details"
      ]
    },
    {
      "id": 3,
      "name": "varied-phrasing",
      "type": "variation",
      "prompt": "[Casual request with implied requirements]",
      "expected_output": "[Documentation meeting implicit needs]",
      "assertions": [
        "Understands implicit requirements",
        "Provides complete documentation",
        "Matches tone of request"
      ]
    }
  ]
}

Tips for Customizing Templates

Making Tests Realistic

Bad:

"Convert CSV to JSON"

Good:

"Convert the CSV file at ./data/sales.csv to JSON and save it as ./output/sales.json. The CSV has headers: date, product, quantity, price. Make sure dates are in ISO 8601 format."

Assertions Should Be Checkable

Vague:

"assertions": [
  "Output looks good"
]

Specific:

"assertions": [
  "JSON file exists at ./output/sales.json",
  "All dates are in ISO 8601 format",
  "Numeric fields are numbers, not strings"
]

Edge Cases to Consider

  1. Large data - Files with 10,000+ rows
  2. Special characters - Emojis, Unicode, unusual symbols
  3. Missing data - Empty cells, null values
  4. Wrong format - CSV with inconsistent columns
  5. Permissions - Read/write errors
  6. Conflicts - Port conflicts, file already exists
  7. Empty input - Zero rows, blank files
  8. Unexpected state - Dirty git worktree, missing dependencies

Variation Ideas

  1. Casual language - "Hey, can you...", "I need to..."
  2. Abbreviated - Minimal words, assumes context
  3. Urgent - "ASAP", "quickly", "fastest way"
  4. Uncertain - "I think", "maybe", "probably"
  5. Brief - Single sentence, minimal details
  6. Verbose - Extra context, backstory

Validation Checklist

Before using your evals.json:

  • Contains exactly 3 test cases (common, edge, variation)
  • Each test has unique id (1, 2, 3)
  • Each test has descriptive name
  • Prompts are realistic with specific details
  • Expected outputs are clear
  • Assertions are objectively checkable
  • JSON is valid (run through jq or json linter)
  • File is saved as evals/evals.json in skill directory

Run this to validate:

jq . evals/evals.json > /dev/null && echo "Valid JSON" || echo "Invalid JSON"