# Grading Guide Comprehensive guide for evaluating skill outputs and identifying improvements. ## Table of Contents 1. [The Evaluation Framework](#the-evaluation-framework) 2. [Correctness](#correctness) 3. [Completeness](#completeness) 4. [Format](#format) 5. [Triggering](#triggering) 6. [Efficiency](#efficiency) 7. [Identifying Patterns](#identifying-patterns) 8. [Decision Framework](#decision-framework) 9. [Common Issues and Solutions](#common-issues-and-solutions) 10. [When to Stop Iterating](#when-to-stop-iterating) --- ## The Evaluation Framework Evaluate skill outputs across five dimensions: 1. **Correctness** - Is the output accurate and correct? 2. **Completeness** - Are all requirements met? 3. **Format** - Does it follow expected structure? 4. **Triggering** - Does it activate appropriately? 5. **Efficiency** - Is it concise yet complete? Each dimension has specific criteria to check. Use the grade-output.sh script for systematic evaluation. --- ## Correctness ### What to Check **Output matches expected result:** - For file transforms: Output file contains correct data - For code generation: Code works as described - For workflows: All steps lead to desired outcome - For documentation: Information is accurate **No factual errors:** - Commands use correct syntax - File paths are accurate - Versions are correct - Technical details are right **Logic is sound:** - Reasoning makes sense - Steps follow logically - No contradictions - Edge cases considered **Edge cases handled appropriately:** - Empty inputs don't crash - Large inputs work - Special characters preserved - Errors handled gracefully ### Examples **Good (Correct):** ```bash # Command actually works docker run -p 3000:3000 --name myapp myimage:v1.0 ``` **Bad (Incorrect):** ```bash # Wrong flag syntax docker run --port 3000:3000 myimage # Should be -p or --publish ``` **Good (Handles Edge Cases):** ```python import os if os.path.exists(filename): process_file(filename) else: print(f"Error: {filename} not found") ``` **Bad (No Edge Case Handling):** ```python process_file(filename) # Will crash if file doesn't exist ``` --- ## Completeness ### What to Check **All requested tasks completed:** - Every requirement from prompt addressed - No steps forgotten - All files generated - All transformations applied **No steps skipped:** - Validation steps included - Cleanup steps not forgotten - Confirmation steps present - Rollback guidance provided **Appropriate level of detail:** - Not too brief (missing context) - Not too verbose (information overload) - Explains "why" not just "what" - Provides examples where helpful **Relevant context included:** - Prerequisites mentioned - Dependencies noted - Error scenarios covered - Alternative approaches offered ### Examples **Good (Complete):** ``` To release version 1.3.0: 1. Update version in package.json: "version": "1.3.0" 2. Add entry to CHANGELOG.md with today's date 3. Commit: git commit -am "chore(release): bump version to 1.3.0" 4. Create tag: git tag -a v1.3.0 -m "Release 1.3.0" 5. Push: git push origin main && git push origin v1.3.0 6. Create GitHub release: gh release create v1.3.0 --notes-from-tag Prerequisites: - All tests passing - CHANGELOG.md updated - You have push access to repository ``` **Bad (Incomplete):** ``` To release: 1. Update version 2. Create tag 3. Push ``` --- ## Format ### What to Check **Output follows specified format:** - If skill defines a template, output matches it - Markdown formatting is correct - Code blocks use proper syntax highlighting - Tables are properly formatted **Consistent with examples in skill:** - Format matches examples in SKILL.md - Style is consistent - Terminology is consistent - Structure follows patterns **Easy to read and understand:** - Clear headings and sections - Good use of whitespace - Logical organization - No wall of text ### Examples **Good (Well-Formatted):** ```markdown ## Deployment Steps ### 1. Build Docker Image ```bash docker build -t myapp:v1.0 . ``` ### 2. Run Container ```bash docker run -d -p 3000:3000 --name myapp myapp:v1.0 ``` ### 3. Verify Deployment ```bash curl http://localhost:3000/health ``` ``` **Bad (Poor Formatting):** ``` step 1: build the image with docker build -t myapp:v1.0 . then step 2 run the container with docker run -d -p 3000:3000 --name myapp myapp:v1.0 and step 3 verify with curl ``` --- ## Triggering ### What to Check **Skill activates when appropriate:** - Relevant triggers work - Similar phrasings activate it - Context clues are recognized - Doesn't require exact keywords **Doesn't activate when inappropriate:** - Adjacent domains don't trigger it - Ambiguous queries don't wrongly trigger - Different tools aren't confused - Scope is respected ### Testing Triggering Test with 8 queries (4 should-trigger, 4 should-not-trigger): **Example for Docker skill:** Should trigger: 1. "Create a docker container for my Node.js app" 2. "Build an image from this Dockerfile" 3. "How do I compose up my services?" 4. "I need to deploy this using containers" Should not trigger: 1. "Install Docker on my machine" (installation vs usage) 2. "What is containerization?" (education vs hands-on) 3. "Show me Kubernetes commands" (different tool) 4. "Write a fibonacci function" (completely unrelated) ### Examples **Good (Appropriate Triggering):** - User: "containerize my app" - Skill: Docker helper activates ✓ **Bad (Missed Trigger):** - User: "I need to run my app in containers" - Skill: Doesn't activate (too narrow description) ✗ **Bad (False Trigger):** - User: "Install Docker on Ubuntu" - Skill: Activates but only handles usage, not installation ✗ --- ## Efficiency ### What to Check **No unnecessary steps:** - Doesn't do redundant work - Doesn't suggest unnecessary commands - Shortcuts used where safe - Process is streamlined **Reasonable response length:** - Not excessively verbose - Doesn't repeat information - No filler content - Every sentence adds value **Not overly verbose:** - Commands not over-explained - Doesn't explain obvious steps - Concise but complete - Respects user's expertise ### Examples **Good (Efficient):** ```bash # Clean up Docker resources docker system prune -f ``` **Bad (Verbose):** ```bash # Clean up Docker resources # First, we need to run the docker command # Then we use the system subcommand # Then we use the prune subcommand # The -f flag means force docker system prune -f # This will remove stopped containers, unused networks, and dangling images ``` --- ## Identifying Patterns ### Single Test Failure **Characteristics:** - Only one test case fails - Issue is unique to that scenario - Other tests pass **Action:** - Add targeted instruction - Include example for that edge case - Fix the specific issue - Don't generalize prematurely **Example:** ``` Issue: Large file processing fails Solution: Add instruction about memory-efficient processing for files over 10,000 rows ``` ### Multiple Test Failures (Same Issue) **Characteristics:** - Same problem across 2+ tests - Pattern suggests root cause - Symptom appears in different forms **Action:** - Fix the root cause - Add general principle, not specific fix - Consider extracting to script - Explain the "why" **Example:** ``` Issue: Tests 1 and 3 both have incorrect date formatting Solution: Add general instruction about ISO 8601 date format rather than fixing each case individually ``` ### Multiple Test Failures (Different Issues) **Characteristics:** - Different problems in each test - No clear pattern - Skill may be too broad **Action:** - Clarify scope in description - Add more specific instructions - Consider splitting into multiple skills - Focus on core functionality first **Example:** ``` Issue: Test 1 fails on format, Test 2 fails on logic, Test 3 never triggers skill Solution: Clarify what the skill does/doesn't do in description Add validation steps ``` --- ## Decision Framework ### Fix Specific Case When: - Unique edge case not covered - One-time issue unlikely to recur - Fix is simple and doesn't add complexity - Doesn't indicate systemic problem **Example:** ``` Test: CSV with Unicode characters fails Fix: Add note about UTF-8 encoding for international characters (Don't rewrite entire CSV handling) ``` ### Generalize Solution When: - Same issue in 2+ tests - Pattern suggests broader applicability - Fix would benefit similar requests - Explains underlying principle **Example:** ``` Tests: Both test 1 and 2 have incorrect error handling Fix: Add general principle about validating inputs before processing with examples of common validation checks ``` ### Extract to Script When: - Same multi-step process repeated - Deterministic operation (not creative) - Would save time on every invocation - Logic is complex or error-prone **Example:** ``` Pattern: All three tests require converting dates to ISO format Action: Create scripts/convert-dates.py Include in SKILL.md: "Use scripts/convert-dates.py for date formatting" ``` --- ## Common Issues and Solutions ### Issue: Skill Doesn't Trigger **Symptoms:** - User says relevant phrase - Skill doesn't appear in available skills - Model handles request without skill **Solutions:** 1. Add specific trigger phrases to description 2. Use "pushy" language: "Make sure to use this skill whenever..." 3. Include synonyms and variations 4. Test with 8 trigger queries ### Issue: Output Format Inconsistent **Symptoms:** - Sometimes follows template, sometimes doesn't - Format varies between similar requests - Missing sections or elements **Solutions:** 1. Add explicit format template in SKILL.md 2. Provide multiple examples 3. Explain why format matters 4. Use imperative: "ALWAYS use this template" ### Issue: Edge Cases Not Handled **Symptoms:** - Large files cause errors - Empty inputs crash - Special characters corrupted - Missing data causes failures **Solutions:** 1. Add validation step instructions 2. Include error handling examples 3. Create helper script for complex validation 4. Document known limitations ### Issue: Too Verbose **Symptoms:** - Responses are walls of text - Over-explains obvious steps - Repeats information - Takes too long to get to point **Solutions:** 1. Remove redundant explanations 2. Trust user's expertise 3. Move details to references/ 4. Use progressive disclosure ### Issue: Incomplete Responses **Symptoms:** - Misses steps in workflow - Forgets validation - Doesn't mention prerequisites - No error handling **Solutions:** 1. Add comprehensive checklist in SKILL.md 2. Include validation at each step 3. Document prerequisites upfront 4. Add error handling examples ### Issue: Incorrect Commands **Symptoms:** - Commands don't work when copied - Syntax errors - Wrong flags or options - Outdated versions **Solutions:** 1. Test all commands before including 2. Specify version requirements 3. Add validation steps 4. Include error messages to expect --- ## When to Stop Iterating ### Stop When: 1. **User is satisfied** - User says "this works for me" - User stops requesting changes - User starts using skill regularly 2. **Outputs meet expectations** - All test cases pass - Common scenarios work - Edge cases handled reasonably - No critical issues remain 3. **No meaningful progress** - Last 2-3 iterations didn't improve results - Changes are cosmetic only - Diminishing returns on effort - 90% solution is good enough ### Don't Stop When: - **Perfect is the enemy of good** - 90% working is better than endlessly tweaking - **Edge cases are theoretical** - Don't over-engineer for unlikely scenarios - **User wants to keep iterating** - Follow user's lead - **New issues emerge** - Fix real problems as they're found ### The 90% Rule A skill that works well for 90% of cases is sufficient. Users can handle: - Rare edge cases manually - Unique situations with custom guidance - Complex scenarios with multiple steps **Focus on:** - Common use cases (80% of value) - Clear failure messages (10% of value) - Good documentation (10% of value) **Don't focus on:** - Every possible edge case (diminishing returns) - Perfect formatting (good enough is fine) - Handling every possible error (major errors are enough) --- ## Grading Checklist Use this checklist for each test case: ### Correctness - [ ] Output matches expected result - [ ] No factual errors - [ ] Logic is sound - [ ] Edge cases handled appropriately ### Completeness - [ ] All requested tasks completed - [ ] No steps skipped - [ ] Appropriate level of detail - [ ] Relevant context included ### Format - [ ] Output follows specified format - [ ] Consistent with examples - [ ] Easy to read and understand ### Triggering - [ ] Skill activated when appropriate - [ ] Did not activate when inappropriate ### Efficiency - [ ] No unnecessary steps - [ ] Reasonable response length - [ ] Not overly verbose ### Overall - [ ] Would use this skill again - [ ] Would recommend to others - [ ] Saves time vs manual approach - [ ] Output quality meets needs --- ## Recording Results After grading, record: ```json { "test_id": 1, "result": "pass|fail|partial", "issues": [ "Description of issue 1", "Description of issue 2" ], "suggested_fix": "Brief description of improvement", "extract_script": false, "priority": "high|medium|low" } ``` Use the grade-output.sh script to generate this structure interactively. --- ## Next Steps After Grading 1. **Review all results** - Look for patterns 2. **Prioritize fixes** - High priority first 3. **Update SKILL.md** - Based on issues found 4. **Create scripts** - Extract repeated work 5. **Re-run tests** - Verify improvements 6. **Repeat** - Until satisfied or good enough