14 KiB
Grading Guide
Comprehensive guide for evaluating skill outputs and identifying improvements.
Table of Contents
- The Evaluation Framework
- Correctness
- Completeness
- Format
- Triggering
- Efficiency
- Identifying Patterns
- Decision Framework
- Common Issues and Solutions
- When to Stop Iterating
The Evaluation Framework
Evaluate skill outputs across five dimensions:
- Correctness - Is the output accurate and correct?
- Completeness - Are all requirements met?
- Format - Does it follow expected structure?
- Triggering - Does it activate appropriately?
- Efficiency - Is it concise yet complete?
Each dimension has specific criteria to check. Use the grade-output.sh script for systematic evaluation.
Correctness
What to Check
Output matches expected result:
- For file transforms: Output file contains correct data
- For code generation: Code works as described
- For workflows: All steps lead to desired outcome
- For documentation: Information is accurate
No factual errors:
- Commands use correct syntax
- File paths are accurate
- Versions are correct
- Technical details are right
Logic is sound:
- Reasoning makes sense
- Steps follow logically
- No contradictions
- Edge cases considered
Edge cases handled appropriately:
- Empty inputs don't crash
- Large inputs work
- Special characters preserved
- Errors handled gracefully
Examples
Good (Correct):
# Command actually works
docker run -p 3000:3000 --name myapp myimage:v1.0
Bad (Incorrect):
# Wrong flag syntax
docker run --port 3000:3000 myimage # Should be -p or --publish
Good (Handles Edge Cases):
import os
if os.path.exists(filename):
process_file(filename)
else:
print(f"Error: {filename} not found")
Bad (No Edge Case Handling):
process_file(filename) # Will crash if file doesn't exist
Completeness
What to Check
All requested tasks completed:
- Every requirement from prompt addressed
- No steps forgotten
- All files generated
- All transformations applied
No steps skipped:
- Validation steps included
- Cleanup steps not forgotten
- Confirmation steps present
- Rollback guidance provided
Appropriate level of detail:
- Not too brief (missing context)
- Not too verbose (information overload)
- Explains "why" not just "what"
- Provides examples where helpful
Relevant context included:
- Prerequisites mentioned
- Dependencies noted
- Error scenarios covered
- Alternative approaches offered
Examples
Good (Complete):
To release version 1.3.0:
1. Update version in package.json: "version": "1.3.0"
2. Add entry to CHANGELOG.md with today's date
3. Commit: git commit -am "chore(release): bump version to 1.3.0"
4. Create tag: git tag -a v1.3.0 -m "Release 1.3.0"
5. Push: git push origin main && git push origin v1.3.0
6. Create GitHub release: gh release create v1.3.0 --notes-from-tag
Prerequisites:
- All tests passing
- CHANGELOG.md updated
- You have push access to repository
Bad (Incomplete):
To release:
1. Update version
2. Create tag
3. Push
Format
What to Check
Output follows specified format:
- If skill defines a template, output matches it
- Markdown formatting is correct
- Code blocks use proper syntax highlighting
- Tables are properly formatted
Consistent with examples in skill:
- Format matches examples in SKILL.md
- Style is consistent
- Terminology is consistent
- Structure follows patterns
Easy to read and understand:
- Clear headings and sections
- Good use of whitespace
- Logical organization
- No wall of text
Examples
Good (Well-Formatted):
## Deployment Steps
### 1. Build Docker Image
```bash
docker build -t myapp:v1.0 .
2. Run Container
docker run -d -p 3000:3000 --name myapp myapp:v1.0
3. Verify Deployment
curl http://localhost:3000/health
**Bad (Poor Formatting):**
step 1: build the image with docker build -t myapp:v1.0 . then step 2 run the container with docker run -d -p 3000:3000 --name myapp myapp:v1.0 and step 3 verify with curl
---
## Triggering
### What to Check
**Skill activates when appropriate:**
- Relevant triggers work
- Similar phrasings activate it
- Context clues are recognized
- Doesn't require exact keywords
**Doesn't activate when inappropriate:**
- Adjacent domains don't trigger it
- Ambiguous queries don't wrongly trigger
- Different tools aren't confused
- Scope is respected
### Testing Triggering
Test with 8 queries (4 should-trigger, 4 should-not-trigger):
**Example for Docker skill:**
Should trigger:
1. "Create a docker container for my Node.js app"
2. "Build an image from this Dockerfile"
3. "How do I compose up my services?"
4. "I need to deploy this using containers"
Should not trigger:
1. "Install Docker on my machine" (installation vs usage)
2. "What is containerization?" (education vs hands-on)
3. "Show me Kubernetes commands" (different tool)
4. "Write a fibonacci function" (completely unrelated)
### Examples
**Good (Appropriate Triggering):**
- User: "containerize my app"
- Skill: Docker helper activates ✓
**Bad (Missed Trigger):**
- User: "I need to run my app in containers"
- Skill: Doesn't activate (too narrow description) ✗
**Bad (False Trigger):**
- User: "Install Docker on Ubuntu"
- Skill: Activates but only handles usage, not installation ✗
---
## Efficiency
### What to Check
**No unnecessary steps:**
- Doesn't do redundant work
- Doesn't suggest unnecessary commands
- Shortcuts used where safe
- Process is streamlined
**Reasonable response length:**
- Not excessively verbose
- Doesn't repeat information
- No filler content
- Every sentence adds value
**Not overly verbose:**
- Commands not over-explained
- Doesn't explain obvious steps
- Concise but complete
- Respects user's expertise
### Examples
**Good (Efficient):**
```bash
# Clean up Docker resources
docker system prune -f
Bad (Verbose):
# Clean up Docker resources
# First, we need to run the docker command
# Then we use the system subcommand
# Then we use the prune subcommand
# The -f flag means force
docker system prune -f
# This will remove stopped containers, unused networks, and dangling images
Identifying Patterns
Single Test Failure
Characteristics:
- Only one test case fails
- Issue is unique to that scenario
- Other tests pass
Action:
- Add targeted instruction
- Include example for that edge case
- Fix the specific issue
- Don't generalize prematurely
Example:
Issue: Large file processing fails
Solution: Add instruction about memory-efficient processing
for files over 10,000 rows
Multiple Test Failures (Same Issue)
Characteristics:
- Same problem across 2+ tests
- Pattern suggests root cause
- Symptom appears in different forms
Action:
- Fix the root cause
- Add general principle, not specific fix
- Consider extracting to script
- Explain the "why"
Example:
Issue: Tests 1 and 3 both have incorrect date formatting
Solution: Add general instruction about ISO 8601 date format
rather than fixing each case individually
Multiple Test Failures (Different Issues)
Characteristics:
- Different problems in each test
- No clear pattern
- Skill may be too broad
Action:
- Clarify scope in description
- Add more specific instructions
- Consider splitting into multiple skills
- Focus on core functionality first
Example:
Issue: Test 1 fails on format, Test 2 fails on logic, Test 3
never triggers skill
Solution: Clarify what the skill does/doesn't do in description
Add validation steps
Decision Framework
Fix Specific Case When:
- Unique edge case not covered
- One-time issue unlikely to recur
- Fix is simple and doesn't add complexity
- Doesn't indicate systemic problem
Example:
Test: CSV with Unicode characters fails
Fix: Add note about UTF-8 encoding for international characters
(Don't rewrite entire CSV handling)
Generalize Solution When:
- Same issue in 2+ tests
- Pattern suggests broader applicability
- Fix would benefit similar requests
- Explains underlying principle
Example:
Tests: Both test 1 and 2 have incorrect error handling
Fix: Add general principle about validating inputs before processing
with examples of common validation checks
Extract to Script When:
- Same multi-step process repeated
- Deterministic operation (not creative)
- Would save time on every invocation
- Logic is complex or error-prone
Example:
Pattern: All three tests require converting dates to ISO format
Action: Create scripts/convert-dates.py
Include in SKILL.md: "Use scripts/convert-dates.py for date formatting"
Common Issues and Solutions
Issue: Skill Doesn't Trigger
Symptoms:
- User says relevant phrase
- Skill doesn't appear in available skills
- Model handles request without skill
Solutions:
- Add specific trigger phrases to description
- Use "pushy" language: "Make sure to use this skill whenever..."
- Include synonyms and variations
- Test with 8 trigger queries
Issue: Output Format Inconsistent
Symptoms:
- Sometimes follows template, sometimes doesn't
- Format varies between similar requests
- Missing sections or elements
Solutions:
- Add explicit format template in SKILL.md
- Provide multiple examples
- Explain why format matters
- Use imperative: "ALWAYS use this template"
Issue: Edge Cases Not Handled
Symptoms:
- Large files cause errors
- Empty inputs crash
- Special characters corrupted
- Missing data causes failures
Solutions:
- Add validation step instructions
- Include error handling examples
- Create helper script for complex validation
- Document known limitations
Issue: Too Verbose
Symptoms:
- Responses are walls of text
- Over-explains obvious steps
- Repeats information
- Takes too long to get to point
Solutions:
- Remove redundant explanations
- Trust user's expertise
- Move details to references/
- Use progressive disclosure
Issue: Incomplete Responses
Symptoms:
- Misses steps in workflow
- Forgets validation
- Doesn't mention prerequisites
- No error handling
Solutions:
- Add comprehensive checklist in SKILL.md
- Include validation at each step
- Document prerequisites upfront
- Add error handling examples
Issue: Incorrect Commands
Symptoms:
- Commands don't work when copied
- Syntax errors
- Wrong flags or options
- Outdated versions
Solutions:
- Test all commands before including
- Specify version requirements
- Add validation steps
- Include error messages to expect
When to Stop Iterating
Stop When:
-
User is satisfied
- User says "this works for me"
- User stops requesting changes
- User starts using skill regularly
-
Outputs meet expectations
- All test cases pass
- Common scenarios work
- Edge cases handled reasonably
- No critical issues remain
-
No meaningful progress
- Last 2-3 iterations didn't improve results
- Changes are cosmetic only
- Diminishing returns on effort
- 90% solution is good enough
Don't Stop When:
- Perfect is the enemy of good - 90% working is better than endlessly tweaking
- Edge cases are theoretical - Don't over-engineer for unlikely scenarios
- User wants to keep iterating - Follow user's lead
- New issues emerge - Fix real problems as they're found
The 90% Rule
A skill that works well for 90% of cases is sufficient. Users can handle:
- Rare edge cases manually
- Unique situations with custom guidance
- Complex scenarios with multiple steps
Focus on:
- Common use cases (80% of value)
- Clear failure messages (10% of value)
- Good documentation (10% of value)
Don't focus on:
- Every possible edge case (diminishing returns)
- Perfect formatting (good enough is fine)
- Handling every possible error (major errors are enough)
Grading Checklist
Use this checklist for each test case:
Correctness
- Output matches expected result
- No factual errors
- Logic is sound
- Edge cases handled appropriately
Completeness
- All requested tasks completed
- No steps skipped
- Appropriate level of detail
- Relevant context included
Format
- Output follows specified format
- Consistent with examples
- Easy to read and understand
Triggering
- Skill activated when appropriate
- Did not activate when inappropriate
Efficiency
- No unnecessary steps
- Reasonable response length
- Not overly verbose
Overall
- Would use this skill again
- Would recommend to others
- Saves time vs manual approach
- Output quality meets needs
Recording Results
After grading, record:
{
"test_id": 1,
"result": "pass|fail|partial",
"issues": [
"Description of issue 1",
"Description of issue 2"
],
"suggested_fix": "Brief description of improvement",
"extract_script": false,
"priority": "high|medium|low"
}
Use the grade-output.sh script to generate this structure interactively.
Next Steps After Grading
- Review all results - Look for patterns
- Prioritize fixes - High priority first
- Update SKILL.md - Based on issues found
- Create scripts - Extract repeated work
- Re-run tests - Verify improvements
- Repeat - Until satisfied or good enough