608 lines
14 KiB
Markdown
608 lines
14 KiB
Markdown
# Grading Guide
|
|
|
|
Comprehensive guide for evaluating skill outputs and identifying improvements.
|
|
|
|
## Table of Contents
|
|
|
|
1. [The Evaluation Framework](#the-evaluation-framework)
|
|
2. [Correctness](#correctness)
|
|
3. [Completeness](#completeness)
|
|
4. [Format](#format)
|
|
5. [Triggering](#triggering)
|
|
6. [Efficiency](#efficiency)
|
|
7. [Identifying Patterns](#identifying-patterns)
|
|
8. [Decision Framework](#decision-framework)
|
|
9. [Common Issues and Solutions](#common-issues-and-solutions)
|
|
10. [When to Stop Iterating](#when-to-stop-iterating)
|
|
|
|
---
|
|
|
|
## The Evaluation Framework
|
|
|
|
Evaluate skill outputs across five dimensions:
|
|
|
|
1. **Correctness** - Is the output accurate and correct?
|
|
2. **Completeness** - Are all requirements met?
|
|
3. **Format** - Does it follow expected structure?
|
|
4. **Triggering** - Does it activate appropriately?
|
|
5. **Efficiency** - Is it concise yet complete?
|
|
|
|
Each dimension has specific criteria to check. Use the grade-output.sh script for systematic evaluation.
|
|
|
|
---
|
|
|
|
## Correctness
|
|
|
|
### What to Check
|
|
|
|
**Output matches expected result:**
|
|
- For file transforms: Output file contains correct data
|
|
- For code generation: Code works as described
|
|
- For workflows: All steps lead to desired outcome
|
|
- For documentation: Information is accurate
|
|
|
|
**No factual errors:**
|
|
- Commands use correct syntax
|
|
- File paths are accurate
|
|
- Versions are correct
|
|
- Technical details are right
|
|
|
|
**Logic is sound:**
|
|
- Reasoning makes sense
|
|
- Steps follow logically
|
|
- No contradictions
|
|
- Edge cases considered
|
|
|
|
**Edge cases handled appropriately:**
|
|
- Empty inputs don't crash
|
|
- Large inputs work
|
|
- Special characters preserved
|
|
- Errors handled gracefully
|
|
|
|
### Examples
|
|
|
|
**Good (Correct):**
|
|
```bash
|
|
# Command actually works
|
|
docker run -p 3000:3000 --name myapp myimage:v1.0
|
|
```
|
|
|
|
**Bad (Incorrect):**
|
|
```bash
|
|
# Wrong flag syntax
|
|
docker run --port 3000:3000 myimage # Should be -p or --publish
|
|
```
|
|
|
|
**Good (Handles Edge Cases):**
|
|
```python
|
|
import os
|
|
if os.path.exists(filename):
|
|
process_file(filename)
|
|
else:
|
|
print(f"Error: {filename} not found")
|
|
```
|
|
|
|
**Bad (No Edge Case Handling):**
|
|
```python
|
|
process_file(filename) # Will crash if file doesn't exist
|
|
```
|
|
|
|
---
|
|
|
|
## Completeness
|
|
|
|
### What to Check
|
|
|
|
**All requested tasks completed:**
|
|
- Every requirement from prompt addressed
|
|
- No steps forgotten
|
|
- All files generated
|
|
- All transformations applied
|
|
|
|
**No steps skipped:**
|
|
- Validation steps included
|
|
- Cleanup steps not forgotten
|
|
- Confirmation steps present
|
|
- Rollback guidance provided
|
|
|
|
**Appropriate level of detail:**
|
|
- Not too brief (missing context)
|
|
- Not too verbose (information overload)
|
|
- Explains "why" not just "what"
|
|
- Provides examples where helpful
|
|
|
|
**Relevant context included:**
|
|
- Prerequisites mentioned
|
|
- Dependencies noted
|
|
- Error scenarios covered
|
|
- Alternative approaches offered
|
|
|
|
### Examples
|
|
|
|
**Good (Complete):**
|
|
```
|
|
To release version 1.3.0:
|
|
|
|
1. Update version in package.json: "version": "1.3.0"
|
|
2. Add entry to CHANGELOG.md with today's date
|
|
3. Commit: git commit -am "chore(release): bump version to 1.3.0"
|
|
4. Create tag: git tag -a v1.3.0 -m "Release 1.3.0"
|
|
5. Push: git push origin main && git push origin v1.3.0
|
|
6. Create GitHub release: gh release create v1.3.0 --notes-from-tag
|
|
|
|
Prerequisites:
|
|
- All tests passing
|
|
- CHANGELOG.md updated
|
|
- You have push access to repository
|
|
```
|
|
|
|
**Bad (Incomplete):**
|
|
```
|
|
To release:
|
|
1. Update version
|
|
2. Create tag
|
|
3. Push
|
|
```
|
|
|
|
---
|
|
|
|
## Format
|
|
|
|
### What to Check
|
|
|
|
**Output follows specified format:**
|
|
- If skill defines a template, output matches it
|
|
- Markdown formatting is correct
|
|
- Code blocks use proper syntax highlighting
|
|
- Tables are properly formatted
|
|
|
|
**Consistent with examples in skill:**
|
|
- Format matches examples in SKILL.md
|
|
- Style is consistent
|
|
- Terminology is consistent
|
|
- Structure follows patterns
|
|
|
|
**Easy to read and understand:**
|
|
- Clear headings and sections
|
|
- Good use of whitespace
|
|
- Logical organization
|
|
- No wall of text
|
|
|
|
### Examples
|
|
|
|
**Good (Well-Formatted):**
|
|
```markdown
|
|
## Deployment Steps
|
|
|
|
### 1. Build Docker Image
|
|
|
|
```bash
|
|
docker build -t myapp:v1.0 .
|
|
```
|
|
|
|
### 2. Run Container
|
|
|
|
```bash
|
|
docker run -d -p 3000:3000 --name myapp myapp:v1.0
|
|
```
|
|
|
|
### 3. Verify Deployment
|
|
|
|
```bash
|
|
curl http://localhost:3000/health
|
|
```
|
|
```
|
|
|
|
**Bad (Poor Formatting):**
|
|
```
|
|
step 1: build the image with docker build -t myapp:v1.0 . then step 2 run the container with docker run -d -p 3000:3000 --name myapp myapp:v1.0 and step 3 verify with curl
|
|
```
|
|
|
|
---
|
|
|
|
## Triggering
|
|
|
|
### What to Check
|
|
|
|
**Skill activates when appropriate:**
|
|
- Relevant triggers work
|
|
- Similar phrasings activate it
|
|
- Context clues are recognized
|
|
- Doesn't require exact keywords
|
|
|
|
**Doesn't activate when inappropriate:**
|
|
- Adjacent domains don't trigger it
|
|
- Ambiguous queries don't wrongly trigger
|
|
- Different tools aren't confused
|
|
- Scope is respected
|
|
|
|
### Testing Triggering
|
|
|
|
Test with 8 queries (4 should-trigger, 4 should-not-trigger):
|
|
|
|
**Example for Docker skill:**
|
|
|
|
Should trigger:
|
|
1. "Create a docker container for my Node.js app"
|
|
2. "Build an image from this Dockerfile"
|
|
3. "How do I compose up my services?"
|
|
4. "I need to deploy this using containers"
|
|
|
|
Should not trigger:
|
|
1. "Install Docker on my machine" (installation vs usage)
|
|
2. "What is containerization?" (education vs hands-on)
|
|
3. "Show me Kubernetes commands" (different tool)
|
|
4. "Write a fibonacci function" (completely unrelated)
|
|
|
|
### Examples
|
|
|
|
**Good (Appropriate Triggering):**
|
|
- User: "containerize my app"
|
|
- Skill: Docker helper activates ✓
|
|
|
|
**Bad (Missed Trigger):**
|
|
- User: "I need to run my app in containers"
|
|
- Skill: Doesn't activate (too narrow description) ✗
|
|
|
|
**Bad (False Trigger):**
|
|
- User: "Install Docker on Ubuntu"
|
|
- Skill: Activates but only handles usage, not installation ✗
|
|
|
|
---
|
|
|
|
## Efficiency
|
|
|
|
### What to Check
|
|
|
|
**No unnecessary steps:**
|
|
- Doesn't do redundant work
|
|
- Doesn't suggest unnecessary commands
|
|
- Shortcuts used where safe
|
|
- Process is streamlined
|
|
|
|
**Reasonable response length:**
|
|
- Not excessively verbose
|
|
- Doesn't repeat information
|
|
- No filler content
|
|
- Every sentence adds value
|
|
|
|
**Not overly verbose:**
|
|
- Commands not over-explained
|
|
- Doesn't explain obvious steps
|
|
- Concise but complete
|
|
- Respects user's expertise
|
|
|
|
### Examples
|
|
|
|
**Good (Efficient):**
|
|
```bash
|
|
# Clean up Docker resources
|
|
docker system prune -f
|
|
```
|
|
|
|
**Bad (Verbose):**
|
|
```bash
|
|
# Clean up Docker resources
|
|
# First, we need to run the docker command
|
|
# Then we use the system subcommand
|
|
# Then we use the prune subcommand
|
|
# The -f flag means force
|
|
docker system prune -f
|
|
# This will remove stopped containers, unused networks, and dangling images
|
|
```
|
|
|
|
---
|
|
|
|
## Identifying Patterns
|
|
|
|
### Single Test Failure
|
|
|
|
**Characteristics:**
|
|
- Only one test case fails
|
|
- Issue is unique to that scenario
|
|
- Other tests pass
|
|
|
|
**Action:**
|
|
- Add targeted instruction
|
|
- Include example for that edge case
|
|
- Fix the specific issue
|
|
- Don't generalize prematurely
|
|
|
|
**Example:**
|
|
```
|
|
Issue: Large file processing fails
|
|
Solution: Add instruction about memory-efficient processing
|
|
for files over 10,000 rows
|
|
```
|
|
|
|
### Multiple Test Failures (Same Issue)
|
|
|
|
**Characteristics:**
|
|
- Same problem across 2+ tests
|
|
- Pattern suggests root cause
|
|
- Symptom appears in different forms
|
|
|
|
**Action:**
|
|
- Fix the root cause
|
|
- Add general principle, not specific fix
|
|
- Consider extracting to script
|
|
- Explain the "why"
|
|
|
|
**Example:**
|
|
```
|
|
Issue: Tests 1 and 3 both have incorrect date formatting
|
|
Solution: Add general instruction about ISO 8601 date format
|
|
rather than fixing each case individually
|
|
```
|
|
|
|
### Multiple Test Failures (Different Issues)
|
|
|
|
**Characteristics:**
|
|
- Different problems in each test
|
|
- No clear pattern
|
|
- Skill may be too broad
|
|
|
|
**Action:**
|
|
- Clarify scope in description
|
|
- Add more specific instructions
|
|
- Consider splitting into multiple skills
|
|
- Focus on core functionality first
|
|
|
|
**Example:**
|
|
```
|
|
Issue: Test 1 fails on format, Test 2 fails on logic, Test 3
|
|
never triggers skill
|
|
Solution: Clarify what the skill does/doesn't do in description
|
|
Add validation steps
|
|
```
|
|
|
|
---
|
|
|
|
## Decision Framework
|
|
|
|
### Fix Specific Case When:
|
|
|
|
- Unique edge case not covered
|
|
- One-time issue unlikely to recur
|
|
- Fix is simple and doesn't add complexity
|
|
- Doesn't indicate systemic problem
|
|
|
|
**Example:**
|
|
```
|
|
Test: CSV with Unicode characters fails
|
|
Fix: Add note about UTF-8 encoding for international characters
|
|
(Don't rewrite entire CSV handling)
|
|
```
|
|
|
|
### Generalize Solution When:
|
|
|
|
- Same issue in 2+ tests
|
|
- Pattern suggests broader applicability
|
|
- Fix would benefit similar requests
|
|
- Explains underlying principle
|
|
|
|
**Example:**
|
|
```
|
|
Tests: Both test 1 and 2 have incorrect error handling
|
|
Fix: Add general principle about validating inputs before processing
|
|
with examples of common validation checks
|
|
```
|
|
|
|
### Extract to Script When:
|
|
|
|
- Same multi-step process repeated
|
|
- Deterministic operation (not creative)
|
|
- Would save time on every invocation
|
|
- Logic is complex or error-prone
|
|
|
|
**Example:**
|
|
```
|
|
Pattern: All three tests require converting dates to ISO format
|
|
Action: Create scripts/convert-dates.py
|
|
Include in SKILL.md: "Use scripts/convert-dates.py for date formatting"
|
|
```
|
|
|
|
---
|
|
|
|
## Common Issues and Solutions
|
|
|
|
### Issue: Skill Doesn't Trigger
|
|
|
|
**Symptoms:**
|
|
- User says relevant phrase
|
|
- Skill doesn't appear in available skills
|
|
- Model handles request without skill
|
|
|
|
**Solutions:**
|
|
1. Add specific trigger phrases to description
|
|
2. Use "pushy" language: "Make sure to use this skill whenever..."
|
|
3. Include synonyms and variations
|
|
4. Test with 8 trigger queries
|
|
|
|
### Issue: Output Format Inconsistent
|
|
|
|
**Symptoms:**
|
|
- Sometimes follows template, sometimes doesn't
|
|
- Format varies between similar requests
|
|
- Missing sections or elements
|
|
|
|
**Solutions:**
|
|
1. Add explicit format template in SKILL.md
|
|
2. Provide multiple examples
|
|
3. Explain why format matters
|
|
4. Use imperative: "ALWAYS use this template"
|
|
|
|
### Issue: Edge Cases Not Handled
|
|
|
|
**Symptoms:**
|
|
- Large files cause errors
|
|
- Empty inputs crash
|
|
- Special characters corrupted
|
|
- Missing data causes failures
|
|
|
|
**Solutions:**
|
|
1. Add validation step instructions
|
|
2. Include error handling examples
|
|
3. Create helper script for complex validation
|
|
4. Document known limitations
|
|
|
|
### Issue: Too Verbose
|
|
|
|
**Symptoms:**
|
|
- Responses are walls of text
|
|
- Over-explains obvious steps
|
|
- Repeats information
|
|
- Takes too long to get to point
|
|
|
|
**Solutions:**
|
|
1. Remove redundant explanations
|
|
2. Trust user's expertise
|
|
3. Move details to references/
|
|
4. Use progressive disclosure
|
|
|
|
### Issue: Incomplete Responses
|
|
|
|
**Symptoms:**
|
|
- Misses steps in workflow
|
|
- Forgets validation
|
|
- Doesn't mention prerequisites
|
|
- No error handling
|
|
|
|
**Solutions:**
|
|
1. Add comprehensive checklist in SKILL.md
|
|
2. Include validation at each step
|
|
3. Document prerequisites upfront
|
|
4. Add error handling examples
|
|
|
|
### Issue: Incorrect Commands
|
|
|
|
**Symptoms:**
|
|
- Commands don't work when copied
|
|
- Syntax errors
|
|
- Wrong flags or options
|
|
- Outdated versions
|
|
|
|
**Solutions:**
|
|
1. Test all commands before including
|
|
2. Specify version requirements
|
|
3. Add validation steps
|
|
4. Include error messages to expect
|
|
|
|
---
|
|
|
|
## When to Stop Iterating
|
|
|
|
### Stop When:
|
|
|
|
1. **User is satisfied**
|
|
- User says "this works for me"
|
|
- User stops requesting changes
|
|
- User starts using skill regularly
|
|
|
|
2. **Outputs meet expectations**
|
|
- All test cases pass
|
|
- Common scenarios work
|
|
- Edge cases handled reasonably
|
|
- No critical issues remain
|
|
|
|
3. **No meaningful progress**
|
|
- Last 2-3 iterations didn't improve results
|
|
- Changes are cosmetic only
|
|
- Diminishing returns on effort
|
|
- 90% solution is good enough
|
|
|
|
### Don't Stop When:
|
|
|
|
- **Perfect is the enemy of good** - 90% working is better than endlessly tweaking
|
|
- **Edge cases are theoretical** - Don't over-engineer for unlikely scenarios
|
|
- **User wants to keep iterating** - Follow user's lead
|
|
- **New issues emerge** - Fix real problems as they're found
|
|
|
|
### The 90% Rule
|
|
|
|
A skill that works well for 90% of cases is sufficient. Users can handle:
|
|
- Rare edge cases manually
|
|
- Unique situations with custom guidance
|
|
- Complex scenarios with multiple steps
|
|
|
|
**Focus on:**
|
|
- Common use cases (80% of value)
|
|
- Clear failure messages (10% of value)
|
|
- Good documentation (10% of value)
|
|
|
|
**Don't focus on:**
|
|
- Every possible edge case (diminishing returns)
|
|
- Perfect formatting (good enough is fine)
|
|
- Handling every possible error (major errors are enough)
|
|
|
|
---
|
|
|
|
## Grading Checklist
|
|
|
|
Use this checklist for each test case:
|
|
|
|
### Correctness
|
|
- [ ] Output matches expected result
|
|
- [ ] No factual errors
|
|
- [ ] Logic is sound
|
|
- [ ] Edge cases handled appropriately
|
|
|
|
### Completeness
|
|
- [ ] All requested tasks completed
|
|
- [ ] No steps skipped
|
|
- [ ] Appropriate level of detail
|
|
- [ ] Relevant context included
|
|
|
|
### Format
|
|
- [ ] Output follows specified format
|
|
- [ ] Consistent with examples
|
|
- [ ] Easy to read and understand
|
|
|
|
### Triggering
|
|
- [ ] Skill activated when appropriate
|
|
- [ ] Did not activate when inappropriate
|
|
|
|
### Efficiency
|
|
- [ ] No unnecessary steps
|
|
- [ ] Reasonable response length
|
|
- [ ] Not overly verbose
|
|
|
|
### Overall
|
|
- [ ] Would use this skill again
|
|
- [ ] Would recommend to others
|
|
- [ ] Saves time vs manual approach
|
|
- [ ] Output quality meets needs
|
|
|
|
---
|
|
|
|
## Recording Results
|
|
|
|
After grading, record:
|
|
|
|
```json
|
|
{
|
|
"test_id": 1,
|
|
"result": "pass|fail|partial",
|
|
"issues": [
|
|
"Description of issue 1",
|
|
"Description of issue 2"
|
|
],
|
|
"suggested_fix": "Brief description of improvement",
|
|
"extract_script": false,
|
|
"priority": "high|medium|low"
|
|
}
|
|
```
|
|
|
|
Use the grade-output.sh script to generate this structure interactively.
|
|
|
|
---
|
|
|
|
## Next Steps After Grading
|
|
|
|
1. **Review all results** - Look for patterns
|
|
2. **Prioritize fixes** - High priority first
|
|
3. **Update SKILL.md** - Based on issues found
|
|
4. **Create scripts** - Extract repeated work
|
|
5. **Re-run tests** - Verify improvements
|
|
6. **Repeat** - Until satisfied or good enough
|