# Grading Guide

Comprehensive guide for evaluating skill outputs and identifying improvements.

## Table of Contents

1. [The Evaluation Framework](#the-evaluation-framework)
2. [Correctness](#correctness)
3. [Completeness](#completeness)
4. [Format](#format)
5. [Triggering](#triggering)
6. [Efficiency](#efficiency)
7. [Identifying Patterns](#identifying-patterns)
8. [Decision Framework](#decision-framework)
9. [Common Issues and Solutions](#common-issues-and-solutions)
10. [When to Stop Iterating](#when-to-stop-iterating)

---

## The Evaluation Framework

Evaluate skill outputs across five dimensions:

1. **Correctness** - Is the output accurate and correct?
2. **Completeness** - Are all requirements met?
3. **Format** - Does it follow expected structure?
4. **Triggering** - Does it activate appropriately?
5. **Efficiency** - Is it concise yet complete?

Each dimension has specific criteria to check. Use the grade-output.sh script for systematic evaluation.

---

## Correctness

### What to Check

**Output matches expected result:**
- For file transforms: Output file contains correct data
- For code generation: Code works as described
- For workflows: All steps lead to desired outcome
- For documentation: Information is accurate

**No factual errors:**
- Commands use correct syntax
- File paths are accurate
- Versions are correct
- Technical details are right

**Logic is sound:**
- Reasoning makes sense
- Steps follow logically
- No contradictions
- Edge cases considered

**Edge cases handled appropriately:**
- Empty inputs don't crash
- Large inputs work
- Special characters preserved
- Errors handled gracefully

### Examples

**Good (Correct):**
```bash
# Command actually works
docker run -p 3000:3000 --name myapp myimage:v1.0
```

**Bad (Incorrect):**
```bash
# Wrong flag syntax
docker run --port 3000:3000 myimage  # Should be -p or --publish
```

**Good (Handles Edge Cases):**
```python
import os
if os.path.exists(filename):
    process_file(filename)
else:
    print(f"Error: {filename} not found")
```

**Bad (No Edge Case Handling):**
```python
process_file(filename)  # Will crash if file doesn't exist
```

---

## Completeness

### What to Check

**All requested tasks completed:**
- Every requirement from prompt addressed
- No steps forgotten
- All files generated
- All transformations applied

**No steps skipped:**
- Validation steps included
- Cleanup steps not forgotten
- Confirmation steps present
- Rollback guidance provided

**Appropriate level of detail:**
- Not too brief (missing context)
- Not too verbose (information overload)
- Explains "why" not just "what"
- Provides examples where helpful

**Relevant context included:**
- Prerequisites mentioned
- Dependencies noted
- Error scenarios covered
- Alternative approaches offered

### Examples

**Good (Complete):**
```
To release version 1.3.0:

1. Update version in package.json: "version": "1.3.0"
2. Add entry to CHANGELOG.md with today's date
3. Commit: git commit -am "chore(release): bump version to 1.3.0"
4. Create tag: git tag -a v1.3.0 -m "Release 1.3.0"
5. Push: git push origin main && git push origin v1.3.0
6. Create GitHub release: gh release create v1.3.0 --notes-from-tag

Prerequisites:
- All tests passing
- CHANGELOG.md updated
- You have push access to repository
```

**Bad (Incomplete):**
```
To release:
1. Update version
2. Create tag
3. Push
```

---

## Format

### What to Check

**Output follows specified format:**
- If skill defines a template, output matches it
- Markdown formatting is correct
- Code blocks use proper syntax highlighting
- Tables are properly formatted

**Consistent with examples in skill:**
- Format matches examples in SKILL.md
- Style is consistent
- Terminology is consistent
- Structure follows patterns

**Easy to read and understand:**
- Clear headings and sections
- Good use of whitespace
- Logical organization
- No wall of text

### Examples

**Good (Well-Formatted):**
```markdown
## Deployment Steps

### 1. Build Docker Image

```bash
docker build -t myapp:v1.0 .
```

### 2. Run Container

```bash
docker run -d -p 3000:3000 --name myapp myapp:v1.0
```

### 3. Verify Deployment

```bash
curl http://localhost:3000/health
```
```

**Bad (Poor Formatting):**
```
step 1: build the image with docker build -t myapp:v1.0 . then step 2 run the container with docker run -d -p 3000:3000 --name myapp myapp:v1.0 and step 3 verify with curl
```

---

## Triggering

### What to Check

**Skill activates when appropriate:**
- Relevant triggers work
- Similar phrasings activate it
- Context clues are recognized
- Doesn't require exact keywords

**Doesn't activate when inappropriate:**
- Adjacent domains don't trigger it
- Ambiguous queries don't wrongly trigger
- Different tools aren't confused
- Scope is respected

### Testing Triggering

Test with 8 queries (4 should-trigger, 4 should-not-trigger):

**Example for Docker skill:**

Should trigger:
1. "Create a docker container for my Node.js app"
2. "Build an image from this Dockerfile"
3. "How do I compose up my services?"
4. "I need to deploy this using containers"

Should not trigger:
1. "Install Docker on my machine" (installation vs usage)
2. "What is containerization?" (education vs hands-on)
3. "Show me Kubernetes commands" (different tool)
4. "Write a fibonacci function" (completely unrelated)

### Examples

**Good (Appropriate Triggering):**
- User: "containerize my app"
- Skill: Docker helper activates ✓

**Bad (Missed Trigger):**
- User: "I need to run my app in containers"
- Skill: Doesn't activate (too narrow description) ✗

**Bad (False Trigger):**
- User: "Install Docker on Ubuntu"
- Skill: Activates but only handles usage, not installation ✗

---

## Efficiency

### What to Check

**No unnecessary steps:**
- Doesn't do redundant work
- Doesn't suggest unnecessary commands
- Shortcuts used where safe
- Process is streamlined

**Reasonable response length:**
- Not excessively verbose
- Doesn't repeat information
- No filler content
- Every sentence adds value

**Not overly verbose:**
- Commands not over-explained
- Doesn't explain obvious steps
- Concise but complete
- Respects user's expertise

### Examples

**Good (Efficient):**
```bash
# Clean up Docker resources
docker system prune -f
```

**Bad (Verbose):**
```bash
# Clean up Docker resources
# First, we need to run the docker command
# Then we use the system subcommand
# Then we use the prune subcommand
# The -f flag means force
docker system prune -f
# This will remove stopped containers, unused networks, and dangling images
```

---

## Identifying Patterns

### Single Test Failure

**Characteristics:**
- Only one test case fails
- Issue is unique to that scenario
- Other tests pass

**Action:**
- Add targeted instruction
- Include example for that edge case
- Fix the specific issue
- Don't generalize prematurely

**Example:**
```
Issue: Large file processing fails
Solution: Add instruction about memory-efficient processing
for files over 10,000 rows
```

### Multiple Test Failures (Same Issue)

**Characteristics:**
- Same problem across 2+ tests
- Pattern suggests root cause
- Symptom appears in different forms

**Action:**
- Fix the root cause
- Add general principle, not specific fix
- Consider extracting to script
- Explain the "why"

**Example:**
```
Issue: Tests 1 and 3 both have incorrect date formatting
Solution: Add general instruction about ISO 8601 date format
rather than fixing each case individually
```

### Multiple Test Failures (Different Issues)

**Characteristics:**
- Different problems in each test
- No clear pattern
- Skill may be too broad

**Action:**
- Clarify scope in description
- Add more specific instructions
- Consider splitting into multiple skills
- Focus on core functionality first

**Example:**
```
Issue: Test 1 fails on format, Test 2 fails on logic, Test 3 
never triggers skill
Solution: Clarify what the skill does/doesn't do in description
Add validation steps
```

---

## Decision Framework

### Fix Specific Case When:

- Unique edge case not covered
- One-time issue unlikely to recur
- Fix is simple and doesn't add complexity
- Doesn't indicate systemic problem

**Example:**
```
Test: CSV with Unicode characters fails
Fix: Add note about UTF-8 encoding for international characters
(Don't rewrite entire CSV handling)
```

### Generalize Solution When:

- Same issue in 2+ tests
- Pattern suggests broader applicability
- Fix would benefit similar requests
- Explains underlying principle

**Example:**
```
Tests: Both test 1 and 2 have incorrect error handling
Fix: Add general principle about validating inputs before processing
with examples of common validation checks
```

### Extract to Script When:

- Same multi-step process repeated
- Deterministic operation (not creative)
- Would save time on every invocation
- Logic is complex or error-prone

**Example:**
```
Pattern: All three tests require converting dates to ISO format
Action: Create scripts/convert-dates.py
Include in SKILL.md: "Use scripts/convert-dates.py for date formatting"
```

---

## Common Issues and Solutions

### Issue: Skill Doesn't Trigger

**Symptoms:**
- User says relevant phrase
- Skill doesn't appear in available skills
- Model handles request without skill

**Solutions:**
1. Add specific trigger phrases to description
2. Use "pushy" language: "Make sure to use this skill whenever..."
3. Include synonyms and variations
4. Test with 8 trigger queries

### Issue: Output Format Inconsistent

**Symptoms:**
- Sometimes follows template, sometimes doesn't
- Format varies between similar requests
- Missing sections or elements

**Solutions:**
1. Add explicit format template in SKILL.md
2. Provide multiple examples
3. Explain why format matters
4. Use imperative: "ALWAYS use this template"

### Issue: Edge Cases Not Handled

**Symptoms:**
- Large files cause errors
- Empty inputs crash
- Special characters corrupted
- Missing data causes failures

**Solutions:**
1. Add validation step instructions
2. Include error handling examples
3. Create helper script for complex validation
4. Document known limitations

### Issue: Too Verbose

**Symptoms:**
- Responses are walls of text
- Over-explains obvious steps
- Repeats information
- Takes too long to get to point

**Solutions:**
1. Remove redundant explanations
2. Trust user's expertise
3. Move details to references/
4. Use progressive disclosure

### Issue: Incomplete Responses

**Symptoms:**
- Misses steps in workflow
- Forgets validation
- Doesn't mention prerequisites
- No error handling

**Solutions:**
1. Add comprehensive checklist in SKILL.md
2. Include validation at each step
3. Document prerequisites upfront
4. Add error handling examples

### Issue: Incorrect Commands

**Symptoms:**
- Commands don't work when copied
- Syntax errors
- Wrong flags or options
- Outdated versions

**Solutions:**
1. Test all commands before including
2. Specify version requirements
3. Add validation steps
4. Include error messages to expect

---

## When to Stop Iterating

### Stop When:

1. **User is satisfied**
   - User says "this works for me"
   - User stops requesting changes
   - User starts using skill regularly

2. **Outputs meet expectations**
   - All test cases pass
   - Common scenarios work
   - Edge cases handled reasonably
   - No critical issues remain

3. **No meaningful progress**
   - Last 2-3 iterations didn't improve results
   - Changes are cosmetic only
   - Diminishing returns on effort
   - 90% solution is good enough

### Don't Stop When:

- **Perfect is the enemy of good** - 90% working is better than endlessly tweaking
- **Edge cases are theoretical** - Don't over-engineer for unlikely scenarios
- **User wants to keep iterating** - Follow user's lead
- **New issues emerge** - Fix real problems as they're found

### The 90% Rule

A skill that works well for 90% of cases is sufficient. Users can handle:
- Rare edge cases manually
- Unique situations with custom guidance
- Complex scenarios with multiple steps

**Focus on:**
- Common use cases (80% of value)
- Clear failure messages (10% of value)
- Good documentation (10% of value)

**Don't focus on:**
- Every possible edge case (diminishing returns)
- Perfect formatting (good enough is fine)
- Handling every possible error (major errors are enough)

---

## Grading Checklist

Use this checklist for each test case:

### Correctness
- [ ] Output matches expected result
- [ ] No factual errors
- [ ] Logic is sound
- [ ] Edge cases handled appropriately

### Completeness
- [ ] All requested tasks completed
- [ ] No steps skipped
- [ ] Appropriate level of detail
- [ ] Relevant context included

### Format
- [ ] Output follows specified format
- [ ] Consistent with examples
- [ ] Easy to read and understand

### Triggering
- [ ] Skill activated when appropriate
- [ ] Did not activate when inappropriate

### Efficiency
- [ ] No unnecessary steps
- [ ] Reasonable response length
- [ ] Not overly verbose

### Overall
- [ ] Would use this skill again
- [ ] Would recommend to others
- [ ] Saves time vs manual approach
- [ ] Output quality meets needs

---

## Recording Results

After grading, record:

```json
{
  "test_id": 1,
  "result": "pass|fail|partial",
  "issues": [
    "Description of issue 1",
    "Description of issue 2"
  ],
  "suggested_fix": "Brief description of improvement",
  "extract_script": false,
  "priority": "high|medium|low"
}
```

Use the grade-output.sh script to generate this structure interactively.

---

## Next Steps After Grading

1. **Review all results** - Look for patterns
2. **Prioritize fixes** - High priority first
3. **Update SKILL.md** - Based on issues found
4. **Create scripts** - Extract repeated work
5. **Re-run tests** - Verify improvements
6. **Repeat** - Until satisfied or good enough