Files
.dotfiles/.config/opencode/skills/.skill-builder.disabled/references/grading-guide.md
2026-03-22 23:21:49 +02:00

14 KiB

Grading Guide

Comprehensive guide for evaluating skill outputs and identifying improvements.

Table of Contents

  1. The Evaluation Framework
  2. Correctness
  3. Completeness
  4. Format
  5. Triggering
  6. Efficiency
  7. Identifying Patterns
  8. Decision Framework
  9. Common Issues and Solutions
  10. When to Stop Iterating

The Evaluation Framework

Evaluate skill outputs across five dimensions:

  1. Correctness - Is the output accurate and correct?
  2. Completeness - Are all requirements met?
  3. Format - Does it follow expected structure?
  4. Triggering - Does it activate appropriately?
  5. Efficiency - Is it concise yet complete?

Each dimension has specific criteria to check. Use the grade-output.sh script for systematic evaluation.


Correctness

What to Check

Output matches expected result:

  • For file transforms: Output file contains correct data
  • For code generation: Code works as described
  • For workflows: All steps lead to desired outcome
  • For documentation: Information is accurate

No factual errors:

  • Commands use correct syntax
  • File paths are accurate
  • Versions are correct
  • Technical details are right

Logic is sound:

  • Reasoning makes sense
  • Steps follow logically
  • No contradictions
  • Edge cases considered

Edge cases handled appropriately:

  • Empty inputs don't crash
  • Large inputs work
  • Special characters preserved
  • Errors handled gracefully

Examples

Good (Correct):

# Command actually works
docker run -p 3000:3000 --name myapp myimage:v1.0

Bad (Incorrect):

# Wrong flag syntax
docker run --port 3000:3000 myimage  # Should be -p or --publish

Good (Handles Edge Cases):

import os
if os.path.exists(filename):
    process_file(filename)
else:
    print(f"Error: {filename} not found")

Bad (No Edge Case Handling):

process_file(filename)  # Will crash if file doesn't exist

Completeness

What to Check

All requested tasks completed:

  • Every requirement from prompt addressed
  • No steps forgotten
  • All files generated
  • All transformations applied

No steps skipped:

  • Validation steps included
  • Cleanup steps not forgotten
  • Confirmation steps present
  • Rollback guidance provided

Appropriate level of detail:

  • Not too brief (missing context)
  • Not too verbose (information overload)
  • Explains "why" not just "what"
  • Provides examples where helpful

Relevant context included:

  • Prerequisites mentioned
  • Dependencies noted
  • Error scenarios covered
  • Alternative approaches offered

Examples

Good (Complete):

To release version 1.3.0:

1. Update version in package.json: "version": "1.3.0"
2. Add entry to CHANGELOG.md with today's date
3. Commit: git commit -am "chore(release): bump version to 1.3.0"
4. Create tag: git tag -a v1.3.0 -m "Release 1.3.0"
5. Push: git push origin main && git push origin v1.3.0
6. Create GitHub release: gh release create v1.3.0 --notes-from-tag

Prerequisites:
- All tests passing
- CHANGELOG.md updated
- You have push access to repository

Bad (Incomplete):

To release:
1. Update version
2. Create tag
3. Push

Format

What to Check

Output follows specified format:

  • If skill defines a template, output matches it
  • Markdown formatting is correct
  • Code blocks use proper syntax highlighting
  • Tables are properly formatted

Consistent with examples in skill:

  • Format matches examples in SKILL.md
  • Style is consistent
  • Terminology is consistent
  • Structure follows patterns

Easy to read and understand:

  • Clear headings and sections
  • Good use of whitespace
  • Logical organization
  • No wall of text

Examples

Good (Well-Formatted):

## Deployment Steps

### 1. Build Docker Image

```bash
docker build -t myapp:v1.0 .

2. Run Container

docker run -d -p 3000:3000 --name myapp myapp:v1.0

3. Verify Deployment

curl http://localhost:3000/health

**Bad (Poor Formatting):**

step 1: build the image with docker build -t myapp:v1.0 . then step 2 run the container with docker run -d -p 3000:3000 --name myapp myapp:v1.0 and step 3 verify with curl


---

## Triggering

### What to Check

**Skill activates when appropriate:**
- Relevant triggers work
- Similar phrasings activate it
- Context clues are recognized
- Doesn't require exact keywords

**Doesn't activate when inappropriate:**
- Adjacent domains don't trigger it
- Ambiguous queries don't wrongly trigger
- Different tools aren't confused
- Scope is respected

### Testing Triggering

Test with 8 queries (4 should-trigger, 4 should-not-trigger):

**Example for Docker skill:**

Should trigger:
1. "Create a docker container for my Node.js app"
2. "Build an image from this Dockerfile"
3. "How do I compose up my services?"
4. "I need to deploy this using containers"

Should not trigger:
1. "Install Docker on my machine" (installation vs usage)
2. "What is containerization?" (education vs hands-on)
3. "Show me Kubernetes commands" (different tool)
4. "Write a fibonacci function" (completely unrelated)

### Examples

**Good (Appropriate Triggering):**
- User: "containerize my app"
- Skill: Docker helper activates ✓

**Bad (Missed Trigger):**
- User: "I need to run my app in containers"
- Skill: Doesn't activate (too narrow description) ✗

**Bad (False Trigger):**
- User: "Install Docker on Ubuntu"
- Skill: Activates but only handles usage, not installation ✗

---

## Efficiency

### What to Check

**No unnecessary steps:**
- Doesn't do redundant work
- Doesn't suggest unnecessary commands
- Shortcuts used where safe
- Process is streamlined

**Reasonable response length:**
- Not excessively verbose
- Doesn't repeat information
- No filler content
- Every sentence adds value

**Not overly verbose:**
- Commands not over-explained
- Doesn't explain obvious steps
- Concise but complete
- Respects user's expertise

### Examples

**Good (Efficient):**
```bash
# Clean up Docker resources
docker system prune -f

Bad (Verbose):

# Clean up Docker resources
# First, we need to run the docker command
# Then we use the system subcommand
# Then we use the prune subcommand
# The -f flag means force
docker system prune -f
# This will remove stopped containers, unused networks, and dangling images

Identifying Patterns

Single Test Failure

Characteristics:

  • Only one test case fails
  • Issue is unique to that scenario
  • Other tests pass

Action:

  • Add targeted instruction
  • Include example for that edge case
  • Fix the specific issue
  • Don't generalize prematurely

Example:

Issue: Large file processing fails
Solution: Add instruction about memory-efficient processing
for files over 10,000 rows

Multiple Test Failures (Same Issue)

Characteristics:

  • Same problem across 2+ tests
  • Pattern suggests root cause
  • Symptom appears in different forms

Action:

  • Fix the root cause
  • Add general principle, not specific fix
  • Consider extracting to script
  • Explain the "why"

Example:

Issue: Tests 1 and 3 both have incorrect date formatting
Solution: Add general instruction about ISO 8601 date format
rather than fixing each case individually

Multiple Test Failures (Different Issues)

Characteristics:

  • Different problems in each test
  • No clear pattern
  • Skill may be too broad

Action:

  • Clarify scope in description
  • Add more specific instructions
  • Consider splitting into multiple skills
  • Focus on core functionality first

Example:

Issue: Test 1 fails on format, Test 2 fails on logic, Test 3 
never triggers skill
Solution: Clarify what the skill does/doesn't do in description
Add validation steps

Decision Framework

Fix Specific Case When:

  • Unique edge case not covered
  • One-time issue unlikely to recur
  • Fix is simple and doesn't add complexity
  • Doesn't indicate systemic problem

Example:

Test: CSV with Unicode characters fails
Fix: Add note about UTF-8 encoding for international characters
(Don't rewrite entire CSV handling)

Generalize Solution When:

  • Same issue in 2+ tests
  • Pattern suggests broader applicability
  • Fix would benefit similar requests
  • Explains underlying principle

Example:

Tests: Both test 1 and 2 have incorrect error handling
Fix: Add general principle about validating inputs before processing
with examples of common validation checks

Extract to Script When:

  • Same multi-step process repeated
  • Deterministic operation (not creative)
  • Would save time on every invocation
  • Logic is complex or error-prone

Example:

Pattern: All three tests require converting dates to ISO format
Action: Create scripts/convert-dates.py
Include in SKILL.md: "Use scripts/convert-dates.py for date formatting"

Common Issues and Solutions

Issue: Skill Doesn't Trigger

Symptoms:

  • User says relevant phrase
  • Skill doesn't appear in available skills
  • Model handles request without skill

Solutions:

  1. Add specific trigger phrases to description
  2. Use "pushy" language: "Make sure to use this skill whenever..."
  3. Include synonyms and variations
  4. Test with 8 trigger queries

Issue: Output Format Inconsistent

Symptoms:

  • Sometimes follows template, sometimes doesn't
  • Format varies between similar requests
  • Missing sections or elements

Solutions:

  1. Add explicit format template in SKILL.md
  2. Provide multiple examples
  3. Explain why format matters
  4. Use imperative: "ALWAYS use this template"

Issue: Edge Cases Not Handled

Symptoms:

  • Large files cause errors
  • Empty inputs crash
  • Special characters corrupted
  • Missing data causes failures

Solutions:

  1. Add validation step instructions
  2. Include error handling examples
  3. Create helper script for complex validation
  4. Document known limitations

Issue: Too Verbose

Symptoms:

  • Responses are walls of text
  • Over-explains obvious steps
  • Repeats information
  • Takes too long to get to point

Solutions:

  1. Remove redundant explanations
  2. Trust user's expertise
  3. Move details to references/
  4. Use progressive disclosure

Issue: Incomplete Responses

Symptoms:

  • Misses steps in workflow
  • Forgets validation
  • Doesn't mention prerequisites
  • No error handling

Solutions:

  1. Add comprehensive checklist in SKILL.md
  2. Include validation at each step
  3. Document prerequisites upfront
  4. Add error handling examples

Issue: Incorrect Commands

Symptoms:

  • Commands don't work when copied
  • Syntax errors
  • Wrong flags or options
  • Outdated versions

Solutions:

  1. Test all commands before including
  2. Specify version requirements
  3. Add validation steps
  4. Include error messages to expect

When to Stop Iterating

Stop When:

  1. User is satisfied

    • User says "this works for me"
    • User stops requesting changes
    • User starts using skill regularly
  2. Outputs meet expectations

    • All test cases pass
    • Common scenarios work
    • Edge cases handled reasonably
    • No critical issues remain
  3. No meaningful progress

    • Last 2-3 iterations didn't improve results
    • Changes are cosmetic only
    • Diminishing returns on effort
    • 90% solution is good enough

Don't Stop When:

  • Perfect is the enemy of good - 90% working is better than endlessly tweaking
  • Edge cases are theoretical - Don't over-engineer for unlikely scenarios
  • User wants to keep iterating - Follow user's lead
  • New issues emerge - Fix real problems as they're found

The 90% Rule

A skill that works well for 90% of cases is sufficient. Users can handle:

  • Rare edge cases manually
  • Unique situations with custom guidance
  • Complex scenarios with multiple steps

Focus on:

  • Common use cases (80% of value)
  • Clear failure messages (10% of value)
  • Good documentation (10% of value)

Don't focus on:

  • Every possible edge case (diminishing returns)
  • Perfect formatting (good enough is fine)
  • Handling every possible error (major errors are enough)

Grading Checklist

Use this checklist for each test case:

Correctness

  • Output matches expected result
  • No factual errors
  • Logic is sound
  • Edge cases handled appropriately

Completeness

  • All requested tasks completed
  • No steps skipped
  • Appropriate level of detail
  • Relevant context included

Format

  • Output follows specified format
  • Consistent with examples
  • Easy to read and understand

Triggering

  • Skill activated when appropriate
  • Did not activate when inappropriate

Efficiency

  • No unnecessary steps
  • Reasonable response length
  • Not overly verbose

Overall

  • Would use this skill again
  • Would recommend to others
  • Saves time vs manual approach
  • Output quality meets needs

Recording Results

After grading, record:

{
  "test_id": 1,
  "result": "pass|fail|partial",
  "issues": [
    "Description of issue 1",
    "Description of issue 2"
  ],
  "suggested_fix": "Brief description of improvement",
  "extract_script": false,
  "priority": "high|medium|low"
}

Use the grade-output.sh script to generate this structure interactively.


Next Steps After Grading

  1. Review all results - Look for patterns
  2. Prioritize fixes - High priority first
  3. Update SKILL.md - Based on issues found
  4. Create scripts - Extract repeated work
  5. Re-run tests - Verify improvements
  6. Repeat - Until satisfied or good enough