giteadmin/.dotfiles

Fork 0

Files

Jonathan Agmon c09d9151ca Add skills

2026-03-22 23:21:49 +02:00

14 KiB

Raw Blame History

Grading Guide

Comprehensive guide for evaluating skill outputs and identifying improvements.

The Evaluation Framework
Correctness
Completeness
Format
Triggering
Efficiency
Identifying Patterns
Decision Framework
Common Issues and Solutions
When to Stop Iterating

The Evaluation Framework

Evaluate skill outputs across five dimensions:

Correctness - Is the output accurate and correct?
Completeness - Are all requirements met?
Format - Does it follow expected structure?
Triggering - Does it activate appropriately?
Efficiency - Is it concise yet complete?

Each dimension has specific criteria to check. Use the grade-output.sh script for systematic evaluation.

Correctness

What to Check

Output matches expected result:

For file transforms: Output file contains correct data
For code generation: Code works as described
For workflows: All steps lead to desired outcome
For documentation: Information is accurate

No factual errors:

Commands use correct syntax
File paths are accurate
Versions are correct
Technical details are right

Logic is sound:

Reasoning makes sense
Steps follow logically
No contradictions
Edge cases considered

Edge cases handled appropriately:

Empty inputs don't crash
Large inputs work
Special characters preserved
Errors handled gracefully

Examples

Good (Correct):

# Command actually works
docker run -p 3000:3000 --name myapp myimage:v1.0

Bad (Incorrect):

# Wrong flag syntax
docker run --port 3000:3000 myimage  # Should be -p or --publish

Good (Handles Edge Cases):

import os
if os.path.exists(filename):
    process_file(filename)
else:
    print(f"Error: {filename} not found")

Bad (No Edge Case Handling):

process_file(filename)  # Will crash if file doesn't exist

Completeness

What to Check

All requested tasks completed:

Every requirement from prompt addressed
No steps forgotten
All files generated
All transformations applied

No steps skipped:

Validation steps included
Cleanup steps not forgotten
Confirmation steps present
Rollback guidance provided

Appropriate level of detail:

Not too brief (missing context)
Not too verbose (information overload)
Explains "why" not just "what"
Provides examples where helpful

Relevant context included:

Prerequisites mentioned
Dependencies noted
Error scenarios covered
Alternative approaches offered

Examples

Good (Complete):

To release version 1.3.0:

1. Update version in package.json: "version": "1.3.0"
2. Add entry to CHANGELOG.md with today's date
3. Commit: git commit -am "chore(release): bump version to 1.3.0"
4. Create tag: git tag -a v1.3.0 -m "Release 1.3.0"
5. Push: git push origin main && git push origin v1.3.0
6. Create GitHub release: gh release create v1.3.0 --notes-from-tag

Prerequisites:
- All tests passing
- CHANGELOG.md updated
- You have push access to repository

Bad (Incomplete):

To release:
1. Update version
2. Create tag
3. Push

Format

What to Check

Output follows specified format:

If skill defines a template, output matches it
Markdown formatting is correct
Code blocks use proper syntax highlighting
Tables are properly formatted

Consistent with examples in skill:

Format matches examples in SKILL.md
Style is consistent
Terminology is consistent
Structure follows patterns

Easy to read and understand:

Clear headings and sections
Good use of whitespace
Logical organization
No wall of text

Examples

Good (Well-Formatted):

## Deployment Steps

### 1. Build Docker Image

```bash
docker build -t myapp:v1.0 .

2. Run Container

docker run -d -p 3000:3000 --name myapp myapp:v1.0

3. Verify Deployment

curl http://localhost:3000/health


**Bad (Poor Formatting):**

step 1: build the image with docker build -t myapp:v1.0 . then step 2 run the container with docker run -d -p 3000:3000 --name myapp myapp:v1.0 and step 3 verify with curl


---

## Triggering

### What to Check

**Skill activates when appropriate:**
- Relevant triggers work
- Similar phrasings activate it
- Context clues are recognized
- Doesn't require exact keywords

**Doesn't activate when inappropriate:**
- Adjacent domains don't trigger it
- Ambiguous queries don't wrongly trigger
- Different tools aren't confused
- Scope is respected

### Testing Triggering

Test with 8 queries (4 should-trigger, 4 should-not-trigger):

**Example for Docker skill:**

Should trigger:
1. "Create a docker container for my Node.js app"
2. "Build an image from this Dockerfile"
3. "How do I compose up my services?"
4. "I need to deploy this using containers"

Should not trigger:
1. "Install Docker on my machine" (installation vs usage)
2. "What is containerization?" (education vs hands-on)
3. "Show me Kubernetes commands" (different tool)
4. "Write a fibonacci function" (completely unrelated)

### Examples

**Good (Appropriate Triggering):**
- User: "containerize my app"
- Skill: Docker helper activates ✓

**Bad (Missed Trigger):**
- User: "I need to run my app in containers"
- Skill: Doesn't activate (too narrow description) ✗

**Bad (False Trigger):**
- User: "Install Docker on Ubuntu"
- Skill: Activates but only handles usage, not installation ✗

---

## Efficiency

### What to Check

**No unnecessary steps:**
- Doesn't do redundant work
- Doesn't suggest unnecessary commands
- Shortcuts used where safe
- Process is streamlined

**Reasonable response length:**
- Not excessively verbose
- Doesn't repeat information
- No filler content
- Every sentence adds value

**Not overly verbose:**
- Commands not over-explained
- Doesn't explain obvious steps
- Concise but complete
- Respects user's expertise

### Examples

**Good (Efficient):**
```bash
# Clean up Docker resources
docker system prune -f

Bad (Verbose):

# Clean up Docker resources
# First, we need to run the docker command
# Then we use the system subcommand
# Then we use the prune subcommand
# The -f flag means force
docker system prune -f
# This will remove stopped containers, unused networks, and dangling images

Identifying Patterns

Single Test Failure

Characteristics:

Only one test case fails
Issue is unique to that scenario
Other tests pass

Action:

Add targeted instruction
Include example for that edge case
Fix the specific issue
Don't generalize prematurely

Example:

Issue: Large file processing fails
Solution: Add instruction about memory-efficient processing
for files over 10,000 rows

Multiple Test Failures (Same Issue)

Characteristics:

Same problem across 2+ tests
Pattern suggests root cause
Symptom appears in different forms

Action:

Fix the root cause
Add general principle, not specific fix
Consider extracting to script
Explain the "why"

Example:

Issue: Tests 1 and 3 both have incorrect date formatting
Solution: Add general instruction about ISO 8601 date format
rather than fixing each case individually

Multiple Test Failures (Different Issues)

Characteristics:

Different problems in each test
No clear pattern
Skill may be too broad

Action:

Clarify scope in description
Add more specific instructions
Consider splitting into multiple skills
Focus on core functionality first

Example:

Issue: Test 1 fails on format, Test 2 fails on logic, Test 3 
never triggers skill
Solution: Clarify what the skill does/doesn't do in description
Add validation steps

Decision Framework

Fix Specific Case When:

Unique edge case not covered
One-time issue unlikely to recur
Fix is simple and doesn't add complexity
Doesn't indicate systemic problem

Example:

Test: CSV with Unicode characters fails
Fix: Add note about UTF-8 encoding for international characters
(Don't rewrite entire CSV handling)

Generalize Solution When:

Same issue in 2+ tests
Pattern suggests broader applicability
Fix would benefit similar requests
Explains underlying principle

Example:

Tests: Both test 1 and 2 have incorrect error handling
Fix: Add general principle about validating inputs before processing
with examples of common validation checks

Extract to Script When:

Same multi-step process repeated
Deterministic operation (not creative)
Would save time on every invocation
Logic is complex or error-prone

Example:

Pattern: All three tests require converting dates to ISO format
Action: Create scripts/convert-dates.py
Include in SKILL.md: "Use scripts/convert-dates.py for date formatting"

Common Issues and Solutions

Issue: Skill Doesn't Trigger

Symptoms:

User says relevant phrase
Skill doesn't appear in available skills
Model handles request without skill

Solutions:

Add specific trigger phrases to description
Use "pushy" language: "Make sure to use this skill whenever..."
Include synonyms and variations
Test with 8 trigger queries

Issue: Output Format Inconsistent

Symptoms:

Sometimes follows template, sometimes doesn't
Format varies between similar requests
Missing sections or elements

Solutions:

Add explicit format template in SKILL.md
Provide multiple examples
Explain why format matters
Use imperative: "ALWAYS use this template"

Issue: Edge Cases Not Handled

Symptoms:

Large files cause errors
Empty inputs crash
Special characters corrupted
Missing data causes failures

Solutions:

Add validation step instructions
Include error handling examples
Create helper script for complex validation
Document known limitations

Issue: Too Verbose

Symptoms:

Responses are walls of text
Over-explains obvious steps
Repeats information
Takes too long to get to point

Solutions:

Remove redundant explanations
Trust user's expertise
Move details to references/
Use progressive disclosure

Issue: Incomplete Responses

Symptoms:

Misses steps in workflow
Forgets validation
Doesn't mention prerequisites
No error handling

Solutions:

Add comprehensive checklist in SKILL.md
Include validation at each step
Document prerequisites upfront
Add error handling examples

Issue: Incorrect Commands

Symptoms:

Commands don't work when copied
Syntax errors
Wrong flags or options
Outdated versions

Solutions:

Test all commands before including
Specify version requirements
Add validation steps
Include error messages to expect

When to Stop Iterating

Stop When:

User is satisfied
- User says "this works for me"
- User stops requesting changes
- User starts using skill regularly
Outputs meet expectations
- All test cases pass
- Common scenarios work
- Edge cases handled reasonably
- No critical issues remain
No meaningful progress
- Last 2-3 iterations didn't improve results
- Changes are cosmetic only
- Diminishing returns on effort
- 90% solution is good enough

Don't Stop When:

Perfect is the enemy of good - 90% working is better than endlessly tweaking
Edge cases are theoretical - Don't over-engineer for unlikely scenarios
User wants to keep iterating - Follow user's lead
New issues emerge - Fix real problems as they're found

The 90% Rule

A skill that works well for 90% of cases is sufficient. Users can handle:

Rare edge cases manually
Unique situations with custom guidance
Complex scenarios with multiple steps

Focus on:

Common use cases (80% of value)
Clear failure messages (10% of value)
Good documentation (10% of value)

Don't focus on:

Every possible edge case (diminishing returns)
Perfect formatting (good enough is fine)
Handling every possible error (major errors are enough)

Grading Checklist

Use this checklist for each test case:

Correctness

Output matches expected result
No factual errors
Logic is sound
Edge cases handled appropriately

Completeness

All requested tasks completed
No steps skipped
Appropriate level of detail
Relevant context included

Format

Output follows specified format
Consistent with examples
Easy to read and understand

Triggering

Skill activated when appropriate
Did not activate when inappropriate

Efficiency

No unnecessary steps
Reasonable response length
Not overly verbose

Overall

Would use this skill again
Would recommend to others
Saves time vs manual approach
Output quality meets needs

Recording Results

After grading, record:

{
  "test_id": 1,
  "result": "pass|fail|partial",
  "issues": [
    "Description of issue 1",
    "Description of issue 2"
  ],
  "suggested_fix": "Brief description of improvement",
  "extract_script": false,
  "priority": "high|medium|low"
}

Use the grade-output.sh script to generate this structure interactively.

Next Steps After Grading

Review all results - Look for patterns
Prioritize fixes - High priority first
Update SKILL.md - Based on issues found
Create scripts - Extract repeated work
Re-run tests - Verify improvements
Repeat - Until satisfied or good enough

14 KiB Raw Blame History

Grading Guide

Table of Contents

The Evaluation Framework

Correctness

What to Check

Examples

Completeness

What to Check

Examples

Format

What to Check

Examples

2. Run Container

3. Verify Deployment

Identifying Patterns

Single Test Failure

Multiple Test Failures (Same Issue)

Multiple Test Failures (Different Issues)

Decision Framework

Fix Specific Case When:

Generalize Solution When:

Extract to Script When:

Common Issues and Solutions

Issue: Skill Doesn't Trigger

Issue: Output Format Inconsistent

Issue: Edge Cases Not Handled

Issue: Too Verbose

Issue: Incomplete Responses

Issue: Incorrect Commands

When to Stop Iterating

Stop When:

Don't Stop When:

The 90% Rule

Grading Checklist

Correctness

Completeness

Format

Triggering

Efficiency

Overall

Recording Results

Next Steps After Grading

14 KiB

Raw Blame History