Add skills

This commit is contained in:
2026-03-22 23:21:49 +02:00
parent 4cbbbae1ef
commit c09d9151ca
104 changed files with 23879 additions and 0 deletions

View File

@@ -0,0 +1,607 @@
# Grading Guide
Comprehensive guide for evaluating skill outputs and identifying improvements.
## Table of Contents
1. [The Evaluation Framework](#the-evaluation-framework)
2. [Correctness](#correctness)
3. [Completeness](#completeness)
4. [Format](#format)
5. [Triggering](#triggering)
6. [Efficiency](#efficiency)
7. [Identifying Patterns](#identifying-patterns)
8. [Decision Framework](#decision-framework)
9. [Common Issues and Solutions](#common-issues-and-solutions)
10. [When to Stop Iterating](#when-to-stop-iterating)
---
## The Evaluation Framework
Evaluate skill outputs across five dimensions:
1. **Correctness** - Is the output accurate and correct?
2. **Completeness** - Are all requirements met?
3. **Format** - Does it follow expected structure?
4. **Triggering** - Does it activate appropriately?
5. **Efficiency** - Is it concise yet complete?
Each dimension has specific criteria to check. Use the grade-output.sh script for systematic evaluation.
---
## Correctness
### What to Check
**Output matches expected result:**
- For file transforms: Output file contains correct data
- For code generation: Code works as described
- For workflows: All steps lead to desired outcome
- For documentation: Information is accurate
**No factual errors:**
- Commands use correct syntax
- File paths are accurate
- Versions are correct
- Technical details are right
**Logic is sound:**
- Reasoning makes sense
- Steps follow logically
- No contradictions
- Edge cases considered
**Edge cases handled appropriately:**
- Empty inputs don't crash
- Large inputs work
- Special characters preserved
- Errors handled gracefully
### Examples
**Good (Correct):**
```bash
# Command actually works
docker run -p 3000:3000 --name myapp myimage:v1.0
```
**Bad (Incorrect):**
```bash
# Wrong flag syntax
docker run --port 3000:3000 myimage # Should be -p or --publish
```
**Good (Handles Edge Cases):**
```python
import os
if os.path.exists(filename):
process_file(filename)
else:
print(f"Error: {filename} not found")
```
**Bad (No Edge Case Handling):**
```python
process_file(filename) # Will crash if file doesn't exist
```
---
## Completeness
### What to Check
**All requested tasks completed:**
- Every requirement from prompt addressed
- No steps forgotten
- All files generated
- All transformations applied
**No steps skipped:**
- Validation steps included
- Cleanup steps not forgotten
- Confirmation steps present
- Rollback guidance provided
**Appropriate level of detail:**
- Not too brief (missing context)
- Not too verbose (information overload)
- Explains "why" not just "what"
- Provides examples where helpful
**Relevant context included:**
- Prerequisites mentioned
- Dependencies noted
- Error scenarios covered
- Alternative approaches offered
### Examples
**Good (Complete):**
```
To release version 1.3.0:
1. Update version in package.json: "version": "1.3.0"
2. Add entry to CHANGELOG.md with today's date
3. Commit: git commit -am "chore(release): bump version to 1.3.0"
4. Create tag: git tag -a v1.3.0 -m "Release 1.3.0"
5. Push: git push origin main && git push origin v1.3.0
6. Create GitHub release: gh release create v1.3.0 --notes-from-tag
Prerequisites:
- All tests passing
- CHANGELOG.md updated
- You have push access to repository
```
**Bad (Incomplete):**
```
To release:
1. Update version
2. Create tag
3. Push
```
---
## Format
### What to Check
**Output follows specified format:**
- If skill defines a template, output matches it
- Markdown formatting is correct
- Code blocks use proper syntax highlighting
- Tables are properly formatted
**Consistent with examples in skill:**
- Format matches examples in SKILL.md
- Style is consistent
- Terminology is consistent
- Structure follows patterns
**Easy to read and understand:**
- Clear headings and sections
- Good use of whitespace
- Logical organization
- No wall of text
### Examples
**Good (Well-Formatted):**
```markdown
## Deployment Steps
### 1. Build Docker Image
```bash
docker build -t myapp:v1.0 .
```
### 2. Run Container
```bash
docker run -d -p 3000:3000 --name myapp myapp:v1.0
```
### 3. Verify Deployment
```bash
curl http://localhost:3000/health
```
```
**Bad (Poor Formatting):**
```
step 1: build the image with docker build -t myapp:v1.0 . then step 2 run the container with docker run -d -p 3000:3000 --name myapp myapp:v1.0 and step 3 verify with curl
```
---
## Triggering
### What to Check
**Skill activates when appropriate:**
- Relevant triggers work
- Similar phrasings activate it
- Context clues are recognized
- Doesn't require exact keywords
**Doesn't activate when inappropriate:**
- Adjacent domains don't trigger it
- Ambiguous queries don't wrongly trigger
- Different tools aren't confused
- Scope is respected
### Testing Triggering
Test with 8 queries (4 should-trigger, 4 should-not-trigger):
**Example for Docker skill:**
Should trigger:
1. "Create a docker container for my Node.js app"
2. "Build an image from this Dockerfile"
3. "How do I compose up my services?"
4. "I need to deploy this using containers"
Should not trigger:
1. "Install Docker on my machine" (installation vs usage)
2. "What is containerization?" (education vs hands-on)
3. "Show me Kubernetes commands" (different tool)
4. "Write a fibonacci function" (completely unrelated)
### Examples
**Good (Appropriate Triggering):**
- User: "containerize my app"
- Skill: Docker helper activates ✓
**Bad (Missed Trigger):**
- User: "I need to run my app in containers"
- Skill: Doesn't activate (too narrow description) ✗
**Bad (False Trigger):**
- User: "Install Docker on Ubuntu"
- Skill: Activates but only handles usage, not installation ✗
---
## Efficiency
### What to Check
**No unnecessary steps:**
- Doesn't do redundant work
- Doesn't suggest unnecessary commands
- Shortcuts used where safe
- Process is streamlined
**Reasonable response length:**
- Not excessively verbose
- Doesn't repeat information
- No filler content
- Every sentence adds value
**Not overly verbose:**
- Commands not over-explained
- Doesn't explain obvious steps
- Concise but complete
- Respects user's expertise
### Examples
**Good (Efficient):**
```bash
# Clean up Docker resources
docker system prune -f
```
**Bad (Verbose):**
```bash
# Clean up Docker resources
# First, we need to run the docker command
# Then we use the system subcommand
# Then we use the prune subcommand
# The -f flag means force
docker system prune -f
# This will remove stopped containers, unused networks, and dangling images
```
---
## Identifying Patterns
### Single Test Failure
**Characteristics:**
- Only one test case fails
- Issue is unique to that scenario
- Other tests pass
**Action:**
- Add targeted instruction
- Include example for that edge case
- Fix the specific issue
- Don't generalize prematurely
**Example:**
```
Issue: Large file processing fails
Solution: Add instruction about memory-efficient processing
for files over 10,000 rows
```
### Multiple Test Failures (Same Issue)
**Characteristics:**
- Same problem across 2+ tests
- Pattern suggests root cause
- Symptom appears in different forms
**Action:**
- Fix the root cause
- Add general principle, not specific fix
- Consider extracting to script
- Explain the "why"
**Example:**
```
Issue: Tests 1 and 3 both have incorrect date formatting
Solution: Add general instruction about ISO 8601 date format
rather than fixing each case individually
```
### Multiple Test Failures (Different Issues)
**Characteristics:**
- Different problems in each test
- No clear pattern
- Skill may be too broad
**Action:**
- Clarify scope in description
- Add more specific instructions
- Consider splitting into multiple skills
- Focus on core functionality first
**Example:**
```
Issue: Test 1 fails on format, Test 2 fails on logic, Test 3
never triggers skill
Solution: Clarify what the skill does/doesn't do in description
Add validation steps
```
---
## Decision Framework
### Fix Specific Case When:
- Unique edge case not covered
- One-time issue unlikely to recur
- Fix is simple and doesn't add complexity
- Doesn't indicate systemic problem
**Example:**
```
Test: CSV with Unicode characters fails
Fix: Add note about UTF-8 encoding for international characters
(Don't rewrite entire CSV handling)
```
### Generalize Solution When:
- Same issue in 2+ tests
- Pattern suggests broader applicability
- Fix would benefit similar requests
- Explains underlying principle
**Example:**
```
Tests: Both test 1 and 2 have incorrect error handling
Fix: Add general principle about validating inputs before processing
with examples of common validation checks
```
### Extract to Script When:
- Same multi-step process repeated
- Deterministic operation (not creative)
- Would save time on every invocation
- Logic is complex or error-prone
**Example:**
```
Pattern: All three tests require converting dates to ISO format
Action: Create scripts/convert-dates.py
Include in SKILL.md: "Use scripts/convert-dates.py for date formatting"
```
---
## Common Issues and Solutions
### Issue: Skill Doesn't Trigger
**Symptoms:**
- User says relevant phrase
- Skill doesn't appear in available skills
- Model handles request without skill
**Solutions:**
1. Add specific trigger phrases to description
2. Use "pushy" language: "Make sure to use this skill whenever..."
3. Include synonyms and variations
4. Test with 8 trigger queries
### Issue: Output Format Inconsistent
**Symptoms:**
- Sometimes follows template, sometimes doesn't
- Format varies between similar requests
- Missing sections or elements
**Solutions:**
1. Add explicit format template in SKILL.md
2. Provide multiple examples
3. Explain why format matters
4. Use imperative: "ALWAYS use this template"
### Issue: Edge Cases Not Handled
**Symptoms:**
- Large files cause errors
- Empty inputs crash
- Special characters corrupted
- Missing data causes failures
**Solutions:**
1. Add validation step instructions
2. Include error handling examples
3. Create helper script for complex validation
4. Document known limitations
### Issue: Too Verbose
**Symptoms:**
- Responses are walls of text
- Over-explains obvious steps
- Repeats information
- Takes too long to get to point
**Solutions:**
1. Remove redundant explanations
2. Trust user's expertise
3. Move details to references/
4. Use progressive disclosure
### Issue: Incomplete Responses
**Symptoms:**
- Misses steps in workflow
- Forgets validation
- Doesn't mention prerequisites
- No error handling
**Solutions:**
1. Add comprehensive checklist in SKILL.md
2. Include validation at each step
3. Document prerequisites upfront
4. Add error handling examples
### Issue: Incorrect Commands
**Symptoms:**
- Commands don't work when copied
- Syntax errors
- Wrong flags or options
- Outdated versions
**Solutions:**
1. Test all commands before including
2. Specify version requirements
3. Add validation steps
4. Include error messages to expect
---
## When to Stop Iterating
### Stop When:
1. **User is satisfied**
- User says "this works for me"
- User stops requesting changes
- User starts using skill regularly
2. **Outputs meet expectations**
- All test cases pass
- Common scenarios work
- Edge cases handled reasonably
- No critical issues remain
3. **No meaningful progress**
- Last 2-3 iterations didn't improve results
- Changes are cosmetic only
- Diminishing returns on effort
- 90% solution is good enough
### Don't Stop When:
- **Perfect is the enemy of good** - 90% working is better than endlessly tweaking
- **Edge cases are theoretical** - Don't over-engineer for unlikely scenarios
- **User wants to keep iterating** - Follow user's lead
- **New issues emerge** - Fix real problems as they're found
### The 90% Rule
A skill that works well for 90% of cases is sufficient. Users can handle:
- Rare edge cases manually
- Unique situations with custom guidance
- Complex scenarios with multiple steps
**Focus on:**
- Common use cases (80% of value)
- Clear failure messages (10% of value)
- Good documentation (10% of value)
**Don't focus on:**
- Every possible edge case (diminishing returns)
- Perfect formatting (good enough is fine)
- Handling every possible error (major errors are enough)
---
## Grading Checklist
Use this checklist for each test case:
### Correctness
- [ ] Output matches expected result
- [ ] No factual errors
- [ ] Logic is sound
- [ ] Edge cases handled appropriately
### Completeness
- [ ] All requested tasks completed
- [ ] No steps skipped
- [ ] Appropriate level of detail
- [ ] Relevant context included
### Format
- [ ] Output follows specified format
- [ ] Consistent with examples
- [ ] Easy to read and understand
### Triggering
- [ ] Skill activated when appropriate
- [ ] Did not activate when inappropriate
### Efficiency
- [ ] No unnecessary steps
- [ ] Reasonable response length
- [ ] Not overly verbose
### Overall
- [ ] Would use this skill again
- [ ] Would recommend to others
- [ ] Saves time vs manual approach
- [ ] Output quality meets needs
---
## Recording Results
After grading, record:
```json
{
"test_id": 1,
"result": "pass|fail|partial",
"issues": [
"Description of issue 1",
"Description of issue 2"
],
"suggested_fix": "Brief description of improvement",
"extract_script": false,
"priority": "high|medium|low"
}
```
Use the grade-output.sh script to generate this structure interactively.
---
## Next Steps After Grading
1. **Review all results** - Look for patterns
2. **Prioritize fixes** - High priority first
3. **Update SKILL.md** - Based on issues found
4. **Create scripts** - Extract repeated work
5. **Re-run tests** - Verify improvements
6. **Repeat** - Until satisfied or good enough