Add skills
This commit is contained in:
@@ -0,0 +1,607 @@
|
||||
# Grading Guide
|
||||
|
||||
Comprehensive guide for evaluating skill outputs and identifying improvements.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [The Evaluation Framework](#the-evaluation-framework)
|
||||
2. [Correctness](#correctness)
|
||||
3. [Completeness](#completeness)
|
||||
4. [Format](#format)
|
||||
5. [Triggering](#triggering)
|
||||
6. [Efficiency](#efficiency)
|
||||
7. [Identifying Patterns](#identifying-patterns)
|
||||
8. [Decision Framework](#decision-framework)
|
||||
9. [Common Issues and Solutions](#common-issues-and-solutions)
|
||||
10. [When to Stop Iterating](#when-to-stop-iterating)
|
||||
|
||||
---
|
||||
|
||||
## The Evaluation Framework
|
||||
|
||||
Evaluate skill outputs across five dimensions:
|
||||
|
||||
1. **Correctness** - Is the output accurate and correct?
|
||||
2. **Completeness** - Are all requirements met?
|
||||
3. **Format** - Does it follow expected structure?
|
||||
4. **Triggering** - Does it activate appropriately?
|
||||
5. **Efficiency** - Is it concise yet complete?
|
||||
|
||||
Each dimension has specific criteria to check. Use the grade-output.sh script for systematic evaluation.
|
||||
|
||||
---
|
||||
|
||||
## Correctness
|
||||
|
||||
### What to Check
|
||||
|
||||
**Output matches expected result:**
|
||||
- For file transforms: Output file contains correct data
|
||||
- For code generation: Code works as described
|
||||
- For workflows: All steps lead to desired outcome
|
||||
- For documentation: Information is accurate
|
||||
|
||||
**No factual errors:**
|
||||
- Commands use correct syntax
|
||||
- File paths are accurate
|
||||
- Versions are correct
|
||||
- Technical details are right
|
||||
|
||||
**Logic is sound:**
|
||||
- Reasoning makes sense
|
||||
- Steps follow logically
|
||||
- No contradictions
|
||||
- Edge cases considered
|
||||
|
||||
**Edge cases handled appropriately:**
|
||||
- Empty inputs don't crash
|
||||
- Large inputs work
|
||||
- Special characters preserved
|
||||
- Errors handled gracefully
|
||||
|
||||
### Examples
|
||||
|
||||
**Good (Correct):**
|
||||
```bash
|
||||
# Command actually works
|
||||
docker run -p 3000:3000 --name myapp myimage:v1.0
|
||||
```
|
||||
|
||||
**Bad (Incorrect):**
|
||||
```bash
|
||||
# Wrong flag syntax
|
||||
docker run --port 3000:3000 myimage # Should be -p or --publish
|
||||
```
|
||||
|
||||
**Good (Handles Edge Cases):**
|
||||
```python
|
||||
import os
|
||||
if os.path.exists(filename):
|
||||
process_file(filename)
|
||||
else:
|
||||
print(f"Error: {filename} not found")
|
||||
```
|
||||
|
||||
**Bad (No Edge Case Handling):**
|
||||
```python
|
||||
process_file(filename) # Will crash if file doesn't exist
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Completeness
|
||||
|
||||
### What to Check
|
||||
|
||||
**All requested tasks completed:**
|
||||
- Every requirement from prompt addressed
|
||||
- No steps forgotten
|
||||
- All files generated
|
||||
- All transformations applied
|
||||
|
||||
**No steps skipped:**
|
||||
- Validation steps included
|
||||
- Cleanup steps not forgotten
|
||||
- Confirmation steps present
|
||||
- Rollback guidance provided
|
||||
|
||||
**Appropriate level of detail:**
|
||||
- Not too brief (missing context)
|
||||
- Not too verbose (information overload)
|
||||
- Explains "why" not just "what"
|
||||
- Provides examples where helpful
|
||||
|
||||
**Relevant context included:**
|
||||
- Prerequisites mentioned
|
||||
- Dependencies noted
|
||||
- Error scenarios covered
|
||||
- Alternative approaches offered
|
||||
|
||||
### Examples
|
||||
|
||||
**Good (Complete):**
|
||||
```
|
||||
To release version 1.3.0:
|
||||
|
||||
1. Update version in package.json: "version": "1.3.0"
|
||||
2. Add entry to CHANGELOG.md with today's date
|
||||
3. Commit: git commit -am "chore(release): bump version to 1.3.0"
|
||||
4. Create tag: git tag -a v1.3.0 -m "Release 1.3.0"
|
||||
5. Push: git push origin main && git push origin v1.3.0
|
||||
6. Create GitHub release: gh release create v1.3.0 --notes-from-tag
|
||||
|
||||
Prerequisites:
|
||||
- All tests passing
|
||||
- CHANGELOG.md updated
|
||||
- You have push access to repository
|
||||
```
|
||||
|
||||
**Bad (Incomplete):**
|
||||
```
|
||||
To release:
|
||||
1. Update version
|
||||
2. Create tag
|
||||
3. Push
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Format
|
||||
|
||||
### What to Check
|
||||
|
||||
**Output follows specified format:**
|
||||
- If skill defines a template, output matches it
|
||||
- Markdown formatting is correct
|
||||
- Code blocks use proper syntax highlighting
|
||||
- Tables are properly formatted
|
||||
|
||||
**Consistent with examples in skill:**
|
||||
- Format matches examples in SKILL.md
|
||||
- Style is consistent
|
||||
- Terminology is consistent
|
||||
- Structure follows patterns
|
||||
|
||||
**Easy to read and understand:**
|
||||
- Clear headings and sections
|
||||
- Good use of whitespace
|
||||
- Logical organization
|
||||
- No wall of text
|
||||
|
||||
### Examples
|
||||
|
||||
**Good (Well-Formatted):**
|
||||
```markdown
|
||||
## Deployment Steps
|
||||
|
||||
### 1. Build Docker Image
|
||||
|
||||
```bash
|
||||
docker build -t myapp:v1.0 .
|
||||
```
|
||||
|
||||
### 2. Run Container
|
||||
|
||||
```bash
|
||||
docker run -d -p 3000:3000 --name myapp myapp:v1.0
|
||||
```
|
||||
|
||||
### 3. Verify Deployment
|
||||
|
||||
```bash
|
||||
curl http://localhost:3000/health
|
||||
```
|
||||
```
|
||||
|
||||
**Bad (Poor Formatting):**
|
||||
```
|
||||
step 1: build the image with docker build -t myapp:v1.0 . then step 2 run the container with docker run -d -p 3000:3000 --name myapp myapp:v1.0 and step 3 verify with curl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Triggering
|
||||
|
||||
### What to Check
|
||||
|
||||
**Skill activates when appropriate:**
|
||||
- Relevant triggers work
|
||||
- Similar phrasings activate it
|
||||
- Context clues are recognized
|
||||
- Doesn't require exact keywords
|
||||
|
||||
**Doesn't activate when inappropriate:**
|
||||
- Adjacent domains don't trigger it
|
||||
- Ambiguous queries don't wrongly trigger
|
||||
- Different tools aren't confused
|
||||
- Scope is respected
|
||||
|
||||
### Testing Triggering
|
||||
|
||||
Test with 8 queries (4 should-trigger, 4 should-not-trigger):
|
||||
|
||||
**Example for Docker skill:**
|
||||
|
||||
Should trigger:
|
||||
1. "Create a docker container for my Node.js app"
|
||||
2. "Build an image from this Dockerfile"
|
||||
3. "How do I compose up my services?"
|
||||
4. "I need to deploy this using containers"
|
||||
|
||||
Should not trigger:
|
||||
1. "Install Docker on my machine" (installation vs usage)
|
||||
2. "What is containerization?" (education vs hands-on)
|
||||
3. "Show me Kubernetes commands" (different tool)
|
||||
4. "Write a fibonacci function" (completely unrelated)
|
||||
|
||||
### Examples
|
||||
|
||||
**Good (Appropriate Triggering):**
|
||||
- User: "containerize my app"
|
||||
- Skill: Docker helper activates ✓
|
||||
|
||||
**Bad (Missed Trigger):**
|
||||
- User: "I need to run my app in containers"
|
||||
- Skill: Doesn't activate (too narrow description) ✗
|
||||
|
||||
**Bad (False Trigger):**
|
||||
- User: "Install Docker on Ubuntu"
|
||||
- Skill: Activates but only handles usage, not installation ✗
|
||||
|
||||
---
|
||||
|
||||
## Efficiency
|
||||
|
||||
### What to Check
|
||||
|
||||
**No unnecessary steps:**
|
||||
- Doesn't do redundant work
|
||||
- Doesn't suggest unnecessary commands
|
||||
- Shortcuts used where safe
|
||||
- Process is streamlined
|
||||
|
||||
**Reasonable response length:**
|
||||
- Not excessively verbose
|
||||
- Doesn't repeat information
|
||||
- No filler content
|
||||
- Every sentence adds value
|
||||
|
||||
**Not overly verbose:**
|
||||
- Commands not over-explained
|
||||
- Doesn't explain obvious steps
|
||||
- Concise but complete
|
||||
- Respects user's expertise
|
||||
|
||||
### Examples
|
||||
|
||||
**Good (Efficient):**
|
||||
```bash
|
||||
# Clean up Docker resources
|
||||
docker system prune -f
|
||||
```
|
||||
|
||||
**Bad (Verbose):**
|
||||
```bash
|
||||
# Clean up Docker resources
|
||||
# First, we need to run the docker command
|
||||
# Then we use the system subcommand
|
||||
# Then we use the prune subcommand
|
||||
# The -f flag means force
|
||||
docker system prune -f
|
||||
# This will remove stopped containers, unused networks, and dangling images
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Identifying Patterns
|
||||
|
||||
### Single Test Failure
|
||||
|
||||
**Characteristics:**
|
||||
- Only one test case fails
|
||||
- Issue is unique to that scenario
|
||||
- Other tests pass
|
||||
|
||||
**Action:**
|
||||
- Add targeted instruction
|
||||
- Include example for that edge case
|
||||
- Fix the specific issue
|
||||
- Don't generalize prematurely
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Issue: Large file processing fails
|
||||
Solution: Add instruction about memory-efficient processing
|
||||
for files over 10,000 rows
|
||||
```
|
||||
|
||||
### Multiple Test Failures (Same Issue)
|
||||
|
||||
**Characteristics:**
|
||||
- Same problem across 2+ tests
|
||||
- Pattern suggests root cause
|
||||
- Symptom appears in different forms
|
||||
|
||||
**Action:**
|
||||
- Fix the root cause
|
||||
- Add general principle, not specific fix
|
||||
- Consider extracting to script
|
||||
- Explain the "why"
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Issue: Tests 1 and 3 both have incorrect date formatting
|
||||
Solution: Add general instruction about ISO 8601 date format
|
||||
rather than fixing each case individually
|
||||
```
|
||||
|
||||
### Multiple Test Failures (Different Issues)
|
||||
|
||||
**Characteristics:**
|
||||
- Different problems in each test
|
||||
- No clear pattern
|
||||
- Skill may be too broad
|
||||
|
||||
**Action:**
|
||||
- Clarify scope in description
|
||||
- Add more specific instructions
|
||||
- Consider splitting into multiple skills
|
||||
- Focus on core functionality first
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Issue: Test 1 fails on format, Test 2 fails on logic, Test 3
|
||||
never triggers skill
|
||||
Solution: Clarify what the skill does/doesn't do in description
|
||||
Add validation steps
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Decision Framework
|
||||
|
||||
### Fix Specific Case When:
|
||||
|
||||
- Unique edge case not covered
|
||||
- One-time issue unlikely to recur
|
||||
- Fix is simple and doesn't add complexity
|
||||
- Doesn't indicate systemic problem
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Test: CSV with Unicode characters fails
|
||||
Fix: Add note about UTF-8 encoding for international characters
|
||||
(Don't rewrite entire CSV handling)
|
||||
```
|
||||
|
||||
### Generalize Solution When:
|
||||
|
||||
- Same issue in 2+ tests
|
||||
- Pattern suggests broader applicability
|
||||
- Fix would benefit similar requests
|
||||
- Explains underlying principle
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Tests: Both test 1 and 2 have incorrect error handling
|
||||
Fix: Add general principle about validating inputs before processing
|
||||
with examples of common validation checks
|
||||
```
|
||||
|
||||
### Extract to Script When:
|
||||
|
||||
- Same multi-step process repeated
|
||||
- Deterministic operation (not creative)
|
||||
- Would save time on every invocation
|
||||
- Logic is complex or error-prone
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Pattern: All three tests require converting dates to ISO format
|
||||
Action: Create scripts/convert-dates.py
|
||||
Include in SKILL.md: "Use scripts/convert-dates.py for date formatting"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Issue: Skill Doesn't Trigger
|
||||
|
||||
**Symptoms:**
|
||||
- User says relevant phrase
|
||||
- Skill doesn't appear in available skills
|
||||
- Model handles request without skill
|
||||
|
||||
**Solutions:**
|
||||
1. Add specific trigger phrases to description
|
||||
2. Use "pushy" language: "Make sure to use this skill whenever..."
|
||||
3. Include synonyms and variations
|
||||
4. Test with 8 trigger queries
|
||||
|
||||
### Issue: Output Format Inconsistent
|
||||
|
||||
**Symptoms:**
|
||||
- Sometimes follows template, sometimes doesn't
|
||||
- Format varies between similar requests
|
||||
- Missing sections or elements
|
||||
|
||||
**Solutions:**
|
||||
1. Add explicit format template in SKILL.md
|
||||
2. Provide multiple examples
|
||||
3. Explain why format matters
|
||||
4. Use imperative: "ALWAYS use this template"
|
||||
|
||||
### Issue: Edge Cases Not Handled
|
||||
|
||||
**Symptoms:**
|
||||
- Large files cause errors
|
||||
- Empty inputs crash
|
||||
- Special characters corrupted
|
||||
- Missing data causes failures
|
||||
|
||||
**Solutions:**
|
||||
1. Add validation step instructions
|
||||
2. Include error handling examples
|
||||
3. Create helper script for complex validation
|
||||
4. Document known limitations
|
||||
|
||||
### Issue: Too Verbose
|
||||
|
||||
**Symptoms:**
|
||||
- Responses are walls of text
|
||||
- Over-explains obvious steps
|
||||
- Repeats information
|
||||
- Takes too long to get to point
|
||||
|
||||
**Solutions:**
|
||||
1. Remove redundant explanations
|
||||
2. Trust user's expertise
|
||||
3. Move details to references/
|
||||
4. Use progressive disclosure
|
||||
|
||||
### Issue: Incomplete Responses
|
||||
|
||||
**Symptoms:**
|
||||
- Misses steps in workflow
|
||||
- Forgets validation
|
||||
- Doesn't mention prerequisites
|
||||
- No error handling
|
||||
|
||||
**Solutions:**
|
||||
1. Add comprehensive checklist in SKILL.md
|
||||
2. Include validation at each step
|
||||
3. Document prerequisites upfront
|
||||
4. Add error handling examples
|
||||
|
||||
### Issue: Incorrect Commands
|
||||
|
||||
**Symptoms:**
|
||||
- Commands don't work when copied
|
||||
- Syntax errors
|
||||
- Wrong flags or options
|
||||
- Outdated versions
|
||||
|
||||
**Solutions:**
|
||||
1. Test all commands before including
|
||||
2. Specify version requirements
|
||||
3. Add validation steps
|
||||
4. Include error messages to expect
|
||||
|
||||
---
|
||||
|
||||
## When to Stop Iterating
|
||||
|
||||
### Stop When:
|
||||
|
||||
1. **User is satisfied**
|
||||
- User says "this works for me"
|
||||
- User stops requesting changes
|
||||
- User starts using skill regularly
|
||||
|
||||
2. **Outputs meet expectations**
|
||||
- All test cases pass
|
||||
- Common scenarios work
|
||||
- Edge cases handled reasonably
|
||||
- No critical issues remain
|
||||
|
||||
3. **No meaningful progress**
|
||||
- Last 2-3 iterations didn't improve results
|
||||
- Changes are cosmetic only
|
||||
- Diminishing returns on effort
|
||||
- 90% solution is good enough
|
||||
|
||||
### Don't Stop When:
|
||||
|
||||
- **Perfect is the enemy of good** - 90% working is better than endlessly tweaking
|
||||
- **Edge cases are theoretical** - Don't over-engineer for unlikely scenarios
|
||||
- **User wants to keep iterating** - Follow user's lead
|
||||
- **New issues emerge** - Fix real problems as they're found
|
||||
|
||||
### The 90% Rule
|
||||
|
||||
A skill that works well for 90% of cases is sufficient. Users can handle:
|
||||
- Rare edge cases manually
|
||||
- Unique situations with custom guidance
|
||||
- Complex scenarios with multiple steps
|
||||
|
||||
**Focus on:**
|
||||
- Common use cases (80% of value)
|
||||
- Clear failure messages (10% of value)
|
||||
- Good documentation (10% of value)
|
||||
|
||||
**Don't focus on:**
|
||||
- Every possible edge case (diminishing returns)
|
||||
- Perfect formatting (good enough is fine)
|
||||
- Handling every possible error (major errors are enough)
|
||||
|
||||
---
|
||||
|
||||
## Grading Checklist
|
||||
|
||||
Use this checklist for each test case:
|
||||
|
||||
### Correctness
|
||||
- [ ] Output matches expected result
|
||||
- [ ] No factual errors
|
||||
- [ ] Logic is sound
|
||||
- [ ] Edge cases handled appropriately
|
||||
|
||||
### Completeness
|
||||
- [ ] All requested tasks completed
|
||||
- [ ] No steps skipped
|
||||
- [ ] Appropriate level of detail
|
||||
- [ ] Relevant context included
|
||||
|
||||
### Format
|
||||
- [ ] Output follows specified format
|
||||
- [ ] Consistent with examples
|
||||
- [ ] Easy to read and understand
|
||||
|
||||
### Triggering
|
||||
- [ ] Skill activated when appropriate
|
||||
- [ ] Did not activate when inappropriate
|
||||
|
||||
### Efficiency
|
||||
- [ ] No unnecessary steps
|
||||
- [ ] Reasonable response length
|
||||
- [ ] Not overly verbose
|
||||
|
||||
### Overall
|
||||
- [ ] Would use this skill again
|
||||
- [ ] Would recommend to others
|
||||
- [ ] Saves time vs manual approach
|
||||
- [ ] Output quality meets needs
|
||||
|
||||
---
|
||||
|
||||
## Recording Results
|
||||
|
||||
After grading, record:
|
||||
|
||||
```json
|
||||
{
|
||||
"test_id": 1,
|
||||
"result": "pass|fail|partial",
|
||||
"issues": [
|
||||
"Description of issue 1",
|
||||
"Description of issue 2"
|
||||
],
|
||||
"suggested_fix": "Brief description of improvement",
|
||||
"extract_script": false,
|
||||
"priority": "high|medium|low"
|
||||
}
|
||||
```
|
||||
|
||||
Use the grade-output.sh script to generate this structure interactively.
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Grading
|
||||
|
||||
1. **Review all results** - Look for patterns
|
||||
2. **Prioritize fixes** - High priority first
|
||||
3. **Update SKILL.md** - Based on issues found
|
||||
4. **Create scripts** - Extract repeated work
|
||||
5. **Re-run tests** - Verify improvements
|
||||
6. **Repeat** - Until satisfied or good enough
|
||||
Reference in New Issue
Block a user