Engineering Advanced TestingQAAI Skills

Skill Tester

Quality assurance testing for AI agent skills.

Install for

Real Conversations

See how the Skill Tester skill responds to actual requests. These are representative examples of what your AI agent will produce.

You

I just finished building a new skill called log-analyzer — it has SKILL.md, README.md, two Python scripts in scripts/, and a references/ directory. I want to make sure it meets POWERFUL tier requirements before I open a PR. How do I run the full quality check?

Agent

This skill includes executable scripts — I'll need your permission before running them.

Here's the full three-tool validation sequence for POWERFUL tier:

# Step 1 — Structural validation
python3 scripts/skill_validator.py skills/log-analyzer --tier POWERFUL

# Step 2 — Script testing (syntax, imports, runtime, output format)
python3 scripts/script_tester.py skills/log-analyzer --timeout 30 --sample-data

# Step 3 — Quality scoring
python3 scripts/quality_scorer.py skills/log-analyzer --detailed --recommendations

What POWERFUL tier requires (checklist before running):

SKILL.md: minimum 300 lines
2-3 Python scripts, each 500-800 lines
--json and human-readable output modes on all scripts
Comprehensive argparse with multiple modes/subcommands
references/ directory with documentation
expected_outputs/ directory for regression testing

Interpreting results:

=== SKILL VALIDATION REPORT ===
Skill: skills/log-analyzer
Tier: POWERFUL
Overall Score: 82/100 (B)

Structure Validation: ✓ PASS
├─ SKILL.md: ✓ EXISTS (312 lines)
├─ scripts/: ✓ EXISTS (2 files)
├─ references/: ✓ EXISTS
└─ expected_outputs/: ⚠ MISSING  ← add this

Script Testing:
├─ log_scanner.py: ✓ PASS (syntax, imports, --help, --json)
└─ log_summarizer.py: ⚠ WARN — no --json output mode

Recommendations:
• Add expected_outputs/ directory with sample runs
• Add --json flag to log_summarizer.py
• Add --format text/json to argument parser

Fix the warnings before the PR — a B score (80+) is acceptable for POWERFUL tier. Anything below 75 will be flagged in CI.

# Run with --json for CI-parseable output
python3 scripts/quality_scorer.py skills/log-analyzer --json | jq '.overall_score'
# Must be >= 75

You

We have 4 contractors submitting skills to our repository via PRs. Quality varies a lot — some have no expected outputs, some scripts crash on --help. I want to block merges automatically when a skill doesn't meet the minimum bar. How do I set this up?

Agent

Here's a complete GitHub Actions quality gate that runs on every PR touching the skills/ directory:

# .github/workflows/skill-quality-gate.yml
name: Skill Quality Gate

on:
  pull_request:
    paths:
      - 'skills/**'

jobs:
  validate-skills:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Find changed skills
        id: changed
        run: |
          changed=$(git diff --name-only origin/main...HEAD | \
            grep '^skills/' | cut -d'/' -f1-2 | sort -u)
          echo "skills=$changed" >> $GITHUB_OUTPUT

      - name: Validate structure
        run: |
          for skill in ${{ steps.changed.outputs.skills }}; do
            echo "--- Validating $skill ---"
            python3 engineering/skill-tester/scripts/skill_validator.py \
              "$skill" --json | tee /tmp/validation.json
            python3 -c "
import json, sys
result = json.load(open('/tmp/validation.json'))
if not result['structure_compliance']['checks']['skill_md_exists']:
    sys.exit(1)
"
          done

      - name: Test scripts
        run: |
          for skill in ${{ steps.changed.outputs.skills }}; do
            python3 engineering/skill-tester/scripts/script_tester.py \
              "$skill" --timeout 30
          done

      - name: Score quality (minimum 75)
        run: |
          for skill in ${{ steps.changed.outputs.skills }}; do
            score=$(python3 engineering/skill-tester/scripts/quality_scorer.py \
              "$skill" --json | jq '.overall_score')
            echo "$skill score: $score"
            if (( $(echo "$score < 75" | bc) )); then
              echo "FAIL: $skill scored $score (minimum: 75)"
              exit 1
            fi
          done

What this gates on:

Missing SKILL.md → immediate block
Script syntax errors or import failures → block
--help crashes or no --json support → block
Quality score < 75 → block with score in the error message

Feedback to contractors in the PR:

Action failed: log-analyzer scored 62/100 (minimum: 75)
Failing checks:
  - scripts/log_summarizer.py: no --json output mode
  - expected_outputs/ directory missing
Run quality_scorer.py --detailed --recommendations to see all suggestions.

This eliminates back-and-forth review cycles on structural issues — contractors get immediate automated feedback.

You

We migrated 167 skills from a legacy repo last month. I suspect many are below quality standards — wrong tier classifications, missing expected outputs, scripts that don't support --json. How do I get a quality report across all 167?

Agent

This skill includes executable scripts — I'll need your permission before running them.

# Batch quality report across all 167 skills
python3 scripts/quality_scorer.py skills/ \
  --batch \
  --json > reports/quality_report_$(date +%Y%m%d).json

Then extract the actionable summary:

# Distribution by grade
jq -r '[.[] | .letter_grade] | group_by(.) | 
  map({grade: .[0], count: length}) | sort_by(.grade)[]' \
  reports/quality_report_*.json

# Output:
# { "grade": "A", "count": 12 }
# { "grade": "B", "count": 43 }
# { "grade": "C", "count": 67 }   ← most of them
# { "grade": "D", "count": 38 }
# { "grade": "F", "count": 7 }

# Skills scoring < 60 (need urgent attention)
jq -r '.[] | select(.overall_score < 60) | 
  "\(.skill_path): \(.overall_score)/100 — \(.improvement_suggestions[:2] | join(", "))"' \
  reports/quality_report_*.json

Most common gaps in legacy migrations:

Issue	Prevalence	Fix
No `--json` output mode	~60% of scripts	Add `--format {json,text}` argparse arg
Missing `expected_outputs/`	~70% of skills	Run script with sample data, save output
SKILL.md < tier minimum lines	~30%	Either expand docs or reclassify tier
External imports (not stdlib)	~15%	Replace with stdlib equivalents
Missing `references/` directory	~45%	Create dir, add relevant docs

Prioritization approach:

Fix the 7 failing skills first (structural issues, scripts crash)
Batch-fix the --json issue (same pattern in all scripts — one change per script)
Add expected_outputs/ by running each script once with sample data

Track progress over time:

# Run weekly, compare to baseline
jq '[.[] | .overall_score] | add/length' reports/quality_report_*.json
# Week 1: 61.2 average → Week 4: 74.8 average

Get Started

How to install and use this skill in your preferred environment.

Skills are designed for AI coding agents (Claude Code, Cursor, Windsurf) and IDE-based workflows where the agent can read files, run scripts, and act on your codebase.

Models & Context

Which AI models and context windows work best with this skill.

Recommended Models

Best

Claude Opus 4.8 Claude Sonnet 4 GPT-4.1 Gemini 2.5 Pro Grok 3 Kimi K2

Good

Claude Haiku 4.5 GPT-4.1 mini Gemini 2.5 Flash Grok 3 mini

Larger models produce more detailed, production-ready outputs.

Context Window

This skill's SKILL.md is typically 3–10 KB — fits in any modern context window.

8K Skill only

32K+ Skill + conversation

100K+ Skill + references + codebase

All current frontier models support 100K+ context.

Pro tips for best results

1

Be specific

Include numbers — users, budget, RPS — so the skill can size the architecture.

2

Share constraints

Compliance needs, team size, and existing stack all improve the output.

3

Iterate

Start with a high-level design, then ask follow-ups for IaC, cost analysis, or security review.

4

Combine skills

Pair with companion skills below for end-to-end coverage.

Ready to try Skill Tester?

Install the skill and start getting expert-level guidance in your workflow — any agent, any IDE.

Install for

← Browse all skills

Skill Tester

Real Conversations

Get Started

Claude Code, OpenCode, or any CLI agent

Cursor, Windsurf, or IDE-based agents

Claude.ai, ChatGPT, Gemini, or any web AI

Models & Context

Recommended Models

Context Window

Pro tips for best results

Ready to try Skill Tester?