judge

Evaluate completed work using LLM-as-Judge with structured rubrics, context isolation, and evidence-based scoring.

  • Purpose - Assess quality of work produced earlier in conversation with isolated context

  • Pattern - Context Extraction → Judge Sub-Agent → Validation → Report

  • Output - Evaluation report with weighted scores, evidence citations, and actionable improvements

  • Quality - Enhanced with Chain-of-Thought scoring, self-verification, and bias mitigation

  • Efficiency - Single focused judge for fast evaluation without multi-agent overhead

Pattern: LLM-as-Judge with Context Isolation

This command implements a three-phase evaluation pattern:

Phase 1: Context Extraction
         Review conversation history
         Identify work to evaluate
         Extract: Original task, output, files, constraints

Phase 2: Judge Sub-Agent (Fresh Context)
         ┌─────────────────────────────────────────┐
         │ Judge receives ONLY extracted context   │
         │ (prevents confirmation bias)            │
         │                                         │
         │ For each criterion:                     │
         │   1. Review evidence                    │
         │   2. Write justification                │
         │   3. Assign score (1-5)                 │
         │   4. Self-verify with questions         │
         │   5. Adjust if needed                   │
         └─────────────────────────────────────────┘

Phase 3: Validation & Report
         Verify scores in valid range (1-5)
         Check justification has evidence
         Confirm weighted total calculation
         Present verdict with recommendations

Usage

When to Use

Use single judge when:

  • Quick quality check needed

  • Work is straightforward with clear criteria

  • Speed/cost matters more than multi-perspective analysis

  • Evaluation is formative (guiding improvements), not summative

  • Low-to-medium stakes decisions

Use judge-with-debate instead when:

  • High-stakes decisions requiring rigorous evaluation

  • Subjective criteria where perspectives differ legitimately

  • Complex solutions with many evaluation dimensions

  • You need defensible, consensus-based evaluation

Default Evaluation Criteria

Criterion
Weight
What It Measures

Instruction Following

0.30

Does output fulfill original request? All requirements addressed?

Output Completeness

0.25

All components covered? Appropriate depth? No gaps?

Solution Quality

0.25

Sound approach? Best practices? No correctness issues?

Reasoning Quality

0.10

Clear decision-making? Appropriate methods used?

Response Coherence

0.10

Well-structured? Easy to understand? Professional?

Scoring Interpretation

Score Range
Verdict
Recommendation

4.50 - 5.00

EXCELLENT

Ready as-is

4.00 - 4.49

GOOD

Minor improvements optional

3.50 - 3.99

ACCEPTABLE

Improvements recommended

3.00 - 3.49

NEEDS IMPROVEMENT

Address issues before use

1.00 - 2.99

INSUFFICIENT

Significant rework needed

Quality Enhancement Techniques

Technique
Benefit

Context Isolation

Judge receives only extracted context, preventing confirmation bias from session state

Chain-of-Thought Scoring

Justification BEFORE score improves reliability by 15-25%

Evidence Requirement

Every score requires specific citations (file paths, line numbers, quotes)

Self-Verification

Judge generates verification questions and documents adjustments

Bias Mitigation

Explicit warnings against length bias, verbosity bias, and authority bias

Theoretical Foundation

Based on:

Last updated