judge

Evaluate completed work using LLM-as-Judge with structured rubrics, context isolation, and evidence-based scoring.

Purpose - Assess quality of work produced earlier in conversation with isolated context
Pattern - Context Extraction → Judge Sub-Agent → Validation → Report
Output - Evaluation report with weighted scores, evidence citations, and actionable improvements
Quality - Enhanced with Chain-of-Thought scoring, self-verification, and bias mitigation
Efficiency - Single focused judge for fast evaluation without multi-agent overhead

Pattern: LLM-as-Judge with Context Isolation

This command implements a three-phase evaluation pattern:

Phase 1: Context Extraction
         Review conversation history
         Identify work to evaluate
         Extract: Original task, output, files, constraints
                     │
Phase 2: Judge Sub-Agent (Fresh Context)
         ┌─────────────────────────────────────────┐
         │ Judge receives ONLY extracted context   │
         │ (prevents confirmation bias)            │
         │                                         │
         │ For each criterion:                     │
         │   1. Review evidence                    │
         │   2. Write justification                │
         │   3. Assign score (1-5)                 │
         │   4. Self-verify with questions         │
         │   5. Adjust if needed                   │
         └─────────────────────────────────────────┘
                     │
Phase 3: Validation & Report
         Verify scores in valid range (1-5)
         Check justification has evidence
         Confirm weighted total calculation
         Present verdict with recommendations

Usage

> Write new controller for the user model

# Evaluate completed work
/judge

# Evaluate with specific focus
/judge code quality and test coverage

# Evaluate security considerations
/judge security implications

# Evaluate requirements alignment
/judge requirements fulfillment

# Evaluate documentation completeness
/judge documentation

When to Use

✅ Use single judge when:

Quick quality check needed
Work is straightforward with clear criteria
Speed/cost matters more than multi-perspective analysis
Evaluation is formative (guiding improvements), not summative
Low-to-medium stakes decisions

❌ Use judge-with-debate instead when:

High-stakes decisions requiring rigorous evaluation
Subjective criteria where perspectives differ legitimately
Complex solutions with many evaluation dimensions
You need defensible, consensus-based evaluation

Default Evaluation Criteria

Criterion

Weight

What It Measures

Instruction Following

0.30

Does output fulfill original request? All requirements addressed?

Output Completeness

0.25

All components covered? Appropriate depth? No gaps?

Solution Quality

0.25

Sound approach? Best practices? No correctness issues?

Reasoning Quality

0.10

Clear decision-making? Appropriate methods used?

Response Coherence

0.10

Well-structured? Easy to understand? Professional?

Scoring Interpretation

Score Range

Verdict

Recommendation

4.50 - 5.00

EXCELLENT

Ready as-is

4.00 - 4.49

GOOD

Minor improvements optional

3.50 - 3.99

ACCEPTABLE

Improvements recommended

3.00 - 3.49

NEEDS IMPROVEMENT

Address issues before use

1.00 - 2.99

INSUFFICIENT

Significant rework needed

Quality Enhancement Techniques

Technique

Benefit

Context Isolation

Judge receives only extracted context, preventing confirmation bias from session state

Chain-of-Thought Scoring

Justification BEFORE score improves reliability by 15-25%

Evidence Requirement

Every score requires specific citations (file paths, line numbers, quotes)

Self-Verification

Judge generates verification questions and documents adjustments

Bias Mitigation

Explicit warnings against length bias, verbosity bias, and authority bias

Theoretical Foundation

Based on:

LLM-as-a-Judge (Zheng et al., 2023) - Structured evaluation rubrics with calibrated scoring
Chain of Thought Prompting (Wei et al., 2022) - Reasoning before conclusion improves accuracy
Constitutional AI (Bai et al., 2022) - Self-critique and verification loops
Inference-Time Scaling of Verification (Wan et al., 2026) - Rubric-guided verification with test-time self-evolution and iterative feedback refinement

Previousdo-in-steps Nextjudge-with-debate

Last updated 3 hours ago

hashtagPattern: LLM-as-Judge with Context Isolation

hashtagUsage

hashtagWhen to Use

hashtagDefault Evaluation Criteria

hashtagScoring Interpretation

hashtagQuality Enhancement Techniques

hashtagTheoretical Foundation