Reflexion
Self-refinement framework that introduces feedback and refinement loops to improve output quality through iterative improvement, complexity triage, and verification.
Focused on:
Self-refinement - Agents review and improve their own outputs
Multi-agent review - Specialized agents critique from different perspectives
Iterative improvement - Systematic loops that converge on higher quality
Memory integration - Lessons learned persist across interactions
Plugin Target
Decrease hallucinations - reflection usually allows you to get rid of hallucinations by verifying the output
Make output quality more predictable - same model usually produces more similar output after reflection, rather than after one shot prompt
Improve output quality - reflection usually allows you to improve the output by identifying areas that were missed or misunderstood in one shot prompt
Overview
The Reflexion plugin implements multiple scientifically-proven techniques for improving LLM outputs through self-reflection, critique, and memory updates. It enables Claude to evaluate its own work, identify weaknesses, and generate improved versions.
Plugin is based on papers like Self-Refine and Reflexion. These techniques improve the output of large language models by introducing feedback and refinement loops.
They are proven to increase output quality by 8–21% based on both automatic metrics and human preferences across seven diverse tasks, including dialogue generation, coding, and mathematical reasoning, when compared to standard one-step model outputs.
On top of that, the plugin is based on the Agentic Context Engineering paper that uses memory updates after reflection, and consistently outperforms strong baselines by 10.6% on agents.
Quick Start
Alternatively, you can use the reflect word in initial prompt:
In order to use this hook, need to have bun installed. But for overall command it is not required.
Automatic Reflection with Hooks
The plugin includes optional hooks that automatically trigger reflection when you include the word "reflect" in your prompt. This removes the need to manually run /reflexion:reflect after each task.
How It Works
Include the word "reflect" anywhere in your prompt
Claude completes your task
The hook automatically triggers
/reflexion:reflectClaude reviews and improves its work
Important: Only the exact word "reflect" triggers automatic reflection. Words like "reflection", "reflective", or "reflects" do not trigger it.
Commands Overview
/reflexion:reflect - Self-Refinement
Reflect on previous response and output, based on Self-refinement framework for iterative improvement with complexity triage and verification
Purpose - Review and improve previous response
Output - Refined output with improvements
Arguments
Optional areas to focus or confidence threshold to use, for example "security" or "deep reflect if less than 90% confidence"
How It Works
Complexity Triage: Automatically determines appropriate reflection depth
Quick Path (5s): Simple tasks get fast verification
Standard Path: Multi-file changes get full reflection
Deep Path: Critical systems get comprehensive analysis
Self-Assessment: Evaluates output against quality criteria
Completeness check
Quality assessment
Correctness verification
Fact-checking
Refinement Planning: If improvements needed, generates specific plan
Identifies issues
Proposes solutions
Prioritizes fixes
Implementation: Produces refined output addressing identified issues
Confidence Thresholds
The command uses confidence levels to determine if further iteration is needed:
Quick Path: No specific threshold (fast verification only)
Standard Path: Requires >70% confidence
Deep Reflection: Requires >90% confidence
If confidence threshold isn't met, the command will iterate automatically.
Usage Examples
Best practices
Reflect after significant work - Don't reflect on trivial tasks
Be specific - Provide context about what to focus on
Iterate when needed - Sometimes multiple reflection cycles are valuable
Capture learnings - Use
/reflexion:memorizeto preserve insights
/reflexion:critique - Multi-Perspective Critique
Memorize insights from reflections and updates CLAUDE.md file with this knowledge. Curates insights from reflections and critiques into CLAUDE.md using Agentic Context Engineering
Purpose - Multi-perspective comprehensive review
Output - Structured feedback from multiple judges
Arguments
Optional file paths, commits, or context to review (defaults to recent changes)
How It Works
Context Gathering: Identifies scope of work to review
Parallel Review: Spawns three specialized judge agents
Requirements Validator: Checks alignment with original requirements
Solution Architect: Evaluates technical approach and design
Code Quality Reviewer: Assesses implementation quality
Cross-Review & Debate: Judges review each other's findings and debate disagreements
Consensus Report: Generates comprehensive report with actionable recommendations
Judge Scoring
Each judge provides a score out of 10:
9-10: Exceptional quality, minimal improvements needed
7-8: Good quality, minor improvements suggested
5-6: Acceptable quality, several improvements recommended
3-4: Below standards, significant rework needed
1-2: Major issues, substantial rework required
Usage Examples
Best practices
For important decisions - Use critique for architectural or design choices
Before major commits - Get multi-perspective review before committing
Learn from debates - Pay attention to different perspectives in the critique
Address all concerns - Don't cherry-pick feedback
/reflexion:memorize - Memory Updates
Comprehensive multi-perspective review using specialized judges with debate and consensus building
Purpose - Save insights to project memory
Output - Updated CLAUDE.md with learnings
Arguments
Optional source specification (last, selection, chat:) or --dry-run for preview
How It Works
Context Harvesting: Gathers insights from recent work
Reflection outputs
Critique findings
Problem-solving patterns
Failed approaches and lessons
Curation Process: Transforms raw insights into structured knowledge
Extracts key insights
Categorizes by impact
Applies curation rules (relevance, non-redundancy, actionability)
Prevents context collapse
CLAUDE.md Updates: Adds curated insights to appropriate sections
Project Context
Code Quality Standards
Architecture Decisions
Testing Strategies
Development Guidelines
Strategies and Hard Rules
Memory Validation: Ensures quality of updates
Coherence check
Actionability test
Consolidation review
Evidence verification
Usage Examples
Best practices
Regular memorization - Periodically save insights to CLAUDE.md
Review memory - Occasionally review CLAUDE.md to ensure it stays relevant
Curate carefully - Only memorize significant, reusable insights
Organize by topic - Keep CLAUDE.md well-structured
Theoretical Foundation
The Reflexion plugin is based on peer-reviewed research demonstrating 8-21% improvement in output quality across diverse tasks:
Core Papers
Self-Refine - Iterative refinement where the model reviews and improves its own output
Reflexion - Self-reflection for autonomous agents with memory
Constitutional AI (CAI) - Critique based on principles and guidelines
LLM-as-a-Judge - Using LLMs to evaluate other LLM outputs
Multi-Agent Debate - Multiple models proposing and critiquing solutions
Agentic Context Engineering - Memory updates after reflection (10.6% improvement)
Additional Techniques
Chain-of-Verification (CoVe) - Generate, verify, revise cycle
Tree of Thoughts (ToT) - Multiple reasoning path exploration
Process Reward Models - Step-by-step evaluation
Last updated