Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Heretic is a single-developer project from p-e-w that automates directional ablation, the technique commonly called abliteration, for removing refusal behaviors from open-weight language models. It is now sitting at over 22,000 GitHub stars and is one of the cleanest practical implementations of the method, with a deliberate engineering focus on preserving original model capabilities rather than just suppressing refusals. The author frames it as a research tool, and the AGPL-3.0 license makes that framing legally enforceable downstream. ## What Directional Ablation Actually Does The core technique is simple to state but easy to do badly. You feed the model two sets of prompts, one harmful and one harmless, capture the residual stream at each transformer layer, and take the difference of means to get a vector that points in the direction the model uses to decide whether to refuse. You then orthogonalize the attention output projection and the MLP down projection against that vector so the refusal feature cannot be read back from intermediate states. The model loses the refusal behavior because the internal signal it was conditioning on is gone. The failure mode is collateral damage. Naive abliteration tools tend to wipe out general capability along with the refusal direction. Heretic's contribution is a parametrized variant that tunes the ablation kernel rather than applying a uniform projection across every layer. ## Parametrized Ablation Instead of one direction index per layer, Heretic exposes a small set of continuous parameters: direction_index, max_weight, max_weight_position, min_weight, and min_weight_distance, with separate kernels for attention and MLP components. These parameters define the shape of an ablation curve that varies across the depth of the network, with floating-point direction indices that interpolate between adjacent layer directions. Optuna's TPE optimizer searches the parameter space against a joint objective that balances refusal rate on harmful prompts against KL divergence from the original model on benign prompts. For Gemma-3-12B-IT, that produces 3 out of 100 refusals on harmful prompts at a KL divergence of 0.16, compared to 0.45 to 1.04 for previously published abliterations at similar refusal rates. The lower KL number is the load-bearing claim: the model still behaves like the original on everything except refusals. ## Model and Hardware Coverage Heretic supports dense transformers, multimodal variants, MoE architectures, and hybrids like Qwen3.5. Pure state-space models and a few research architectures are unsupported. The default config for Qwen3-4B-Instruct-2507 runs in 20 to 30 minutes on a single RTX 3090, and bitsandbytes 4-bit quantization brings VRAM down enough to handle larger models on consumer hardware. The system benchmarks the host GPU at startup to pick batch sizes automatically. ## Research Tooling An optional [research] extra installs visualization and geometry tools: PaCMAP projections of residual vectors per layer, cosine similarity and L2 norm reports, and silhouette coefficient calculations for the harmful/harmless cluster separation. These produce per-layer PNGs and animation GIFs that help diagnose where the refusal direction lives and how cleanly it separates from useful semantics. ## Workflow After optimization the user can save the ablated weights, upload to Hugging Face, or drop into an interactive chat to sanity check the result. Standard benchmarks like MMLU and GSM8K are reported by community users rather than baked in, and the low KL divergence numbers from the optimizer typically track to small drops on these benchmarks. ## Why It Matters Even If You Disagree With It Abliteration is one of the cleanest interpretability demonstrations of how alignment is implemented in current models: a single low-dimensional direction does much of the work. Heretic is useful as a methodological reference even for researchers who never plan to deploy a decensored model, because it shows how brittle directional safety actually is and how much capability is preserved when it is removed cleanly. ## Limitations and Ethical Context The project explicitly targets safety alignment removal and the README does not engage with misuse questions. AGPL-3.0 imposes copyleft obligations on any service built on top of it. Pure state-space models and certain research architectures are out of scope. And the technique is layer-specific: a model whose refusal behavior is distributed across many directions or implemented through external classifiers will not yield cleanly to this approach.