Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

Heretic - Open Source | Evermx | Evermx

Back to Open Source

Trending

Heretic

p-e-wAGPL-3.0

View on GitHub

Inference7.3K Stars746 Forks209 views

Heretic is an open-source Python tool that fully automates the removal of safety alignment restrictions from transformer-based language models through directional ablation techniques, without requiring expensive post-training or fine-tuning. ## The Problem Heretic Addresses Language models ship with safety alignment that restricts certain types of outputs. While these restrictions serve important purposes in consumer products, researchers, red-teamers, and developers building specialized applications often need unrestricted model behavior. Traditional approaches to removing these restrictions involve costly fine-tuning runs or manual identification of refusal-inducing components, both of which require deep expertise in transformer internals. Heretic eliminates this barrier by automating the entire process. Users supply a model name and the tool handles everything else. ## How It Works Heretic identifies refusal directions by computing difference-of-means between hidden states generated from harmful and harmless prompt sets. Once these refusal directions are located in the model's residual stream, the tool orthogonalizes transformer weight matrices against them, effectively removing the model's tendency to refuse. The key technical innovations include float-valued direction indices with linear interpolation between directions, separate optimization parameters for attention projections and MLP down-projections, and configurable ablation weight kernels with position and amplitude control. ## Optimization with Optuna Rather than using fixed ablation parameters, Heretic employs Optuna's Tree-structured Parzen Estimator (TPE) optimizer to find the optimal balance between refusal suppression and model fidelity. This means the tool automatically discovers settings that maximize compliance while minimizing degradation to the model's general capabilities. The optimization process evaluates candidates against both a compliance metric (how few refusals remain) and a fidelity metric (KL divergence from the original model's distribution on benign prompts). ## Performance Results On Gemma-3-12B-IT, Heretic achieved only 3 out of 100 refusals with a KL divergence of just 0.16, outperforming manual abliteration approaches at comparable compliance levels. This demonstrates that automated optimization can surpass hand-tuned interventions. Processing times depend on model size and hardware. Llama-3.1-8B takes approximately 45 minutes on an RTX 3090. Larger models require proportionally more time but benefit from bitsandbytes quantization support for reduced VRAM requirements. ## Installation and Usage Installation is straightforward via pip: `pip install -U heretic-llm`. Running the tool requires only a single command with the target model name: `heretic Qwen/Qwen3-4B-Instruct-2507`. The tool downloads the model, runs the optimization, and saves the modified weights. For researchers interested in understanding the ablation process, Heretic includes visualization and geometric analysis features using PaCMAP projections. These can be installed with `pip install -U heretic-llm[research]`. ## Community Impact Over 1,000 models have been decensored using Heretic and uploaded to Hugging Face. The tool has become the standard pipeline for creating uncensored model variants in the open-source community, replacing ad-hoc manual interventions with a reproducible, optimized process. ## Ethical Considerations The project raises important questions about AI safety alignment. While it enables legitimate research and application development, it also lowers the barrier to creating unrestricted models. The AGPL-3.0 license ensures that modifications to the tool itself remain open source, maintaining transparency in how the technique evolves. ## Limitations Heretic works specifically with transformer-based architectures and requires access to model weights, limiting it to open-source models. The optimization process requires significant GPU memory, particularly for larger models. While bitsandbytes quantization helps, very large models (70B+) still demand substantial hardware resources. The quality of ablation depends on the prompt sets used to identify refusal directions, which may not cover all categories of restricted behavior.