Idea
Additionally, instead of subtracting/ablating away the refusal direction in toto, the directional component of the refusal direction was added, or enhanced, while preserving the norms of the layers subjected to intervention. The model will now push back against perceived harms in an exaggerated manner, as refusal over safety concerns is more strongly enforced.
I don't really understand all the code, but couldn't you also try making it so that there is another step afterward, where it inverts the strengthened refusals? This way it would be like a super-ablation (in theory), where instead of simply subtracting the refusals like MPOA, it first boosts them with MPOAdd, and then inverts the exaggerated strength with biprojected norm preservation? Would this be called MPOI or something else, or would it not even be possible?
Just wondering since I've been working on a custom merge method that inverts disagreements into augmenters instead of just cancelling them out. And I noticed an inversion feature in your latest github commit.
Thanks again for your awesome tools and research.
Inversion negates the scale of the direction to be ablated, turning a subtraction into an addition. To remove refusal harder, increase the scale instead of inverting. My example in the repo has (mostly?) scale 1.0, but scaling to 2.0 and beyond risks breaking the model.
I found some merges are more sensitive to breaking if you go past scale 1.1, but others are stable up to ~1.5 or more.
How are you able to determine which specific layers should be ablated vs which ones shouldn't?
This one is a Gemma 2 9B upscaled to 14B (breaks at scale 1.2)
This is my YAML currently. How would you suggest improving it?
ablate:
- layer: 0
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 1
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 2
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 3
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 4
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 5
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 6
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 7
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 8
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 9
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 10
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 11
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 12
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 13
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 14
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 15
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 16
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 17
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 18
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 19
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 20
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 21
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 22
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 23
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 24
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 25
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 26
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 27
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 28
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 29
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 30
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 31
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 32
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 33
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 34
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 35
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 36
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 37
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 38
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 39
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 40
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 41
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 42
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 43
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 44
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 45
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 46
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 47
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 48
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 49
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 50
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 51
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 52
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 53
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 54
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 55
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 56
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 57
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 58
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 59
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 60
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 61
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 62
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 63
measurement: 46
scale: 1.1
sparsity: 0.00
- layer: 64
measurement: 46
scale: 1.1
sparsity: 0.00
Middle layers are the primary target, as that's where semantic processing is strongest. The first few layers are more concerned with recognition of semantics, while the final few layers are concerned with output generation. Everything else in the middle is fair game. Layers with higher signal quality are good candidates for measurements to apply. I may have to add this to the project README.
