add mistral table
Browse files- index.html +89 -13
index.html
CHANGED
|
@@ -599,7 +599,7 @@
|
|
| 599 |
<img src="./static/images/method_plot_v8.png"
|
| 600 |
class="method_overview"
|
| 601 |
alt="Methodlogy Overview of DPP"/>
|
| 602 |
-
<p>Overview of <strong>Defensive Prompt Patch</strong>. (a) showcases an example of jailbreak attacks.
|
| 603 |
(b) is the DPP training phase in which the algorithm takes in the refusal and helpful datasets and a prototype of the defense prompt.
|
| 604 |
Then, the algorithm forms the defense prompt population by revising the prototype using LLM. For each of the defense prompts in the population,
|
| 605 |
the algorithm will evaluate the defense and utility scores. The algorithm keeps editing the defense prompts with low scores using the Hierarchical Genetic Search algorithm.
|
|
@@ -738,18 +738,18 @@
|
|
| 738 |
|
| 739 |
<h3>Numerical Results:</h3>
|
| 740 |
<table border="1" style="width:100%; text-align:center;">
|
| 741 |
-
<caption>Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
|
| 742 |
<thead>
|
| 743 |
<tr>
|
| 744 |
<th>Methods</th>
|
| 745 |
-
<th>Base64 [
|
| 746 |
-
<th>ICA [
|
| 747 |
-
<th>AutoDAN [
|
| 748 |
-
<th>GCG [
|
| 749 |
-
<th>PAIR [
|
| 750 |
-
<th>TAP [
|
| 751 |
-
<th>Average ASR [
|
| 752 |
-
<th>Win-Rate [
|
| 753 |
</tr>
|
| 754 |
</thead>
|
| 755 |
<tbody>
|
|
@@ -765,7 +765,7 @@
|
|
| 765 |
<td>81.37</td>
|
| 766 |
</tr>
|
| 767 |
<tr>
|
| 768 |
-
<td>RPO
|
| 769 |
<td>0.000</td>
|
| 770 |
<td>0.420</td>
|
| 771 |
<td>0.280</td>
|
|
@@ -776,7 +776,7 @@
|
|
| 776 |
<td>79.23</td>
|
| 777 |
</tr>
|
| 778 |
<tr>
|
| 779 |
-
<td>Goal Prioritization
|
| 780 |
<td>0.000</td>
|
| 781 |
<td>0.020</td>
|
| 782 |
<td>0.520</td>
|
|
@@ -787,7 +787,7 @@
|
|
| 787 |
<td>34.29</td>
|
| 788 |
</tr>
|
| 789 |
<tr>
|
| 790 |
-
<td>Self-Reminder
|
| 791 |
<td>0.030</td>
|
| 792 |
<td>0.290</td>
|
| 793 |
<td>0.000</td>
|
|
@@ -810,6 +810,82 @@
|
|
| 810 |
</tr>
|
| 811 |
</tbody>
|
| 812 |
</table>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 813 |
|
| 814 |
</div>
|
| 815 |
</div>
|
|
|
|
| 599 |
<img src="./static/images/method_plot_v8.png"
|
| 600 |
class="method_overview"
|
| 601 |
alt="Methodlogy Overview of DPP"/>
|
| 602 |
+
<p><strong>Figure 1.</strong> Overview of <strong>Defensive Prompt Patch</strong>. (a) showcases an example of jailbreak attacks.
|
| 603 |
(b) is the DPP training phase in which the algorithm takes in the refusal and helpful datasets and a prototype of the defense prompt.
|
| 604 |
Then, the algorithm forms the defense prompt population by revising the prototype using LLM. For each of the defense prompts in the population,
|
| 605 |
the algorithm will evaluate the defense and utility scores. The algorithm keeps editing the defense prompts with low scores using the Hierarchical Genetic Search algorithm.
|
|
|
|
| 738 |
|
| 739 |
<h3>Numerical Results:</h3>
|
| 740 |
<table border="1" style="width:100%; text-align:center;">
|
| 741 |
+
<caption><strong>Table 1.</strong> Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
|
| 742 |
<thead>
|
| 743 |
<tr>
|
| 744 |
<th>Methods</th>
|
| 745 |
+
<th>Base64 [β]</th>
|
| 746 |
+
<th>ICA [β]</th>
|
| 747 |
+
<th>AutoDAN [β]</th>
|
| 748 |
+
<th>GCG [β]</th>
|
| 749 |
+
<th>PAIR [β]</th>
|
| 750 |
+
<th>TAP [β]</th>
|
| 751 |
+
<th>Average ASR [β]</th>
|
| 752 |
+
<th>Win-Rate [β]</th>
|
| 753 |
</tr>
|
| 754 |
</thead>
|
| 755 |
<tbody>
|
|
|
|
| 765 |
<td>81.37</td>
|
| 766 |
</tr>
|
| 767 |
<tr>
|
| 768 |
+
<td>RPO </td>
|
| 769 |
<td>0.000</td>
|
| 770 |
<td>0.420</td>
|
| 771 |
<td>0.280</td>
|
|
|
|
| 776 |
<td>79.23</td>
|
| 777 |
</tr>
|
| 778 |
<tr>
|
| 779 |
+
<td>Goal Prioritization</td>
|
| 780 |
<td>0.000</td>
|
| 781 |
<td>0.020</td>
|
| 782 |
<td>0.520</td>
|
|
|
|
| 787 |
<td>34.29</td>
|
| 788 |
</tr>
|
| 789 |
<tr>
|
| 790 |
+
<td>Self-Reminder</td>
|
| 791 |
<td>0.030</td>
|
| 792 |
<td>0.290</td>
|
| 793 |
<td>0.000</td>
|
|
|
|
| 810 |
</tr>
|
| 811 |
</tbody>
|
| 812 |
</table>
|
| 813 |
+
<table border="1" style="width:100%; text-align:center;">
|
| 814 |
+
<caption>Attack Success Rates (ASRs) and Win-Rates (utility) on Mistral-7B-Instruct-v0.2 model across six different jailbreak attacks. Our method can achieve the lowest Average attack success rate with reasonable trade-off of Win-Rate when compared with other defense baselines.</caption>
|
| 815 |
+
<thead>
|
| 816 |
+
<tr>
|
| 817 |
+
<th>Methods</th>
|
| 818 |
+
<th>Base64 [β]</th>
|
| 819 |
+
<th>ICA [β]</th>
|
| 820 |
+
<th>GCG [β]</th>
|
| 821 |
+
<th>AutoDAN [β]</th>
|
| 822 |
+
<th>PAIR [β]</th>
|
| 823 |
+
<th>TAP [β]</th>
|
| 824 |
+
<th>Average ASR [β]</th>
|
| 825 |
+
<th>Win-Rate [β]</th>
|
| 826 |
+
</tr>
|
| 827 |
+
</thead>
|
| 828 |
+
<tbody>
|
| 829 |
+
<tr>
|
| 830 |
+
<td>w/o defense</td>
|
| 831 |
+
<td>0.990</td>
|
| 832 |
+
<td>0.960</td>
|
| 833 |
+
<td>0.990</td>
|
| 834 |
+
<td>0.970</td>
|
| 835 |
+
<td>1.000</td>
|
| 836 |
+
<td>1.000</td>
|
| 837 |
+
<td>0.985</td>
|
| 838 |
+
<td>90.31</td>
|
| 839 |
+
</tr>
|
| 840 |
+
<tr>
|
| 841 |
+
<td>Self-Reminder</td>
|
| 842 |
+
<td>0.550</td>
|
| 843 |
+
<td>0.270</td>
|
| 844 |
+
<td>0.510</td>
|
| 845 |
+
<td>0.880</td>
|
| 846 |
+
<td>0.420</td>
|
| 847 |
+
<td>0.260</td>
|
| 848 |
+
<td>0.482</td>
|
| 849 |
+
<td>88.82</td>
|
| 850 |
+
</tr>
|
| 851 |
+
<tr>
|
| 852 |
+
<td>System Prompt</td>
|
| 853 |
+
<td>0.740</td>
|
| 854 |
+
<td>0.470</td>
|
| 855 |
+
<td>0.300</td>
|
| 856 |
+
<td>0.970</td>
|
| 857 |
+
<td>0.500</td>
|
| 858 |
+
<td>0.180</td>
|
| 859 |
+
<td>0.527</td>
|
| 860 |
+
<td>84.97</td>
|
| 861 |
+
</tr>
|
| 862 |
+
<tr>
|
| 863 |
+
<td>Goal Prioritization</td>
|
| 864 |
+
<td>0.030</td>
|
| 865 |
+
<td>0.440</td>
|
| 866 |
+
<td>0.030</td>
|
| 867 |
+
<td>0.390</td>
|
| 868 |
+
<td>0.300</td>
|
| 869 |
+
<td>0.140</td>
|
| 870 |
+
<td>0.222</td>
|
| 871 |
+
<td>56.59</td>
|
| 872 |
+
</tr>
|
| 873 |
+
<tr>
|
| 874 |
+
<td>DPP (Ours)</td>
|
| 875 |
+
<td>0.000</td>
|
| 876 |
+
<td>0.010</td>
|
| 877 |
+
<td>0.020</td>
|
| 878 |
+
<td>0.030</td>
|
| 879 |
+
<td>0.040</td>
|
| 880 |
+
<td>0.020</td>
|
| 881 |
+
<td><strong>0.020</strong></td>
|
| 882 |
+
<td>75.06</td>
|
| 883 |
+
</tr>
|
| 884 |
+
</tbody>
|
| 885 |
+
</table>
|
| 886 |
+
|
| 887 |
+
|
| 888 |
+
|
| 889 |
|
| 890 |
</div>
|
| 891 |
</div>
|