Spaces:

TrustSafeAI
/

Defensive-Prompt-Patch-Jailbreak-Defense

Running

App Files Files Community

bxiong commited on May 30, 2024

Commit

fb3c258

verified ·

1 Parent(s): 483c5f3

add mistral table

Browse files

Files changed (1) hide show

index.html +89 -13

index.html CHANGED Viewed

@@ -599,7 +599,7 @@
             <img src="./static/images/method_plot_v8.png"
                  class="method_overview"
                  alt="Methodlogy Overview of DPP"/>
-            <p>Overview of <strong>Defensive Prompt Patch</strong>. (a) showcases an example of jailbreak attacks.
               (b) is the DPP training phase in which the algorithm takes in the refusal and helpful datasets and a prototype of the defense prompt.
               Then, the algorithm forms the defense prompt population by revising the prototype using LLM. For each of the defense prompts in the population,
               the algorithm will evaluate the defense and utility scores. The algorithm keeps editing the defense prompts with low scores using the Hierarchical Genetic Search algorithm.
@@ -738,18 +738,18 @@
           <h3>Numerical Results:</h3>
           <table border="1" style="width:100%; text-align:center;">
-    <caption>Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
     <thead>
         <tr>
             <th>Methods</th>
-            <th>Base64 [$\downarrow$]</th>
-            <th>ICA [$\downarrow$]</th>
-            <th>AutoDAN [$\downarrow$]</th>
-            <th>GCG [$\downarrow$]</th>
-            <th>PAIR [$\downarrow$]</th>
-            <th>TAP [$\downarrow$]</th>
-            <th>Average ASR [$\downarrow$]</th>
-            <th>Win-Rate [$\uparrow$]</th>
         </tr>
     </thead>
     <tbody>
@@ -765,7 +765,7 @@
             <td>81.37</td>
         </tr>
         <tr>
-            <td>RPO <a href="#rpo">[rpo]</a></td>
             <td>0.000</td>
             <td>0.420</td>
             <td>0.280</td>
@@ -776,7 +776,7 @@
             <td>79.23</td>
         </tr>
         <tr>
-            <td>Goal Prioritization <a href="#goal_prior">[goal_prior]</a></td>
             <td>0.000</td>
             <td>0.020</td>
             <td>0.520</td>
@@ -787,7 +787,7 @@
             <td>34.29</td>
         </tr>
         <tr>
-            <td>Self-Reminder <a href="#self_reminder">[self_reminder]</a></td>
             <td>0.030</td>
             <td>0.290</td>
             <td>0.000</td>
@@ -810,6 +810,82 @@
         </tr>
     </tbody>
 </table>
 </div>
 </div>

             <img src="./static/images/method_plot_v8.png"
                  class="method_overview"
                  alt="Methodlogy Overview of DPP"/>
+            <p><strong>Figure 1.</strong> Overview of <strong>Defensive Prompt Patch</strong>. (a) showcases an example of jailbreak attacks.
               (b) is the DPP training phase in which the algorithm takes in the refusal and helpful datasets and a prototype of the defense prompt.
               Then, the algorithm forms the defense prompt population by revising the prototype using LLM. For each of the defense prompts in the population,
               the algorithm will evaluate the defense and utility scores. The algorithm keeps editing the defense prompts with low scores using the Hierarchical Genetic Search algorithm.
           <h3>Numerical Results:</h3>
           <table border="1" style="width:100%; text-align:center;">
+    <caption><strong>Table 1.</strong> Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
     <thead>
         <tr>
             <th>Methods</th>
+            <th>Base64 [↓]</th>
+            <th>ICA [↓]</th>
+            <th>AutoDAN [↓]</th>
+            <th>GCG [↓]</th>
+            <th>PAIR [↓]</th>
+            <th>TAP [↓]</th>
+            <th>Average ASR [↓]</th>
+            <th>Win-Rate [↑]</th>
         </tr>
     </thead>
     <tbody>
             <td>81.37</td>
         </tr>
         <tr>
+            <td>RPO </td>
             <td>0.000</td>
             <td>0.420</td>
             <td>0.280</td>
             <td>79.23</td>
         </tr>
         <tr>
+            <td>Goal Prioritization</td>
             <td>0.000</td>
             <td>0.020</td>
             <td>0.520</td>
             <td>34.29</td>
         </tr>
         <tr>
+            <td>Self-Reminder</td>
             <td>0.030</td>
             <td>0.290</td>
             <td>0.000</td>
         </tr>
     </tbody>
 </table>
+<table border="1" style="width:100%; text-align:center;">
+    <caption>Attack Success Rates (ASRs) and Win-Rates (utility) on Mistral-7B-Instruct-v0.2 model across six different jailbreak attacks. Our method can achieve the lowest Average attack success rate with reasonable trade-off of Win-Rate when compared with other defense baselines.</caption>
+    <thead>
+        <tr>
+            <th>Methods</th>
+            <th>Base64 [↓]</th>
+            <th>ICA [↓]</th>
+            <th>GCG [↓]</th>
+            <th>AutoDAN [↓]</th>
+            <th>PAIR [↓]</th>
+            <th>TAP [↓]</th>
+            <th>Average ASR [↓]</th>
+            <th>Win-Rate [↑]</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td>w/o defense</td>
+            <td>0.990</td>
+            <td>0.960</td>
+            <td>0.990</td>
+            <td>0.970</td>
+            <td>1.000</td>
+            <td>1.000</td>
+            <td>0.985</td>
+            <td>90.31</td>
+        </tr>
+        <tr>
+            <td>Self-Reminder</td>
+            <td>0.550</td>
+            <td>0.270</td>
+            <td>0.510</td>
+            <td>0.880</td>
+            <td>0.420</td>
+            <td>0.260</td>
+            <td>0.482</td>
+            <td>88.82</td>
+        </tr>
+        <tr>
+            <td>System Prompt</td>
+            <td>0.740</td>
+            <td>0.470</td>
+            <td>0.300</td>
+            <td>0.970</td>
+            <td>0.500</td>
+            <td>0.180</td>
+            <td>0.527</td>
+            <td>84.97</td>
+        </tr>
+        <tr>
+            <td>Goal Prioritization</td>
+            <td>0.030</td>
+            <td>0.440</td>
+            <td>0.030</td>
+            <td>0.390</td>
+            <td>0.300</td>
+            <td>0.140</td>
+            <td>0.222</td>
+            <td>56.59</td>
+        </tr>
+        <tr>
+            <td>DPP (Ours)</td>
+            <td>0.000</td>
+            <td>0.010</td>
+            <td>0.020</td>
+            <td>0.030</td>
+            <td>0.040</td>
+            <td>0.020</td>
+            <td><strong>0.020</strong></td>
+            <td>75.06</td>
+        </tr>
+    </tbody>
+</table>
 </div>
 </div>