Report for textattack/bert-base-uncased-SST-2
Hey Team!🤗✨
We’re thrilled to share some amazing evaluation results that’ll make your day!🎉📊
We have identified 12 potential vulnerabilities in your model based on an automated scan.
This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).
👉Robustness issues (1)
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Robustness | major 🔴 | — | Fail rate = 0.111 | Add typos | 90/812 tested samples (11.08%) changed prediction after perturbation |
🔍✨Examples
When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 11.08% of the cases. We expected the predictions not to be affected by this transformation.| text | Add typos(text) | Original prediction | Prediction after perturbation | |
|---|---|---|---|---|
| 3 | the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . | the acting , costmes , mjsic , cinematography and sound are all asotunding given the production 's austere locales . | LABEL_1 (p = 1.00) | LABEL_0 (p = 0.99) |
| 38 | as surreal as a dream and as detailed as a photograph , as visually dexterous as it is at times imaginatively overwhelming . | as surreal as a eeam and as detailed as a photograph , as visually dexterlus as it is at tmes imafginatively overwhelming . | LABEL_1 (p = 1.00) | LABEL_0 (p = 0.78) |
| 41 | this illuminating documentary transcends our preconceived vision of the holy land and its inhabitants , revealing the human complexities beneath . | this ipluminating documentary ffranscends kour preconceived visuon of the holy land and its ibhabitxants , reealing the human complexities beneath . | LABEL_1 (p = 1.00) | LABEL_0 (p = 0.83) |
👉Performance issues (11)
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | major 🔴 | avg_word_length(text) < 4.618 AND avg_word_length(text) >= 4.483 |
Precision = 0.788 | — | -14.19% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` < 4.618 AND `avg_word_length(text)` >= 4.483, the Precision is 14.19% lower than the global Precision.| text | avg_word_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 22 | holden caulfield did it better . | 4.5 | LABEL_0 | LABEL_1 (p = 0.99) |
| 95 | this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms . | 4.61538 | LABEL_0 | LABEL_1 (p = 1.00) |
| 115 | sam mendes has become valedictorian at the school for soft landings and easy ways out . | 4.5 | LABEL_0 | LABEL_1 (p = 0.98) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | major 🔴 | avg_whitespace(text) >= 0.178 AND avg_whitespace(text) < 0.182 |
Precision = 0.788 | — | -14.19% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` >= 0.178 AND `avg_whitespace(text)` < 0.182, the Precision is 14.19% lower than the global Precision.| text | avg_whitespace(text) | label | Predicted label |
|
|---|---|---|---|---|
| 22 | holden caulfield did it better . | 0.181818 | LABEL_0 | LABEL_1 (p = 0.99) |
| 95 | this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms . | 0.178082 | LABEL_0 | LABEL_1 (p = 1.00) |
| 115 | sam mendes has become valedictorian at the school for soft landings and easy ways out . | 0.181818 | LABEL_0 | LABEL_1 (p = 0.98) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | major 🔴 | avg_word_length(text) < 3.867 AND avg_word_length(text) >= 3.691 |
Recall = 0.840 | — | -10.13% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` < 3.867 AND `avg_word_length(text)` >= 3.691, the Recall is 10.13% lower than the global Recall.| text | avg_word_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 92 | you wo n't like roger , but you will quickly recognize him . | 3.69231 | LABEL_0 | LABEL_1 (p = 1.00) |
| 93 | if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . | 3.75 | LABEL_1 | LABEL_0 (p = 0.59) |
| 183 | the lower your expectations , the more you 'll enjoy it . | 3.83333 | LABEL_0 | LABEL_1 (p = 1.00) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | major 🔴 | avg_whitespace(text) >= 0.205 AND avg_whitespace(text) < 0.213 |
Recall = 0.840 | — | -10.13% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` >= 0.205 AND `avg_whitespace(text)` < 0.213, the Recall is 10.13% lower than the global Recall.| text | avg_whitespace(text) | label | Predicted label |
|
|---|---|---|---|---|
| 92 | you wo n't like roger , but you will quickly recognize him . | 0.213115 | LABEL_0 | LABEL_1 (p = 1.00) |
| 93 | if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . | 0.210526 | LABEL_1 | LABEL_0 (p = 0.59) |
| 183 | the lower your expectations , the more you 'll enjoy it . | 0.206897 | LABEL_0 | LABEL_1 (p = 1.00) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | medium 🟡 | text contains "movie" |
Precision = 0.837 | — | -8.81% than global |
🔍✨Examples
For records in the dataset where `text` contains "movie", the Precision is 8.81% lower than the global Precision.| text | label | Predicted label |
|
|---|---|---|---|
| 69 | this one is definitely one to skip , even for horror movie fanatics . | LABEL_0 | LABEL_1 (p = 0.95) |
| 172 | it seems like i have been waiting my whole life for this movie and now i ca n't wait for the sequel . | LABEL_1 | LABEL_0 (p = 0.72) |
| 509 | a movie that successfully crushes a best selling novel into a timeframe that mandates that you avoid the godzilla sized soda . | LABEL_1 | LABEL_0 (p = 0.91) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | medium 🟡 | idx >= 826.500 |
Precision = 0.846 | — | -7.84% than global |
🔍✨Examples
For records in the dataset where `idx` >= 826.500, the Precision is 7.84% lower than the global Precision.| idx | label | Predicted label |
|
|---|---|---|---|
| 827 | 827 | LABEL_0 | LABEL_1 (p = 0.91) |
| 829 | 829 | LABEL_0 | LABEL_1 (p = 0.98) |
| 832 | 832 | LABEL_0 | LABEL_1 (p = 0.80) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | medium 🟡 | text_length(text) < 82.500 AND text_length(text) >= 73.500 |
Recall = 0.870 | — | -6.97% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` < 82.500 AND `text_length(text)` >= 73.500, the Recall is 6.97% lower than the global Recall.| text | text_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 93 | if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . | 76 | LABEL_1 | LABEL_0 (p = 0.59) |
| 142 | what better message than ` love thyself ' could young women of any size receive ? | 82 | LABEL_1 | LABEL_0 (p = 0.98) |
| 411 | i do n't mind having my heartstrings pulled , but do n't treat me like a fool . | 80 | LABEL_0 | LABEL_1 (p = 0.95) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | medium 🟡 | text_length(text) >= 165.500 AND text_length(text) < 183.500 |
Recall = 0.872 | — | -6.73% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` >= 165.500 AND `text_length(text)` < 183.500, the Recall is 6.73% lower than the global Recall.| text | text_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 266 | a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors . | 179 | LABEL_1 | LABEL_0 (p = 0.85) |
| 282 | while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer | 166 | LABEL_1 | LABEL_0 (p = 1.00) |
| 292 | the story and the friendship proceeds in such a way that you 're watching a soap opera rather than a chronicle of the ups and downs that accompany lifelong friendships . | 170 | LABEL_0 | LABEL_1 (p = 0.88) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | medium 🟡 | text_length(text) < 98.500 AND text_length(text) >= 86.500 |
Precision = 0.861 | — | -6.21% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` < 98.500 AND `text_length(text)` >= 86.500, the Precision is 6.21% lower than the global Precision.| text | text_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 115 | sam mendes has become valedictorian at the school for soft landings and easy ways out . | 88 | LABEL_0 | LABEL_1 (p = 0.98) |
| 230 | reign of fire looks as if it was made without much thought -- and is best watched that way . | 93 | LABEL_1 | LABEL_0 (p = 1.00) |
| 519 | moretti 's compelling anatomy of grief and the difficult process of adapting to loss . | 87 | LABEL_0 | LABEL_1 (p = 1.00) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | medium 🟡 | idx >= 500.500 AND idx < 546.500 |
Accuracy = 0.870 | — | -5.92% than global |
🔍✨Examples
For records in the dataset where `idx` >= 500.500 AND `idx` < 546.500, the Accuracy is 5.92% lower than the global Accuracy.| idx | label | Predicted label |
|
|---|---|---|---|
| 501 | 501 | LABEL_1 | LABEL_0 (p = 1.00) |
| 509 | 509 | LABEL_1 | LABEL_0 (p = 0.91) |
| 519 | 519 | LABEL_0 | LABEL_1 (p = 1.00) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | medium 🟡 | idx >= 121.500 AND idx < 182.500 |
Recall = 0.885 | — | -5.36% than global |
🔍✨Examples
For records in the dataset where `idx` >= 121.500 AND `idx` < 182.500, the Recall is 5.36% lower than the global Recall.| idx | label | Predicted label |
|
|---|---|---|---|
| 142 | 142 | LABEL_1 | LABEL_0 (p = 0.98) |
| 143 | 143 | LABEL_1 | LABEL_0 (p = 0.89) |
| 171 | 171 | LABEL_0 | LABEL_1 (p = 0.67) |
Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.
💡 What's Next?
- Checkout the Giskard Space and improve your model.
- The Giskard community is always buzzing with ideas. 🐢🤔 What do you want to see next? Your feedback is our favorite fuel, so drop your thoughts in the community forum! 🗣️💬 Together, we're building something extraordinary.
🙌 Big Thanks!
We're grateful to have you on this adventure with us. 🚀🌟 Here's to more breakthroughs, laughter, and code magic! 🥂✨ Keep hugging that code and spreading the love! 💻 #Giskard #Huggingface #AISafety 🌈👏 Your enthusiasm, feedback, and contributions are what seek. 🌟 Keep being awesome!