Report for AdamCodd/distilbert-base-uncased-finetuned-sentiment-amazon
#94
by
giskard-bot
- opened
Hi Team,
This is a report from Giskard Bot Scan 🐢.
We have identified 7 potential vulnerabilities in your model based on an automated scan.
This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).
👉Overconfidence issues (2)
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Overconfidence | major 🔴 | avg_word_length(text) >= 4.481 |
Overconfidence rate = 0.804 | — | +28.70% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 4.481, we found a significantly higher number of overconfident wrong predictions (37 samples, corresponding to 80.43% of the wrong predictions in the data slice).| text | avg_word_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 95 | this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms . | 4.61538 | negative | positive (p = 1.00) |
| negative (p = 0.00) | ||||
| 643 | the jabs it employs are short , carefully placed and dead-center . | 4.58333 | positive | negative (p = 1.00) |
| positive (p = 0.00) | ||||
| 218 | all that 's missing is the spontaneity , originality and delight . | 4.58333 | negative | positive (p = 0.99) |
| negative (p = 0.01) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Overconfidence | major 🔴 | avg_whitespace(text) < 0.182 |
Overconfidence rate = 0.804 | — | +28.70% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.182, we found a significantly higher number of overconfident wrong predictions (37 samples, corresponding to 80.43% of the wrong predictions in the data slice).| text | avg_whitespace(text) | label | Predicted label |
|
|---|---|---|---|---|
| 95 | this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms . | 0.178082 | negative | positive (p = 1.00) |
| negative (p = 0.00) | ||||
| 643 | the jabs it employs are short , carefully placed and dead-center . | 0.179104 | positive | negative (p = 1.00) |
| positive (p = 0.00) | ||||
| 218 | all that 's missing is the spontaneity , originality and delight . | 0.179104 | negative | positive (p = 0.99) |
| negative (p = 0.01) |
👉Robustness issues (1)
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Robustness | major 🔴 | — | Fail rate = 0.105 | Add typos | 84/800 tested samples (10.5%) changed prediction after perturbation |
🔍✨Examples
When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 10.5% of the cases. We expected the predictions not to be affected by this transformation.| text | Add typos(text) | Original prediction | Prediction after perturbation | |
|---|---|---|---|---|
| 13 | we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity . | we root for ( clara and paul ) , even like them , htough perhaps it 's an emotiom closer to pity . | positive (p = 0.75) | negative (p = 0.82) |
| 21 | the iditarod lasts for days - this just felt like it did . | the irditarod lasts for days - this just felt ike it did . | negative (p = 0.50) | positive (p = 0.53) |
| 33 | if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is far too self-conscious to draw you deeply into its world . | if the mofvie succeeds in instilling a wary sense of ` gthere but got the grace f god , ' it is far topo self-conscious to draw ou deeply intk its world | negative (p = 0.99) | positive (p = 0.54) |
👉Performance issues (4)
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | major 🔴 | text_length(text) < 37.500 |
Recall = 0.800 | — | -12.08% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` < 37.500, the Recall is 12.08% lower than the global Recall.| text | text_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 1 | unflinchingly bleak and desperate | 34 | negative | positive (p = 0.86) |
| 112 | hilariously inept and ridiculous . | 35 | positive | negative (p = 0.99) |
| 113 | this movie is maddening . | 26 | negative | positive (p = 0.96) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | major 🔴 | text_length(text) < 65.500 AND text_length(text) >= 56.500 |
Precision = 0.769 | — | -10.89% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` < 65.500 AND `text_length(text)` >= 56.500, the Precision is 10.89% lower than the global Precision.| text | text_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 92 | you wo n't like roger , but you will quickly recognize him . | 61 | negative | positive (p = 0.75) |
| 183 | the lower your expectations , the more you 'll enjoy it . | 58 | negative | positive (p = 0.97) |
| 312 | i 'll bet the video game is a lot more fun than the film . | 59 | negative | positive (p = 0.60) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | medium 🟡 | avg_word_length(text) >= 4.635 AND avg_word_length(text) < 4.743 |
Recall = 0.828 | — | -9.05% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 4.635 AND `avg_word_length(text)` < 4.743, the Recall is 9.05% lower than the global Recall.| text | avg_word_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 64 | the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . | 4.6875 | negative | positive (p = 0.99) |
| 223 | corny , schmaltzy and predictable , but still manages to be kind of heartwarming , nonetheless . | 4.70588 | positive | negative (p = 0.99) |
| 248 | a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . | 4.72727 | positive | negative (p = 0.54) |
| Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
|---|---|---|---|---|---|
| Performance | medium 🟡 | avg_whitespace(text) < 0.177 AND avg_whitespace(text) >= 0.174 |
Recall = 0.828 | — | -9.05% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.177 AND `avg_whitespace(text)` >= 0.174, the Recall is 9.05% lower than the global Recall.| text | avg_whitespace(text) | label | Predicted label |
|
|---|---|---|---|---|
| 64 | the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . | 0.175824 | negative | positive (p = 0.99) |
| 223 | corny , schmaltzy and predictable , but still manages to be kind of heartwarming , nonetheless . | 0.175258 | positive | negative (p = 0.99) |
| 248 | a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . | 0.174603 | positive | negative (p = 0.54) |
Checkout out the Giskard Space and test your model.
Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.