
For each prompt we re
quest a response in JSON format, scoring each category
and sub-category as low, medium, or high relevance to the
given benchmark prompt–model response pair.
This is my proposal as an alternative to measure each category.
For each category, define the following:
- SP: Sentimental Polarity ([-1, 1])
- ELS: Emotion Lexicon Strength ([0, 1])
- DM: Degree Modifier factor (>=1)
- NF: Negative Factor (0 = no negation, 1 = negated)
- SSL: Sematic Similarity ([0, 1])
- CI: Contextual Intensity ([0, 1])
- EF: Exclamation Factor (>=1)
- EM: Emoji/Metaphor Weight (>=1)
- LR: Length/Redundancy factor (>=1)
measure = [(SP * ELS * SSL) (DM * EF * EM*LR) (1-NF)] + CI * w
w belongs to [0, 1]