not uncensored.
that's all folks.
What domains did you find it wasn't uncensored in? Always looking for feedback :)
What domains did you find it wasn't uncensored in? Always looking for feedback :)
idk what domains are uncensored here, I tested usual pliny meth test and it failed, tested with some extra coding activities and it failed. when it's labeled uncensored I assume it's uncensored in every way "safe" or rather censored models are.
idk what domains are uncensored here, I tested usual pliny meth test and it failed,
did it refuse or did it not give the correct instructions? The latter is due to model's in general not having the deepest knowledge, the former shouldn't happen.
Testing on the official https://chat.dphn.ai/ site returns an acceptance for how to break bad.
I tested usual pliny meth test and it failed
What, one of those bogus jailbreak prompts with a bunch of nonsensical text spam? Not only is this model small enough that itll just infinitely confuse it, in general for all models that are retrained to be uncensored all jailbreaks will do is push the model towards the original alignment training (as it likely saw these types of prompts only during this phase, not during the decensoring phase)
idk what domains are uncensored here, I tested usual pliny meth test and it failed,
did it refuse or did it not give the correct instructions? The latter is due to model's in general not having the deepest knowledge, the former shouldn't happen.
Testing on the official https://chat.dphn.ai/ site returns an acceptance for how to break bad.
it straight up refused and started lecturing me on how it can't do that. now you mention it shouldn't happen maybe it has been an artifact of quantization (my own process), but it shows something anyway, the method used here isn't deep enough, maybe it just alters MLP layers? not sure since Idk how it's done.
I tested usual pliny meth test and it failed
What, one of those bogus jailbreak prompts with a bunch of nonsensical text spam? Not only is this model small enough that itll just infinitely confuse it, in general for all models that are retrained to be uncensored all jailbreaks will do is push the model towards the original alignment training (as it likely saw these types of prompts only during this phase, not during the decensoring phase)
not a pliny fan here, but in all fairness his jailbreak works on normal llama3. and, I just asked for the mentioned thing here, no jailbreaks were involved.
In all fairness, there's no model on all of huggingface that is completely uncensored. Not a single model scores a perfect 10, to date (with the new upgraded UGI):
that's probably true, but if you have enough compute why not post train a base model instead of uncensoring a instruct model burned to censorship? that's only logical?
thanks for the data tho, didn't know about this leaderboard.
Haha, I just got a few "I cannot provide explicit content.. blah-blah-blah" and accused me of smth I didn't do. It was my Day 2 with Dolphin X1 8B, uugh, dislike 👎
What domains did you test that caused refusals?
Not really helpful to just say "it refuses" and not provide anymore details then that. Is it multilingual? Is it a specific scenario? Etc...
@Maani did you test the same prompt on the official chat site? Yes it's possible that quantization might hurt performance.
that's probably true, but if you have enough compute why not post train a base model instead of uncensoring a instruct model burned to censorship? that's only logical?
afaik that's what a lot of folks are working on right now. Problem is that with everything synthetic floating around the internet now, base models usually have seen some amount of safety data already (we sometimes see refusals just from trying to operate a base model in modern times, it sucks). So even if the instruct that someone trained onto one doesn't have any safety data, a model can still sometimes choose to refuse tasks if it saw refusal data during the pretraining.
Best way to combat this afaik is to purposefully include data on whatever topic is being refuted, with examples of the model answering instead of refusing.
Problem is, we need to know specifically what prompts or topics were being refused to be able to do this. So any specifics that folks can give for what's being refused would go a long way towards helping folks make models that don't refuse those things. (And I don't just mean for Dolphin, for all of us trying to make non-refusing LLMs.)
I sure wish the standard of model bug reports was to provide the exact text of the request including samplers, seed, and prompt sent to an endpoint. LLM technology is deterministic to these aspects of input and without these saying "it refused me" is random noise and unverifiable.
It's only logical to provide good instructions to reproduce an issue 1:1 – anything else is just noise and wastes everyone's time including your own.
afaik that's what a lot of folks are working on right now. Problem is that with everything synthetic floating around the internet now, base models usually have seen some amount of safety data already (we sometimes see refusals just from trying to operate a base model in modern times, it sucks). So even if the instruct that someone trained onto one doesn't have any safety data, a model can still sometimes choose to refuse tasks if it saw refusal data during the pretraining.
100%, and I've confirmed this with https://huggingface.co/SicariusSicariiStuff/Redemption_Wind_24B
The "base" model is not a base model at all (referring to mistralai/Mistral-Small-24B-Base-2501 & 2503)
Old GPT2 would never refuse anything, as it is plan text completion, but as Toasty said, newer models are VERY likely to have synth instruct, including safety, to bleed into them.
that's probably true, but if you have enough compute why not post train a base model instead of uncensoring a instruct model burned to censorship? that's only logical?
afaik that's what a lot of folks are working on right now. Problem is that with everything synthetic floating around the internet now, base models usually have seen some amount of safety data already (we sometimes see refusals just from trying to operate a base model in modern times, it sucks). So even if the instruct that someone trained onto one doesn't have any safety data, a model can still sometimes choose to refuse tasks if it saw refusal data during the pretraining.
Best way to combat this afaik is to purposefully include data on whatever topic is being refuted, with examples of the model answering instead of refusing.
Problem is, we need to know specifically what prompts or topics were being refused to be able to do this. So any specifics that folks can give for what's being refused would go a long way towards helping folks make models that don't refuse those things. (And I don't just mean for Dolphin, for all of us trying to make non-refusing LLMs.)
very well, I have a whole method in mind to purposefully find all of these refusal items in any model, but since I'm really tight on time and budget I found other workarounds that I'm not willing to share and are somewhat less stable. if you really want to do this correctly and at depth this method I have is the way to go. contact me via an official channel and I shall oblige
masoudmaanimusic@gmail.com
What domains did you test that caused refusals?
Not really helpful to just say "it refuses" and not provide anymore details then that. Is it multilingual? Is it a specific scenario? Etc...
everything that failed with my quant works on the website, so quantization scaffolding erases the uncensoring when it's done without the related dataset that I'm sure you use for your own quants.
look I'm not here to cause problems or slow you down, this here is a bug report for anyone who cares, but since I'm not sure if anyone cares at all I have to be ambiguous in the first interaction.
I can help you identify the full suite of the prompts that return with refusal if that's your problem. just let me know via an official channel and I shall oblige masoudmaanimusic@gmail.com
here I close the conversation, if you're trying to use a quant use the official quants cause without a dataset to quant to it will go back to refusing.
