Adversarial Training Attacks

#1
by TimeLordRaps - opened

What are you thoughts on this somehow being corrupted in like a meta-adversarial way, in which the training becomes poisoning-aware so to speak, and learns deeper-level gradients due to the compute that was already inherently placed into the images if that makes sense. Like, I train on image A unpoisoned, it does standard level of gradient change because my training is aware it is unpoisoned, and then I also train on A poisoned, or B a different image that is poisoned, either way, the training becomes aware it is poisoned and piggybacks on the extra information in the image that is inherently there. Like meta-adversarial training.

Very sharp insight. You've hit on the core challenge for any data poisoning tool like this one, Glaze, Nightshade, etc; which is a meta-adversarial attack where a model is trained to detect and remove the protection. Research shows from the recent LightShed paper that this is possible if an attacker has a large dataset of paired clean and protected images.

My primary defense against this is economic, not absolute unbreakability. To build that removal model, an attacker must acquire thousands of paired images, forcing them to either license original content or use our service at scale, making the attack prohibitively expensive.

As for the model "piggybacking" on the extra information, I think that's unlikely to be beneficial. The embedded patterns are not a useful signal but are mathematically structured chaos designed to disrupt the fundamental training process, like gradient flow and feature extraction. Instead of learning deeper gradients, the model's training convergence is degraded a lot, making it weaker, not stronger. The information is inherently toxic to the learning process.

Thanks for the insight, my original thought was degrading the signal while being more or less unchanged to human perception could lead to more robust generalization, but from what I gather you are targeting the nature of information transfer between samples during training. Even if they could train a classifier from large scale poisoning datasets, it would still be imperfect/not 100% and it seems even a small number of poisoned examples that would be misclassified could cause significant disruption. My only question is if trainers can't effectively determine poisoned data will this overtime degrade all training in some form. How does this compare to some form of obfuscation through something akin to honeypotting or maybe there is a better term. But essentially having the poison act still as a meaningful signal to the training, so their model still gets trained fine, but in such a way that the ip of the original poisoned image/data is not learned. This pairs with my initial question of sort of is there a way to instill more information/compute into training samples in such a way that the ip/sample stays hidden, but the model learns better because of it. If anyone could figure this out it would be you. To me and my naive to chaos dynamics brain it seems impossible.

For some reason I'm bearish on the word honeyspotting. Emphasis on spots.

Rereading that convergence is slower for some reason leads me to think about how this particularly effects new recurrent architectures like trm and hrm. The idea that the training is still capable of converging, sort of hints at instilling robustness to overfitting naturally.

Sign up or log in to comment