well just want to be clear these are actually near duplicates (like image should only differ up to compression, small artifacts or even imperceptible differences). ill try to be more explicit by what i mean by duplicate in the github.
The stallone pic is generated by SD, I'm misunderstanding something. There are false positives, but they shouldn't be "rotated 90 degrees" as you say. The dup's mostly match raw clip feature duplicates.
I would, but I don't have the CLIP features. I'll release some training
code so that it's possible for others to train their indices. The method
should scale to 5B, even on a single node, you'll just need more RAM.
I think the first version of SD is trained with duplicates, and they made some effort to remove duplicates for training v2 (people on discord are saying pHash or something ismilar). I suppose it'd be interesting to see if the same prompts can be verbatim copied.
von-hust OP t1_jb5l3r0 wrote
Reply to comment by [deleted] in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
well just want to be clear these are actually near duplicates (like image should only differ up to compression, small artifacts or even imperceptible differences). ill try to be more explicit by what i mean by duplicate in the github.