This article will give an overview of how perceptual image hashes can be used to detect duplicate and near duplicate images.
Install dependencies
Install txtai
and all dependencies.
pip install txtai[pipeline] textdistance
!wget -N https://github.com/neuml/txtai/releases/download/v3.5.0/tests.tar.gz
tar -xvzf tests.tar.gz
Generate hashes
The example below generates perceptual image hashes for a list of images.
import glob
from PIL import Image
from txtai.pipeline import ImageHash
def show(image):
width, height = image.size
return image.resize((int(width / 2.25), int((width / 2.25) * height / width)))
# Get and scale images
images = [Image.open(image) for image in glob.glob('txtai/*jpg')]
# Create image pipeline
ihash = ImageHash()
# Generate hashes
hashes = ihash(images)
hashes
['000000c0feffff00',
'0859dd04ffbfbf00',
'78f8f8d8f8f8f8f0',
'0000446c6f2f2724',
'ffffdf0700010100',
'00000006fefcfc30',
'ff9d8140c070ffff',
'ff9f010909010101',
'63263c183ce66742',
'60607072fe78cc00']
Hash search
Next we'll generate a search hash to use to find similar near-duplicate images. This logic takes a section of an image and generates a hash for that
# Select portion of image
width, height = images[0].size
# Get dimensions for middle of image
left = (width - width/3)/2
top = (height - height/1.35)/2
right = (width + width/3)/2
bottom = (height + height/1.35)/2
# Crop image
search = images[0].crop((left, top, right, bottom))
show(search)
Now let's compare the hash to all the image hashes using Levenshtein distance. We'll use the textdistance library for that.
import textdistance
# Find closest image hash using textdistance
shash = ihash(search)
# Calculate distances for search hash
distances = [int(textdistance.levenshtein.distance(h, shash)) for h in hashes]
# Show closest image hash
low = min(distances)
show(images[distances.index(low)])
And as expected, the closest match is the original full image!
Generate hashes with Embeddings indexes
Next we'll add a custom field with a perceptual image hash and a custom SQL function to calculate Levenshtein distance. An index of images is built and then a search query run using the distance from the same search hash.
from txtai.embeddings import Embeddings
def distance(a, b):
if a and not b:
return len(a)
if not a and b:
return len(b)
if not a and not b:
return 0
return int(textdistance.levenshtein.distance(a, b))
# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True, "objects": "image", "functions": [distance]})
# Create an index for the list of text
embeddings.index([(uid, {"object": image, "text": ihash(image)}, None) for uid, image in enumerate(images)])
# Find image that is closest to hash
show(embeddings.search(f"select object from txtai order by distance(text, '{shash}')")[0]["object"])
And just like above, the best match is the original full image.
Wrapping up
This article introduced perceptual image hashing. These hashes can be used to detect near-duplicate images. This method is not backed by machine learning models and not intended to find conceptually similar images. But for tasks looking to find similar/near-duplicate images this method is fast and does the job!