Near duplicate image detection

Near duplicate image detection

Identify duplicate and near-duplicate images

This article will give an overview of how perceptual image hashes can be used to detect duplicate and near duplicate images.

Install dependencies

Install txtai and all dependencies.

pip install txtai[pipeline] textdistance

!wget -N https://github.com/neuml/txtai/releases/download/v3.5.0/tests.tar.gz
tar -xvzf tests.tar.gz

Generate hashes

The example below generates perceptual image hashes for a list of images.

import glob

from PIL import Image

from txtai.pipeline import ImageHash

def show(image):
  width, height = image.size
  return image.resize((int(width / 2.25), int((width / 2.25) * height / width)))

# Get and scale images
images = [Image.open(image) for image in glob.glob('txtai/*jpg')]

# Create image pipeline
ihash = ImageHash()

# Generate hashes
hashes = ihash(images)
hashes
['000000c0feffff00',
 '0859dd04ffbfbf00',
 '78f8f8d8f8f8f8f0',
 '0000446c6f2f2724',
 'ffffdf0700010100',
 '00000006fefcfc30',
 'ff9d8140c070ffff',
 'ff9f010909010101',
 '63263c183ce66742',
 '60607072fe78cc00']

Hash search

Next we'll generate a search hash to use to find similar near-duplicate images. This logic takes a section of an image and generates a hash for that

# Select portion of image
width, height = images[0].size

# Get dimensions for middle of image
left = (width - width/3)/2
top = (height - height/1.35)/2
right = (width + width/3)/2
bottom = (height + height/1.35)/2

# Crop image
search = images[0].crop((left, top, right, bottom))
show(search)

search

Now let's compare the hash to all the image hashes using Levenshtein distance. We'll use the textdistance library for that.

import textdistance

# Find closest image hash using textdistance
shash = ihash(search)

# Calculate distances for search hash
distances = [int(textdistance.levenshtein.distance(h, shash)) for h in hashes]

# Show closest image hash
low = min(distances)
show(images[distances.index(low)])

original

And as expected, the closest match is the original full image!

Generate hashes with Embeddings indexes

Next we'll add a custom field with a perceptual image hash and a custom SQL function to calculate Levenshtein distance. An index of images is built and then a search query run using the distance from the same search hash.

from txtai.embeddings import Embeddings

def distance(a, b):
  if a and not b:
    return len(a)
  if not a and b:
    return len(b)
  if not a and not b:
    return 0

  return int(textdistance.levenshtein.distance(a, b))

# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True, "objects": "image", "functions": [distance]})

# Create an index for the list of text
embeddings.index([(uid, {"object": image, "text": ihash(image)}, None) for uid, image in enumerate(images)])

# Find image that is closest to hash
show(embeddings.search(f"select object from txtai order by distance(text, '{shash}')")[0]["object"])

original

And just like above, the best match is the original full image.

Wrapping up

This article introduced perceptual image hashing. These hashes can be used to detect near-duplicate images. This method is not backed by machine learning models and not intended to find conceptually similar images. But for tasks looking to find similar/near-duplicate images this method is fast and does the job!