DupCheck: Open-Source Image Duplication & Tampering Detection (Python)

10月21日 Published inImage Tools

DupCheck is a Python-based utility designed for general-purpose image duplication and tampering detection. It is built to support use cases such as insurance claim reviews, content moderation, e-commerce verification, and copyright protection.

The tool integrates several feature extraction techniques: multi-pose pHash, multi-scale block hashing, ORB keypoints, and deep learning embeddings (ResNet-18 or CLIP). By building a comprehensive index, it can identify near-duplicates, including exact copies, cropped versions, rotated or flipped images, and those with minor modifications.

The DupCheck pipeline consists of four distinct stages:

1. Index Building

Each image in the library is processed to generate multi-pose pHashes (covering original, rotated, and flipped versions) and multi-scale block hashes. The system also caches ORB keypoints and can optionally create ResNet-18 or CLIP embeddings. This multi-layered approach ensures that images are recognized even if they have undergone geometric tweaks or semantic changes.

2. Candidate Recall

When a new image is submitted, DupCheck retrieves potential matches through three methods. First, it uses pHash bucket matching and block hash voting. Second, it performs a FAISS vector search using ResNet-18 or CLIP embeddings. Third, it employs multi-pose ORB matching to ensure that rotated or flipped near-duplicates are pulled into the candidate set.

3. Verification & Alignment

For the best-matching pose in the candidate set, DupCheck executes an ORB + RANSAC routine. It aligns the database image to the query image’s coordinate system using an estimated homography. The tool then crops a Region of Interest (ROI) around the convex hull of the inliers and calculates the Zero-mean Normalized Cross-Correlation (ZNCC) across the aligned area. If the correlation score exceeds the threshold, the match is classified as an exact_patch.

4. Output Results

All identified matches are logged in dup_report.csv. The command-line interface can also generate side-by-side comparison images to facilitate manual review.

Scalability & Performance

Scaling options

  • You can change the FAISS index type by setting the environment variables DUPC_VECTOR_INDEX=ivf_pq or DUPC_VECTOR_INDEX=hnsw.
  • For enterprise-scale libraries or cluster deployments, the built-in FAISS index can be replaced with external vector databases like Milvus, Qdrant, or Pinecone. To implement this, modify build_index and load_index_from_db within duplicate_check/indexer.py to write to the external database, then update matcher.recall_candidates to query that service before ORB re-ranking.

Performance tuning To optimize processing for large-scale image libraries, adjust the DUPC_TILE_SCALES (e.g., "1.0,0.6") and DUPC_TILE_GRID parameters. This allows you to balance detection sensitivity against execution speed.

Core Modules

  • duplicate_check/ – The primary module containing subdirectories for features, indexer, matcher, and report.
  • dupcheck_cli.py – The main CLI tool, supporting both in-memory and SQLite-based indexes for routine detection tasks.
  • duplicate_check.py – A legacy entry point maintained for backward compatibility; not recommended for new features.
  • tools/ – Utility scripts for generating synthetic datasets and fine-tuning thresholds.
  • tests/ – Scripts for validating core functionality.
  • data/ – Contains a sample synthetic dataset for initial testing.

Local Installation

DupCheck requires Python 3.9 or newer.

Install dependencies

python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# Windows: .venv\Scripts\activate

pip install -r requirements.txt

Core dependencies include Pillow, OpenCV, imagehash, torch, and torchvision. If optional dependencies are missing, DupCheck will continue to function, though detection accuracy may be reduced.

Optional dependencies

  • For vector-based recall: pip install faiss-cpu
  • For CLIP-ViT embeddings: pip install open-clip-torch (or pip install clip)

Generate sample dataset

python tools/generate_synthetic.py --out_dir data --count 5

Rebuild index and run detection

python dupcheck_cli.py \
--db_dir data/synth_db \
--input_dir data/synth_new \
--out_dir reports \
--index_db ./index.db \
--rebuild_index

Exclude the --rebuild_index flag if your image library has not changed since the last run.

Check results

The reports directory will contain:

  • dup_report.csv – A tabular summary of all detected matches.
  • Side-by-side visual evidence for manual verification.

Evaluation & Threshold Tuning

Evaluate detection accuracy using labeled synthetic data:

python tools/verify_synthetic.py \
--db_dir data/synth_db \
--input_dir data/synth_new \
--labels data/synth_labels.csv \
--phash_thresh 16 \
--orb_inliers_thresh 6 \
--ncc_thresh 0.88 \
--roi_margin_ratio 0.12 \
--max_roi_matches 60

Perform a grid search for optimal thresholds:

python tools/tune_thresholds.py \
--labels data/synth_labels.csv \
--db_dir data/synth_db \
--input_dir data/synth_new \
--out_dir reports/tune_out

This script generates tune_results.csv, detailing True Positive (TP), False Positive (FP), and False Negative (FN) counts for various parameter configurations.

Quick Command Reference

Task Command
Rebuild index + detect python dupcheck_cli.py --db_dir data/synth_db --input_dir data/synth_new --out_dir reports --index_db ./index.db --rebuild_index
Custom thresholds python dupcheck_cli.py --db_dir data/synth_db --input_dir data/synth_new --out_dir reports --phash_thresh 12 --orb_inliers_thresh 30 --ncc_thresh 0.94
Reuse existing index python dupcheck_cli.py --db_dir data/synth_db --input_dir data/synth_new --out_dir reports --index_db ./index.db