DupCheck is a Python-based utility designed for general-purpose image duplication and tampering detection. It is built to support use cases such as insurance claim reviews, content moderation, e-commerce verification, and copyright protection.
The tool integrates several feature extraction techniques: multi-pose pHash, multi-scale block hashing, ORB keypoints, and deep learning embeddings (ResNet-18 or CLIP). By building a comprehensive index, it can identify near-duplicates, including exact copies, cropped versions, rotated or flipped images, and those with minor modifications.
The DupCheck pipeline consists of four distinct stages:
Each image in the library is processed to generate multi-pose pHashes (covering original, rotated, and flipped versions) and multi-scale block hashes. The system also caches ORB keypoints and can optionally create ResNet-18 or CLIP embeddings. This multi-layered approach ensures that images are recognized even if they have undergone geometric tweaks or semantic changes.
When a new image is submitted, DupCheck retrieves potential matches through three methods. First, it uses pHash bucket matching and block hash voting. Second, it performs a FAISS vector search using ResNet-18 or CLIP embeddings. Third, it employs multi-pose ORB matching to ensure that rotated or flipped near-duplicates are pulled into the candidate set.
For the best-matching pose in the candidate set, DupCheck executes an ORB + RANSAC routine. It aligns the database image to the query image’s coordinate system using an estimated homography. The tool then crops a Region of Interest (ROI) around the convex hull of the inliers and calculates the Zero-mean Normalized Cross-Correlation (ZNCC) across the aligned area. If the correlation score exceeds the threshold, the match is classified as an exact_patch.
All identified matches are logged in dup_report.csv. The command-line interface can also generate side-by-side comparison images to facilitate manual review.
Scaling options
DUPC_VECTOR_INDEX=ivf_pq or DUPC_VECTOR_INDEX=hnsw.build_index and load_index_from_db within duplicate_check/indexer.py to write to the external database, then update matcher.recall_candidates to query that service before ORB re-ranking.Performance tuning
To optimize processing for large-scale image libraries, adjust the DUPC_TILE_SCALES (e.g., "1.0,0.6") and DUPC_TILE_GRID parameters. This allows you to balance detection sensitivity against execution speed.
features, indexer, matcher, and report.DupCheck requires Python 3.9 or newer.
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# Windows: .venv\Scripts\activate
pip install -r requirements.txt
Core dependencies include Pillow, OpenCV, imagehash, torch, and torchvision. If optional dependencies are missing, DupCheck will continue to function, though detection accuracy may be reduced.
pip install faiss-cpupip install open-clip-torch (or pip install clip)python tools/generate_synthetic.py --out_dir data --count 5
python dupcheck_cli.py \
--db_dir data/synth_db \
--input_dir data/synth_new \
--out_dir reports \
--index_db ./index.db \
--rebuild_index
Exclude the --rebuild_index flag if your image library has not changed since the last run.
The reports directory will contain:
dup_report.csv – A tabular summary of all detected matches.Evaluate detection accuracy using labeled synthetic data:
python tools/verify_synthetic.py \
--db_dir data/synth_db \
--input_dir data/synth_new \
--labels data/synth_labels.csv \
--phash_thresh 16 \
--orb_inliers_thresh 6 \
--ncc_thresh 0.88 \
--roi_margin_ratio 0.12 \
--max_roi_matches 60
Perform a grid search for optimal thresholds:
python tools/tune_thresholds.py \
--labels data/synth_labels.csv \
--db_dir data/synth_db \
--input_dir data/synth_new \
--out_dir reports/tune_out
This script generates tune_results.csv, detailing True Positive (TP), False Positive (FP), and False Negative (FN) counts for various parameter configurations.
| Task | Command |
|---|---|
| Rebuild index + detect | python dupcheck_cli.py --db_dir data/synth_db --input_dir data/synth_new --out_dir reports --index_db ./index.db --rebuild_index |
| Custom thresholds | python dupcheck_cli.py --db_dir data/synth_db --input_dir data/synth_new --out_dir reports --phash_thresh 12 --orb_inliers_thresh 30 --ncc_thresh 0.94 |
| Reuse existing index | python dupcheck_cli.py --db_dir data/synth_db --input_dir data/synth_new --out_dir reports --index_db ./index.db |
GRAG: Continuous Image Editing Control for DiT Models
AI Multi-Agent Stock Trading System: GPT-5 and Claude 4.5 Sonnet
CrewAI Stock Analysis: Multi-Agent Investment Tool with AkShare & GPT
Grey Deer VPN: Residential IPs for Secure Global Access
Feiniao VPN: Free Trial, 4K Streaming & Unblock Netflix (2026 Guide)
AI Interactive Fiction Generator Builds Stories You Actually Control
How to Install and Use Vosk Offline Speech Recognition
Fay: Build and Deploy Your Own Talking Digital Human for Free
Weapp-QRCode: Generating QR Codes in WeChat Mini Programs
KVoiceWalk: Clone Any Voice for Kokoro TTS Using Random Walks
ChineseBQB: The Ultimate Archive of Chinese Memes—Search, Download, and Win Every Group Chat
Dragon Ball RPG “Peak of Power” Review: Best Teams, Goku Skills, and F2P Guide