Open English Dictionary is an open-source English–Chinese lexicon generated entirely by large language models. It offers more than basic translations; each entry clarifies nuanced shades of meaning that standard bilingual dictionaries often overlook. The project currently contains over 25,000 entries, automated via models like Qwen3-Next-80B-A3B-Instruct.
Entries are stored as structured JSON. Each record includes pronunciation, a concise core definition, inflected forms, and bilingual explanations with example sentences. It also provides direct comparisons between near-synonyms to show exactly how similar English words diverge within a Chinese context. The focus is on precision.
The toolchain ensures data quality rather than relying on raw LLM output. A batch generation script (main.py) manages the primary workload, supported by several maintenance utilities: check_json_structure.py validates the schema, clean_json_entries.py removes empty fields, and generate_json_template.py maintains structural uniformity as the dictionary expands. The resulting data is clean and consistent. You can access the full dictionary by cloning the repository.
Setup
Install dependencies:
uv sync
Add your DATABASE_URL to the .env file.
Ensure the URL points to an active PostgreSQL instance.
Download the compressed dump:
uv run open-dictionary download --output data/raw-wiktextract-data.jsonl.gz
Extract the JSONL file:
uv run open-dictionary extract --input data/raw-wiktextract-data.jsonl.gz --output data/raw-wiktextract-data.jsonl
Stream the data into PostgreSQL:
uv run open-dictionary load data/raw-wiktextract-data.jsonl --table dictionary_all --column data --truncate
(The dictionary_all.data column uses the JSONB type.)
Run the end-to-end pipeline:
uv run open-dictionary pipeline --workdir data --table dictionary_all --column data --truncate
Split rows by language code:
uv run open-dictionary partition --table dictionary_all --column data --lang-field lang_code
Filter for specific languages:
• Isolate English and Chinese into dedicated tables with a custom prefix:
uv run open-dictionary filter en zh --table dictionary_all --column data --table-prefix dictionary_filtered
• Create a separate table for every language found:
uv run open-dictionary filter all --table dictionary_all --column data
Add word frequency scores:
uv run open-dictionary db-commonness --table dictionary_filtered_en
(Use --recompute-existing to refresh existing scores.)
Purge low-quality entries:
uv run open-dictionary db-clean --table dictionary_filtered_en
This removes rows with zero commonness scores, entries containing digits, and data with outdated tags.
Generate learner-focused definitions via LLM:
uv run open-dictionary llm-define --table dictionary_filtered_en --source-column data --target-column new_speak
This triggers the define workflow. Results are stored in the new_speak column as JSONB. The process supports 50 concurrent LLM calls, uses exponential backoff for errors, and resumes automatically if interrupted.
Before running any LLM-based commands, set your LLM_MODEL, LLM_KEY, and LLM_API environment variables. The pipeline is built to handle datasets exceeding 10 million rows without stalling.
withoutbg: Free Local & API-Based AI Background Removal Tool
Shendeng VPN Review: High-Speed Gaming, Video Streaming, and Unlimited Data
NeuralAgent: An Open-Source AI Agent for Native Desktop Automation
Strapi Setup Guide: Local Development & Cloud Deployment
RunAgent: Build AI Agents in Python, Invoke Them Natively from Any Language
PandaWiki Setup Guide: Building an AI-Powered Knowledge Base
LetsMarkdown: Lightweight Collaborative Markdown Editor Powered by Rust
Crawl4AI: Fast LLM-Ready Web Scraping Without the Bloat
Flameshot CLI Guide: Capture, Edit, and Upload Screenshots Rapidly
ALLinSSL: Automated SSL Certificate Lifecycle Management
DeerFlow: Modular Multi-Agent Research With LangGraph and MCP
ZeroSearch: Training LLMs to Search Without Real-World Search Engines