Open English Dictionary: 25,000+ LLM-Refined Word Entries for Deeper Chinese Understanding

10月30日 Published inDictionaries & Lexicons

Open English Dictionary is an open-source English–Chinese lexicon generated entirely by large language models. It offers more than basic translations; each entry clarifies nuanced shades of meaning that standard bilingual dictionaries often overlook. The project currently contains over 25,000 entries, automated via models like Qwen3-Next-80B-A3B-Instruct.

Entries are stored as structured JSON. Each record includes pronunciation, a concise core definition, inflected forms, and bilingual explanations with example sentences. It also provides direct comparisons between near-synonyms to show exactly how similar English words diverge within a Chinese context. The focus is on precision.

The toolchain ensures data quality rather than relying on raw LLM output. A batch generation script (main.py) manages the primary workload, supported by several maintenance utilities: check_json_structure.py validates the schema, clean_json_entries.py removes empty fields, and generate_json_template.py maintains structural uniformity as the dictionary expands. The resulting data is clean and consistent. You can access the full dictionary by cloning the repository.

Getting Started

Setup

  1. Install dependencies: uv sync

  2. Add your DATABASE_URL to the .env file.

  3. Ensure the URL points to an active PostgreSQL instance.

The Wiktionary Pipeline

  1. Download the compressed dump: uv run open-dictionary download --output data/raw-wiktextract-data.jsonl.gz

  2. Extract the JSONL file: uv run open-dictionary extract --input data/raw-wiktextract-data.jsonl.gz --output data/raw-wiktextract-data.jsonl

  3. Stream the data into PostgreSQL: uv run open-dictionary load data/raw-wiktextract-data.jsonl --table dictionary_all --column data --truncate (The dictionary_all.data column uses the JSONB type.)

  4. Run the end-to-end pipeline: uv run open-dictionary pipeline --workdir data --table dictionary_all --column data --truncate

  5. Split rows by language code: uv run open-dictionary partition --table dictionary_all --column data --lang-field lang_code

  6. Filter for specific languages: • Isolate English and Chinese into dedicated tables with a custom prefix: uv run open-dictionary filter en zh --table dictionary_all --column data --table-prefix dictionary_filtered • Create a separate table for every language found: uv run open-dictionary filter all --table dictionary_all --column data

  7. Add word frequency scores: uv run open-dictionary db-commonness --table dictionary_filtered_en (Use --recompute-existing to refresh existing scores.)

  8. Purge low-quality entries: uv run open-dictionary db-clean --table dictionary_filtered_en This removes rows with zero commonness scores, entries containing digits, and data with outdated tags.

  9. Generate learner-focused definitions via LLM: uv run open-dictionary llm-define --table dictionary_filtered_en --source-column data --target-column new_speak This triggers the define workflow. Results are stored in the new_speak column as JSONB. The process supports 50 concurrent LLM calls, uses exponential backoff for errors, and resumes automatically if interrupted.

Before running any LLM-based commands, set your LLM_MODEL, LLM_KEY, and LLM_API environment variables. The pipeline is built to handle datasets exceeding 10 million rows without stalling.