QSV: Slice, Query, and Clean Massive CSV Files with High Performance

6月9日 Published inData Processing Tools

QSV is a powerful command-line toolkit designed for data processing. It efficiently handles CSV files, spreadsheets, databases, and Parquet formats, allowing users to query, slice, index, filter, transform, validate, and join datasets directly from the terminal.

Core Commands

Command Functionality
apply Executes string, date, mathematical, or currency operations on a column. It also supports basic NLP tasks like similarity scoring and sentiment analysis.
count Counts rows and generates column width statistics using Polars-driven multithreading.
diff Identifies differences between two CSV files. It can process one million rows in approximately 600ms.
geocode Converts addresses into geographic coordinates using local Geonames and MaxMind databases. It processes up to 360,000 records per second and caches results.
joinp Joins datasets using the Polars engine, supporting non-equi joins and "asof" joins for time-series data.
sqlp Executes Polars SQL queries on CSV files that exceed available RAM, with support for multiple output formats.
stats Generates comprehensive statistics, including percentiles, standard deviation, sparsity, and automated type inference.
validate Validates a CSV against a JSON Schema, checking custom formats and ensuring composite key uniqueness.

Installation

Several installation methods are available to suit different environments.

Pre-built Binaries

Binaries are available for Linux, macOS, and Windows. CPU-optimized builds are provided for x86_64 and Apple Silicon, utilizing SSE4.2, AVX2, or ARM64 NEON instructions where supported. Windows users can use an MSI installer that automatically adds QSV to the system path. On macOS, the quarantine flag must be cleared manually using: xattr -d com.apple.quarantine qsv.

The tool can be kept up to date with the qsv --update command, which automatically verifies package signatures.

Package Managers

  • Arch Linux: yay -S qsv
  • Homebrew (macOS/Linux): brew install qsv
  • Windows Scoop: scoop install qsv
  • Ubuntu/Debian: Import the GPG key and add the official repository.

Building From Source

Building from source requires the Rust toolchain. Run the following command:

cargo install qsv --locked --features all_features

Specific features can be toggled using the --features flag. Slimmer variants are also available:

  • qsvlite: A stripped-down version, roughly 13% the size of the full binary.
  • qsvdp: Specialized for DataPusher+, containing only essential commands and a lightweight applydp function.

Capabilities and Performance

Performance

The index command creates a dedicated index file for a CSV. Once indexed, operations like slicing and counting run in constant time. For example, a 15GB dataset containing 28 million rows can be indexed in just 14 seconds.

Commands marked with 🚀 utilize multiple threads even without an index, while those marked with 🏎️ switch to multithreaded mode once an index is detected. Most commands stream data, ensuring they never load an entire file into memory. Giant files can be deduplicated using extdedup, which employs a memory-mapped hash table.

Format Support

QSV provides native support for CSV, TSV, SSV, JSON, JSONL, Excel, and ODS. When Polars features are enabled, it also supports Parquet and Arrow IPC. Additionally, Snappy-compressed files (.sz) are decompressed automatically on the fly.

Output options are equally flexible, including Parquet, PostgreSQL, SQLite, Excel, and Data Packages. The sqlp command can also output data in JSON or Avro formats. QSV expects UTF-8 encoding by default; on Windows, setting the QSV_OUTPUT_BOM environment variable adds a Byte Order Mark to prevent Excel from mangling text.

Scripting and Integration

  • luau: An embedded DSL for building complex data pipelines within QSV. It includes helper functions for queries and aggregation, performing faster than Python without requiring an external runtime.
  • py: Allows the use of Python expressions for filtering rows or creating new columns (requires Python 3.8+).
  • fetch / fetchpost: Interacts with web services. This can be combined with Jaq JSON queries and MiniJinja templates, with response caching available via Redis or local disk.
  • geoconvert: Facilitates data conversion between GeoJSON, SHP, CSV, and SVG formats.

QSV Pro

A premium tier offers a graphical user interface, an API, CKAN integration, and support for natural language queries. This version is verified against Windows Defender and is available via the official website or the Microsoft Store.

Security

Built in Rust, QSV is audited for vulnerabilities. Pre-built binaries are signed using zipsign, and data integrity is maintained through JSON Schema validation and composite key checks.

Real-World Use Cases

Big Data Analysis

Using the NYC 311 dataset (15GB, 28 million rows) as a benchmark, QSV completes indexing in 14 seconds. With the index active, the stats command runs in just 7.3 seconds. Commands can be chained using standard pipes; for example: qsv slice -i 5 data.csv | qsv flatten.

Data Cleaning

The dedup command performs in-memory deduplication, while extdedup is designed for files that exceed available RAM. Sorting options include alphabetical, numeric, or random ordering, and sortcheck can verify if a file is already ordered. The input command standardizes malformed CSVs, and fmt allows for the adjustment of delimiters and quoting rules.

Geocoding

The geocode command resolves addresses against a local database, supporting place names in 254 languages. Because it runs locally, it is not subject to API rate limits.

Interactive Exploration

The lens command opens an interactive viewer powered by the csvlens engine, allowing users to filter and scroll through data in real time.

Practical Tips

  • When processing files larger than the system's memory, prioritize commands with the ext prefix (e.g., extsort, extdedup) to avoid memory exhaustion.
  • To reduce the binary size, compile with specific --features and omit the python feature if it is not required.
  • Use luau for complex logic; it runs natively within QSV, has no external dependencies, and starts up faster than Python.