Semlib: Build LLM Pipelines With Map, Filter, and Sort in Python

9月19日 Published inDatabase Tools

Semlib is a Python library designed to build data pipelines powered by Large Language Models (LLMs). Rather than writing manual prompting and parsing logic, you describe your requirements using natural language. Semlib manages the heavy lifting—prompting, output parsing, concurrency, caching, and cost tracking—under the hood. It brings familiar functional programming primitives like map, reduce, sort, and filter to the world of AI.

Why adopt this approach? Deconstructing complex tasks into semantic steps offers several practical advantages:

  • Higher Precision – Breaking tasks into smaller, focused steps allows the model to perform more accurately on each specific segment.
  • Bypass Context Constraints – You can process datasets of any size. The LLM’s context window is no longer a bottleneck.
  • Improved Performancemap and reduce operations run concurrently, significantly reducing total execution time.
  • Cost Efficiency – You can assign the most appropriate model to each sub-task, utilizing smaller, cheaper models for simpler operations.
  • Enhanced Privacy – The library supports self-hosted open-source models, ensuring sensitive data never leaves your local infrastructure.
  • Hybrid Flexibility – You can easily mix LLM calls with standard Python code, using each where it is most effective.

Installation and Quick Start

pip install semlib

Here is a basic implementation:

# Retrieve a list of U.S. presidents
presidents = await prompt(
    "Who were the 39th through 42nd presidents of the United States?",
    return_type=Bare(list[str])
)

# Sort them based on political leaning
await sort(presidents, by="right-leaning", reverse=True)
# -> ['Ronald Reagan', 'George H. W. Bush', 'Bill Clinton', 'Jimmy Carter']

# Locate a specific entry
await find(presidents, by="former actor")
# -> 'Ronald Reagan'

# Calculate their age at inauguration
await map(
    presidents,
    "How old was {} when he took office?",
    return_type=Bare(int),
)
# -> [52, 69, 64, 46]

Feeding a massive dataset into a single LLM prompt rarely yields optimal results. Semlib provides a more reliable path: decompose the workload, process it step-by-step, and maintain granular control over the pipeline.

Real-World Use Cases

Customer Support – Analyze thousands of support tickets to automatically classify issues and extract key information.

Academic Research – Semantically sort through large collections of abstracts to find and recommend relevant papers.

Sentiment Analysis – Aggregate and synthesize feedback from product reviews at scale.

Content Processing – Filter and extract structured data from resumes, reports, or diverse document sets.