SpikingBrain: 100x Faster LLM Inference via Spike Sparsity

9月9日 Published inMoE Models

SpikingBrain redesigns large language models to more closely mirror the functional efficiency of the human brain. By combining hybrid efficient attention, a Mixture of Experts (MoE) module, and spike encoding, the model achieves a remarkable feat: it maintains performance parity with mainstream open-source models while requiring less than 2% of the typical data volume for continual pretraining.

The most immediate advantage is raw speed. When processing sequences as long as 4 million tokens, SpikingBrain reduces the Time to First Token (TTFT) by more than 100x. This is driven by two distinct layers of sparsity. At the micro level, spike-driven computation bypasses more than 69% of standard operations. At the macro level, MoE sparsity provides a second tier of efficiency. For developers and engineers designing next-generation neuromorphic hardware, this architecture serves as a functional blueprint.

SpikingBrain is available in several configurations. Users can access the standard HuggingFace checkpoints, the vLLM-optimized inference version, or the W8ASpike quantized variant. It is also hardware-agnostic; the team optimized frameworks, operators, parallel strategies, and communication primitives specifically for MetaX clusters, while the vLLM-HyMeta plugin provides seamless support for NVIDIA GPUs. Meanwhile, W8ASpike explores the limits of low-precision inference through a pseudo-spike technique—a practical bridge toward advanced spiking neural network (SNN) research.

The model underwent continual pretraining on approximately 150 billion tokens, demonstrating a strong capacity for long-context handling and a balance between general knowledge and task-specific performance. Multi-scale sparsity governs the model’s processing rhythm, adjusting spike activity based on the incoming event stream and capping firing rates to maintain a strict balance between performance and efficiency. This approach results in a lower memory footprint and simplified computation, supporting both integer and spike data types.

Core Components

vLLM-HyMeta This plugin integrates HyMeta models into the vLLM inference framework. It enables high-efficiency execution on NVIDIA GPUs while keeping the hardware backend distinct from the vLLM core.

Key advantages include:

  • Clean separation: The backend code remains isolated, ensuring the vLLM core stays organized and readable.
  • Reduced maintenance: vLLM maintainers can update the main framework without worrying about specific hardware backend quirks.
  • Modular integration: New hardware backends can be developed and updated on their own independent schedules.

W8ASpike W8ASpike is the quantized version of SpikingBrain-7B. Its primary goal is to minimize inference costs in low-precision environments while exploring the practical applications of spiking neural networks.

The current iteration employs a pseudo-spike mechanism, where activations are compressed into spike-like signals at the tensor level. It is important to note that this is not a true asynchronous, event-driven spike as found on native neuromorphic silicon; true spike hardware requires specialized async operators and event-based chips that are outside the scope of this specific repository. However, the pseudo-spike approach is an effective prototyping tool that provides a high-speed approximation. The activation encoding is inspired by the BICLab/Int2Spike interface; those seeking additional PyTorch spike utilities may find that library a helpful resource.

Deployment and Installation

NVIDIA Container Setup You can quickly deploy the environment using the following Docker command:

sudo docker run -itd \
    --entrypoint /bin/bash \
    --network host \
    --name hymeta-bench \
    --shm-size 160g \
    --gpus all \
    --privileged \
    -v /host_path:/container_path \
    --env "HF_ENDPOINT=https://hf-mirror.com" \
    docker.1ms.run/vllm/vllm-openai:v0.10.0

Plugin Installation Clone the repository and install the vLLM plugin directly:

git clone https://github.com/BICLab/SpikingBrain-7B.git
cd vllm-hymeta
pip install .

For optimal installation on NVIDIA GPUs, ensure the following dependencies are met:

  • decorator
  • pyyaml
  • scipy
  • setuptools
  • setuptools-scm
  • flash_attn==2.7.3
  • flash-linear-attention==0.1
  • vllm==0.10.0
  • torch==2.7.1

Model Resources

Model weights are hosted on ModelScope. Select the version that best matches your specific workload requirements: