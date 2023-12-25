Researchers have developed a new inference system called PowerInfer, aimed at improving the performance of large language models (LLMs) on local deployments using consumer-grade GPUs. PowerInfer addresses the challenge of high memory requirements associated with LLMs while prioritizing low latency and reducing inference costs.

The key idea behind PowerInfer is to leverage the power-law distribution in neuron activation observed during LLM inference. This distribution reveals that a small fraction of “hot” neurons consistently activate across different inputs, while most “cold” neurons only change based on specific inputs. PowerInfer makes use of this understanding preloading cold-activated neurons onto the CPU for computation and hot-activated neurons onto the GPU for instant access. This strategic distribution of workload significantly reduces GPU memory requirements and minimizes data transfers between the CPU and GPU.

In addition to distributed workload, PowerInfer integrates neuron-aware sparse operators and adaptive predictors to further optimize performance. Neuron-aware sparse operators directly interact with individual neurons, eliminating the need to operate on entire matrices. Adaptive predictors help identify and forecast active neurons during runtime, enhancing computational sparsity and effective neuron activation.

Evaluation of PowerInfer has demonstrated promising results, with an average token creation rate of 13.20 tokens per second and a peak performance of 29.08 tokens per second using a single consumer-grade GPU. These performance metrics are only 18% lower than those achieved with server-grade GPUs, showcasing the effectiveness of PowerInfer on mainstream hardware.

Compared to existing systems, PowerInfer has shown the potential to run up to 11.69 times faster while maintaining model fidelity. This improvement in LLM inference speed makes PowerInfer a promising solution for executing advanced language models on desktop PCs with limited GPU capabilities.

The development of PowerInfer offers hope for researchers and practitioners working with LLMs. By improving inference performance on local systems, PowerInfer opens up opportunities for enhanced data privacy, customizable models, and reduced inference costs. As more applications rely on language models, advancements like PowerInfer are crucial for ensuring efficient and accessible model execution.