Large language models have made significant advancements in natural language processing, but they face limitations due to auto-regressive generation. This method requires a forward pass for each generated token, which impacts memory access and processing time. Speculative sampling has emerged as a solution, but it involves using two models with the same tokenizer, introducing bottlenecks. In collaboration with Apple, EPFL researchers have introduced a new approach called Parallel Speculative Sampling (PaSS), which eliminates the need for a second model.

PaSS allows for the drafting of multiple tokens simultaneously using a single model, combining the benefits of auto-regressive generation and speculative sampling. This method utilizes parallel decoding, which involves two phases: drafting and validation. During the drafting phase, the model simultaneously produces multiple tokens using parallel decoding. The first token is excluded from the draft for distribution matching in case of rejection. By eliminating the need for a second model, PaSS achieves superior speed and performance while maintaining overall model quality.

Comparative evaluations with autoregressive generation and a baseline method demonstrate PaSS’s superior speed and performance. In tests on text and code completion tasks, PaSS shows promising results without compromising overall model quality. Notably, PaSS generates tokens with lower variance and higher predictability compared to baselines using different sampling schemes.

The number of look-ahead steps also plays a crucial role in PaSS performance. The study found that a decrease in running time up to 6 look-ahead steps can significantly impact PaSS performance. Therefore, researchers recommend further investigation into the impact of the number of look-ahead steps on PaSS, as an increased number of steps might potentially negate the approach’s benefits.

PaSS is a powerful language model generation technique that eliminates the need for a second model, resulting in a significant speed-up compared to auto-regressive generation. It generates tokens with low variance and high predictability, making it effective for text and code completion tasks. Ongoing research aims to enhance the quality of parallel generation with look-ahead tokens and explore further improvements to enhance PaSS performance.

