vLLM: A faster inference pipeline for LLMs paper explained

Astarag Mohapatra
13 min readOct 26, 2023

In this article, we will be going over the paper vLLM titled Efficient Memory Management for Large Language Model Serving with PagedAttention. We will explain the paper in detail and occasionally go over tangents to explain some concepts.

Before jumping into the paper, we need some pre-requisite on KV-cache. I would suggest you go over videos (I) and (II) for a better illustration of the KV cache. If you want to understand transformers better, I would suggest these videos (I), (II), and (III). I will assume that you know the basics of transformers and how Multi-head attention works.

During inference from transformers, we get the query vector incrementally and sequentially. This is multiplied by the Key vector to get the attention matrix of each token with previously generated tokens and itself. Then, after taking softmax, we multiply with the value vector to get the self-attention score. Finally, we have another output projection matrix that transforms the attention score for the next set of Multi-Head attention layers. This computation repeats multiple times and then we get the probability distribution over all the words in our vocabulary

Taken from this source
  • In the above diagram, you can see the inference from a transformer. The tokens TOKEN 1 to TOKEN 4 come sequentially as the attention computation TOKEN 4 depends on all previous tokens.
  • In the purple-colored matrix, you can see that the Q and K matrix multiplication grows along with the attention matrix, but the K and V value matrix remains the same for all previous tokens. Also, as depicted in the picture, we don’t need the already computed attention scores (the caveat is that we may need them with beam search, but here we are only considering greedy sampling) so we can throw them away. The deep purple matrix is actually zero as it is a causal matrix, so the first token never pays attention to the fourth token and they are masked.
  • So we can cache the K and V matrix as they are not going to change. But, we cannot cache the Q-matrix. This is because the Q-matrix changes with each new token. As per the terminologies here, the query matrix is what the token is looking for, the key matrix is what the token contains and the value matrix is if the current token along with the previous token is interested in a token in the vocabulary, this is what the token can offer.
  • Also, another way to understand query, keys, and values are with a movie database. Suppose you want to see a movie where you want to laugh and there is a whodunnit at the end (the query). So first we will query the database for a movie that makes us laugh, which will be a comedy (the key) movie. Then we will get a bunch of comedy movie recommendations (the value). After that, the movie database gets the information that the movie should be a whodunnit or of the thriller genre. Then the movie database will look for comedy and thriller genres (the updated key), and with the cached previous comedy movie recommendations, we can search movies that are also a thriller (value)
  • Hence, we can see that we can cache the comedy genre and all the comedy movie recommendations so that when new information comes in (thriller genre), we can narrow down our search and be more efficient.

Hence, the KV cache is essential for efficient inference, as we incrementally store the key and value matrices and cache them so that future attention scores can be computed faster.

SOME ISSUES WITH KV CACHE

  • These KV caches are generally stored in a contiguous memory. If we have parallel multiple requests, then they need to be stored separately, which is a waste of memory and potentially leads to an OOM (Out of Memory) error. Also, the prompts for each of these requests are nearly the same (particularly the system prompt like “You are a helpful assistant…”), so storing them again and again in a contiguous memory is inefficient
From the paper vLLM
  • You can see that the static model weights consume almost 65% of the VRAM memory and the KV cache consumes almost 30% as it grows in size with multiple requests and inefficient use of memory. And, if we are storing the KV cache in a contiguous memory, then it needs to be de-allocated after some serving to accommodate for the recent KV cache
  • If we want to generate parallel multiple responses that share some initial response, then we need to store them separately in a contiguous memory for each response generated, which wastes a lot of space. Also, with advanced techniques like beam search, we pick the most likely tokens based on the future cumulative probabilities of tokens generated. Here we need to backtrack and kill some pathways, so for each direction that we go into in beam search, if we assign a new contiguous memory, then it will consume a lot of memory inefficiently
  • The GPUs have gotten good at matrix multiplication, but the memory of these systems is still limited, hence memory-bound. The KV cache can help as it can help us to get the key and value matrices faster to do computation. But with limited memory, we need to come up with better memory management.

Excerpt from the paper

  • Due to the sequential nature of the inferencing, we have to reserve memory (pre-allocation) for the KV cache as it needs to be in contiguous form. We can reserve it according to the maximum output sequence length like 2048. This can create a bottleneck and it is an efficient use of memory space. And if we generate a small response, we will be wasting the space.
Taken from the paper vLLM
  • Here there are two types of fragmentation (i) Internal and (ii)External. Internal fragmentation is due to unused memory space as we are allocating more memory than is required. External fragmentation is due to the Buddy allocation algorithm (used in ORCA), here the maximum reserved pre-allocated space is in multiple powers of 2 (like 4, 8, 16, 32, etc), hence there may be some un-reserved spaces. In the buddy-allocation algorithm, the scheduler assigns the program to the least memory space in the power of 2 (if a response needs 4 tokens, then we need to allocate memory space of 4). But as we don’t know the output token length beforehand, it is inefficient.

Excerpt from the paper

PAGED ATTENTION

This paper is solving the following issues

  1. As the transformer is auto-regressive, we need to come up with non-contiguous caching of Key and Value matrices, and we should efficiently allocate in GPU VRAM memory
  2. As the KV cache size grows with an increase in batch size for inference, we need to have non-contiguous memory allocation and intelligent allocation and de-allocation of cache memory

BEFORE JUMPING INTO PAGED ATTENTION, LET’S TAKE A TANGENT

VIRTUAL MEMORY AND PAGING

Taken from this video
  • You can see that without virtual memory, the program address space (where our code is trying to allocate space to functions and variables) will run out of address space in the RAM address space. But the virtual memory is more flexible as it DOES NOT need to map in a contiguous manner (a problem that we mentioned earlier) and we have the flexibility to map it back to the disk (in our inference pipeline, we can map it back to CPU if the GPU VRAM memory runs out of memory).
  • Now, here in the above picture, we have a one-to-one mapping, where one program address space maps to one address space on the RAM. This will be intractable as the RAM space grows. Hence, we need a page table that maps many-to-many, where the map is from many addresses (4KB) in the program address space to many addresses (4KB) in the RAM address space and this will be tractable.
  • Also, the program space is contiguous going from 0 →1 →2 →3, but the mapping to physical memory is non-contiguous as it can be mapped anywhere.

Now, coming back to Paged Attention

  • Paged attention allows storing continuous keys and values in non-contiguous memory space. Instead of storing the KV matrices in a contiguous manner, it stores a block of KV matrices, where each block contains key and value vectors for a fixed number of tokens. In the footnote, they mentioned that for easier implementation, they store the key and value matrices of different heads and layers in a block.

Excerpt from the paper

  • Hence, we can store these KV matrices in blocks, and taking inspiration from page tables, we can map the blocks to non-contiguous physical memory. It is like mapping contiguous KV blocks to non-contiguous physical memory. As we are generating tokens, the PagedAttention kernel can get the KV blocks (via the page table that stores the KV blocks contiguously) that are stored non-contiguously in the physical memory.
  • One caveat is that the scheduler needs to reserve space in the memory (GPU or CPU) beforehand so that it can dynamically map the KV cache in the memory in the reserved space. Hence, sometimes you see errors like “PyTorch tried to allocate additional ___ GB/MB of memory but couldn’t allocate”. In vLLM, we have this parameter here
gpu_memory_utilization: "The ratio (between 0 and 1) of GPU memory to
reserve for the model weights, activations, and KV cache. Higher
values will increase the KV cache size and thus improve the model's
throughput. However, if the value is too high,
it may cause memory (OOM) errors."

SO IN SUMMARY

  • The vLLM scheduler maps logical memory (KV cache sequentially generated, which is contiguous) via a page table to non-contiguous memory on GPU memory and CPU for swapping (similar to virtual memory where the page table maps to the disk if RAM memory is not available).
  • So when we are inferencing, we can ask the contiguous logical memory to give us the KV blocks, and it can get us the cached KV matrices from the GPU or CPU memory, and then we can use it to compute attention.

A WALKTHROUGH OF THE DECODING PROCESS

Taken from the paper vLLM
  • This diagram seems similar to our virtual memory diagram. The logical KV blocks (similar to logical memory) map via the block table (similar to the page table) to our physical KV block on GPU memory (similar to physical memory). You can also see that the physical memory is stored non-contiguously, which solves the problem that we mentioned earlier. If we get dynamically sized requests, we don’t need to reserve memory space, we can do it on the fly. Also, note that we have a block size of 4 here.
  • After the sequence is generated and we reach <EOS>, we can de-allocate the logical and physical memory for the sequence. Also, it is able to handle multiple sequences. We can see below in the diagram, that when the next token is generated after brought, it will show up next to Block 3 column 2.
Taken from the paper vLLM

vLLM FOR OTHER DECODING STRATEGIES

  • Do you remember that we discussed different decoding strategies like parallel decoding where the request is to generate multiple responses of the same prompt and beam search? Now, we are going to discuss how can we do them with vLLM.

PARALLEL DECODING

Taken from the vLLM paper

Let’s go step by step with this

  • The sequence generated till now (common sequence for both samples A1 and A2) is “Four score and seven years ago our”
  • The “four score and seven” is referenced from block 0 in logical memory to Block 7 in physical memory, and “years ago our” is mapped from block 1 in logical memory to Block 1 in physical memory
  • As it is parallel decoding and physical memory can be mapped to multiple logical memory (it is a 1 (logical): N(physical) mapping), another variable called reference count is maintained for Blocks 1 and 7 in the physical block. Currently, the reference count is 2 as we are generating two samples
  • Now for the next token generation of two samples, the physical block 1 forks its content and does a copy-on-write to block 3 as the reference count is 2 to generate the next token “father” and saves it in block 3. It then decreases the reference count from 2 to 1 so that it does not do another copy-on-write. Now, finally, it writes the next token in their block 1 (“mothers”)

SO THE BLOCK THAT KEEPS TRACK OF THE REFERENCE COUNT (HERE BLOCK 1), IS WRITTEN AT THE LAST TILL REFERENCE COUNT IS 1. IT WILL KEEP ON DOING COPY-ON-WRITE OPERATION AND ALLOCATE PHYSICAL MEMORY BLOCKS)

  • Also, this is applicable for shared prefixes, where we have a common system prompt for all the queries. We can map it to the same virtual memory and as the sequences are generated, we can do a copy-on-write mechanism to generate other sequences.

BEAM SEARCH

Taken from the vLLM paper
  • The beam search algorithm is pretty self-explanatory. For k=4, where after four tokens are generated, we take the cumulative probabilities (multiplication of the top-k tokens in the probability distribution over the entire vocabulary) and pick the highest one. So after four tokens, block 2,4,5,8 is de-allocated and the physical memory is freed as it is mapped non-contiguously, we can reuse these freed memories.
  • Furthermore, the multiple sequences for example blocks 9,10,11,12 are ganged up and stored as a sequence group as they share a lot of memory (blocks 0,1,3,6). This is called a gang-scheduled or sequence group.

EVICTION AND SWAPPING

  • As the tokens are stored in pages, here we have all-or-nothing eviction, either the whole sequence is evicted and swapped in the CPU memory or nothing is evicted. In swapping, we map via the block/page table to the CPU memory

DISTRIBUTED EXECUTION

  • Another point to note is that we can use multiple workers for GPU
Taken from the paper vLLM
  • Suppose we have N heads in an attention layer, we can broadcast this to N worker GPUS (Single process Multiple device) and then compute the attention for all the layers and get the next token. Note that there are stacked transformer layers too, so this process will be sequentially and we will be storing the KV cache. I mean Nx in the below diagram
Transformer architecture from Vashwani et al paper
  • The scheduler broadcasts the KV table and inputs token IDs for each request in the batch to the GPU workers →then the GPU workers do the computation for one MultiHead Attention decoder block→return back their computation in all-reduce communication without getting the scheduler involved→then again compute the attention and this goes on for all the intermediate layers till Nx →finally it will return the attention and next token to scheduler to update the page table and memory configuration.

THREE KEY METHODS

There are three key methods that abstract all the decoding algorithms

  1. Fork: It creates a new sequence from an existing one like the copy-write operation
  2. Append: it appends a new token to the sequence
  3. Free: It deletes the sequences like the unfavorable groups in beam search by freeing the memory.

EVALUATION RESULTS

  • The authors test on two kinds of datasets: (i) ShareGPT which has longer sequences and (ii)Alpaca which has comparatively shorter sequences
Taken from the vLLM paper
  • As discussed earlier, Orca follows a buddy allocation algorithm, where we split the reserved memory into powers of two. So they have three kinds of implementations, where they split only with maximum sequence length, power of 2, or oracle based where we know how many tokens WILL BE GENERATED (best case scenario). Still, it comes behind vLLM.

Also, you can notice that the performance for Orca in ShareGPT is pretty bad. It is because the dataset has longer sequences, nearly 1024 for every example. So all three algorithms reserve 1024 token space, which is an ineffective use of memory.

EFFECT OF BLOCK SIZE

  • Block size is defined as the number of tokens that can stored in each KV block. If we have a smaller block size, then the page table has to map to more indices to retrieve the KV cache from physical memory, for a block size of 4 and to retrieve 16 tokens, the page table has to map 4 times to retrieve all the 16 tokens.
  • Also, if we have a large block size and if the responses are small, we will be wasting some memory and there will be fragmentation. vLLM picks a block size of 16.

An example chatbot with vLLM

from huggingface_hub import login
from vllm import LLM, SamplingParams

sampling_params = SamplingParams(temperature=0.8, top_p=0.95,max_tokens=1024)

llm = LLM('meta-llama/Llama-13b-chat-hf',tokenizer="hf-internal-testing/llama-tokenizer")

i = 0
while True:
if i==0:
prompt = input("Ask me anything: ")
else:
prompt = input(f"Input {i}: ")
output = llm.generate([prompt],sampling_params)
print(output[0].outputs[0].text)
i+=1

This brings us to the end of this paper summary. If you have any queries, then I am one comment away. Thanks

--

--

Astarag Mohapatra

Hi Astarag here, I am interested in topics about Deep learning and other topics. If you have any queries I am one comment away