AMD GPUでvLLMを使う

Document

AMD GPUでvLLMを使う

vLLM をAMD Radeon RX9060XT 2枚挿しの環境で使おうとしたとしたら結構苦労したので記録を残す。vLLMのバージョンはv0.11.2、ROCmのバージョンは7.1.1である。

まとめ

以下の2つの選択肢がある。

Dockerを使う。Docker hubのrocm/vllm-devで navi のタグがあるイメージを使うこと。
自分でビルドする。ドキュメントどおりではできない部分がある（後述）

Dockerの場合

vLLMのインストールドキュメント（GPU）によれば “Docker is the recommended way to use vLLM on ROCm.” とのことである。しかしこの後にある rocm/vllm-dev:nightly イメージはInstinct用なのでRadeonでは動作しない。

コンシューマーGPUで動作するイメージはタグに navi が付いているものである。 Docker Hubの rocm/vllm-dev - Docker Image から適当なイメージを選んで使用すればよい。今回は rocm7.1.1_navi_ubuntu24.04_py3.12_pytorch_2.8_vllm_0.10.2rc1 イメージで動作を確認した。

自分でビルドする場合

vLLMのインストールドキュメント（GPU）は内容がやや古いので、一部読み替えが必要である。

事前準備

ROCm他のライブラリを導入しておく。Arch Linuxなので以下のパッケージをインストールした。不足があるときはあとの手順で configure 中にエラーが出るのでその都度足せばよい。

hipblas
hipfft
hipfft
hipsparse
hipcub
hipsparselt
hipsolver
hsa-rocr
rccl
rocblas
rocfft
rocm-cmake
rocm-core
rocm-device-libs
rocm-llvm
rocm-smi-lib
rocminfo
rocrand
rocsolver
rocsparse
rocthrust
roctracer

C++ライブラリのコンパイルに clang++ 、Python環境セットアップに uv も必要である。

手順1 Python環境を作る

ドキュメントに書いてはいないが、手順はすべて vLLM のディレクトリで実行する。

# （ドキュメント外）vllmのソースを用意
git clone https://github.com/vllm-project/vllm.git
cd vllm
git switch --detach v0.11.2

# python環境を作って有効化
uv venv --python 3.12 --seed
source ./.venv/bin/activate

なおmainブランチで実施すると hipify-perl が必要とかで詰まったこともあり、リリースタグ v0.11.2 で進めている。

手順2 PyTorchをインストール

PyTorchのリポジトリから導入する。indexのURLはROCmのバージョンに合わせたものを https://download.pytorch.org/whl から選ぶ。この階層には rocm3.7 から rocm6.4 までしかないが、より新しいバージョンは nightly にある。

uv pip install torch torchvision --index https://download.pytorch.org/whl/nightly/rocm7.1

手順3 Tritonをインストール

ドキュメントにはTritonを自分でビルドするように指示があるが、PyTorchのリポジトリから入れたほうが早い。

uv pip install triton --index https://download.pytorch.org/whl/nightly/rocm7.1

手順4 Optionalな手順を無視

flash attention と AITER の導入は無視する。とくに AITER はInstinctを前提にしているのでコンシューマー向けの gfx1200 では意味がない。

手順4 amd_smiをインストール

権限の都合上直接 /opt/rocm/share/amd_smi からのインストールはできない（しないほうがよい）ので一旦コピーしてから実行する。

cp -r /opt/rocm/share/amd_smi .
uv pip install ./amd_smi

手順5 ビルド依存関係をインストール

ドキュメントには特に書いていないが PyTorch のリポジトリからインストールする。

uv pip install --upgrade \
        numba \
        scipy \
        'huggingface-hub[cli,hf_transfer]' \
        setuptools_scm\
        --index https://download.pytorch.org/whl/nightly/rocm7.1

手順6 requirements.txtからインストール

これも PyTorch のリポジトリを指定する。ただし一部のライブラリは PyTorch のリポジトリに不足している（ compressed-tensors が要求されるバージョンより古い等）ので --index-strategy unsafe-first-match をつけて PyPi からもインストールできるようにする。

uv pip install -r requirements/rocm.txt --index https://download.pytorch.org/whl/nightly/rocm7.1 --index-strategy unsafe-first-match

手順6 vLLMをビルド

これでようやくビルドができる。短縮のために自分の使っているGPUアーキテクチャのみを有効にするとよい。GPUのアーキテクチャは rocminfo コマンドで確認できる。以下の出力であれば gfx1200 となる。

rocminfo

# 略
*******
Agent 2
*******
  Name:                    gfx1200
  Uuid:                    *****************
  Marketing Name:          AMD Radeon RX 9060 XT
  Vendor Name:             AMD

環境変数 PYTORCH_ROCM_ARCH にアーキテクチャを指定してビルドを行う。依存関係に不足があればここでエラーが出るはずなので追加してやりなおす。

export PYTORCH_ROCM_ARCH="gfx1200"
python3 setup.py develop

実行例

ビルド完了後、実行ファイルは .venv/bin/vllm に配置される。このパスを叩いてもいいし、venv仮想環境に入っていればパスが通っているので直接 vllm コマンドを叩いてもよい。

GPUが複数あるので --tensor-parallel-size にGPU数を指定している。各レイヤーを分割してGPUに配置するので計算能力を目一杯使える、らしい。

vllm serve Qwen/Qwen3-4B --tensor-parallel-size 2

vLLMはデフォルトでポート8000で待ち受けるので適当なリクエストを送ってみよう。

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-4B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is lorem ipsum?"}
        ]
    }' | jq

vLLMのログはこちら。

(APIServer pid=62226) INFO 11-30 11:59:42 [loggers.py:236] Engine 000: Avg prompt throughput: 2.4 tokens/s, Avg generation throughput: 3.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=62226) INFO:     127.0.0.1:39598 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=62226) INFO 11-30 11:59:52 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 39.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

レスポンスはこちら。

{
  "id": "chatcmpl-f24ed88dd5b741d2bf82216428f9188b",
  "object": "chat.completion",
  "created": 1764471580,
  "model": "Qwen/Qwen3-4B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nOkay, the user is asking what lorem ipsum is. I need to explain it clearly. First, I should mention that it's a placeholder text used in design. Then, explain its origins—probably from a Latin text. Maybe mention that it's used to show layout and typography without using real content. Also, note that it's commonly used in web design and publishing. Should I include the example text? Like \"Lorem ipsum dolor sit amet...\" Yes, that would help. Also, clarify that it's not real text and is just for demonstration. Maybe add that it's from a 15th-century text. Oh, and mention that it's used to test how text looks in different formats. Keep it simple and concise. Avoid jargon. Make sure the user understands its purpose and use cases.\n</think>\n\n**Lorem ipsum** is a placeholder text used in design, publishing, and web development to demonstrate the visual structure of a layout, typography, or formatting without relying on meaningful content. It’s not real text but a fabricated passage in Latin, often attributed to a 15th-century text by Cicero. \n\n### Key Points:\n- **Purpose**: Helps designers and developers visualize how text will look in a space before real content is added.\n- **Origins**: Derived from a section of Cicero’s *De Finibus Bonorum et Malorum* (1544), which was later misattributed to the 15th century.\n- **Use Cases**: Commonly used in web design, print media, and software development to test layouts, fonts, spacing, and alignment.\n- **Example**:  \n  *\"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\"*\n\n### Why It’s Used:\n- Avoids distractions from real content.\n- Ensures consistency in testing and prototyping.\n- Helps evaluate how text interacts with design elements (e.g., headings, margins, spacing).\n\nIt’s a versatile tool for showcasing how a design will appear with actual text, without the need for real data or content.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 24,
    "total_tokens": 451,
    "completion_tokens": 427,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

まあいいんじゃないだろうか。

2024年8月PC更新 ascon製GIGAスクール構想対応タブレットAT-08をLinuxで使う