OpenELM 服务器部署实战指南
一 方案总览与选型
二 准备与依赖
三 部署步骤
1) 创建虚拟环境并安装依赖
2) 运行最小推理脚本(示例)
import torch
model_name = "apple/OpenELM-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto"
)
prompt = "Once upon a time there was"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.7, top_p=0.9, repetition_penalty=1.15)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
3) 可选:使用 8-bit 量化(需安装 bitsandbytes)
1) 拉取基础镜像并启动容器
-v $(pwd):/workspace -p 7860:7860 \
--name openelm-deploy \
nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04 /bin/bash
2) 容器内安装依赖(同方式 A 的依赖)
3) 容器内运行推理脚本(挂载的代码目录中执行)
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui --restart always \
ghcr.io/open-webui/open-webui:main
四 服务化与对外提供 API
from pydantic import BaseModel
import torch, json
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("apple/OpenELM-3B-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"apple/OpenELM-3B-Instruct", trust_remote_code=True,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto"
)
app = FastAPI()
class Req(BaseModel): prompt: str; max_new_tokens: int = 128; temperature: float = 0.7; top_p: float = 0.9
@app.post("/generate")
def generate(req: Req):
inputs = tokenizer(req.prompt, return_tensors="pt").to(model.device)
outs = model.generate(**inputs, max_new_tokens=req.max_new_tokens,
temperature=req.temperature, top_p=req.top_p, repetition_penalty=1.15)
return {"text": tokenizer.decode(outs[0], skip_special_tokens=True)}
五 性能优化与常见问题