bitsandbytes。# 创建conda虚拟环境(Python 3.9)
conda create -n deepseek python=3.9
conda activate deepseek# 安装PyTorch 2.0+(匹配CUDA 11.8,从PyTorch官方源下载)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# 安装Transformers、Accelerate等库
pip install transformers>=4.33 accelerate sentencepiece若需运行13B及以上模型或减少显存占用,安装bitsandbytes库(支持4/8-bit量化):
pip install bitsandbyteshuggingface-cli login # 输入账户凭证
git lfs install # 初始化Git LFS(大文件存储)
git clone https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat # 克隆模型仓库.bin/.safetensors)和配置文件(config.json),放置在同一目录下。from transformers import AutoTokenizer, AutoModelForCausalLM
# 加载模型和分词器(指定模型路径)
model_path = "./deepseek-llm-7b-chat" # 替换为实际路径
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto", # 自动分配GPU/CPU
torch_dtype=torch.float16, # 使用半精度减少显存占用
# load_in_4bit=True # 可选:4-bit量化(需bitsandbytes)
)
# 输入文本并生成响应
input_text = "请介绍一下RTX 3080显卡的性能特点"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)pip install vllmfrom vllm import LLM, SamplingParams
model = LLM(model=model_path, tensor_parallel_size=2) # 多GPU并行
sampling_params = SamplingParams(temperature=0.8, max_tokens=200)
outputs = model.generate(["RTX 3080显卡评测", "RTX 3080游戏性能"], sampling_params)
for output in outputs:
print(output.text)load_in_4bit或load_in_8bit参数减少显存占用(需bitsandbytes支持):model = AutoModelForCausalLM.from_pretrained(
model_path,
load_in_4bit=True,
quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
)nvcc not found,需将CUDA添加至环境变量:export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHmax_new_tokens(生成文本长度)。nvidia-smi查看驱动版本,前往NVIDIA官网下载对应CUDA版本的驱动(如CUDA 11.8需驱动≥450.80.02)。