微信网站程序影响网站pr的因素有哪些
- 作者: 五速梦信息网
- 时间: 2026年04月20日 07:18
当前位置: 首页 > news >正文
微信网站程序,影响网站pr的因素有哪些,wordpress 4.6.3 漏洞,网站可信认证多少钱Quantize量化概念与技术细节
题外话#xff0c;在七八年前#xff0c;一些关于表征的研究#xff0c;会去做表征的压缩#xff0c;比如二进制嵌入这种事情#xff0c;其实做得很简单#xff0c;无非是找个阈值#xff0c;然后将浮点数划归为零一值#xff0c;现在的Qu…Quantize量化概念与技术细节
题外话在七八年前一些关于表征的研究会去做表征的压缩比如二进制嵌入这种事情其实做得很简单无非是找个阈值然后将浮点数划归为零一值现在的Quantize差不多也是这么一回事冷饭重炒但在当下LLM的背景下明显比那时候更有意义。
HuggingFace bitsandbytes包GPTQ: data compression, GPUarxiv.2210.17323 GPTQ is a post-training quantization (PTQ) method for 4-bit quantization that focuses primarily on GPU inference and performance.to quantizing the weights of transformer-based modelsfirst applies scalar quant to the weights, followed by vector quant to the residualsThe idea behind the method is that it will try to compress all weights to a 4-bit quantization by minimizing the mean squared error to that weight. During inference, it will dynamically dequantize its weights to float16 for improved performance whilst keeping memory low. GGUF: ggml, CPU, 这是与GPTQ相对应的量化方法在CPU上实现推理优化。过时 c,llama.cpp, https://github.com/ggerganov/llama.cpp AWQactivation aware quantizationarxiv.2306.00978 声称是对GPTQ的优化提升了速度但牺牲的精度小都这样说
安装源码安装更容易成功
Latest HF transformers version for Mistral-like models
!pip install githttps://github.com/huggingface/transformers.git
!pip install accelerate bitsandbytes xformers# GPTQ Dependencies
!pip install optimum
!pip install auto-gptq –extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
我这边走的是源码安装# GGUF Dependencies
!pip install ctransformers[cuda]在llama3-8b上的测试
from torch import bfloat16 import torch from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
Load in your LLM without any compression tricks
model_id meta-llama/Meta-Llama-3-8B-Instruct
model_id HuggingFaceH4/zephyr-7b-beta
pipe pipeline(text-generation,modelmodel_id,torch_dtypebfloat16,device_mapauto
)
pipe.model输出模型的结构
LlamaForCausalLM((model): LlamaModel((embed_tokens): Embedding(128256, 4096)(layers): ModuleList((0-31): 32 x LlamaDecoderLayer((self_attn): LlamaSdpaAttention((q_proj): Linear(in_features4096, out_features4096, biasFalse)(k_proj): Linear(in_features4096, out_features1024, biasFalse)(v_proj): Linear(in_features4096, out_features1024, biasFalse)(o_proj): Linear(in_features4096, out_features4096, biasFalse)(rotary_emb): LlamaRotaryEmbedding())(mlp): LlamaMLP((gate_proj): Linear(in_features4096, out_features14336, biasFalse)(up_proj): Linear(in_features4096, out_features14336, biasFalse)(down_proj): Linear(in_features14336, out_features4096, biasFalse)(act_fn): SiLU())(input_layernorm): LlamaRMSNorm()(post_attention_layernorm): LlamaRMSNorm()))(norm): LlamaRMSNorm())(lm_head): Linear(in_features4096, out_features128256, biasFalse)
)一个细节查看任意一个layer的权重值的分布查看前10000个发现是基本呈现零均值的正态分布的这也是后面normal float(nf4)就是基于这样的前提做的量化
import seaborn as sns
q_proj pipe.model.model.layers[0].self_attn.q_proj.weight.detach().to(torch.float16).cpu().numpy().flatten()
plt.figure(figsize(10, 6))
sns.histplot(q_proj[:10000], bins50, kdeTrue)chat template:
llama3 |begin_of_text||start_header_id|system|end_header_id|….|eot_id||start_header_id|user|end_header_id|…|eot_id||start_header_id|assistant|end_header_id|… zephyr |system| … /s|user| … /s|assistant| … /s
具体使用template:
See https://huggingface.co/docs/transformers/main/en/chat_templating
messages [{role: system,content: You are a friendly chatbot.,},{role: user,content: Tell me a funny joke about Large Language Models.}, ] prompt pipe.tokenizer.apply_chat_template(messages, tokenizeFalse, add_generation_promptTrue) print(prompt) T AutoTokenizer.from_pretrained(model_id)
T
T.encode(|system|)|begin_of_text||start_header_id|system|end_header_id|You are a friendly chatbot.|eot_id||start_header_id|user|end_header_id|Tell me a funny joke about Large Language Models.|eot_id||start_header_id|assistant|end_header_id|使用pipe进行生成
outputs pipe(prompt,max_new_tokens256,do_sampleTrue,temperature0.1,top_p0.95
)
(torch.cuda.max_memory_allocated(devicecuda:0) torch.cuda.max_memory_allocated(devicecuda:1)) / (1024*1024*1024) # 15.021286964416504差不多是15GB
print(outputs[0][generated_text])|begin_of_text||start_header_id|system|end_header_id|You are a friendly chatbot.|eot_id||start_header_id|user|end_header_id|Tell me a funny joke about Large Language Models.|eot_id||start_header_id|assistant|end_header_id|Heres one:Why did the Large Language Model go to therapy?Because it was struggling to process its emotions and was feeling a little disconnected from its users! But in the end, it just needed to retrain its thoughts and update its perspective!Hope that made you LOL!使用accelerate作sharding分片
from accelerate import Accelerator# Shard our model into pieces of 1GB
accelerator Accelerator()
accelerator.save_model(modelpipe.model,save_directory./content/model,max_shard_size4GB
)量化概述
4bit-NormalFloat (NF4, qlora lora on a quantize LLMsarxiv.2305.14314) consists of three steps: Normalization: The weights of the model are normalized so that we expect the weights to fall within a certain range. This allows for more efficient representation of more common values.密度高的地方多分配离散值密度低的地方少分配离散值前提就是上面的正态分布 The weights of the model are first normalized to have zero mean and unit variance. This ensures that the weights are distributed around zero and fall within a certain range. Quantization: The weights are quantized to 4-bit. In NF4, the quantization levels are evenly spaced with respect to the normalized weights, thereby efficiently representing the original 32-bit weights.所谓那些int4模型就是每个权重都由16个离散值表示int8就是64个以此类推这个主意之前bf16, float32, float16的具体表征三者都有1bit用来存符号bf16跟float32的区别在于小数位减少float16则两者都变少分别是18718231510比如同样一个0.1234三者的结果就是0.1235351…0.1234000…0.1234130…而75505则对应75505inf75264即bf16是做了一个权衡能表示很大的数但是精度不够 The normalized weights are then quantized to 4 bits. This involves mapping the original high-precision weights to a smaller set of low-precision values. In the case of NF4, the quantization levels are chosen to be evenly spaced in the range of the normalized weights. Dequantization: Although the weights are stored in 4-bit, they are dequantized during computation which gives a performance boost during inference. During the forward pass and backpropagation, the quantized weights are dequantized back to full precision. This is done by mapping the 4-bit quantized values back to their original range. The dequantized weights are used in the computations, but they are stored in memory in their 4-bit quantized form. bitsandbytes 的分位数计算 密度高的地方多分配密度低的地方少分配https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/functional.py#L267https://zhuanlan.zhihu.com/p/647378373
验证一下上面bf16, f32, f16的区别
torch.set_printoptions(sci_modeFalse)
X torch.tensor([0.1234, 75535])
print(X, X.dtype) # tensor([ 0.1234, 75535.0000]) torch.float32
print(X.to(torch.float16)) # tensor([0.1234, inf], dtypetorch.float16)
print(X.to(torch.bfloat16)) # tensor([ 0.1235, 75776.0000], dtypetorch.bfloat16)接下来手动量化用BitsAndBytes
Delete any models previously created
del pipe, accelerator
del pipe# Empty VRAM cache import gc gc.collect() torch.cuda.empty_cache()from transformers import BitsAndBytesConfig from torch import bfloat16 model_id meta-llama/Meta-Llama-3-8B-Instruct # Our 4-bit configuration to load the LLM with less GPU memory bnb_config BitsAndBytesConfig(load_in_4bitTrue, # 4-bit quantizationbnb_4bit_quant_typenf4, # Normalized float 4bnb_4bit_use_double_quantTrue, # Second quantization after the firstbnb_4bit_compute_dtypebfloat16 # Computation type )# Zephyr with BitsAndBytes Configuration tokenizer AutoTokenizer.from_pretrained(model_id) model AutoModelForCausalLM.from_pretrained(model_id,quantization_configbnb_config,device_mapauto, )# Create a pipeline pipe pipeline(modelmodel, tokenizertokenizer, tasktext-generation)(torch.cuda.max_memory_allocated(cuda:0) torch.cuda.max_memory_allocated(cuda:1)) / (1024*1024*1024) # 5.5174360275268555内存占用相较于上面的15G明显减少参数含义在论文中都有同样可以打印prompt都是没有区别的输出发生变化
See https://huggingface.co/docs/transformers/main/en/chat_templating
messages [{role: system,content: You are a friendly chatbot.,},{role: user,content: Tell me a funny joke about Large Language Models.}, ] prompt pipe.tokenizer.apply_chat_template(messages, tokenizeFalse, add_generation_promptTrue) print(prompt)|begin_of_text||start_header_id|system|end_header_id|You are a friendly chatbot.|eot_id||start_header_id|user|end_header_id|Tell me a funny joke about Large Language Models.|eot_id||start_header_id|assistant|end_header_id| outputs pipe(prompt,max_new_tokens256,do_sampleTrue,temperature0.1,top_p0.95 ) print(outputs[0][generated_text]) |begin_of_text||start_header_id|system|end_header_id|You are a friendly chatbot.|eot_id||start_header_id|user|end_header_id|Tell me a funny joke about Large Language Models.|eot_id||start_header_id|assistant|end_header_id|Why did the Large Language Model go to therapy?Because it was struggling to process its emotions and was worried it would overfit to its own biases!但是这个量化是不完全的混合精度量化有int8也有float16 load_in_8bit: embed_tokens 继续是 torch.float16每个layer的内部self attention以及 mlp 部分是 int8每个layer的outputlayernorm部分是 float16如果 load 时传入了 torch_dtypetorch.bfloat16则这部分为 torch.float16同理适用于 load_in_4bit model.embed_tokens.weight torch.float16 cuda:0 model.layers.0.self_attn.q_proj.weight torch.int8 cuda:0 model.layers.0.self_attn.k_proj.weight torch.int8 cuda:0 model.layers.0.self_attn.v_proj.weight torch.int8 cuda:0 model.layers.0.self_attn.o_proj.weight torch.int8 cuda:0 model.layers.0.mlp.gate_proj.weight torch.int8 cuda:0 model.layers.0.mlp.up_proj.weight torch.int8 cuda:0 model.layers.0.mlp.down_proj.weight torch.int8 cuda:0 model.layers.0.input_layernorm.weight torch.float16 cuda:0 model.layers.0.post_attention_layernorm.weight torch.float16 cuda:0具体的参数输出和推理 import torch from torch import nn from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer from transformers.optimization import AdamW
del model
import gc # garbage collect library gc.collect() torch.cuda.empty_cache() model AutoModelForCausalLM.from_pretrained(meta-llama/Meta-Llama-3-8B, quantization_configBitsAndBytesConfig(load_in_8bitTrue,# load_in_4bitTrue), torch_dtypetorch.bfloat16,device_mapauto) for name, para in model.named_parameters():print(name, para.dtype, para.shape, para.device)
——
tokenizer AutoTokenizer.from_pretrained(meta-llama/Meta-Llama-3-8B) tokenizer.pad_token tokenizer.eos_token
示例训练数据
texts [Hello, how are you?,The quick brown fox jumps over the lazy dog. ]# Tokenize数据 inputs tokenizer(texts, return_tensorspt, paddingTrue, truncationTrue) input_ids inputs[input_ids] attention_mask inputs[attention_mask]# 移动到GPU如果可用 device torch.device(cuda if torch.cuda.is_available() else cpu) input_ids input_ids.to(device) attention_mask attention_mask.to(device)
model.to(device)# 设置优化器和损失函数
optimizer AdamW(model.parameters(), lr5e-5) loss_fn nn.CrossEntropyLoss()# 模型训练步骤 model.train() outputs model(input_ids, attention_maskattention_mask, labelsinput_ids) loss outputs.loss# 反向传播 optimizer.zero_grad() loss.backward() optimizer.step()GPTQ
Delete any models previously created
del tokenizer, model, pipe# Empty VRAM cache import torch import gc gc.collect() torch.cuda.empty_cache()https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQinstall https://github.com/AutoGPTQ/AutoGPTQ 走源码安装是 ok 的
GPTQ Dependencies
!pip install optimum
!pip install auto-gptq –extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline# Load LLM and Tokenizer
model_id MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ
tokenizer AutoTokenizer.from_pretrained(model_id, use_fastTrue)
model AutoModelForCausalLM.from_pretrained(model_id,device_mapauto,trust_remote_codeFalse,revisionmain
)# Create a pipeline
pipe pipeline(modelmodel, tokenizertokenizer, tasktext-generation)# See https://huggingface.co/docs/transformers/main/en/chat_templating
messages [{role: system,content: You are a friendly chatbot.,},{role: user,content: Tell me a funny joke about Large Language Models.},
]
prompt pipe.tokenizer.apply_chat_template(messages, tokenizeFalse, add_generation_promptTrue)
print(prompt)outputs pipe(prompt,max_new_tokens256,do_sampleTrue,temperature0.1,top_p0.95
)
print(outputs[0][generated_text])(torch.cuda.max_memory_allocated(cuda:0) torch.cuda.max_memory_allocated(cuda:1)) / (1024*1024*1024) # 5.626893043518066跟上面bytesandbits差不太多GGUF
HUGGINGFACE的QuantFactory仓库下有很多量化模型比如llama3-8b的https://huggingface.co/QuantFactory/Meta-Llama-3-8B-instruct-GGUF
GPT-Generated Unified Format是由Georgi Gerganov定义发布的一种大模型文件格式。Georgi Gerganov是著名开源项目llama.cpp的创始人。 GGMLGPT-Generated Model Language Although GPTQ does compression well, its focus on GPU can be a disadvantage if you do not have the hardware to run it. GGUF, previously GGML, is a quantization method that allows users to use the CPU to run an LLM but also offload some of its layers to the GPU for a speed up (llama.cpp 中的 -ngl ). Although using the CPU is generally slower than using a GPU for inference, it is an incredible format for those running models on CPU or Apple devices.Especially since we are seeing smaller and more capable models appearing, like Mistral 7B, the GGUF format might just be here to stay! Q4_K_M Q stands for Quantization.4 indicates the number of bits used in the quantization process.K refers to the use of k-means clustering in the quantization.M represents the size of the model after quantization. (S Small, M Medium, L Large).
这里说GGUF用的K均值聚类来做的量化下面是一个通用的idea不代表GGUF就是这么做的其实就是一种分层聚类还是数值型的很浅然 代码实现
import numpy as np
from sklearn.cluster import KMeans# 原始权重矩阵
weights np.array([[2.09, -0.98, 1.48, 0.09],[0.05, -0.14, -1.08, 2.12],[-0.91, 1.92, 0, -1.03],[1.87, 0, 1.53, 1.49]
])# K-means聚类
kmeans KMeans(n_clusters4)
kmeans.fit(weights.reshape(-1, 1))
cluster_indices kmeans.predict(weights.reshape(-1, 1)).reshape(weights.shape)
centroids kmeans.clustercenters.flatten()# 根据质心值排序
sorted_indices np.argsort(centroids)
sorted_centroids centroids[sorted_indices]# 创建索引映射
index_map {old_idx: new_idx for new_idx, old_idx in enumerate(sorted_indices)}# 更新量化索引矩阵
new_cluster_indices np.vectorize(index_map.get)(cluster_indices)print(重新排序后的量化索引矩阵\n, new_cluster_indices)
print(重新排序后的质心值\n, sorted_centroids)重新排序后的量化索引矩阵[[3 0 2 1][1 1 0 3][0 3 1 0][3 1 2 2]]
重新排序后的质心值[-1. 0. 1.5 2. ]使用GGUF进行推理优化建议用llama.cpp否则容易失败
del tokenizer, model, pipe# Empty VRAM cache
import torch
import gc
gc.collect()
torch.cuda.empty_cache()from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline# Load LLM and Tokenizer
Use gpu_layers to specify how many layers will be offloaded to the GPU.
model AutoModelForCausalLM.from_pretrained(QuantFactory/Meta-Llama-3-8B-Instruct-GGUF,model_fileMeta-Llama-3-8B-Instruct.Q4_K_M.gguf,# model_typellama, gpu_layers20, hfTrue ) tokenizer AutoTokenizer.from_pretrained(QuantFactory/Meta-Llama-3-8B-Instruct-GGUF, use_fastTrue )# Create a pipeline pipe pipeline(modelmodel, tokenizertokenizer, tasktext-generation)AWQ A new format on the block is AWQ (Activation-aware Weight Quantization) which is a quantization method similar to GPTQ. There are several differences between AWQ and GPTQ as methods but the most important one is that AWQ assumes that not all weights are equally important for an LLM’s performance. In other words, there is a small fraction of weights that will be skipped during quantization which helps with the quantization loss. As a result, their paper mentions a significant speed-up compared to GPTQ whilst keeping similar, and sometimes even better, performance. 下面使用vllm框架进行部署 from vllm import LLM, SamplingParams# Load the LLM sampling_params SamplingParams(temperature0.0, top_p1.0, max_tokens256) llm LLM(modelcasperhansen/llama-3-8b-instruct-awq,quantizationawq,dtypehalf,gpu_memory_utilization.95,max_model_len4096 ) tokenizer AutoTokenizer.from_pretrained(casperhansen/llama-3-8b-instruct-awq)
See https://huggingface.co/docs/transformers/main/en/chat_templating
messages [{role: system,content: You are a friendly chatbot.,},{role: user,content: Tell me a funny joke about Large Language Models.}, ] prompt tokenizer.apply_chat_template(messages, tokenizeFalse, add_generation_promptTrue) print(prompt)
Generate output based on the input prompt and sampling parameters
output llm.generate(prompt, sampling_params) print(output[0].outputs[0].text)
- 上一篇: 微信网站 手机网站wordpress网站压缩
- 下一篇: 微信网站搭建价格wordpress怎么排版
相关文章
-
微信网站 手机网站wordpress网站压缩
微信网站 手机网站wordpress网站压缩
- 技术栈
- 2026年04月20日
-
微信网页手机登录入口官网宁波关键词优化企业网站建设
微信网页手机登录入口官网宁波关键词优化企业网站建设
- 技术栈
- 2026年04月20日
-
微信团购群网站怎样做网站正在建设中 模板 下载
微信团购群网站怎样做网站正在建设中 模板 下载
- 技术栈
- 2026年04月20日
-
微信网站搭建价格wordpress怎么排版
微信网站搭建价格wordpress怎么排版
- 技术栈
- 2026年04月20日
-
微信网站方案wordpress seo title
微信网站方案wordpress seo title
- 技术栈
- 2026年04月20日
-
微信网站结构网站降权了怎么办
微信网站结构网站降权了怎么办
- 技术栈
- 2026年04月20日
