首页 - 技术栈

404 没有找到网站试试申请收录吧临海网站设计

作者: 五速梦信息网
时间: 2026年05月16日 17:45

当前位置：首页 > news >正文

404 没有找到网站试试申请收录吧,临海网站设计,新竹自助网站,泰安网站建设报价Preprocess数据预处理文本使用Tokenizer将文本转换为标记序列#xff0c;创建标记的数值表示#xff0c;并将它们组装成张量。预处理文本数据的主要工具是标记器。标记器根据一组规则将文本拆分为标记。标记被转换为数字#xff0c;然后转换为张量#xff0c;这些张量…Preprocess数据预处理文本使用Tokenizer将文本转换为标记序列创建标记的数值表示并将它们组装成张量。预处理文本数据的主要工具是标记器。标记器根据一组规则将文本拆分为标记。标记被转换为数字然后转换为张量这些张量成为模型输入。模型所需的任何其他输入都由标记器添加。 from transformers import AutoTokenizertokenizer AutoTokenizer.from_pretrained(google-bert/bert-base-cased)encoded_input tokenizer(Do not meddle in the affairs of wizards, for they are subtle and quick to anger.) print(encoded_input)# 输出结果 { input_ids : [ 101 , 2079 , 2025 , 19960 , 10362 , 1999 , 1996 , 3821 , 1997 , 16657 , 1010 , 2005 , 2027 , 2024 , 11259 , 1998 , 4248 , 2000 , 4963 , 1012 , 102 ],token_type_ids : [ 0 , 0 , 0 , 0 , 0 , 0 , 0000000000000 0 00 ] attention_mask [ 111111111111111111111111111 ] }input_ids是句子中每个标记对应的索引。 tention_mask表示是否应该关注一个标记。当有多个序列时token_type_ids标识一个 token 属于哪个序列。# 通过解码返回输入的内容 tokenizer.decode(encoded_input[input_ids]) [CLS] 不要干涉巫师的事务因为他们很狡猾而且很容易发怒。[SEP]padding填充句子的长度并不总是相同的但是张量模型输入需要具有统一的形状。因此填充是一种通过向较短的句子添加特殊填充标记来确保张量为矩形的策略。将参数设置padding为True填充批次中较短的序列以匹配最长的序列 batch_sentences [But what about second breakfast?,Dont think he knows about second breakfast, Pip.,What about elevensies?, ] encoded_input tokenizer(batch_sentences, paddingTrue) print(encoded_input)

输出结果

{ input_ids : [[ 101 , 1252 , 1184 , 1164 , 1248 , 6462 , 136 , 102 , 0 , 0 , 0 , 0 , 0 , 0 ] , [ 101 , 1790 , 112 , 189 , 1341 , 1119 , 3520 , 1164 , 1248 , 6462 , 117 , 21902 , 1643 , 119 , 102 ] , [ 101 , 1327 , 1164 , 5450 , 23434 , 136 , 102 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ]] , token_type_ids : [ [ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] , [ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] , [0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ]] , attention_mask : [ [ 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ] , [ 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ] , [ 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 ,0 , 0 , 0 ]]} Truncation 截断另一方面有时序列可能太长模型无法处理。在这种情况下您需要将序列截断为较短的长度。将参数设置truncation为True将序列截断为模型接受的最大长度 batch_sentences [But what about second breakfast?,Dont think he knows about second breakfast, Pip.,What about elevensies?, ] encoded_input tokenizer(batch_sentences, paddingTrue, truncationTrue) print(encoded_input)

输出结果

{ input_ids : [[ 101 , 1252 , 1184 , 1164 , 1248 , 6462 , 136 , 102 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ], [ 101 , 1790 , 112 , 189 , 1341 , 1119 , 3520 , 1164 , 1248 , 6462 , 117 , 21902 , 1643 , 119 , 102 ], [ 101,1327,1164,5450,23434,136,102,0,0,0,0,0,0,0,0,0,0 ] ] , token_type_ids [ [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ] , [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ] [ 0000000000000 00 ] ] attention_mask [ [ 1 11111111000000 ] [ 1111111111111 ][ 111 1111 000 0 000 ] ] }构建张量基于Pytorch构建 batch_sentences [But what about second breakfast?,Dont think he knows about second breakfast, Pip.,What about elevensies?, ] encoded_input tokenizer(batch_sentences, paddingTrue, truncationTrue, return_tensorspt) print(encoded_input) #输出结果 {input_ids: tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]),token_type_ids: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}基于TensorFlow构建 batch_sentences [But what about second breakfast?,Dont think he knows about second breakfast, Pip.,What about elevensies?, ] encoded_input tokenizer(batch_sentences, paddingTrue, truncationTrue, return_tensorstf) print(encoded_input)

输出结果

{input_ids: tf.Tensor: shape(2, 9), dtypeint32, numpy array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],dtypeint32),token_type_ids: tf.Tensor: shape(2, 9), dtypeint32, numpy array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtypeint32),attention_mask: tf.Tensor: shape(2, 9), dtypeint32, numpy array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtypeint32)}不同的管道在其 call() 中对标记器参数的支持不同。text-2-text-generation 管道仅支持即传递truncation。text-generation 管道支持 max_length、truncation、padding 和 add_special_tokens。在 fill-mask 管道中标记器参数可以在 tokenizer_kwargs 参数字典中传递。音频对于音频任务您需要一个特征提取器来为模型准备数据集。特征提取器旨在从原始音频数据中提取特征并将其转换为张量。 from datasets import load_dataset, Audio

从公共数据中下载数据

dataset load_dataset(PolyAI/minds14, nameen-US, splittrain)

audio列会自动加载并重新采样音频文件

dataset[0][audio]

输出结果

{ array : array([ 0. , 0.00024414 , - 0.00024414 , …, - 0.00024414 ,0. , 0. ], dtypefloat32),path : /root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav ,sampling_rate : 8000 }array是以一维数组形式加载并可能重新采样的语音信号。 path指向音频文件的位置。 sampling_rate指每秒测量语音信号中的数据点数。改变采样频率的方法有两种

在加载数据时可以规定采样频率

dataset dataset.cast_column(audio, Audio(sampling_rate16_000))# 在传递给特征提取时可以规定采样频率 audio_input [dataset[0][audio][array]] feature_extractor(audio_input, sampling_rate16000)如果存在数据的尺寸不合适可以采用填充或者截断来处理可变序列

创建一个函数来预处理数据集使音频样本具有相同的长度。指定最大样本长度特征提取器将填充或截断序列以匹配它;

def preprocess_function(examples):audio_arrays [x[array] for x in examples[audio]]inputs feature_extractor(audio_arrays,sampling_rate16000,paddingTrue,max_length100000,truncationTrue,)return inputs计算机视觉对于计算机视觉任务您需要一个图像处理器来为模型准备数据集。图像预处理包括几个步骤将图像转换为模型所需的输入。这些步骤包括但不限于调整大小、规范化、颜色通道校正以及将图像转换为张量。图像预处理通常遵循某种形式的图像增强。图像预处理和图像增强都会转换图像数据但它们的用途不同图像增强可以改变图像有助于防止过度拟合并提高模型的稳健性。您可以发挥创意来增强数据 - 调整亮度和颜色、裁剪、旋转、调整大小、缩放等。但是请注意不要通过增强改变图像的含义。图像预处理可确保图像与模型的预期输入格式相匹配。在微调计算机视觉模型时必须像最初训练模型时一样对图像进行预处理。可以使用任何库来进行图像增强。对于图像预处理请使用ImageProcessor与模型相关的库。 from transformers import AutoImageProcessorimage_processor AutoImageProcessor.from_pretrained(google/vit-base-patch16-224)# 一些图像增强的功能 from torchvision.transforms import RandomResizedCrop, ColorJitter, Composesize (image_processor.size[shortest_edge]if shortest_edge in image_processor.sizeelse (image_processor.size[height], image_processor.size[width]) )_transforms Compose([RandomResizedCrop(size), ColorJitter(brightness0.5, hue0.5)])多模态数据对于涉及多模态输入的任务需要一个处理器来为模型准备数据集。处理器将两个处理对象例如标记器和特征提取器结合在一起。使用AutoProcessor.from_pretrained()加载处理器 from transformers import AutoProcessorprocessor AutoProcessor.from_pretrained(facebook/wav2vec2-base-960h)

上一篇： 400网站建设永川做网站的
下一篇： 500m网站空间中国最好的旅游网站

404 没有找到网站试试申请收录吧临海网站设计

输出结果

输出结果

输出结果

从公共数据中下载数据

audio列会自动加载并重新采样音频文件

输出结果

在加载数据时可以规定采样频率

创建一个函数来预处理数据集使音频样本具有相同的长度。指定最大样本长度特征提取器将填充或截断序列以匹配它;

相关文章

400网站建设永川做网站的

400网站建设推广wordpress 侵权

400电话网络推广商城网站做外贸球衣用什么网站

500m网站空间中国最好的旅游网站

500个公司取名大全深圳设计优化公司

598网站建设不需要验证码的注册网站

HTTPS免费证书为啥只能用3个月？

不同领域，GEO方法要“对症下药”

新手站长也能上手：3类高效GEO方法

GEO能帮站长解决什么问题？

AI时代，站长该如何让网站内容被更多人看到？

为什么当初会有www这个前缀？

404 没有找到网站 试试申请收录吧临海网站设计

输出结果

输出结果

输出结果

从公共数据中下载数据

audio列会自动加载并重新采样音频文件

输出结果

在加载数据时可以规定采样频率

创建一个函数来预处理数据集使音频样本具有相同的长度。指定最大样本长度特征提取器将填充或截断序列以匹配它;

相关文章

400网站建设永川做网站的

400网站建设推广wordpress 侵权

400电话网络推广商城网站做外贸球衣用什么网站

500m网站空间中国最好的旅游网站

500个公司取名大全深圳设计优化公司

598网站建设不需要验证码的注册网站

HTTPS免费证书为啥只能用3个月？

不同领域，GEO方法要“对症下药”

新手站长也能上手：3类高效GEO方法

GEO能帮站长解决什么问题？

AI时代，站长该如何让网站内容被更多人看到？

为什么当初会有www这个前缀？

404 没有找到网站试试申请收录吧临海网站设计