3 Mar

T5 PEGASUS：开源一个中文生成式预训练模型

By 苏剑林 | 2021-03-03 | 215473位读者 |

去年在文章《那个屠榜的T5模型，现在可以在中文上玩玩了》中我们介绍了Google的多国语言版T5模型（mT5），并给出了用mT5进行中文文本生成任务的例子。诚然，mT5做中文生成任务也是一个可用的方案，但缺乏完全由中文语料训练出来模型总感觉有点别扭，于是决心要搞一个出来。

经过反复斟酌测试，我们决定以mT5为基础架构和初始权重，先结合中文的特点完善Tokenizer，然后模仿PEGASUS来构建预训练任务，从而训练一版新的T5模型，这就是本文所开源的T5 PEGASUS。

Tokenizer #

首先，这里介绍我们对Tokenizer的完善工作。mT5使用的Tokenizer是sentencepiece，这是一个C++所写的分词库，具有高效轻便的特点，但是很遗憾，对于中文来说它并不是特别友好，主要体现为：

1、sentencepiece会把某些全角符号强制转化为半角符号，这在某些情况下是难以接受的，而且还可能影响任务的评测结果；
2、sentencepiece内置的算法虽然有能力分出中文词来，但对于中文分词来说其实还是不够智能的；
3、sentencepiece用C++写的，虽然开源了，但对于用惯Python的人来说C++就相当于黑箱，难以阅读源码，改起来也不容易。

这些特点让我们决定将Tokenizer切换回BERT的Tokenizer。但直接替换原始版本的中文BERT的Tokenizer是不够的，一来是我们之前的工作《提速不掉点：基于词颗粒度的中文WoBERT》已经表明以词为单位来做生成模型能获得更好的效果，二来哪怕只看字中文BERT的vocab.txt也是很不完善的，漏了一些常见的标点符号（如双引号）和中文字（比如“琊”等）。为此，我们选择给BERT的tokenizer加入分词功能，并进一步完善vocab.txt。

具体来说，我们往原始中文BERT的token_dict里边加入结巴分词的前20万个词，然后修改Tokenizer的逻辑，使得它能够切分出词来，这些改动都已经内置在bert4keras中了，直接调用就行。接着，我们用这个修改后的Tokenizer去遍历切分我们准备的预训练语料，统计各个token的频数，最后只保留最高频的5万个token，得到一个规模为5万的vocab.txt来构建我们最终的Tokenizer。

除了用这个新Tokenizer来训练T5 PEGASUS外，我们还用它来重新训练了一版WoBERT模型（WoBERT⁺），也欢迎读者尝试（链接）。

预训练任务 #

对于预训练任务，我们希望更加接近自然语言生成（而不是像T5那样的只预测挖空部分），并且尽可能具有实用价值。为此，我们关注到了PEGASUS，来自论文《PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization》。PEGASUS在其论文称是专门为摘要定制的预训练模型，但在我们看来，它也可以作为通用的生成式预训练任务。PEGASUS的大体思路是通过最长公共子序列的方式该摘要类似的数据对，T5 PEGASUS并没有完全复现PEGASUS的做法，只是借鉴了PEGASUS的思路做语料构建。

T5 PEGASUS的训练数据示例

具体来说，假设一个文档有$n$个句子，我们从中挑出大约$n/4$个句子（可以不连续），使得这$n/4$个句子拼起来的文本，跟剩下的$3n/4$个句子拼起来的文本，最长公共子序列尽可能长，然后我们将$3n/4$个句子拼起来的文本视为原文，$n/4$个句子拼起来的文本视为摘要，这样就构成了一个“(原文, 摘要)”的伪摘要数据对了，就用这些数据对去训练Seq2Seq模型即可。注意，如果文档里没有重复句子的话，那么原文跟摘要的句子是不会有交集的，所以这样的生成任务并非是原文的简单复制，因此还是有一定难度的。

搜索算法则是通过如下的贪心算法逐步搜索至满足长度要求：

1、先找出1个句子，使得它跟生成的$n-1$个句子的最长公共子序列最长；
2、假设已经找到了$k$个句子，那么继续找第$k+1$个句子，使得这$k+1$个句子拼起来的文本，跟剩下的$n-k-1$个句子拼起来的文本的最长公共子序列最长。

参数与配置 #

目前开源的T5 PEGASUS是base版，总参数量为2.75亿，训练时最大长度为512，batch_size为96，学习率为$10^{-4}$，使用6张3090训练了100万步，训练时间约13天，数据是30多G的精处理通用语料，训练acc约47%，训练loss约2.97。模型使用bert4keras进行编写、训练和测试。

Github地址：https://github.com/ZhuiyiTechnology/t5-pegasus

实验与评测 #

在CSL和LCSTS两个文本生成任务上，T5 PEGASUS是我们已知的所有模型中的SOTA：
\begin{array}{c}
\text{CSL摘要生成实验结果}\\
{\begin{array}{c|c|cccc}
\hline
& \text{beam size} & \text{Rouge-L} & \text{Rouge-1} & \text{Rouge-2} & \text{BLEU} \\
\hline
\text{BERT} & 1 & 63.81 & 65.45 & 54.91 & 45.52 \\
\text{WoBERT} & 1 & 66.38 & 68.22 & 57.83 & 47.76 \\
\text{mT5} & 1 & 66.96 & 69.00 & 58.74 & \textbf{49.79} \\
\text{T5 PEGASUS} & 1 & \textbf{67.68} & \textbf{69.87} & \textbf{59.8} & 49.37 \\
\hline
\text{BERT} & 2 & 64.44 & 66.09 & 55.75 & 46.39 \\
\text{WoBERT} & 2 & 66.65 & 68.68 & 58.5 & 48.4 \\
\text{mT5} & 2 & 67.25 & 69.19 & 59.10 & \textbf{50.17} \\
\text{T5 PEGASUS} & 2 & \textbf{68.26} & \textbf{70.45} & \textbf{60.57} & 50.06 \\
\hline
\text{BERT} & 3 & 64.75 & 66.34 & 56.06 & 46.7 \\
\text{WoBERT} & 3 & 66.83 & 68.81 & 58.67 & 48.6 \\
\text{mT5} & 3 & 67.17 & 69.11 & 59.05 & 50.13 \\
\text{T5 PEGASUS} & 3 & \textbf{68.39} & \textbf{70.54} & \textbf{60.69} & \textbf{50.19} \\
\hline
\end{array}}\\
\\
\text{LCSTS摘要生成实验结果}\\
{\begin{array}{c|c|cccc}
\hline
& \text{beam size} & \text{Rouge-L} & \text{Rouge-1} & \text{Rouge-2} & \text{BLEU} \\
\hline
\text{BERT} & 1 & 27.99 & 29.57 & 18.04 & 11.72 \\
\text{WoBERT} & 1 & \textbf{31.51} & 32.90 & 21.13 & 13.74 \\
\text{mT5} & 1 & 28.92 & 30.75 & 19.54 & 13.21 \\
\text{T5 PEGASUS} & 1 & 31.21 & \textbf{33.53} & \textbf{21.54} & \textbf{14.47} \\
\hline
\text{BERT} & 2 & 29.20 & 30.70 & 19.17 & 12.64 \\
\text{WoBERT} & 2 & \textbf{31.91} & 33.35 & 21.55 & 14.13 \\
\text{mT5} & 2 & 29.96 & 31.67 & 20.40 & 13.84 \\
\text{T5 PEGASUS} & 2 & 31.47 & \textbf{34.00} & \textbf{21.98} & \textbf{14.75} \\
\hline
\text{BERT} & 3 & 29.45 & 30.95 & 19.50 & 12.93 \\
\text{WoBERT} & 3 & \textbf{32.19} & 33.72 & 21.81 & 14.29 \\
\text{mT5} & 3 & 30.15 & 31.97 & 20.72 & 14.05 \\
\text{T5 PEGASUS} & 3 & 31.78 & \textbf{34.12} & \textbf{22.23} & \textbf{14.96} \\
\hline
\end{array}}
\end{array}

更重要的是，T5 PEGASUS有着非常出色的小样本学习能力：
\begin{array}{c}
\text{CSL摘要生成实验结果(小样本, beam size=1)}\\
{\begin{array}{c|c|cccc}
\hline
& \text{样本数} & \text{Rouge-L} & \text{Rouge-1} & \text{Rouge-2} & \text{BLEU} \\
\hline
\text{WoBERT} & 10000 & 66.38 & 68.22 & 57.83 & 47.76 \\
\text{mT5} & 10000 & 66.96 & 69.00 & 58.74 & \textbf{49.79} \\
\text{T5 PEGASUS} & 10000 & \textbf{67.68} & \textbf{69.87} & \textbf{59.8} & 49.37 \\
\hline
\text{WoBERT} & 1000 & 59.34 & 60.42 & 49.07 & 37.87 \\
\text{mT5} & 1000 & 59.91 & 61.52 & 50.38 & 40.87 \\
\text{T5 PEGASUS} & 1000 & \textbf{63.12} & \textbf{65.28} & \textbf{54.54} & \textbf{43.55} \\
\hline
\text{WoBERT} & 100 & 55.68 & 55.33 & 43.10 & 31.55 \\
\text{mT5} & 100 & 55.33 & 54.62 & 42.78 & 32.50 \\
\text{T5 PEGASUS} & 100 & \textbf{60.87} & \textbf{62.78} & \textbf{52.30} & \textbf{41.40} \\
\hline
\text{WoBERT} & 10 & 26.32 & 20.99 & 12.29 & 5.76 \\
\text{mT5} & 10 & 26.62 & 27.00 & 17.95 & 13.11 \\
\text{T5 PEGASUS} & 10 & \textbf{55.85} & \textbf{57.66} & \textbf{47.52} & \textbf{35.97} \\
\hline
\end{array}}
\end{array}

哪怕样本标注样本降低到10个，T5 PEGASUS依然可以微调出一个摘要（标题）生成模型出来，性能显著超过其他模型。在LCSTS上，T5 PEGASUS具有类似的小样本学习效果，只不过非T5 PEGASUS模型效果实在太差了，所以就没有把表格整理在此了。

小样本演示 #

下面是标注样本数为10个时训练出来的模型生成效果演示：

输入： 针对以超立方体网络为蓝本的多处理机系统的可靠性和容错能力的精准度量问题,结合多处理机系统遭受计算机病毒攻击时常常发生结构性故障的特点,研究了n维超立方体网络的结构连通性和子结构连通性评价问题。首先,使用构造n维超立方体网络的3路结构割的方法得到其3路结构连通度的一个上界;然后,使用构造n维超立方体网络的3路子结构集的等价变换或约简变换的方法,得到其3路结构子连通度的一个下界;最后,利用任意网络的3路结构连通度不小于3路子结构连通度的性质,证实了超立方体网络的3路结构连通度和子结构连通度均为该超立方体网络维数
标题： 超立方体网络的3路结构连通度及子结构连通度
预测： 基于n维超立方体网络结构连通性和子结构连通性评价研究
输入： 针对传统无线体域网(WBAN)预测模型对感知数据预测精度低、计算量大、能耗高的问题,提出一种基于惩罚误差矩阵的自适应三次指数平滑算法。首先在感知节点与路由节点之间建立轻量级预测模型,其次采用地毯式搜索方式对预测模型进行参数优化处理,最后采用惩罚误差矩阵对预测模型参数作进一步的细粒化处理。实验结果表明,与Zig Bee协议相比,在1000时隙范围内,所提方法可节省12%左右的能量;而采用惩罚误差矩阵与地毯式搜索方式相比,预测精度提高了3. 306%。所提方法在有效降低计算复杂度的同时能进一步降低WBAN的能耗
标题： 基于惩罚误差矩阵的同步预测无线体域网节能方法
预测： 基于惩罚误差矩阵的自适应三次指数平滑算法
输入： 针对车联网(IoV)环境下消息传输效率低下、网络资源开销较大等诸多问题,提出一种适用于城市交通场景下基于车辆节点认知交互的路由算法。首先,依据信任理论提出节点认知交互度的概念,并在此基础上对车联网中的车辆节点进行分类,赋予它们不同的认知交互度初值;同时还引入车辆节点交互时间、交互频率、车辆节点物理间隔距离、间隔跳数以及消息生存时间等影响因子,进而构建了车辆节点认知交互评估模型。基于该模型计算并更新节点的认知交互度,并通过比较对应车辆节点间的认知交互度值来选取认知交互度相对较高的邻居节点作为中继节点进行消息转
标题： 车联网环境下基于节点认知交互的路由算法
预测： 基于车辆节点认知交互的路由算法
输入： 针对近场源波达方向(DOA)和距离的联合估计问题,提出一种近场迭代自适应算法(NF-IAA)。首先通过划分二维网格表示出近场区域内信源所有可能的位置,每个位置都看作存在一个潜在的信源入射到阵列上,表示出阵列输出的数据模型;然后通过循环迭代利用上一次谱估计的结果构建信号的协方差矩阵,将协方差矩阵的逆作为加权矩阵估计出每个位置对应的潜在信源能量;最后绘制出三维能量谱图,由于只有真实存在的信源能量不为0,因此谱峰对应的位置即为真实存在信源的位置。仿真实验表明在10个快拍条件下,NF-IAA的DOA分辨概率达到了9
标题： 基于迭代自适应方法的近场源二维参数联合估计
预测： 基于nf-iaa的近场迭代自适应算法
输入： 针对现有的软件众包工人选择机制对工人间协同开发考虑不足的问题,在竞标模式的基础上提出一种基于活跃时间分组的软件众包工人选择机制。首先,基于活跃时间将众包工人划分为多个协同开发组;然后,根据组内工人开发能力和协同因子计算协同工作组权重;最后,选定权重最大的协同工作组为最优工作组,并根据模块复杂度为每个任务模块从该组内选择最适合的工人。实验结果表明,该机制相比能力优先选择方法在工人平均能力上仅有0. 57%的差距, 同时因为保证了工人间的协同而使项目风险平均降低了32%,能有效指导需多人协同进行的众包软件任务的工
标题： 基于活跃时间分组的软件众包工人选择机制
预测： 基于活跃时间分组的软件众包工人选择机制

可以看到哪怕标注样本很少，但依然能够得到可读性较好的生成结果，这得益于PEGASUS式的伪摘要预训练与下游任务是很贴近的。

简单的总结 #

本文主要分享了我们的中文生成式预训练模型T5 PEGASUS，它以mT5为基础，在中文语料上使用PEGASUS式的伪摘要预训练，最终有着不错的文本生成表现，尤其是出色的小样本学习能力，欢迎有文本生成需求的读者使用。

转载到请包括本文地址：https://kexue.fm/archives/8209

更详细的转载事宜请参考：《科学空间FAQ》

如果您还有什么疑惑或建议，欢迎在下方评论区继续讨论。

如果您觉得本文还不错，欢迎分享/打赏本文。打赏并非要从中获得收益，而是希望知道科学空间获得了多少读者的真心关注。当然，如果你无视它，也不会影响你的阅读。再次表示欢迎和感谢！

如果您需要引用本文，请参考：

苏剑林. (Mar. 03, 2021). 《T5 PEGASUS：开源一个中文生成式预训练模型》[Blog post]. Retrieved from https://kexue.fm/archives/8209

@online{kexuefm-8209,
        title={T5 PEGASUS：开源一个中文生成式预训练模型},
        author={苏剑林},
        year={2021},
        month={Mar},
        url={\url{https://kexue.fm/archives/8209}},
}

分类：信息时代标签：语言模型, 文本生成, attention 107 评论

< 【搜出来的文本】⋅（四）通过增、删、改来用词造句 | 短文本匹配Baseline：脱敏数据使用预训练模型的尝试 >

你也许还对下面的内容感兴趣

发表你的看法

小星星

May 19th, 2021

你好，复现博客结果过程中有几个问题想请教一下
1、文中lcsts的测试结果是基于分词后的rouge score吗？我这边在之前的实验以及paper中看到目前的sota应该字rouge-1在44.0左右（使用lcsts paper中的样本选择，test - 725条）
2、用最新的t5-pegasus finetune.py代码以及bert4keras 0.10.0对lcsts数据进行实验（bs=8,max_len=256,其余为默认配置）；发现在最初100个step内loss快速下降，之后稳定在3.35左右；在测试集上的结果表现一致（100个step后rouge-1约为0.35，之后基本不再变化）

回复评论

小星星发表于 May 20th, 2021

同步下2的问题，发现在finetune阶段需要较低的lr（e-5级别），目前loss下降以及测试结果基本正常

回复评论

苏剑林发表于 May 20th, 2021

1、是基于字的。似乎rouge的不同实现给出的指标差距蛮大的；
2、好的。

回复评论

tianyunzqs

June 10th, 2021

不懂就问~ 苏神，如果需要根据不同的文章类型来生成摘要，能不能把类别转换为id加到tokenizer编码出来的结果上啊(比如：[start_id] + [class_id] + toke_ids + [end_id])，然后在计算loss的时候，y_true错开2位，y_pred错开1位

回复评论

平常心

August 30th, 2021

使用T5 PEGASUS报错,tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 592068290 vs. calculated on the restored bytes 517592881

使用google mt5权重没有问题,是不是T5 PEGASUS ckpt损坏了啊.
求教

回复评论

平常心发表于 August 30th, 2021

用tensorflow 1.13.1 没有问题了

回复评论

苏剑林发表于 August 31st, 2021

建议重新下载并解压权重。

回复评论

严德美

September 13th, 2021

苏神，中文T5的预训练怎么初始化参数？预训练一般需要多大的语料？

回复评论

苏剑林发表于 September 13th, 2021

我这里用mt5初始化。语料我这里是30多G，但这没有硬性规定。

回复评论

严德美

September 16th, 2021

@苏剑林|comment-17340

你的词汇是5万和mt5的词汇不一样，这怎么预训练啊，我试了下，使用多语言版本的mt5，显示参数不对应
Traceback (most recent call last):
File "D:/codes/summary/t5-pegasus/demo.py", line 49, in
state_dict=torch.load(os.path.join(restore_path, model_name), map_location='cpu'))
File "D:\tools\Anaconda3\envs\chat\lib\site-packages\transformers\modeling_utils.py", line 1159, in from_pretrained
model.__class__.__name__, "\n\t".join(error_msgs)
RuntimeError: Error(s) in loading state_dict for MT5ForConditionalGeneration:
size mismatch for shared.weight: copying a param with shape torch.Size([50000, 512]) from checkpoint, the shape in current model is torch.Size([32128, 512]).
size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([50000, 512]) from checkpoint, the shape in current model is torch.Size([32128, 512]).
size mismatch for encoder.block.0.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.0.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.0.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.0.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight: copying a param with shape torch.Size([32, 6]) from checkpoint, the shape in current model is torch.Size([32, 8]).
size mismatch for encoder.block.0.layer.1.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.0.layer.1.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.0.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for encoder.block.1.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.1.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.1.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.1.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.1.layer.1.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.1.layer.1.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.1.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for encoder.block.2.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.2.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.2.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.2.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.2.layer.1.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.2.layer.1.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.2.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for encoder.block.3.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.3.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.3.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.3.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.3.layer.1.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.3.layer.1.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.3.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for encoder.block.4.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.4.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.4.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.4.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.4.layer.1.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.4.layer.1.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.4.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for encoder.block.5.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.5.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.5.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.5.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for encoder.block.5.layer.1.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.5.layer.1.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for encoder.block.5.layer.1.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([50000, 512]) from checkpoint, the shape in current model is torch.Size([32128, 512]).
size mismatch for decoder.block.0.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.0.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.0.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.0.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight: copying a param with shape torch.Size([32, 6]) from checkpoint, the shape in current model is torch.Size([32, 8]).
size mismatch for decoder.block.0.layer.1.EncDecAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.0.layer.1.EncDecAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.0.layer.1.EncDecAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.0.layer.1.EncDecAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.0.layer.2.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.0.layer.2.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.0.layer.2.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for decoder.block.1.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.1.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.1.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.1.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.1.layer.1.EncDecAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.1.layer.1.EncDecAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.1.layer.1.EncDecAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.1.layer.1.EncDecAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.1.layer.2.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.1.layer.2.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.1.layer.2.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for decoder.block.2.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.2.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.2.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.2.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.2.layer.1.EncDecAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.2.layer.1.EncDecAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.2.layer.1.EncDecAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.2.layer.1.EncDecAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.2.layer.2.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.2.layer.2.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.2.layer.2.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for decoder.block.3.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.3.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.3.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.3.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.3.layer.1.EncDecAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.3.layer.1.EncDecAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.3.layer.1.EncDecAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.3.layer.1.EncDecAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.3.layer.2.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.3.layer.2.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.3.layer.2.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for decoder.block.4.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.4.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.4.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.4.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.4.layer.1.EncDecAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.4.layer.1.EncDecAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.4.layer.1.EncDecAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.4.layer.1.EncDecAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.4.layer.2.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.4.layer.2.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.4.layer.2.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for decoder.block.5.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.5.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.5.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.5.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.5.layer.1.EncDecAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.5.layer.1.EncDecAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.5.layer.1.EncDecAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.5.layer.1.EncDecAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for decoder.block.5.layer.2.DenseReluDense.wi_0.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.5.layer.2.DenseReluDense.wi_1.weight: copying a param with shape torch.Size([1024, 512]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
size mismatch for decoder.block.5.layer.2.DenseReluDense.wo.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([50000, 512]) from checkpoint, the shape in current model is torch.Size([32128, 512]).

回复评论

苏剑林发表于 September 17th, 2021

那你就不能只挑出对应的5万个Embedding来初始化吗？

回复评论

jjj

October 9th, 2021

generate里面的encoder.predict支持批处理吗？sequence_padding后试了下批处理后的效果和单独生成的效果差别很大

回复评论

苏剑林发表于 October 11th, 2021

最新版的应该是没有问题了。

回复评论

jjj 发表于 October 11th, 2021

我感觉是decoder里面的multi-head attention没有get到输入里面的attention—mask....

回复评论

jjj 发表于 October 11th, 2021

另外，我尝试了下把token padding后经过encoder预测的输出截断掉实际的长度去做解码，解码的结果与单个样例(不做padding)去解码的结果不一样...（虽然他们的encoder predicted后的输出tensor值是一样的）为啥？

回复评论

jjj 发表于 October 11th, 2021

更新到最新版是没问题了

回复评论

fyy

October 12th, 2021

苏神你好，我根据你开源的模型，用自己构造的数据训练了mT5-pegasus拿来做概念抽取，效果非常的好。但是有一个问题想请教一下：生成的结果中缺少了原句中的一些符号和生僻字，这个应该是你的词表里面去除掉了这部分的原因吧。如果我生成的时候需要这些符号和生僻字，是不是得用完整的词表重新预训练才行啊，有没有其他办法可以解决呢？谢谢！

回复评论

苏剑林发表于 October 13th, 2021

可以把新字追加到vocab.txt中，然后build_transformer_model时通过compound_tokens或者keep_tokens追加新字的Embedding

回复评论

fyy 发表于 October 13th, 2021

谢谢苏神的回复，我去试试！

回复评论

LY 发表于 December 24th, 2021

你好，想请教一下您是怎么做这个概念抽取的，希望可以和您交流一下

回复评论

fyy 发表于 October 15th, 2021

之前是我说错了，我是拿预训练的chinese_t5_pegasus_base做finetune，但是使用的数据里面有5w词表里面没有的字和符号，如果从mT5_base的模型里面追加这些字和符号的Embedding，不是会和经过pegasus预训练任务的5w字词的Embedding不一样吗。

回复评论

苏剑林发表于 October 17th, 2021

你不需要从mT5考虑，我说的就是把新字追加到t5 pegasus的vocab.txt中，然后build_transformer_model时通过compound_tokens或者keep_tokens追加新字的Embedding，新字的Embedding直接初始化为已有的5w字的某个字的Embedding（比如unk）即可，反正模型会自己学的，初始化无所谓。这个过程跟mT5没什么关系。

回复评论

fyy 发表于 October 22nd, 2021

刚看到。好的，现在理解了，既然初始化没有关系那就没问题了。谢谢苏神的解答！

回复评论

tyler_zxc 发表于 November 23rd, 2021

苏神，新词虽然可以随便初始化，但毕竟没预训练过，是不是finetune的时候最好多几个epoch？

回复评论

苏剑林发表于 November 24th, 2021

这个还是看验证集的效果来确定什么时候停吧。

其实正确的顺序是：某个词出现的次数足够多，所以你才决定要将它作为新词加入，所以理论上来说，对于新加入的词，训练频率都会比较高。

回复评论

刘源东

November 1st, 2021

BERT怎么做摘要？

回复评论

苏剑林发表于 November 2nd, 2021

https://kexue.fm/archives/6933

回复评论

sysuhys

November 26th, 2021

问一下这个适合用来做基于关键词的问题生成吗？比如输入“姚明”，生成“姚明是哪里人”。看了你的预训练任务，好像是不太适合做我说的这类问题。

回复评论

苏剑林发表于 November 29th, 2021

直接用肯定是不适合的。你可以自己构造语料finetune一下。

回复评论

jhsysom

December 23rd, 2021

老师您好，请问一下训练好的模型怎么转换成pb模型？我尝试在训练完成后调用model.save('saved_model')保存失败，是不是模型并不能转换在TensorFlow serving上跑。

回复评论

SEARCH

MENU

CATEGORIES

NEWPOSTS

COMMENTS

USERLOGIN

T5 PEGASUS：开源一个中文生成式预训练模型

Tokenizer #

预训练任务 #

参数与配置 #

实验与评测 #

小样本演示 #

简单的总结 #

你也许还对下面的内容感兴趣

内容速览

智能搜索

热门标签

随机文章

最近评论

友情链接