EasyNLP中文文图生成静态带你秒变艺术家

2024-11-02 科技

成HDTV大左图，如下上左图：

在本文中都，我们不再对这些细微展开赘述。热衷的读者可以全面性查到旋考文献。

EasyNLP文左图转换成时成数学模型

由于前述数学模型的数目往往在数十亿、百亿旋数级别，有限的数学模型虽然能转换成时成质量很小的相片，然后对量度资源和先为受训样本的促请使得这些数学模型难以在OpenBSD邻里广泛应用，尤其在需要依托向上层面的情形。在本节中都，我们详细引介EasyNLP备有人的中都文文左图转换成时成数学模型，它在较小旋数目的情形，始终不具良好的文左图转换成时成功效。

数学模型Core

数学模型前提左图如下左图上左图：

再考虑到Transformer数学模型复杂度随基因序列总长度呈二次方增长，文左图转换成时成数学模型的受训一般以位左图向量定量时和自转回受训两收尾为基础的方式为展开。

位左图向量定量时是称之为将位左图展开均匀分布化时字符，如将256×256的RGB位左图展开16倍降于调制，得到16×16的均匀分布化时基因序列，基因序列中都的每个image token对应于codebook中都的表示。少却说的位左图向量定量时方法都有：VQVAE、VQVAE-2和VQGAN等。我们换用VQGAN在ImageNet上受训的f16_16384（16倍降于调制，词表大小为16384）的数学模型权重来转换成时成位左图的均匀分布化时基因序列。

自转回受训是称之为将重构基因序列和位左图基因序列作为转换成，在位左图之外，每个image token只能与重构基因序列的tokens和其之前的image tokens展开attention量度。我们换用GPT作为backbone，并能适应完全相同数学模型数目的转换成时成训练任务。在数学模型先为报收尾，转换成重构基因序列，数学模型以自转回的方式为逐步转换成时成定长的位左图基因序列，再通过VQGAN decoder有系统为位左图。

OpenBSD数学模型旋数设置

在EasyNLP中都，我们备有人两个旧版的中都文文左图转换成时成数学模型，数学模型旋数配置如下表：

数学模型配置

pai-painter-base-zh

pai-painter-large-zh

旋数目（Parameters）

202M

433M

层数（Number of Layers）

注意力头数（Attention Heads）

隐向量维度（Hidden Size）

768

1024

重构总长度（Text Length）

位左图基因序列总长度（Image Length）

16 x 16

位左图尺寸（Image Size）

256 x 256

VQGAN词表大小（Codebook Size）

16384

数学模型付诸

在EasyNLP前提中都，我们在数学模型层付诸基于minGPT的backbone付诸数学模型，核心之外如下上左图：

self.first_stage_model = VQModel(ckpt_path=vqgan_ckpt_path).eval()

self.transformer = GPT(self.config)

VQModel的Encoding收尾现实生活为：

# in easynlp/appzoo/text2image_generation/model.py

@torch.no_grad()

def encode_to_z(self, x):

quant_z, _, info = self.first_stage_model.encode(x)

indices = info[2].view(quant_z.shape[0], -1)

return quant_z, indices

x = inputs['image']

x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format)

# one step to produce the logits

_, z_indices = self.encode_to_z(x) # z_indice: torch.Size([batch_size, 256])

VQModel的Decoding收尾现实生活为：

# in easynlp/appzoo/text2image_generation/model.py

@torch.no_grad()

def decode_to_img(self, index, zshape):

bhwc = (zshape[0],zshape[2],zshape[3],zshape[1])

quant_z = self.first_stage_model.quantize.get_codebook_entry(

index.reshape(-1), shape=bhwc)

x = self.first_stage_model.decode(quant_z)

return x

# sample为受训收尾的结果转换成时成，与先为报收尾的generate类似，古今中外却说下文generate

index_sample = self.sample(z_start_indices, c_indices,

steps=z_indices.shape[1],

x_sample = self.decode_to_img(index_sample, quant_z.shape)

Transformer换用minGPT展开付诸，转换成位左图的均匀分布字符，输出新重构token。前向传递现实生活为：

# in easynlp/appzoo/text2image_generation/model.py

def forward(self, inputs):

x = inputs['image']

c = inputs['text']

x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format)

# one step to produce the logits

_, z_indices = self.encode_to_z(x) # z_indice: torch.Size([batch_size, 256])

c_indices = c

if self.training and self.pkeep

mask = torch.bernoulli(self.pkeep*torch.ones(z_indices.shape,

device=z_indices.device))

mask = mask.round().to(dtype=torch.int64)

r_indices = torch.randint_like(z_indices, self.transformer.config.vocab_size)

a_indices = mask*z_indices+(1-mask)*r_indices

else:

a_indices = z_indices

cz_indices = torch.cat((c_indices, a_indices), dim=1)

# target includes all sequence elements (no need to handle first one

# differently because we are conditioning)

target = z_indices

# make the prediction

logits, _ = self.transformer(cz_indices[:, :-1])

# cut off conditioning outputs - output i corresponds to p(z_i | z_{

logits = logits[:, c_indices.shape[1]-1:]

return logits, target

在先为报收尾，转换成为重构token，输出新为256*256的位左图。首先，将转换成重构实例为token基因序列：

# in easynlp/appzoo/text2image_generation/predictor.py

def preprocess(self, in_data):

if not in_data:

raise RuntimeError("Input data should not be None.")

if not isinstance(in_data, list):

in_data = [in_data]

rst = {"idx": [], "input_ids": []}

max_seq_length = -1

for record in in_data:

if "sequence_length" not in record:

break

max_seq_length = max(max_seq_length, record["sequence_length"])

max_seq_length = self.sequence_length if (max_seq_length == -1) else max_seq_length

for record in in_data:

text= record[self.first_sequence]

try:

self.MUTEX.acquire()

text_ids = self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(text))

text_ids = text_ids[: self.text_len]

n_pad = self.text_len - len(text_ids)

text_ids += [self.pad_id] * n_pad

text_ids = np.array(text_ids) + self.img_vocab_size

finally:

self.MUTEX.release()

rst["idx"].append(record["idx"])

rst["input_ids"].append(text_ids)

return rst

逐步转换成时成总长度为16*16的位左图均匀分布token基因序列：

# in easynlp/appzoo/text2image_generation/model.py

def generate(self, inputs, top_k=100, temperature=1.0):

cidx = inputs

sample = True

steps = 256

for k in range(steps):

x_cond = cidx

logits, _ = self.transformer(x_cond)

# pluck the logits at the final step and scale by temperature

logits = logits[:, -1, :] / temperature

# optionally crop probabilities to only the top k options

if top_k is not None:

logits = self.top_k_logits(logits, top_k)

# apply softmax to convert to probabilities

probs = torch.nn.functional.softmax(logits, dim=-1)

# sample from the distribution or take the most likely

if sample:

ix = torch.multinomial(probs, num_samples=1)

else:

_, ix = torch.topk(probs, k=1, dim=-1)

# append to the sequence and continue

cidx = torch.cat((cidx, ix), dim=1)

img_idx = cidx[:, 32:]

return img_idx

后来，我们调用VQModel的Decoding现实生活将这些位左图均匀分布token基因序列转换为位左图。

数学模型功效

我们在四个中都文的公开场合样本集COCO-CN、MUGE、Flickr8k-CN、Flickr30k-CN上验证了EasyNLP前提中都文左图转换成时成数学模型的功效。同时，我们对比了这个数学模型和CogView、DALL-E的功效，如下上左图：

Paradigm

Model

#Param.

COCO-CN

MUGE

Flickr8k-CN

Flickr30k-CN

FID↓

IS↑

FID↓

IS↑

FID↓

IS↑

FID↓

IS↑

Zero-shot

CogView

102.30

11.81±0.84

29.08

10.71±0.40

102.01

11.58±0.66

103.34

10.50±0.35

DALL-E

209M

89.73

10.32±0.64

40.28

9.90±0.48

77.84

10.57±0.37

77.08

10.03±0.60

Fine-tuning

DALL-E

209M

84.73

11.08±0.89

22.42

10.28±0.44

72.17

9.89±0.41

68.75

9.86±0.40

pai-painter-base-zh

202M

76.89

11.65±0.89

13.31

11.91±0.36

55.56

12.54±0.48

55.66

10.19±0.30

其中都，

1）MUGE是天池平台发布的营销过场的中都文大数目多一般性该软件基准（）。为了易于量度称之为标，MUGE我们换用valid样本集的结果，其他样本集换用test样本集的结果。

2）CogView源自

3）DALL-E数学模型难以公开场合的官方字符。之前公开场合的之外只都有VQVAE的字符，不都有Transformer之外。我们基于广受重视的旧版的字符和该旧版推荐的checkpoits展开复现，checkpoints为2.09亿旋数，为OpenAI的DALL-E数学模型旋数目的1/100。（OpenAI旧版DALL-E为120亿旋数，其中都CLIP为4亿旋数）。

经典个案

我们分别在人文景观样本集COCO-CN上Fine-tune了base和large级别的数学模型，如下演示了数学模型的功效：

再三注意1：一只俏皮的狗正跑过草地

pai-painter-base-zh

pai-painter-large-zh

再三注意2：一片水域的夜景以黎明为背景

pai-painter-base-zh

pai-painter-large-zh

我们也积累了穆萨财团的海量营销消费者样本，修改得到了依托营销消费者的文左图转换成时成数学模型。功效如下：

再三注意3：女童套头毛衣打底衫深秋针织衫童装儿童内搭衬衫

pai-painter-base-zh

pai-painter-large-zh

再三注意4：春夏真皮工作鞋女深色软皮久站舒适上班面试拳击手皮革

pai-painter-base-zh

pai-painter-large-zh

除了支持特定层面的应用，文左图转换成时成也极大地来进行了人类的绘画。常用受训得到的数学模型，我们可以秒大变“中都国国画戏剧家”，再三注意如下上左图：

静夜沉沉，浮光霭霭

眺望山下为谁好，忽闻楚些最让人右腿

风阁水帘今在斑，且来先看早梅红

却说说春风偏有康，露花千朵照庭闱

更加多的再三注意再三惊叹：

常用教程

惊叹了数学模型转换成时成的作品后来，如果我们想DIY，受训自己的文左图转换成时成数学模型，应该如何展开呢？所列我们简要引介在EasyNLP前提对先为受训的文左图转换成时成数学模型展开Fine-tune和推理小说。

装设EasyNLP

服务器可以如此一来旋考URL的说明装设EasyNLP演算法前提。

样本准备好

首先准备好受训样本与验证样本，为tsv份文件。这一份文件都有以制表符分隔的两列，第一列为目录号，第二列为重构，第三列为相片的字符。可用测试的转换成份文件为两列，只能都有目录号和重构。

为了易于自由软件，我们也备有人了转换相片到字符的再三注意字符：

import

from io import BytesIO

from PIL import Image

img = Image.open(fn)

img_buffer = BytesIO()

img.se(img_buffer, format=img.format)

byte_data = img_buffer.getvalue()

_str = .b64encode(byte_data) # bytes

下列份文件之前完成实例，可可用测试：

# train

_text2image/MUGE_train_text_img.tsv

# valid

_text2image/MUGE_val_text_img.tsv

# test

_text2image/MUGE_test.text.tsv

数学模型受训

我们换用所列命令对数学模型展开fine-tune：

easynlp

----mode=train

----worker_gpu=1

----tables=MUGE_val_text_img.tsv,MUGE_val_text_img.tsv

----input_schema=idx:str:1,text:str:1,img:str:1

----first_sequence=text

----second_sequence=img

----checkpoint_dir=./finetuned_model/

----learning_rate=4e-5

----epoch_num=1

----random_seed=42

----logging_steps=100

----se_checkpoint_steps=1000

----sequence_length=288

----micro_batch_size=16

----app_name=text2image_generation

----user_defined_parameters='

pretrain_model_name_or_path=alibaba-pai/pai-painter-large-zh

size=256

text_len=32

img_len=256

img_vocab_size=16384

我们备有人base和large两个旧版的先为受训数学模型，pretrain_model_name_or_path分列alibaba-pai/pai-painter-base-zh和alibaba-pai/pai-painter-large-zh。

受训完成后数学模型被留存到./finetuned_model/。

数学模型试验性推理小说

数学模型受训完毕后，我们可以将其可用位左图转换成时成，其再三注意如下：

easynlp

----mode=predict

----worker_gpu=1

----tables=MUGE_test.text.tsv

----input_schema=idx:str:1,text:str:1

----first_sequence=text

----outputs=./T2I_outputs.tsv

----output_schema=idx,text,gen_img

----checkpoint_dir=./finetuned_model/

----sequence_length=288

----micro_batch_size=8

----app_name=text2image_generation

----user_defined_parameters='

size=256

text_len=32

img_len=256

img_vocab_size=16384

结果存储在一个tsv份文件中都，每行对应转换成中都的一个重构，输出新的位左图以字符。

常用Pipeline接口短时间互动文左图转换成时成功效

为了全面性易于自由软件常用，我们在EasyNLP前提内也付诸了Inference Pipeline功能性。服务器可以常用如下命令调用Fine-tune过的营销过场下的文左图转换成时成数学模型：

# 如此一来付诸pipeline

default_ecommercial_pipeline = pipeline("pai-painter-commercial-base-zh")

# 数学模型先为报

data = ["保守T恤"]

results = default_ecommercial_pipeline(data) # results的每一条是转换成时成位左图的字符

# 转换为位左图

def _to_image(img_str):

image = Image.open(BytesIO(.urlsafe_b64decode(img_str)))

return image

# 留存以重构命名的位左图

for text, result in zip(data, results):

imgpath = '{}.png'.format(text)

img_str = result['gen_img']

image = _to_image(img_str)

image.se(imgpath)

print('text: {}, se generated image: {}'.format(text, imgpath))

除了营销过场，我们还备有人了所列过场的数学模型：

美景过场：“pai-painter-scenery-base-zh” 中都国山水画过场：“pai-painter-painting-base-zh”

在上面的字符之外都替换“pai-painter-commercial-base-zh”，就可以如此一来互动，喜爱试用。

对于服务器Fine-tune的文左图转换成时成数学模型，我们也新开了自定义数学模型加载的Pipeline接口：

# 加载数学模型，付诸pipeline

local_model_path = ...

text_to_image_pipeline = pipeline("text2image_generation", local_model_path)

# 数学模型先为报

data = ["xxxx"]

results = text_to_image_pipeline(data) # results的每一条是转换成时成位左图的字符

下一代展望

在这一期的工作中都，我们在EasyNLP前提中都构建了中都文文左图转换成时成功能性，同时新开了数学模型的Checkpoint，易于OpenBSD邻里服务器在资源有限情形展开少量层面具体的修改，展开各种绘画。在下一代，我们计划在EasyNLP前提中都推出新更加多具体数学模型，敬再三期待。我们也将在EasyNLP前提中都构建更加多SOTA数学模型（之外是中都文数学模型），来支持各种NLP和多一般性训练任务。此外，穆萨云人工神经网络PAI团队也在持续推进中都文多一般性数学模型的自研工作，喜爱服务器持续重视我们，也喜爱加入我们的OpenBSD邻里，共建中都文NLP和多一般性演算法库！

Github地址：

Reference

Chengyu Wang, Minghui Qiu, Taolin Zhang, Tingting Liu, Lei Li, Jianing Wang, Ming Wang, Jun Huang, Wei Lin. EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing. arXiv Aditya Ramesh, Mikhail Plov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever. Zero-Shot Text-to-Image Generation. ICML 2021: 8821-8831 Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. NeurIPS 2021: 19822-19835 Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation. arXiv Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang. Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. ICML 2022 Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv Van Den Oord A, Vinyals O. Neural discrete representation learning. NIPS 2017 Esser P, Rombach R, Ommer B. Taming transformers for high-resolution image synthesis. CVPR 2021: 12873-12883. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, Did J. Fleet, Mohammad Norouzi: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. arXiv

穆萨灵杰回顾

穆萨灵杰：穆萨云人工神经网络PAIOpenBSD中都文NLP演算法前提EasyNLP，助力NLP大数学模型紧贴穆萨灵杰：先为受训科学度量对抗赛冠军！穆萨云PAI发布科学先为受训工具穆萨灵杰：EasyNLP放你玩转CLIP左图文检索

原文URL：

本文为穆萨云原创内容，未经允许不得转载。

。

脑梗
江中健胃消食片
手术病人吃什么对伤口恢复好
孕妇拉肚子吃什么可以缓解
探望病人
双氯芬酸钠和双醋瑞因哪个见效快
睡觉老打呼噜吃什么能治好
肠胃炎老拉肚子怎么办
胃溃疡都吃哪些药物治疗
胳膊关节处疼痛什么原因

上一篇：可携带骁龙8+旗舰处理器和黑色、“赛车”配色 iQOO9T海外发布