Chinese Song Ci (Iambics) Generation: From Overview to VAE

Presented by: Xinyu Liu Partner: Xinwei Chen Supervised by: Yongyi Mao 2017/03/17

## Outline

Introduction Related Work Model Apply VAE Future Work

Introduction —

##### What is the Song Ci
• Ci is part of the format of Shi(poem) and develops widely in Song dynasty which is so called Song CI.
• ## Ci is type of lyric and usually it is coordinated with instruments such is ‘The Adagio of Resonance’(声声慢)

##### Why choose Song Ci and its potential difficulty
• structure constraints: compared to traditional 5/7-character poems,Song Ci:
• written with uneven lengths of lines and verses
• different format from the ancient poetic style
• Its lyrics in a regular format and with set tunes(vowels)
• tone constraints:
• once Ci’s name(lyrics) is given,# of words(characters) is fixed and a tone constraints is set for each position of the word
• every character has its own tone(s)
• nī ní nǐ nì
• 妮 倪 你 腻
• $+ \in{\bar{},\acute{}}$denotes ping(平)
• $- \in{\check{},\grave{}}$denotes ze（仄）
• rhyming constraints:
• Usually the last character between current and next sentence needs to be a same/similar vowel sound

++--，--++，++----。 寻寻觅，冷冷清清，凄凄惨惨戚。I look for what I miss:I know not what it is. I feel so sad, so drear,So lonely, without cheer. --+++-，-++-。 乍暖还寒时候，最难将。How hard is it; To keep me fit; In this lingering cold! ++----，--+、-++-？ 三杯两盏淡酒，怎敌他、晚来风？Hardly warmed up; By cup on cup; Of wine so dry, Oh, how could I; Endure at dusk the drift;Of wind so swift? ---，-++、---++-。 雁过也，正伤心，却是旧时相。It breaks my heart, alas! To see the wild geese pass, For they are my acquaintances of old.

--+++-，+--、++-++-。 满地黄花堆，憔悴损，如今有谁堪？The ground is covered with yellow flowers Faded and fallen in showers. Who will pick them up now? --++，---+--。 守着窗儿，独自怎生得？Sitting alone at the window, how; Could I but quicken; The pace of darkness which won’t thicken? ++-+--，-++、----。 梧桐更兼细雨，到黄昏、点点滴。On parasol-trees leaves a fine rain drizzles As twilight grizzles. ---，---+---。 这次第，怎一个愁字了？Oh! what can I do with a grief; Beyond belief?

• By given ci’s name(The Adagio of Resonance (声声慢) ):
• every position has its corresponding tone, and lengh of line is uneven… Based on Traditional Generation Method - The work of using machine to generate poem starts in 70s, and the major approaches are:
• Word Salada: the earliest approach which is only based on the permutation of phrases/words (does not care about gramma/semantic level)
• Template Model: like the task of imputing missing words and cloze test, removing some words from existing poems.(lack of flexibility)
• Generic algorithm: treat poems generation as a searching problem from state space and using the pre-defined evaluation function to iterate over each sentence.(lack of the relative relation between sentences)
• Abstract: treat poems generation as an abstract generation based on user intents.
• Machine Translation：treat last sentence as a source sentence and generate next sentence as the target sentence(topics shift/off-topic problems)

Based on Deep (Learning) Generation Method - $P_{\theta}(w_1,…,w_n)=\displaystyle\prod_{n=1}^{N}P_{\theta}(w_n|w_{<n})$ RNNLM: given a sequence of word as the encoder inputs, using temporal model to generate a sentence compression(C) first, then given C and last token to generate next token. SEQ2SEQ: it uses ground-truth inputs on decoder side when training.

Based on Deep (Learning) Generation Method(Cond.) - $\begin{split} P_{\theta}(U_1,…,U_M)& =\displaystyle\prod_{m=1}^{M}P_{\theta}(U_m|U_{<m}) &=\displaystyle\prod_{m=1}^{M}\prod_{n=1}^{N_m}P_{\theta}(w_{m,n}|w_{m<n},U_{<m}) \end{split}$ Dialogue: is more complicated, has a group of word-level enc/dec and a sequence of context-level representation built on top it.

Our Model:Phrases & Formulation —

• Phrases:
1. Intention Representation
2. Ci Generation
• Notations:
• $W_v={w_1,w_2,…,w_n}:$ A set of keywords from user Intention
• $c\in C:$ A collection of Ci’s name/lyric(词牌名)
• $D={w_1,w_2,…,w_n}:$ A sentence or word sequence
• $P={D_1,D_2,…,D_n}:$ A Song Ci or sentence sequence
• Formulation:
• Given user’s specified Ci’s name/lyric and keywords as inputs
• to generate a Ci as a output
• $P=f(c,W_v)$
• Summary:
• hierarchical variational auto-encoder
• word level
• context/sentence level
• latent representations(vector)
• core idea behind VAE

Training

Reconstruct/Generation

Apply VAE: Purpose —

• what is the variational auto-encoder(VAE)?
• A generate auto-encoder framework
• learns a simple and meaningful feature representations($\vec{z}$) via encoding and inference
• generate a new output($\hat{x}$) given ($\vec{z}$)
• why use VAE
• not only capture more compact latent representations($\vec{z}$) for data
• but also because ($\vec{z}$) is interpretable
• once we have a good and strong enough of ($\vec{z}$):
• build a connection between user’s intent and latent representation
• create/reconstruct a vivid data without $x$(source)
• or more fun play it and manipulate it

Apply VAE: Framework - We force a fake posterior (q(z|x)) to close to the ground-truth prior(p(z)) as much as possible, then we sample form prior(p(z)) in order to get a close enough but not identical $\hat x$

Apply VAE: Theory —

• Notation: $D_{KL}:$Kullback–Leibler Divergence (a measure of how close between 2 distributions.)
•  Maximize log likelihood -> marginalize the joint distribution over z; but z is intractable -> variational inference(q(z x))
• Objective: Maximize the lower bound of marginal log likelihood
• \begin{aligned} {\text{Maximize}} & &{\cal L}(x,\theta,\phi)=[\log p(x|z)-D_{KL}(q(z|x)||p(z))] \end{aligned} $\begin{split} \log p_{\theta}(x)&= \log \int_{z} p_{\theta}(x,z) & &= \log \int_{z} q_\phi (z|x) \frac{p_{\theta}(x,z)}{q_\phi(z|x)} & &\ge \int_{z} q(z|x) \log \frac{p(x,z)}{q(z|x)} \text{(Jensen’s inequality)} & &= \mathbb E_{z\sim q(z|x)} [\log p(x,z)-q(z|x)] & &\text{if }\log p(x,z)=\log p(x)+\log p(z|x) &\text{else }\log p(x,z)=\log p(x|z)+\log p(z) &= \mathbb E_{z\sim q(z|x)} [\log p(x)+\log p(z|x)-q(z|x)] & =\mathbb E_{z\sim q(z|x)} [\log p(x|z)+\log p(z)-q(z|x)] &= \mathbb E_{z\sim q(z|x)} [\log p(x)-[-\log p(z|x)+q(z|x)]] & =\mathbb E_{z\sim q(z|x)} [\log p(x|z)-[-\log p(z)+q(z|x)]] &= \mathbb E_{z\sim q(z|x)} [\log p(x)-\log \frac{q(z|x)}{p(z|x)}] &=\mathbb E_{z\sim q(z|x)} [\log p(x|z)-\log \frac{q(z|x)}{p(z)}] &= - \mathbb E_{z\sim q(z|x)} [\log \frac{q(z|x)}{p(z|x)}]+\log p(x) &=- \mathbb E_{z\sim q(z|x)} [\log \frac{q(z|x)}{p(z)}]+\log p(x|z) &= - D_{KL}(q_\phi(z|x)||p_\theta(z|x))+\log p_\theta(x) &=- D_{KL}(q_\phi(z|x)||p_\theta(z))+\log p_\theta(x|z) & &= {\cal L}(x,\theta,\phi)& \end{split}$

Apply VAE: Limitation — Assumption on data distribution hard to train in order to get a meaningful latent Representation

Future Work —

• Cvae: Conditioned on topic,phrases or keywords