Chinese Song Ci (Iambics) Generation: From Overview to VAE
Presented by: Xinyu Liu
Partner: Xinwei Chen
Supervised by: Yongyi Mao
2017/03/17
Outline
Introduction
Related Work
Model
Apply VAE
Future Work
Introduction
—
What is the Song Ci
 Ci is part of the format of Shi(poem) and develops widely in Song dynasty which is so called Song CI.

Ci is type of lyric and usually it is coordinated with instruments such is ‘The Adagio of Resonance’(声声慢)
Why choose Song Ci and its potential difficulty
 structure constraints: compared to traditional 5/7character poems,Song Ci:
 written with uneven lengths of lines and verses
 different format from the ancient poetic style
 Its lyrics in a regular format and with set tunes(vowels)
 tone constraints:
 once Ci’s name(lyrics) is given,# of words(characters) is fixed and a tone constraints is set for each position of the word
 every character has its own tone(s)
 $+ \in{\bar{},\acute{}}$denotes ping(平)
 $ \in{\check{},\grave{}}$denotes ze（仄）
 rhyming constraints:
 Usually the last character between current and next sentence needs to be a same/similar vowel sound
The Adagio of Resonance (声声慢)
—
++，++，++。
寻寻觅觅，冷冷清清，凄凄惨惨戚戚。I look for what I miss:I know not what it is. I feel so sad, so drear,So lonely, without cheer.
+++，++。
乍暖还寒时候，最难将息。How hard is it; To keep me fit; In this lingering cold!
++，+、++？
三杯两盏淡酒，怎敌他、晚来风急？Hardly warmed up; By cup on cup; Of wine so dry, Oh, how could I; Endure at dusk the drift;Of wind so swift?
，++、++。
雁过也，正伤心，却是旧时相识。It breaks my heart, alas! To see the wild geese pass, For they are my acquaintances of old.
+++，+、++++。
满地黄花堆积，憔悴损，如今有谁堪摘？The ground is covered with yellow flowers Faded and fallen in showers. Who will pick them up now?
++，+。
守着窗儿，独自怎生得黑？Sitting alone at the window, how; Could I but quicken; The pace of darkness which won’t thicken?
+++，++、。
梧桐更兼细雨，到黄昏、点点滴滴。On parasoltrees leaves a fine rain drizzles As twilight grizzles.
，+。
这次第，怎一个愁字了得？Oh! what can I do with a grief; Beyond belief?
 By given ci’s name(The Adagio of Resonance (声声慢) ):
 every position has its corresponding tone, and lengh of line is uneven…
Based on Traditional Generation Method

The work of using machine to generate poem starts in 70s, and the major approaches are:
 Word Salada: the earliest approach which is only based on the permutation of phrases/words (does not care about gramma/semantic level)
 Template Model: like the task of imputing missing words and cloze test, removing some words from existing poems.(lack of flexibility)
 Generic algorithm: treat poems generation as a searching problem from state space and using the predefined evaluation function to iterate over each sentence.(lack of the relative relation between sentences)
 Abstract: treat poems generation as an abstract generation based on user intents.
 Machine Translation：treat last sentence as a source sentence and generate next sentence as the target sentence(topics shift/offtopic problems)
Based on Deep (Learning) Generation Method

$P_{\theta}(w_1,…,w_n)=\displaystyle\prod_{n=1}^{N}P_{\theta}(w_nw_{<n})$
RNNLM: given a sequence of word as the encoder inputs, using temporal model to generate a sentence compression(C) first, then given C and last token to generate next token.
SEQ2SEQ: it uses groundtruth inputs on decoder side when training.
Based on Deep (Learning) Generation Method(Cond.)

$
\begin{equation}
\begin{split}
P_{\theta}(U_1,…,U_M)& =\displaystyle\prod_{m=1}^{M}P_{\theta}(U_mU_{<m})
&=\displaystyle\prod_{m=1}^{M}\prod_{n=1}^{N_m}P_{\theta}(w_{m,n}w_{m<n},U_{<m})
\end{split}
\end{equation}$
Dialogue: is more complicated, has a group of wordlevel enc/dec and a sequence of contextlevel representation built on top it.
Our Model:Phrases & Formulation
—
 Phrases:
 Intention Representation
 Ci Generation
 Notations:
 $W_v={w_1,w_2,…,w_n}:$ A set of keywords from user Intention
 $c\in C:$ A collection of Ci’s name/lyric(词牌名)
 $D={w_1,w_2,…,w_n}:$ A sentence or word sequence
 $P={D_1,D_2,…,D_n}:$ A Song Ci or sentence sequence
 Formulation:
 Given user’s specified Ci’s name/lyric and keywords as inputs
 to generate a Ci as a output
 $P=f(c,W_v)$
 Summary:
 hierarchical variational autoencoder
 word level
 context/sentence level
 latent representations(vector)
Training
Reconstruct/Generation
Apply VAE: Purpose
—
 what is the variational autoencoder(VAE)?
 A generate autoencoder framework
 learns a simple and meaningful feature representations($\vec{z}$) via encoding and inference
 generate a new output($\hat{x}$) given ($\vec{z}$)
 why use VAE
 not only capture more compact latent representations($\vec{z}$) for data
 but also because ($\vec{z}$) is interpretable
 once we have a good and strong enough of ($\vec{z}$):
 build a connection between user’s intent and latent representation
 create/reconstruct a vivid data without $x$(source)
 or more fun play it and manipulate it
Apply VAE: Framework

We force a fake posterior (q(zx)) to close to the groundtruth prior(p(z)) as much as possible, then we sample form prior(p(z)) in order to get a close enough but not identical $\hat x$
Apply VAE: Theory
—
 Notation:
$D_{KL}:$Kullback–Leibler Divergence (a measure of how close between 2 distributions.)

Maximize log likelihood > marginalize the joint distribution over z; but z is intractable > variational inference(q(z 
x)) 
 Objective: Maximize the lower bound of marginal log likelihood
 $
\begin{aligned}
{\text{Maximize}}
& &{\cal L}(x,\theta,\phi)=[\log p(xz)D_{KL}(q(zx)p(z))]
\end{aligned}
$
$
\begin{equation}
\begin{split}
\log p_{\theta}(x)&= \log \int_{z} p_{\theta}(x,z) &
&= \log \int_{z} q_\phi (zx) \frac{p_{\theta}(x,z)}{q_\phi(zx)} &
&\ge \int_{z} q(zx) \log \frac{p(x,z)}{q(zx)} \text{(Jensen’s inequality)} &
&= \mathbb E_{z\sim q(zx)} [\log p(x,z)q(zx)] &
&\text{if }\log p(x,z)=\log p(x)+\log p(zx) &\text{else }\log p(x,z)=\log p(xz)+\log p(z)
&= \mathbb E_{z\sim q(zx)} [\log p(x)+\log p(zx)q(zx)] & =\mathbb E_{z\sim q(zx)} [\log p(xz)+\log p(z)q(zx)]
&= \mathbb E_{z\sim q(zx)} [\log p(x)[\log p(zx)+q(zx)]] & =\mathbb E_{z\sim q(zx)} [\log p(xz)[\log p(z)+q(zx)]]
&= \mathbb E_{z\sim q(zx)} [\log p(x)\log \frac{q(zx)}{p(zx)}] &=\mathbb E_{z\sim q(zx)} [\log p(xz)\log \frac{q(zx)}{p(z)}]
&=  \mathbb E_{z\sim q(zx)} [\log \frac{q(zx)}{p(zx)}]+\log p(x) &= \mathbb E_{z\sim q(zx)} [\log \frac{q(zx)}{p(z)}]+\log p(xz)
&=  D_{KL}(q_\phi(zx)p_\theta(zx))+\log p_\theta(x) &= D_{KL}(q_\phi(zx)p_\theta(z))+\log p_\theta(xz)
& &= {\cal L}(x,\theta,\phi)&
\end{split}
\end{equation}
$
Apply VAE: Limitation
—
Assumption on data distribution
hard to train in order to get a meaningful latent Representation
Future Work
—
 Cvae: Conditioned on topic,phrases or keywords
 GAN(generative adversarial nets):
 no explicit assumption on distribution
 Representation decouple(decompose semantic) :latent vector arithmetic
Questions and Thanks!
