Partner: Xinwei Chen

Supervised by: Yongyi Mao

2017/03/17

Introduction

Related Work

Model

Apply VAE

Future Work

Ci is part of the format of Shi(

**poem**) and develops widely in Song dynasty which is so called**Song CI**.Ci is type of lyric and usually it is coordinated with instruments such is 'The Adagio of Resonance'(声声慢)

**structure constraints:**compared to traditional 5/7-character poems,**Song Ci**:written with

**uneven lengths**of lines and versesdifferent format from the ancient poetic style

Its

**lyrics**in a**regular format**and with set**tunes(vowels)**

**tone constraints:**once Ci's name(lyrics) is given,# of words(characters) is fixed and a tone constraints is set for each position of the word

every character has its own tone(s)

nī ní nǐ nì

妮 倪 你 腻

$+ \in\{\bar{},\acute{}\}$denotes ping(平)

$- \in\{\check{},\grave{}\}$denotes ze（仄）

**rhyming constraints:**Usually the last character between current and next sentence needs to be a same/similar vowel sound

++-

-，--++，++----。

寻寻觅觅，冷冷清清，凄凄惨惨戚戚。I look for what I miss:I know not what it is. I feel so sad, so drear,So lonely, without cheer.

--+++-，-++-。

乍暖还寒时候，最难将息。How hard is it; To keep me fit; In this lingering cold!

++----，--+、-++-？

三杯两盏淡酒，怎敌他、晚来风急？Hardly warmed up; By cup on cup; Of wine so dry, Oh, how could I; Endure at dusk the drift;Of wind so swift?

---，-++、---++-。

雁过也，正伤心，却是旧时相识。It breaks my heart, alas! To see the wild geese pass, For they are my acquaintances of old.

--+++

-，+--、++-++-。

满地黄花堆积，憔悴损，如今有谁堪摘？The ground is covered with yellow flowers Faded and fallen in showers. Who will pick them up now?

--++，---+--。

守着窗儿，独自怎生得黑？Sitting alone at the window, how; Could I but quicken; The pace of darkness which won't thicken?

++-+--，-++、----。

梧桐更兼细雨，到黄昏、点点滴滴。On parasol-trees leaves a fine rain drizzles As twilight grizzles.

---，---+---。

这次第，怎一个愁字了得？Oh! what can I do with a grief; Beyond belief?

By given ci's name(The Adagio of Resonance (声声慢) ):

every position has its corresponding tone, and lengh of line is uneven...

The work of using machine to generate poem starts in 70s, and the major approaches are:

Word Salada: the earliest approach which is only based on the permutation of phrases/words (

**does not care about gramma/semantic level**)Template Model: like the task of imputing missing words and cloze test, removing some words from existing poems.(

**lack of flexibility**)Generic algorithm: treat poems generation as a searching problem from state space and using the pre-defined evaluation function to iterate over each sentence.(

**lack of the relative relation between sentences**)Abstract: treat poems generation as an abstract generation based on user intents.

Machine Translation：treat last sentence as a source sentence and generate next sentence as the target sentence(

**topics shift/off-topic problems**)

$P_{\theta}(w_1,...,w_n)=\displaystyle\prod_{n=1}^{N}P_{\theta}(w_n|w_{<n})$

**RNNLM:** given a sequence of word as the encoder inputs, using temporal model to generate a sentence compression(C) first, then given C and last token to generate next token.

**SEQ2SEQ:** it uses ground-truth inputs on decoder side when training.

$\begin{equation}\begin{split}P_{\theta}(U_1,...,U_M)& =\displaystyle\prod_{m=1}^{M}P_{\theta}(U_m|U_{<m}) \\ &=\displaystyle\prod_{m=1}^{M}\prod_{n=1}^{N_m}P_{\theta}(w_{m,n}|w_{m<n},U_{<m})\end{split}\end{equation}$

**Dialogue:** is more complicated, has a group of word-level enc/dec and a sequence of context-level representation built on top it.

Phrases:

Intention Representation

Ci Generation

Notations:

$W_v=\{w_1,w_2,...,w_n\}:$ A set of keywords from user Intention

$c\in C:$ A collection of Ci's name/lyric(词牌名)

$D=\{w_1,w_2,...,w_n\}:$ A sentence or word sequence

$P=\{D_1,D_2,...,D_n\}:$ A Song Ci or sentence sequence

Formulation:

Given user's specified Ci's name/lyric and keywords

**as inputs**to generate a Ci

**as a output**$P=f(c,W_v)$

Summary:

hierarchical

**variational auto-encoder**word level

context/sentence level

latent representations(vector)

core idea behind VAE

Training

Reconstruct/Generation

what is the variational auto-encoder(VAE)?

A generate auto-encoder framework

learns a simple and meaningful feature representations($\vec{z}$) via encoding and inference

generate a new output($\hat{x}$) given ($\vec{z}$)

why use VAE

not only capture more compact latent representations($\vec{z}$) for data

but also because ($\vec{z}$) is

**interpretable**once we have a good and strong enough of ($\vec{z}$):

build a connection between user's intent and latent representation

create/reconstruct a vivid data without $x$(source)

or more fun play it and manipulate it

We force a fake posterior (q(z|x)) to close to the ground-truth prior(p(z)) as much as possible, then we sample form prior(p(z)) in order to get a close enough but not identical $\hat x$

Notation:

$D_{KL}:$Kullback–Leibler Divergence (a measure of how close between 2 distributions.)Maximize log likelihood -> marginalize the joint distribution over z; but z is intractable -> variational inference(q(z|x))

Objective:

**Maximize the lower bound of marginal log likelihood**$\begin{aligned} {\text{Maximize}} & &{\cal L}(x,\theta,\phi)=[\log p(x|z)-D_{KL}(q(z|x)||p(z))]\end{aligned}$

$\begin{equation}\begin{split}\log p_{\theta}(x)&= \log \int_{z} p_{\theta}(x,z) &\\&= \log \int_{z} q_\phi (z|x) \frac{p_{\theta}(x,z)}{q_\phi(z|x)} &\\&\ge \int_{z} q(z|x) \log \frac{p(x,z)}{q(z|x)} \text{(Jensen's inequality)} & \\&= \mathbb E_{z\sim q(z|x)} [\log p(x,z)-q(z|x)] &\\&\text{if }\log p(x,z)=\log p(x)+\log p(z|x) &\text{else }\log p(x,z)=\log p(x|z)+\log p(z)\\&= \mathbb E_{z\sim q(z|x)} [\log p(x)+\log p(z|x)-q(z|x)] & =\mathbb E_{z\sim q(z|x)} [\log p(x|z)+\log p(z)-q(z|x)]\\&= \mathbb E_{z\sim q(z|x)} [\log p(x)-[-\log p(z|x)+q(z|x)]] & =\mathbb E_{z\sim q(z|x)} [\log p(x|z)-[-\log p(z)+q(z|x)]]\\&= \mathbb E_{z\sim q(z|x)} [\log p(x)-\log \frac{q(z|x)}{p(z|x)}] &=\mathbb E_{z\sim q(z|x)} [\log p(x|z)-\log \frac{q(z|x)}{p(z)}]\\&= - \mathbb E_{z\sim q(z|x)} [\log \frac{q(z|x)}{p(z|x)}]+\log p(x) &=- \mathbb E_{z\sim q(z|x)} [\log \frac{q(z|x)}{p(z)}]+\log p(x|z)\\&= - D_{KL}(q_\phi(z|x)||p_\theta(z|x))+\log p_\theta(x) &=- D_{KL}(q_\phi(z|x)||p_\theta(z))+\log p_\theta(x|z)\\& &= {\cal L}(x,\theta,\phi)&\\\end{split}\end{equation}$

Assumption on data distribution

hard to train in order to get a meaningful latent Representation

Cvae: Conditioned on topic,phrases or keywords

GAN(generative adversarial nets):

no explicit assumption on distribution

Representation decouple(decompose semantic) :latent vector arithmetic