3-D Convolutional Network with Adversarial Training for Video Generation

1. Abstract

After building image translator powered by CycleGAN, I found a problem that, in generation and translation of videos, traditional model was not capable to deal with temporal dimension, causing generated videos collapse. In this work, referring to 3D-convolutional networks for spatial-temporal data representation, I conduct experiments aiming to find appropriate and powerful neural network structures on generation of coherent video with dynamic object. Till now, I found that three dimensional deep residual network has best performance in extracting, filling and transforming features with temporal information. Based on residual connection, I designed ResNet-V5-A/B/C network with adversarial training which is capable to generate realistic video clips in situation of unbalanced translation.

2. Introduction

2.1 Inspiration

These days, I have read many tutorials and papers about video problem in computer vision with machine learning. Many famous papers like 3-D-convolution, 3-D-ResNet and two-stream network inspired me a lot. Meanwhile, I was guessing that could GANs in my previous work translate videos. One day, I found a clue in CycleGAN’s introduction which was an generated animation about house-to-zebra translation. Author splits every frame from a house recording and translates them side-by-side using a trained CycleGAN. But something strange in his result – there are many dynamic figures on zebra (the real should be fixed) which meant that the network could not get the relation between different frames, but instead, it only focus on spatial information. In other word, the network was not capable to deal the sequence information behind the video. Then I started to explore computer vision problems about temporal dimension.

2D-CycleGAN: House-to-zebra (https://arxiv.org/pdf/1703.10593.pdf)

2.2 Three-Dimensional Convolution

In one of methods, the network contain full 3D convolution to cover two spatial dimensions and one temporal dimension. Apparently, with the involvement of another axis, the convolutional kernel can easily extract the relation between spatial features and temporal information. I will use this simple but strong network structure as backbone in this section. Due to original intention, I will test two-stream network and other structures in the future, because, compared to 3D-Conv method, it cannot keep full information between spatial and temporal features when using separate dimensions. So I will mainly focus on 3-dimensional convolutional network to build a productive generative models. Due to limit of dataset, the experiment only focus on dynamic object generation, but I will test dynamic scenes with background mask in the future.

3D-GAN: http://carlvondrick.com/tinyvideo/

3. Basic Model

In this part, I will only train a translative network mainly to test network’s capability of gathering, mapping, reshaping features with temporal information. The results helped to design the generator of a complete generative adversarial model. In these tests, I will use the small dataset to adjust network’s structure and full dataset to evaluate the performance.

3.1 Methodology

I used Sketch Simplify 2D-convolutional translator for reference and adjusted its structure and parameters to fit in 3-D environment. In following experiments, I conducted different structures and different training methods to find the best model. In consideration of training speed, at beginning, I only used small-scale datasets and limited the max number of filters to 32.

Sketch Simplify: http://hi.cs.waseda.ac.jp/~esimo/en/research/sketch/

The diagram above shows the structure of a fully-connected bottleneck-like network for sketch simplicity. In this section, I may test many types of similar structures – residual blocks in the middle, encoder and decoder without flat convolution, ‘fully-residual’ network. All networks have 16 channels for 64-by-64 frames in input layer and identically has 16 channels in output. The number of channel doubles when going to the next layer in down-sampling stage and oppositely decrease two times in up-sampling stage, and keeps the same in middle stage.

3.2 Loss Function

In order to compare the difference of generated video frames and target frames, in this section, the networks were all trained with Gradient Descent optimizer using MSE-loss. Compared to L1-loss, squared loss enables networks to filter more unrelated data which benefits to have less noise in result, as well as train faster. In Trial A-1, I have trained two networks to test both functions. The result showed that frames generated by MSE-loss have less miscellaneous points in first 100 epochs of training and the loss descents faster.

Mean Squared Error

3.3 Bottleneck Translator

Applying basic method mentioned in 3.1, I designed the model having three stages – down-sampling, residual blocks as flat convolution, and up-sampling. These trials will disclose capability of bottleneck-like model in related job. In real tests, I tested structures containing some convolutional layers as down-sampling stage, transposed-convolutional layers as up-sampling stage, and finally with a convolutional layer for generating frames (RGB filters).

Trials	Network Type	Downsample	Upsample	Flat Convolution	Middle Filters	Middle Features	Results
A-1	Residual Bottleneck	4 Convolutional Layers	4 Transposed-Conv Layers	6 ResNet Blocks	256	9x4x4	Dynamic features with intact colors
A-2	Residual Bottleneck	4 Convolutional Layers	4 Transposed-Conv Layers	9 ResNet Blocks	256	9x4x4	Chaotic features with chaotic colors
A-3	Residual Bottleneck	3 Convolutional Layers	3 Transposed-Conv Layers	6 ResNet Blocks	128	9x8x8	Model collapses (overfit)
A-4	Residual Bottleneck	4 Convolutional Layers	4 Transposed-Conv Layers	3 ResNet Blocks	256	9x4x4	Dynamic features with poor colors

The result shows that, in all configurations, the detailed features in generated video were almost dynamic and chaotic. The colors of character are vague, making it very easy for a human to distinguish from real video. In some complex actions, the network got confused with the position of arms and legs, showing that it was not able to use temporal information well to make the generated video smooth.

3.4 U-Net Translator

Inspired by U-net for image segmentation, I tried to use this kind of structures as video frame translator. I tested two trials for this kind – 4 levels U-net with 3 shortcuts as picture below shows and 5 levels U-net. Both networks are trained on Stick-to-Miku dataset with MSE-loss for comparison between video splits. U-Net: https://arxiv.org/pdf/1505.04597.pdf

Trials	Network Type	Depth	Middle Filters	Middle Features	Results
A-5	U-Net	4	128	9x8x8	Broken features with poor colors
A-6	U-Net	5	256	9x4x4	Broken features with intact colors

The results of two trials are worse than Trial A-1 using Residual Bottleneck. Although the frames generated by U-net are accurate in some cases, they loss too many temporal data, causing many features were damaged. I guessed it was caused by overfitting problem and I would explore more about U-net in adversarial training.

3.5 Deep ResNet Translator V1, V2

In order to evaluate influence of residual blocks, I conducted several trials which contained more residual blocks and fewer up-sampling or down-sampling layers. In this case, the network can go deeper benefiting from outstanding capability of residual connections to keep gradients stable. I have tested several trials on Stick-to-Miku dataset to stimulate conditional generation using movement data provided by stickman animation.

Trials	Network Type	Downsample	Upsample	Flat Convolution	Middle Filter	Middle Features	Results
A-7	Deep ResNet	1 Conv Layer	1 Transposed-Conv Layer	14 ResNet Blocks	32	9x32x32	Vague features
A-8	Deep ResNet	1 Conv Layer	1 Transposed-Conv Layer	18 ResNet Blocks	32	9x32x32	Damaged features with vague colors
A-9	Deep ResNet	2 Conv Layers	2 Transposed-Conv Layers	14 ResNet Blocks	64	9x16x16	Smooth but vague
A-10 (128x128)	Deep ResNet	2 Conv Layers	2 Transposed-Conv Layers	14 ResNet Blocks	64	9x32x32	Smooth but vague

After training of A-7 to A-9, I found many vague features created by networks. I guessed it was caused by the insufficient training data, so I double the frame size of video frames and re-train A-9 to get a high-resolution result. However, the result was poor as well. Then, I replaced some of layers to max-pooling to simplify the down-sampling procedure.

Trials	Network Type	Downsample	Upsample	Flat Convolution	Middle Filters	Middle Features	Results
A-11	Deep ResNet V2	Conv Layer + MaxPool3D	2 Transposed-Conv Layer	14 ResNet Blocks	32	9x16x16	Smooth but vague
A-12	Deep ResNet V2	MaxPool3D + Cov Layer	2 Transposed-Conv Layers	14 ResNet Blocks	32	9x16x16	Smooth but unstable
A-13 (128x128)	Deep ResNet V2	MaxPool3D + 2 Cov Layers	3 Transposed-Conv Layers	14 ResNet Blocks	64	9x8x8	Incomplete translation
A-14 (128x128)	Deep ResNet V2	3x(MaxPool3D + Flat-Cov)	3 Transposed-Conv Layers	14 ResNet Blocks	64	9x8x8	Smooth features with poor colors

Due to the computing resources, I only trained Trial A-13 and A-14 for 30 epochs. The training of 128×128 frames is much slower than 64×64 frames, so I cannot make evaluation of these model until now. The 128×128 experiments will be conducted in GAN tests.

3.6 Deep ResNet Translator V3, V4, V5-A/B/C

To make the down-sampling more efficient, I directly combined down-sampling layers and flat convolutional layers together. Referring to the design of ResNet-29, I designed the expansion blocks to replace the function of old down-sampling layers. In V3 design, I also explored the use of convolution on the temporal dimension. In some trials, the stride of temporal convolution is 2 in order to expand the temporal feature maps.

Trials	Network Type	Downsample	Upsample	Normalization	Temporal Compress	Results
A-15	Deep ResNet V3	Maxpool + (3+4+6+2) ResNet Blocks	5 Transposed-Conv Layers	Instance Normalization	No	Dynamic features with poor colors
A-16	Deep ResNet V3	Maxpool + (3+4+6+2) ResNet Blocks	5 Transposed-Conv Layers	Large batchsize	No	Damaged video
A-17	Deep ResNet V3	Maxpool + (3+4+6+2) ResNet Blocks	5 Transposed-Conv Layers	Instance Normalization	1/4 in Middle	Incomplete translation
A-18	Deep ResNet V3	(2+3+4+6+2) ResNet Blocks	4 Transposed-Conv Layers	Instance Normalization	No	Dynamic features with poor colors

In these results, the phenomenon of dynamic features is very similar to Trial A-1 to A-4, so I supposed that, with the use of more up-sampling layers, the feature generation becomes out of control caused by barriers of transferring gradients between layers, losing the relation of figures and locations from down-sampling stage. And that is also the reason why U-Net’s shortcut between down-sampling and up-sampling layers works so well on process of translating related features. To test this idea, I designed another type of residual translators named ResNet-V4 and ResNet-V5-A/B. With the use of residual connections and U-Net’s shortcuts though the whole network, it seems capable to map every level of features between input and output videos. To demonstrate the performance of this architecture, I conducted several trials about ResNet-V4 and ResNet-V5.

Trials	Network Type	Downsample	Upsample	Depth	Temporal Compress	Results
A-19	Deep ResNet V4	(1+2+3+4+2) ResNet Blocks	(2+4+3+2+1) ResNet Blocks	50 Layers	No	Dynamic features
A-20	Deep ResNet V4	(1+2+2+2+1) ResNet Blocks	(1+2+2+2+1) ResNet Blocks	34 Layers	No	Slightly dynamic features

The ResNet-V5-A/B/C and other experiments in this section will be released after I publish the formal paper or apply for protection of copyright.

In ResNet-V5-B, the number of convolutional layers is capable to reach 89 in 128×128 video translation using my newly-designed translative structure. So, in real application, it can hold more difficult translations and more complex situations.

3.7 Squeeze and Excitation Pipeline

Referring to Squeeze-and-Excitation structure in https://arxiv.org/pdf/1709.01507.pdf, I tried to add channel-wise gates for every 3D-ResNet block in order to increase capability of network. The diagram below shows the Squeeze and Excitation pipeline in a residual block.

4. Conditional Adversarial Training

4.1 Methodology

In this section, I used stickman input as network’s condition and several translative models as generator in a generative adversarial training to get higher quality of generated video clips. In previous experiments, I have tested many generative or translative models for video clips. Here, I took many tests on the design of different discriminator to maximize the performance of generative network.

Conditional GAN (https://arxiv.org/pdf/1411.1784.pdf)

In original design, generator has two types of inputs – noise data (z) and condition (y). But, in my experiment where is designed to translate or generate video from an instructional video clips, noise data seemed no use to the whole structure, since it is originally designed to map features and diversify generator’s output. So, in this case, the noise data can be cancelled when I aim to only generate single character.

Illustration of Minimax Game in C-GAN

4.2 Discriminator and Training

Referring to conditional GAN, the discriminator generally contains a matrix concatenation process to combine the discriminative object and the conditions as the input of discriminator. After combination, like a normal GAN, I used several convolutional layers ended with a sigmoid function and I have trained the network with binary cross entropy loss function using zeros to represent fake video and ones to represent real video.

Trials	Discriminator	Concatenation	Temporal Compress	Generator	Normalization	Balance	Result
B-1	Concatenate + 4 Conv Layers	1 Conv Layer + Channel-wise Concatenate	1/4 in Last Two Layers	ResNet-34 (Trial A-9)	Instance Normalization	Discriminator is stronger	Smooth but slightly dynamic
B-2	Concatenate + 5 Conv Layers	Channel-wise Concatenate	1/4 in Last Two Layers	ResNet-34 (Trial A-9)	Instance Normalization	Pending	Pending
B-3	Concatenate + 4 Conv Layers	1 Conv Layer + Channel-wise Concatenate	1/4 in Last Two Layers	ResNet-V4-50 (Trial A-19)	Batch Normalization	Discriminator is stronger	Damaged features
B-4	Concatenate + 4 Conv Layers	1 Conv Layer + Channel-wise Concatenate	1/4 in Last Two Layers	ResNet-V5-A-40	Instance Normalization	Discriminator is slightly stronger	Smooth but slightly dynamic
B-5	Concatenate + ResNet-11	1 Conv Layer + Channel-wise Concatenate	1/4 in Last Two Blocks	ResNet-V5-B-71	Instance Normalization	Discriminator is stronger	Smooth but slightly dynamic

4.3 Different Generators

In this section, I conduct several trials using the translator model in previous tests and evaluate their performance in adversarial training.

Trials	Generator	Generator Type	Discriminator	Results
B-6	Trial A-1	Bottleneck-22	5 Conv Layers with Direct Concatenation	Dynamic features
B-7	Trial A-6	U-Net-22	5 Conv Layers with Direct Concatenation	Pending
B-8	Trial A-9	ResNet-V1-32	5 Conv Layers with Direct Concatenation	Smooth but with slightly-dynamic features

4.4 Deep Networks and Mixed Training

The ResNet-V5-A/B/C and other experiments in this section will be released after I publish the formal paper or apply for protection of copyright.

Trials	Generator	Depth	Discriminator	Training	Result
B-9	ResNet-V5-A	40 Layers	6 Conv Layers	C-GAN Loss	Slightly damaged features
B-10	ResNet-V5-B	71 Layers	6 Conv Layers	C-GAN Loss	-
B-11 (128x128)	ResNet-V5-B	89 Layers	ResNet-11	C-GAN Loss	-
B-12	ResNet-V5-C1	45 Layers	6 Conv Layers	C-GAN Loss	-
B-13	ResNet-V5-C2	54 Layers	ResNet-11	Mixed Loss	-
B-14	ResNet-V5-C3	60 Layers	6 Conv Layers	Mixed Loss	-
B-15	ResNet-V5-C4	68 Layers	6 Conv Layers	Mixed Loss	-

5. Environment

All the networks are trained on Laptop with GTX960M and Google Compute Engine with Tesla K80 and Tesla P100 depending on training time. All the codes and sampling jobs are completed in macOS 10.13 and Ubuntu 16.04 LTS.

5.1 Miku Datasets

I have prepared three datasets for dynamic object translation – small Miku-to-Miku dataset, small Stick-to-Miku dataset, and full Stick-to-Miku dataset. The first two datasets have approximately 3000 image pairs and the full dataset have 10000 image pairs. Miku-to-Miku datasets contain anime character dancing in two dressing styles aiming to test model’s performance of translation. Stick-to-Miku datasets contain the same dances performed by stickman and Miku separately aiming to simulate conditional generation of dynamic object from movement data. All the video frames are generated by Miku-Miku-Dance software and the character models’ authors are listed in reference.

Data Sample Preview: Stickman (Left), Electric Miku (Middle), TDA Miku (Right)

5.2 House-to-Zebra Dataset

5.3 Video Experiment Platform

In order to conduct these experiments and realize different models, I designed a video experiment platform which can split videos, generate datasets, train different networks and, generate a video with comparison of original data. I designed a frame traverse slider which can generate arbitrary sequence of frames as network’s input. I will release it for research use after I publish my formal paper.

During experiments, due to the slow computational speed of python object class, I optimized the allocation of GPU and CPU by reducing algorithm complexity and increasing batch-size so that the network-trainer can occupy full GPU with least interval caused by slow “for” loop computed by CPU. According to test result, the old version of frame slider occupy 96% of GPU usage. In comparison, the new slider can use full GPU core. On average, the training speed on laptop with GTX960M rises 30%.

6. Limitations

Although this new architecture works better than traditional U-Net, the generated videos are still very easy for human to distinguish from real ones, mainly because of vague textures. I believe, with the increase of frame size and network’s scale, this architecture will be applied to more application scenarios.

7. Results

Architecture	Train	Discriminator	Loss	Score (low is better)
5-level U-Net (baseline)	Direct	–	1.1488	0
4-level U-Net	Direct	–	1.1999	+0.0511
Conv-Deconv-1	Direct	–	1.1739	+0.0251
Conv-Deconv-2	Direct	–	1.1419	-0.0069
Conv-Deconv-3	Direct	–	1.1425	-0.0063
Conv-Deconv-4	Direct	–	1.1410	-0.0078
Conv-Deconv-5	C-GAN	Conv	1.1436	-0.0052
Residual U-Net	C-GAN	Conv	1.1432	-0.0056
ResNet-V5-C	C-GAN	Conv	1.1352	-0.0136

8. Conclusions

To sum up, with the use of 3D convolution, the generative video model works so well in various applications. Residual learning framework significantly enhances the performance of traditional U-Net.

9. References

Share This Page:

Posts