3-D Convolutional Network with Adversarial Training for Video Generation

1. Abstract

After building image translator powered by CycleGAN, I found a problem that, in generation and translation of videos, traditional model was not capable to deal with temporal dimension, causing generated videos collapse. In this work, referring to 3D-convolutional networks for spatial-temporal data representation, I conduct experiments aiming to find appropriate and powerful neural network structures on generation of coherent video with dynamic object. Till now, I found that three dimensional deep residual network has best performance in extracting, filling and transforming features with temporal information. Based on residual connection, I designed ResNet-V5-A/B/C network with adversarial training which is capable to generate realistic video clips in situation of unbalanced translation.

2. Introduction

2.1 Inspiration

These days, I have read many tutorials and papers about video problem in computer vision with machine learning. Many famous papers like 3-D-convolution, 3-D-ResNet and two-stream network inspired me a lot. Meanwhile, I was guessing that could GANs in my previous work translate videos. One day, I found a clue in CycleGAN’s introduction which was an generated animation about house-to-zebra translation. Author splits every frame from a house recording and translates them side-by-side using a trained CycleGAN. But something strange in his result – there are many dynamic figures on zebra (the real should be fixed) which meant that the network could not get the relation between different frames, but instead, it only focus on spatial information. In other word, the network was not capable to deal the sequence information behind the video. Then I started to explore computer vision problems about temporal dimension.

2D-CycleGAN: House-to-zebra (https://arxiv.org/pdf/1703.10593.pdf)

2.2 Three-Dimensional Convolution

In one of methods, the network contain full 3D convolution to cover two spatial dimensions and one temporal dimension. Apparently, with the involvement of another axis, the convolutional kernel can easily extract the relation between spatial features and temporal information. I will use this simple but strong network structure as backbone in this section. Due to original intention, I will test two-stream network and other structures in the future, because, compared to 3D-Conv method, it cannot keep full information between spatial and temporal features when using separate dimensions. So I will mainly focus on 3-dimensional convolutional network to build a productive generative models. Due to limit of dataset, the experiment only focus on dynamic object generation, but I will test dynamic scenes with background mask in the future.

3D-GAN: http://carlvondrick.com/tinyvideo/

3. Basic Model

In this part, I will only train a translative network mainly to test network’s capability of gathering, mapping, reshaping features with temporal information. The results helped to design the generator of a complete generative adversarial model. In these tests, I will use the small dataset to adjust network’s structure and full dataset to evaluate the performance.

3.1 Methodology

I used Sketch Simplify 2D-convolutional translator for reference and adjusted its structure and parameters to fit in 3-D environment. In following experiments, I conducted different structures and different training methods to find the best model. In consideration of training speed, at beginning, I only used small-scale datasets and limited the max number of filters to 32.

Sketch Simplify: http://hi.cs.waseda.ac.jp/~esimo/en/research/sketch/

The diagram above shows the structure of a fully-connected bottleneck-like network for sketch simplicity. In this section, I may test many types of similar structures – residual blocks in the middle, encoder and decoder without flat convolution, ‘fully-residual’ network. All networks have 16 channels for 64-by-64 frames in input layer and identically has 16 channels in output. The number of channel doubles when going to the next layer in down-sampling stage and oppositely decrease two times in up-sampling stage, and keeps the same in middle stage.

3.2 Loss Function

In order to compare the difference of generated video frames and target frames, in this section, the networks were all trained with Gradient Descent optimizer using MSE-loss. Compared to L1-loss, squared loss enables networks to filter more unrelated data which benefits to have less noise in result, as well as train faster. In Trial A-1, I have trained two networks to test both functions. The result showed that frames generated by MSE-loss have less miscellaneous points in first 100 epochs of training and the loss descents faster.

Mean Squared Error

3.3 Bottleneck Translator

Applying basic method mentioned in 3.1, I designed the model having three stages – down-sampling, residual blocks as flat convolution, and up-sampling. These trials will disclose capability of bottleneck-like model in related job. In real tests,  I tested structures containing some convolutional layers as down-sampling stage, transposed-convolutional layers as up-sampling stage, and finally with a convolutional layer for generating frames (RGB filters).

TrialsNetwork TypeDownsampleUpsampleFlat ConvolutionMiddle FiltersMiddle FeaturesResults
A-1Residual Bottleneck4 Convolutional Layers4 Transposed-Conv Layers6 ResNet Blocks2569x4x4Dynamic features with intact colors
A-2Residual Bottleneck4 Convolutional Layers4 Transposed-Conv Layers9 ResNet Blocks2569x4x4Chaotic features with chaotic colors
A-3Residual Bottleneck3 Convolutional Layers3 Transposed-Conv Layers6 ResNet Blocks1289x8x8Model collapses (overfit)
A-4Residual Bottleneck4 Convolutional Layers4 Transposed-Conv Layers3 ResNet Blocks2569x4x4Dynamic features with poor colors

The result shows that, in all configurations, the detailed features in generated video were almost dynamic and chaotic. The colors of character are vague, making it very easy for a human to distinguish from real video. In some complex actions, the network got confused with the position of arms and legs, showing that it was not able to use temporal information well to make the generated video smooth.

3.4 U-Net Translator

Inspired by U-net for image segmentation, I tried to use this kind of structures as video frame translator. I tested two trials for this kind – 4 levels U-net with 3 shortcuts as picture below shows and 5 levels U-net. Both networks are trained on Stick-to-Miku dataset with MSE-loss for comparison between video splits. U-Net: https://arxiv.org/pdf/1505.04597.pdf

U-Net Structure
TrialsNetwork TypeDepthMiddle FiltersMiddle FeaturesResults
A-5U-Net41289x8x8Broken features with poor colors
A-6U-Net52569x4x4Broken features with intact colors

The results of two trials are worse than Trial A-1 using Residual Bottleneck. Although the frames generated by U-net are accurate in some cases, they loss too many temporal data, causing many features were damaged. I guessed it was caused by overfitting problem and I would explore more about U-net in adversarial training.

3.5 Deep ResNet Translator V1, V2

In order to evaluate influence of residual blocks, I conducted several trials which contained more residual blocks and fewer up-sampling or down-sampling layers. In this case, the network can go deeper benefiting from outstanding capability of residual connections to keep gradients stable. I have tested several trials on Stick-to-Miku dataset to stimulate conditional generation using movement data provided by stickman animation.

TrialsNetwork TypeDownsampleUpsampleFlat ConvolutionMiddle FilterMiddle FeaturesResults
A-7Deep ResNet1 Conv Layer1 Transposed-Conv Layer14 ResNet Blocks329x32x32Vague features
A-8Deep ResNet1 Conv Layer1 Transposed-Conv Layer18 ResNet Blocks329x32x32Damaged features with vague colors
A-9Deep ResNet2 Conv Layers2 Transposed-Conv Layers14 ResNet Blocks649x16x16Smooth but vague
A-10 (128x128)Deep ResNet2 Conv Layers2 Transposed-Conv Layers14 ResNet Blocks649x32x32Smooth but vague

After training of A-7 to A-9, I found many vague features created by networks. I guessed it was caused by the insufficient training data, so I double the frame size of video frames and re-train A-9 to get a high-resolution result. However, the result was poor as well. Then, I replaced some of layers to max-pooling to simplify the down-sampling procedure.

TrialsNetwork TypeDownsampleUpsampleFlat ConvolutionMiddle FiltersMiddle FeaturesResults
A-11Deep ResNet V2Conv Layer + MaxPool3D2 Transposed-Conv Layer14 ResNet Blocks329x16x16Smooth but vague
A-12Deep ResNet V2MaxPool3D + Cov Layer2 Transposed-Conv Layers14 ResNet Blocks329x16x16Smooth but unstable
A-13 (128x128)Deep ResNet V2MaxPool3D + 2 Cov Layers3 Transposed-Conv Layers14 ResNet Blocks649x8x8Incomplete translation
A-14 (128x128)Deep ResNet V23x(MaxPool3D + Flat-Cov)3 Transposed-Conv Layers14 ResNet Blocks649x8x8Smooth features with poor colors

Due to the computing resources, I only trained Trial A-13 and A-14 for 30 epochs. The training of 128×128 frames is much slower than 64×64 frames, so I cannot make evaluation of these model until now. The 128×128 experiments will be conducted in GAN tests.

3.6 Deep ResNet Translator V3, V4, V5-A/B/C

To make the down-sampling more efficient, I directly combined down-sampling layers and flat convolutional layers together. Referring to the design of ResNet-29, I designed the expansion blocks to replace the function of old down-sampling layers. In V3 design, I also explored the use of convolution on the temporal dimension. In some trials, the stride of temporal convolution is 2 in order to expand the temporal feature maps.

TrialsNetwork TypeDownsampleUpsampleNormalizationTemporal CompressResults
A-15Deep ResNet V3Maxpool + (3+4+6+2) ResNet Blocks5 Transposed-Conv LayersInstance NormalizationNoDynamic features with poor colors
A-16Deep ResNet V3Maxpool + (3+4+6+2) ResNet Blocks5 Transposed-Conv LayersLarge batchsizeNoDamaged video
A-17Deep ResNet V3Maxpool + (3+4+6+2) ResNet Blocks5 Transposed-Conv LayersInstance Normalization1/4 in MiddleIncomplete translation
A-18Deep ResNet V3(2+3+4+6+2) ResNet Blocks4 Transposed-Conv LayersInstance NormalizationNoDynamic features with poor colors

In these results, the phenomenon of dynamic features is very similar to Trial A-1 to A-4, so I supposed that, with the use of more up-sampling layers, the feature generation becomes out of control caused by barriers of transferring gradients between layers, losing the relation of figures and locations from down-sampling stage. And that is also the reason why U-Net’s shortcut between down-sampling and up-sampling layers works so well on process of translating related features. To test this idea, I designed another type of residual translators named ResNet-V4 and ResNet-V5-A/B. With the use of residual connections and U-Net’s shortcuts though the whole network, it seems capable to map every level of features between input and output videos. To demonstrate the performance of this architecture, I conducted several trials about ResNet-V4 and ResNet-V5.

TrialsNetwork TypeDownsampleUpsampleDepthTemporal CompressResults
A-19Deep ResNet V4(1+2+3+4+2)
ResNet Blocks
ResNet Blocks
50 LayersNoDynamic features
A-20Deep ResNet V4(1+2+2+2+1)
ResNet Blocks
ResNet Blocks
34 LayersNoSlightly dynamic features

The ResNet-V5-A/B/C and other experiments in this section will be released after I publish the formal paper or apply for protection of copyright.

In ResNet-V5-B, the number of convolutional layers is capable to reach 89 in 128×128 video translation using my newly-designed translative structure. So, in real application, it can hold more difficult translations and more complex situations.

3.7 Squeeze and Excitation Pipeline

Referring to Squeeze-and-Excitation structure in https://arxiv.org/pdf/1709.01507.pdf, I tried to add channel-wise gates for every 3D-ResNet block in order to increase capability of network. The diagram below shows the Squeeze and Excitation pipeline in a residual block.

Squeeze and Excitation Pipeline


4. Conditional Adversarial Training

4.1 Methodology

In this section, I used stickman input as network’s condition and several translative models as generator in a generative adversarial training to get higher quality of generated video clips. In previous experiments, I have tested many generative or translative models for video clips. Here, I took many tests on the design of different discriminator to maximize the performance of generative network.

Conditional GAN (https://arxiv.org/pdf/1411.1784.pdf)

In original design, generator has two types of inputs – noise data (z) and condition (y). But, in my experiment where is designed to translate or generate video from an instructional video clips, noise data seemed no use to the whole structure, since it is originally designed to map features and diversify generator’s output. So, in this case, the noise data can be cancelled when I aim to only generate single character.

Illustration of Minimax Game in C-GAN

4.2 Discriminator and Training

Referring to conditional GAN, the discriminator generally contains a matrix concatenation process to combine the discriminative object and the conditions as the input of discriminator. After combination, like a normal GAN, I used several convolutional layers ended with a sigmoid function and I have trained the network with binary cross entropy loss function using zeros to represent fake video and ones to represent real video.

Binary Cross Entropy
TrialsDiscriminatorConcatenationTemporal CompressGeneratorNormalizationBalanceResult
B-1Concatenate + 4 Conv Layers1 Conv Layer + Channel-wise Concatenate1/4 in Last Two LayersResNet-34
(Trial A-9)
Instance NormalizationDiscriminator is strongerSmooth but slightly dynamic
B-2Concatenate + 5 Conv LayersChannel-wise Concatenate1/4 in Last Two LayersResNet-34
(Trial A-9)
Instance NormalizationPendingPending
B-3Concatenate + 4 Conv Layers1 Conv Layer + Channel-wise Concatenate1/4 in Last Two LayersResNet-V4-50
(Trial A-19)
Batch NormalizationDiscriminator is strongerDamaged features
B-4Concatenate + 4 Conv Layers1 Conv Layer + Channel-wise Concatenate1/4 in Last Two LayersResNet-V5-A-40Instance NormalizationDiscriminator is slightly strongerSmooth but slightly dynamic
B-5Concatenate + ResNet-111 Conv Layer + Channel-wise Concatenate1/4 in Last Two BlocksResNet-V5-B-71Instance NormalizationDiscriminator is strongerSmooth but slightly dynamic


4.3 Different Generators

In this section, I conduct several trials using the translator model in previous tests and evaluate their performance in adversarial training.

TrialsGeneratorGenerator TypeDiscriminatorResults
B-6Trial A-1Bottleneck-225 Conv Layers with Direct ConcatenationDynamic features
B-7Trial A-6U-Net-225 Conv Layers with Direct ConcatenationPending
B-8Trial A-9ResNet-V1-325 Conv Layers with Direct ConcatenationSmooth but with slightly-dynamic features


4.4 Deep Networks and Mixed Training

The ResNet-V5-A/B/C and other experiments in this section will be released after I publish the formal paper or apply for protection of copyright.

B-9ResNet-V5-A40 Layers6 Conv LayersC-GAN LossSlightly damaged features
B-10ResNet-V5-B71 Layers6 Conv LayersC-GAN Loss-
ResNet-V5-B89 LayersResNet-11C-GAN Loss-
B-12ResNet-V5-C145 Layers6 Conv LayersC-GAN Loss-
B-13ResNet-V5-C254 LayersResNet-11Mixed Loss-
B-14ResNet-V5-C360 Layers6 Conv LayersMixed Loss-
B-15ResNet-V5-C468 Layers6 Conv LayersMixed Loss-


5. Environment

All the networks are trained on Laptop with GTX960M and Google Compute Engine with Tesla K80 and Tesla P100 depending on training time. All the codes and sampling jobs are completed in macOS 10.13 and Ubuntu 16.04 LTS.

5.1 Miku Datasets

I have prepared three datasets for dynamic object translation – small Miku-to-Miku dataset, small Stick-to-Miku dataset, and full Stick-to-Miku dataset. The first two datasets have approximately 3000 image pairs and the full dataset have 10000 image pairs. Miku-to-Miku datasets contain anime character dancing in two dressing styles aiming to test model’s performance of translation. Stick-to-Miku datasets contain the same dances performed by stickman and Miku separately aiming to simulate conditional generation of dynamic object from movement data. All the video frames are generated by Miku-Miku-Dance software and the character models’ authors are listed in reference.

Data Sample Preview: Stickman (Left), Electric Miku (Middle), TDA Miku (Right)

5.2 House-to-Zebra Dataset


5.3 Video Experiment Platform

In order to conduct these experiments and realize different models, I designed a video experiment platform which can split videos, generate datasets, train different networks and, generate a video with comparison of original data. I designed a frame traverse slider which can generate arbitrary sequence of frames as network’s input. I will release it for research use after I publish my formal paper.

During experiments, due to the slow computational speed of python object class, I optimized the allocation of GPU and CPU by reducing algorithm complexity and increasing batch-size so that the network-trainer can occupy full GPU with least interval caused by slow “for” loop computed by CPU. According to test result, the old version of frame slider occupy 96% of GPU usage. In comparison, the new slider can use full GPU core. On average, the training speed on laptop with GTX960M rises 30%.

6. Limitations

Although this new architecture works better than traditional U-Net, the generated videos are still very easy for human to distinguish from real ones, mainly because of vague textures. I believe, with the increase of frame size and network’s scale, this architecture will be applied to more application scenarios.


7. Results





Score (low is better)

5-level U-Net (baseline)




4-level U-Net

























Residual U-Net










8. Conclusions

To sum up, with the use of 3D convolution, the generative video model works so well in various applications. Residual learning framework significantly enhances the performance of traditional U-Net.

9. References


Share This Page:

Leave a Reply

Your email address will not be published.

three × 4 =

This site uses Akismet to reduce spam. Learn how your comment data is processed.