In most of generative adversarial networks, CycleGAN has amazing ability to translate images without labels. Like an unsupervised learner, it is powerful to deal with large-scale task in visual or musical translation. Unlike some simple unsupervised learner, it works so great that sometime its fake photos fool human’s eyes.
In consideration of difficulty to get large amount of labeled Anime data, I planed to use a type of unsupervised learning to perform superiority of Machine Learning in Anime generation. Referring to the result of CycleGAN in several application scenes, like zebra-to-house translation and satellite-to-map translation in original paper, I believe it will work well in Anime.
I conducted several experiments on CycleGAN test. My aim is to explore how this translator works, how to enhance it and most importantly to generate beautiful images. The amount of computation seems many times bigger than previous tests, making these fake images really expensive.
Anime Scene Translation
In consideration of convenience, I have re-implemented CycleGAN with SE-ResNet blocks and deconvolution blocks. The link to Github is at the bottom of this page. I used the scenes from Sword Art Online and To Aru Kagaku No Railgun which contain their protagonists respectively. These scenes were cropped by my Anime Face Recognizer in previous test.
My first configuration of CycleGAN contained two discriminators with 5 convolutional layers and two generator consisted 2 convolutional layers and 2 deconvolutional layers with 6 ResNet blocks in the middle. I set the channel number of both generator and discriminator to 32. In consideration of GPU’s speed, I resized all training images from 256 to 128 to get results faster and fit appropriate parameters for upcoming large scale test. Fortunately, I got some nice results in first test.
The pictures below show the original image, translated or fake image and reconstructed image respectively. The three on the left side is the translation from Asuna to Misaka and on the right is from Misaka to Asuna.
After first 10 epochs of training, the generator learned to adjust the brightness and color style, since two Anime have very different drawing styles. Sword Art Online has more fighting scenes and the whole story happens in a relatively dark world. In To Aru Kagaku No Railgun, most of the story happens in a sunny city and has less dark scenes. At this time, the generator can make these simple differences clear in translation.
After 20 epochs training, the generator learned the difference between color of skin and hair. Because of weak discriminators, they can only distinguish very conspicuous features in characters.
After a long-time training, I found something interesting. It learned the length of hair in most scenes and cut the hair directly during translation. But on the right side, the result seemed not improved in training.
The results after 60 and 80 epoch training showed that it worked really well in translation from Asuna to Misaka but had tiny improvement on the right side. The generator learned to change the clothes between two characters. I only trained 80 epochs because online GPU is really expensive and big images cost extremely large amount of computation. I will update this part when I get more GPU resources.
In this part of experiment, I decide to focus on translation of character’s face to test CycleGAN’s ability of details translation. In order to reduce the computation in experiment and get more epochs and configurations, I set the image size to 128.
I used the dataset in DCGAN test and doubled the channel number in discriminator. In last experiment, the ability of discriminator seemed not enough for the strong generator to establish a stable adversarial relationship. The loss of generator sometimes declined to nearly 0.01 in a short period while discriminator trended to misconverge. I turned on SE-layer in ResNet blocks and other configuration stayed the same.
The result looks very nice with this configuration. The network learned very fast, indicating this larger discriminator worked very well with generator. But I found that in epoch 100 to 120, all the images got very fuzzy and the training losses fluctuated abnormally. After that, the generator learned cut or add hair during translation and it even imitated expressions in epoch 180.
The test was really successful. I will use this network as pre-trained model for my upcoming video translation test and more complex application in Anime.