DOI QR코드

DOI QR Code

ViStoryNet: Neural Networks with Successive Event Order Embedding and BiLSTMs for Video Story Regeneration

ViStoryNet: 비디오 스토리 재현을 위한 연속 이벤트 임베딩 및 BiLSTM 기반 신경망

  • 허민오 (서울대학교 컴퓨터공학부) ;
  • 김경민 (서울대학교 컴퓨터공학부) ;
  • 장병탁 (서울대학교 컴퓨터공학부)
  • Received : 2017.11.03
  • Accepted : 2018.01.16
  • Published : 2018.03.15

Abstract

A video is a vivid medium similar to human's visual-linguistic experiences, since it can inculcate a sequence of situations, actions or dialogues that can be told as a story. In this study, we propose story learning/regeneration frameworks from videos with successive event order supervision for contextual coherence. The supervision induces each episode to have a form of trajectory in the latent space, which constructs a composite representation of ordering and semantics. In this study, we incorporated the use of kids videos as a training data. Some of the advantages associated with the kids videos include omnibus style, simple/explicit storyline in short, chronological narrative order, and relatively limited number of characters and spatial environments. We build the encoder-decoder structure with successive event order embedding, and train bi-directional LSTMs as sequence models considering multi-step sequence prediction. Using a series of approximately 200 episodes of kids videos named 'Pororo the Little Penguin', we give empirical results for story regeneration tasks and SEOE. In addition, each episode shows a trajectory-like shape on the latent space of the model, which gives the geometric information for the sequence models.

본 고에서는 비디오로부터 coherent story를 학습하여 비디오 스토리를 재현할 수 있는 스토리 학습/재현 프레임워크를 제안한다. 이를 위해 연속 이벤트 순서를 감독학습 정보로 사용함으로써 각 에피소드들이 은닉 공간 상에서 궤적 형태를 가지도록 유도하여, 순서정보와 의미정보를 함께 다룰 수 있는 복합된 표현 공간을 구축하고자 한다. 이를 위해 유아용 비디오 시리즈를 학습데이터로 활용하였다. 이는 이야기 구성의 특성, 내러티브 순서, 복잡도 면에서 여러 장점이 있다. 여기에 연속 이벤트 임베딩을 반영한 인코더-디코더 구조를 구축하고, 은닉 공간 상의 시퀀스의 모델링에 양방향 LSTM을 학습시키되 여러 스텝의 서열 데이터 생성을 고려하였다. '뽀롱뽀롱 뽀로로' 시리즈 비디오로부터 추출된 약 200 개의 에피소드를 이용하여 실험결과를 보였다. 실험을 통해 에피소드들이 은닉공간에서 궤적 형태를 갖는 것과 일부 큐가 주어졌을 때 스토리를 재현하는 문제에 적용할 수 있음을 보였다.

Keywords

Acknowledgement

Supported by : 미래창조과학부

References

  1. B.-T. Zhang, "SNU Videome Project: Human-level Machine Learning from Videos," Communications of the Korean Institute of Information Scientists and Engineers, Vol. 29, No. 2, pp. 17-31, 2011. (in Korean)
  2. M.-O. Heo, K.-M. Kim,, B.-T. Zhang, "Story Learning Methods from Cartoon Videos via Consecutive Event Embedding," KIISE Winter Conference 2016, pp. 600-602, 2016. (in Korean)
  3. R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, S. Fidler, "Skip-thought Vectors," NIPS, 2015.
  4. S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, "Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks," NIPS, pp. 1171-1179, 2015.
  5. B. Kybartas, and R. Bidarra, "A Survey on Story Generation Techniques for Authoring Computational Narratives," IEEE Transactions on Computational Intelligence and AI in Game, 2016.
  6. I. Mani, Computational Narratology, the living Handbook of Narratology, 2014.
  7. M. A. Finlayson, Learning Narrative Structure from Annotated Folktales, PhD thesis, Massachusetts Institute of Technology, 2012.
  8. B. O'Neill, and M. Riedl, "Dramatis: A Computational Model of Suspense," Proc. of the 28th AAAI conference on Artificial Intelligence, Vol. 2, pp. 944-950, 2014.
  9. B. Li, and M. Riedl, "Scheherazade: Crowd-powered Interactive Narrative Generation," 29th AAAI Conference on Artificial Intelligence, 2015.
  10. K. Pichotta, and R. J. Mooney, "Using Sentence-Level LSTM Language Models for Script Inference," ACL-16, 2016.
  11. K. Pichotta, and R. J. Mooney, "Learning Statistical Scripts with LSTM Recurrent Neural Networks," AAAI, 2016.
  12. L. J. Martin, P. Ammanabrolu, W. Hancock, S. Singh, B. Harrison, and M. Riedl, "Event Representations for Automated Story Generation with Deep Neural Nets," Proc. of the KDD 2017 Workshop on Machine Learning for Creativity, Halifax, Nova Scotia, Canada., 2017.
  13. A. Salway, M. Graham, E. Tomadaki, Y. Xu, "Linking Video and Text via Representations of Narrative," AAAI Spring Symposium on Intelligent Multimedia Knowledge Management, pp. 104-112, 2003.
  14. K. He, et al., "Deep Residual Learning for Image Recognition," CVPR, 2016.
  15. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention," ICML, pp. 2048-2057, 2015.
  16. J.-W. Ha, K.-M. Kim, and B.-T. Zhang, "Automated Construction of Visual-linguistic Knowledge via Concept Learning from Cartoon Videos," AAAI, 2015.
  17. K. Simonyan and A. Zisserman, "Two-stream Convolutional Networks for Action Recognition in Videos," NIPS, pp. 568-576, 2014.
  18. N. Srivastava, E. Mansimov, and R. Salakhutdinov, "Unsupervised Learning of Video Representations using Lstms," ICML, pp. 843-852, 2015.
  19. C. Vondrick, H. Pirsiavash, and A. Torralba, "Generating Videos with Scene Dynamics," NIPS, pp. 613-621, 2016.
  20. C. Vondrick, H. Pirsiavash, and A. Torralba, "Anticipating Visual Representations with Unlabeled Video," CVPR, pp. 98-106, 2016.
  21. T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, "Movie/script: Alignment and Parsing of Video and Text Transcription," ECCV, 2008.
  22. Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books," ICCV, pp. 19-27, 2015.
  23. D. P. Kingma, J. Ba, "Adam: A method for stochastic optimization," ICLR, 2015.