Abstract: This study tackles a video question answering (VideoQA), which requires spatiotemporal video reasoning. VideoQA aims to return an appropriate answer about textual questions referring to image frames in the video. In this paper, based on the observation that multiple entities and their movements in the video can be important clues for deriving the correct answer, we propose a two-stream spatiotemporal compositional attention network that achieves sophisticated multi-step spatiotemporal reasoning by using both motion and detailed appearance features. In contrast to the existing video reasoning approach that uses frame-level or clip-level appearance and motion features, our method simultaneously attends detailed appearance features of multiple entities as well as motion features guided by attending words in the textual question. Furthermore, it progressively refines internal representation and infers the answer via multiple reasoning steps. We evaluate our method on short- and long-form VideoQA benchmarks: MSVD-QA, MSRVTT-QA, and ActivityNet-QA and achieve state-of-the-art accuracy on these datasets.