This paper will present a specific problematic of a broader PhD project on children’s reception of music and characters in animated features which, along with textual analyses, aims to gather and analyze an empirical ‘child perspective’ on selected films. In order to do so, video-recorded screenings and interviews have been carried out with small groups of children aged 7 and 11 years old respectively. In order to fully understand children’s reactions and responses to film, a multimodal approach needs to be taken when transcribing and analyzing these kinds of video-recorded data; a multimodal approach which seeks to, as a minimum, account for the children’s use of facial expressions and body language as well as their use of the verbal mode. This focus on gesture might seem particularly important when working with relatively young children and with the content of non-verbal modes such as music, which tends to appeal to people in a highly embodied and intuitive way often escaping clear verbal descriptions. At this preliminary stage of the project, the data seems to suggest a very high dependence for the children on using semiotic resources outside of the verbal mode, e.g. by singing, humming, tapping rhythms, dancing, and imitating the playing of instruments, in order to elicit experiences with music for which a suitable vocabulary might be out of reach. In this paper, I want to present a model for a multimodal transcription of interviews based on the possibilities afforded by the software program Multimodal Analysis Video, developed by Kay O’Halloran and her team. In doing so, I will open up a discussion of how the multimodal meaning-making practices of children can be captured, analyzed and understood in academic research.