Prosodic prominence is a multimodal phenomenon involving both acoustic and kinematic dimensions. In order to study the multimodal nature of prominence, we need to collect prominence ratings based on audio-visual speech material from large groups of speakers. This is feasible by means of a web-based crowdsourcing set-up, allowing volunteers to participate using a private computer or mobile phone. However, this freedom also implies a certain reduction of experimental control due to variation in hardware used by the raters.
In this pilot study we explore potential effects of two hardware features – screen size and audio device (headphones vs. loudspeakers) – on multimodal prominence ratings. To this end, 16 brief clips from Swedish television news (218 words in total) were rated by 31 native Swedish volunteers using a web-based set-up. In our GUI, orthographic representations of the text were displayed below the video player. Each word was to be rated as either non-prominent, moderately prominent, or strongly prominent, by means of clicking on the word in question until the desired prominence level was encoded through a specific color (yellow: moderate; red: strong). Participants were free to use a mobile phone, a tablet, or a computer, and headphones or loudspeakers, and we collected information about their hardware using a questionnaire. In addition, we automatically logged the screen size of the participant’s computer/phone.
We applied two different approaches to analyze the participant’s rating behavior as a function of the hardware features under discussion. First, we calculated a selection of five variables from the raw prominence ratings: (i) the sum of all ratings (over all 218 words), (ii) the percentage of words rated as (moderately or strongly) prominent, (iii) among prominent words, the proportion of words rated as strongly prominent, and (iv-v) the relative prominence rating of two selected words. Effects of screen size and audio device on these variables were analyzed using linear regression models. Second, we calculated inter-rater reliability for multiple raters using Fleiss’ kappa, both for all raters as a reference and for subgroups concerning audio device and screen size.
The results reveal a significant model fit for variable (iii) defined above (proportion of strong ratings; F(5;21) = 5.332; p=.0022**), suggesting a significantly higher proportion of strong prominent ratings obtained with loudspeakers (34.0% of words rated as prominent on average) compared to with headphones (18.3%; t=2.944; p=.0073**), as well as with medium size screens (34.2%) compared to with small screens (24.4%; t=2.433; p=.0232*); however, the proportion of strong prominent ratings tended to be lowest with large screens (14.2% on average). Effects of screen size were also reflected in inter-rater reliability, revealing the highest kappa for users with medium-sized screens (kappa=.566, when ratings are recoded to a binary decision) compared to large (kappa=.485) and small screens (mobile phones, kappa=.437). However, inter-rater reliability was less affected by the listening condition (headphones vs. loudspeakers). 3
To conclude, the choice of hardware might have effects on multimodal prominence ratings, which has to be considered in crowdsourcing approaches. More detailed results will be presented at the conference.
University of Leuven , 2019. p. 2-3
audio-visual perception, crowdsourcing, web-based, inter-rater reliability, headphones
MMSYM 2019, 6th European and 9th Nordic Symposium on Multimodal Communication, Research group MIDI (Multimodality, Interaction & Discourse), University of Leuven, September 9-10, 2019