You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Designers characterize naturalness in voice user interfaces : their goals, practices, and challenges Kim, Yelim 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_may_kim_yelim.pdf [ 2.42MB ]
JSON: 24-1.0389688.json
JSON-LD: 24-1.0389688-ld.json
RDF/XML (Pretty): 24-1.0389688-rdf.xml
RDF/JSON: 24-1.0389688-rdf.json
Turtle: 24-1.0389688-turtle.txt
N-Triples: 24-1.0389688-rdf-ntriples.txt
Original Record: 24-1.0389688-source.json
Full Text

Full Text

Designers Characterize Naturalness in VoiceUser Interfaces: Their Goals, Practices, andChallengesbyYelim KimBSc., The University of Toronto, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)March 2020c©Yelim Kim, 2020The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, the thesisentitled:Designers Characterize Naturalness in Voice User Interfaces: Their Goals,Practices, and Challengessubmitted by Yelim Kim in partial fulfillment of the requirementsfor the degree of Master of Science in Computer ScienceExamining Committee:Dongwook Yoon, Computer ScienceSupervisorJoanna McGrenere, Computer ScienceSupervisorKaron MacLean, Computer ScienceExamining committee memberiiAbstractWith substantial industrial interests, conversational voice user interfaces(VUIs) are becoming ubiquitous through devices that feature voice assistantssuch as Apple’s Siri and Amazon Alexa. Naturalness is often considered tobe central to conversational VUI designs as it is associated with numerousbenefits such as reducing cognitive load and increasing accessibility. The lit-erature offers several definitions for naturalness, and existing conversationalVUI design guidelines provide different suggestions for delivering a naturalexperience to users. However, these suggestions are hardly comprehensiveand often fragmented. A precise characterization of naturalness is necessaryfor identifying VUI designers’ needs and supporting their design practices. Tothis end, we interviewed 20 VUI designers, asking what naturalness meansto them, how they incorporate the concept in their design practice, andwhat challenges they face in doing so. Through inductive and deductive the-matic analysis, we identify 12 characteristics describing naturalness in VUIsand classify these characteristics into three groups, which are ‘Fundamental’,‘Transactional’ and ‘Social’ depending on the purpose each characteristicserves. Then we describe how designers pursue these characteristics underdifferent categories in their practices depending on the contexts of their VUIs(e.g., target users, application purpose). We identify 10 challenges that de-signers are currently encountering in designing natural VUIs. Our designersreported experiencing the most challenges when creating naturally soundingdialogues, and they required better tools and guidelines. We conclude withiiiimplications for developing better tools and guidelines for designing naturalVUIs.ivLay SummaryProviding natural conversation experience is often considered central to de-signing conversational Voice User Interfaces (VUIs), as it is expected to bringout numerous benefits such as lower cognitive load, lower learning curve, andhigher accessibility. Despite its noted importance, naturalness is ill-defined.There are also no comprehensive standard resources for helping designers topursue naturalness. In order to provide support for VUI designers in thefuture, it is critical to understand how they currently perceive and pursuenaturalness in their design practices. Hence, we interviewed 20 VUI designersto understand their notion of a natural conversational VUI and their practicesand challenges of pursuing it. In this thesis, we present 12 characteristics ofnaturalness and classify these characteristics into 3 groups. We also identify10 challenges that our designers are currently encountering and conclude withimplications for developing better tools and guidelines for designing naturalconversational VUIs.vPrefaceThis thesis was written based on the study approved by the UBC BehaviouralResearch Ethics Board (certificate number H18-01732). This thesis extendsa conference paper that is currently under review for publication. As the firstauthor of the submitted paper, I designed and conducted the semi-structuredinterviews and analyzed the data under the supervision of Dr. DongwookYoon and Dr. Joanna McGrenere. More specifically, my two supervisorshelped me formulate research questions, design the study and analyze thecollected data. The submitted paper was written with great help from thetwo supervisors as well as the help from Mohi Reza, another co-author ofthe paper. Mohi Reza, an MSc student, provided a great amount of help inwriting the introduction and related work section of the submitted paper aswell as providing great insight for shaping findings and contributions. MohiReza also provided English writing assistance for the submitted paper.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 VUI Literature in HCI . . . . . . . . . . . . . . . . . . . . . . 52.2 Existing Definitions of Naturalness . . . . . . . . . . . . . . . 62.3 Characterizing Conversations . . . . . . . . . . . . . . . . . . 72.4 Difference Between Spoken Language and Written Language . 82.5 Human-likeness in Embodied Agents . . . . . . . . . . . . . . 82.6 Tools and Guidelines for VUI Design . . . . . . . . . . . . . . 9vii3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 User Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.6 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1 Designers Characterize Naturalness . . . . . . . . . . . . . . . 174.1.1 Fundamental Conversation Characteristics . . . . . . . 184.1.2 Social Conversation Characteristics . . . . . . . . . . . 234.1.3 Transactional Conversation Characteristics . . . . . . . 254.2 Designers Experience Challenges . . . . . . . . . . . . . . . . . 305 Discussion and Implications for Design . . . . . . . . . . . . . 475.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.1.1 Capturing Naturalness in Context . . . . . . . . . . . . 475.1.2 Positioning Naturalness of VUI in the Literature . . . . 485.1.3 Contrasting Naturalness of Transactional vs. SocialAgents . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . 505.1.5 Implications for VUI Design Tools . . . . . . . . . . . . 506 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . 53Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69A.1 The Recruitment Poster . . . . . . . . . . . . . . . . . . . . . . 69A.2 The Recruitment Message Posted Through SNS . . . . . . . . . 71A.3 The Consent Form . . . . . . . . . . . . . . . . . . . . . . . . . 73viiiA.4 The Pre-interview Survey . . . . . . . . . . . . . . . . . . . . . 78A.5 The Semi-structured Interview Script . . . . . . . . . . . . . . 86A.6 More Descriptions on the User Task . . . . . . . . . . . . . . . 93A.7 The Post User-task Survey . . . . . . . . . . . . . . . . . . . . 97A.8 The Data Analysis Process . . . . . . . . . . . . . . . . . . . . 104ixList of Figures3.1 The familiarity with SSML . . . . . . . . . . . . . . . . . . . . 114.1 The twelve characteristics of naturalness that designers deemimportant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 The 10 challenges that designers are currently encountering indesigning natural VUIs . . . . . . . . . . . . . . . . . . . . . . 32xAcknowledgementsFirstly, I would like to express my sincere gratitude to my two supervisors,Dongwook Yoon and Joanna McGrenere. They were always generous withtheir time whenever I needed their help. Before I entered graduate school, Idid not have much exposure to the Human-computer Interaction (HCI) fieldand research environment. My two great supervisors were always so patientand generous with my progress and encouraged me to explore and make myown decisions for my project. I really thank them for their extensive helpthroughout the whole project. My two supervisors always amazed me withtheir great professionalism and their endless passion for HCI research, andthey set an example for me to follow. Secondly, I would like to thank MohiReza, an MSc student who helped me in writing a paper we submitted to a toptier computing conference. He dedicated 2 weeks for me to help my project.I really enjoyed working with him and learned a lot from him. Thirdly, Iwould like to express my appreciation to Karon MacLean for accepting tobe the second reader of my thesis and being so generous with her time. Ialso want to thank her for her insightful guidance and help for the otherproject that I worked on with her student, Soheil Kianzad. Lastly, I wouldlike to thank the students in Multimodal User eXperience lab for providinginsightful feedback on my study and for offering much valuable advice ongraduate life.xiChapter 1IntroductionIn this chapter, we first introduce an overview of the problem space. Then,we motivate and illustrate the contributions of our study, and outline theoverall structure of the thesis.1.1 Problem DefinitionWith substantial industrial interest, conversational Voice User Interfaces(VUIs) are becoming ubiquitous, with the plethora of everyday gadgets, fromsmartphones to home control systems, that feature voice assistants (e.g., Ap-ple’s Siri, Amazon Alexa and Google Home Assistant). Conversational VUIsystems are one of the two general types of VUI systems [1]. In a conversa-tional VUI system, users perceive the voice agents as conversation partnersand accomplish their goals by having conversations with the agents [1]. Whilein a command-based VUI system, which is the other general type of VUI sys-tem, users are expected to learn and use the appropriate voice commands toaccomplish their goals [2]. Hereafter, we use the term ‘VUI’ to refer to ‘con-versational VUI’, and we use the term ‘voice agent’ to refer to ‘conversationalagent’ [3].At the heart of desired properties of VUIs is naturalness. According to1prior work and industrial design guidelines, enabling users to accomplish theirgoals by having natural conversations with voice agents brings out numerousbenefits such as lowering cognitive load [4, 5], lowering learning curve [5],and increasing accessibility [5, 6]. Hence, multiple VUI design textbooks andguidelines recommend that designers make VUIs that provide natural con-versational experiences to the users [7, 5, 8]. VUIs have only become popularrecently. As such, designing a conversational voice user interface can be dif-ficult due to the lack of standard and comprehensive design guidelines asRobert and Raphael noted in their book, “conversational interfaces are atthe stage that web interfaces were in 1996: the technologies are in the handsof masses, but mature design standards have not yet emerged around them.”.In fact, currently, available design resources suggest design approaches to-ward naturalness, but their characterization of the term is fragmented andhardly comprehensive. Multiple resources recommend different practices todesigners to make VUIs sound more natural [9, 10], feature natural dialogues[11, 12, 13], or offer natural interactions [14, 15]. However, the term, natural-ness, is an ill-defined construct lacking precision and clarity [16]. Therefore,the field is lacking a comprehensive and substantive characterization of nat-uralness in VUIs despite its advertised importance.Bridging this conceptual gap is a critical step towards providing compre-hensive guidance to designers who strive to create natural VUI experiences.The literature in communications and social science inform the characteri-zation of naturalness in human dialogues. They suggest that people havedifferent expectations and concepts of violations in interpersonal communi-cation, depending on a class of situational factors, such as who is talking,and the relationship between interlocutors [17, 18]. Given that modern voiceassistants are often situated in complex and dynamic social settings [19], itis possible that conversational characteristics of a VUI considered natural inone setting is not perceived the same in another setting (e.g. an extremelyhuman-like voice agent can be considered deceptively anthropomorphic and2uncanny [20]). The extent to which specific characteristics of naturalnessapply in different conversational settings remains an open question.Within the broader discourse on Natural User Interface (NUI) design,the preliminary conceptions of naturalness offered have remained abstractand generic. Some have described naturalness as a property that refers tohow the users “interact with and feel about a product” [21]. Others haveused it to describe devices that “adapt to our needs and preferences”, andenable people to use technology is “whatever way is most comfortable andnatural” [22]. Such broad characterizations make sense at a conceptual level.However, the extent to which they can be applied in the domain of VUIdesign remains uncertain.While there are some existing VUI guidelines, they too are “too high-leveland not easy to operationalise” [23]. To offer designers proper guidelines andtools, we need to seek answers to questions on how designers characterize nat-uralness, how they align such characteristics to their varying design goals,and what challenges they face in this pursuit of naturalness. Doing so willenable researchers to create conceptual and technical tools that support nat-uralness for VUI designers.As a first step towards characterizing naturalness, we conducted semi-structured interviews with 20 VUI designers to understand how they definenaturalness, what design process they use to enhance naturalness, and thechallenges they face. Through reflexive thematic analysis [24], our studyrevealed a comprehensive set of characteristics which we mapped into dif-ferent categories according to the aspects of naturalness each characteristiccontributes to, whether it is for achieving a basic required skill for having afluent verbal communication, for providing social interactions, or for helpingusers’ tasks. Some of the characteristics mirror those found in the human-to-human conversation literature, but interestingly designers also identifiedcharacteristics that are “beyond-human”, which reflect the machine-specificcharacteristics that outperform people such as superior memory capacity and3processing power. Our VUI designers also described significant challenges inachieving natural interaction related to a lack of adequate design tools andguidelines and in balancing the different characteristics based on the role ofvoice agent (e.g., a social companion or a personal assistant).1.2 ContributionsIn this study, we recruited VUI designers, a group that was not exploredpreviously in the HCI literature to our knowledge, and ran an empiricalstudy to uncover their perceptions of naturalness and their current practicesof pursing it. Our work contributes the following:1. We identified a set of 12 characteristics for naturalness as perceivedby designers and categorized them based on different aspects of nat-uralness each characteristic contributes to: Fundamental, Social andTransactional.2. We identified and characterized the 10 challenges that hinder designersfrom creating natural VUIs.3. We proposed design implications for the tools and design guidelines tosupport designers in creating natural VUIs based on our findings fromthe interviews.1.3 OverviewIn Chapter 2, we present relevant previous works. For Chapter 3, we describeour study design and analysis methodologies. Then, Chapter 4 introducesour study findings, and Chapter 5 discusses our reflections on the findingsand presents design implications that we created based on our reflectionsand insights. Finally, Chapter 6 provides the conclusion of the thesis andsuggestions for future work.4Chapter 2Related WorkWe set the stage by first reviewing the existing body of literature on VUIs.Then, we look at the broad manner in which naturalness is currently con-ceived and employed, and the ways in which people have characterized con-versations. We then focus on the rich body of work on anthropomorphism inconversational agents and beyond, a topic that is of particular importance toour discussion on naturalness. Finally, we look at orthogonal, yet importantconcerns, and review existing tools and guidelines for VUIs.2.1 VUI Literature in HCIResearchers have been investigating ways to support speech interactions sinceas early as the 50s. With rapid advancements in Natural Language Under-standing (NLU), we transitioned from rudimentary speech-recognition basedsystems such as Audrey [25] and Harpy [26] in the 50s and 70s, to task-oriented systems like SpeechActs [27] in the 90s, and sophisticated conversa-tional agents that we now have.More recently, several studies from the HCI community, have been inves-tigating how voice assistants impact users [19, 28, 29, 30]. In these studies,various issues have been explored, including how VUIs fit into everyday set-5tings [19], how users perceive social and functional roles in conversation [28],and the disparity between high user expectations and low system capability[31].A common thread between many of these studies is that they take intoaccount the perspective of the users. As Wigdor [21] put it, naturalness isa powerful word because it elicits a range of ideas in those who hear it - inthis study, we take the path less trodden, and see what designers think.2.2 Existing Definitions of NaturalnessGiven that naturalness is a construct, we see several angles from which ex-isting studies define and use the term.As a descriptor for human-likeness: Naturalness is often seen as a “mimicryof the real world” [21]. In the context of speech, the human is the naturalentity of concern, and hence, behavioral realism, i.e. creating VUIs that actlike real humans, has become a focus. We can trace the attribution of an-thropomorphic traits onto computers in a seminal paper by Turing [32] onwhether machines can think. In that paper, he assumes the “best strategy”to answer this question is to seek answers from machines that would be “nat-urally given by man”. The pervasive influence of such thought can be seenin existing definitions of naturalness in VUI literature - [33, 34, 35] all treatnaturalness in this light, as a pursuit of human-likeness.As a distinguishing term between the novel and traditional modes of in-put: The term is also used to contrast interfaces that leverage newer inputmodalities such as speech and gestures, with more classical modes on in-put, namely, graphical and command-line interfaces [36]. In this definition,the term is in essence an umbrella descriptor of countless systems involv-ing multi-touch [37, 38], hand-gestures [39, 40], speech [41, 42], and beyond[43, 44].As interfaces that are unnoticeable to the user: Another usage draws from6Mark Weiser’s notion of transparency introduced in his seminal article onubiquitous computing [45]. In this formulation, naturalness is a descriptorfor technologies that “vanish into the background” by leveraging naturalhuman capabilities [46, 47].As an external property: In this conception, the term does not refer to thedevice itself, but rather the experience of using it, i.e. the focus is on whatusers do and how they feel when using the device [48]. The characteristicsthat we present in this paper can also be viewed from such an angle, i.e.designers form and utilize characteristics not because they make the VUImore natural, but rather the experience of using it more natural.The existing usage of the term has drawn heavy criticism from some -Hansen and Dalsgaard [16] describe naturalness as “unnuanced and marredby imprecision”, and find the non-neutral nature of the term to be prob-lematic. In their view, the term has been misused to conflate “novel andunfamiliar” products with “positive associations”, akin to marketing propa-ganda.Norman [49] contends the distinction between natural and non-naturalsystems and notes that there is nothing inherently more natural about newermodalities over traditional input methods. With speech, for example, henotes that utterances still have to be learned.2.3 Characterizing ConversationsExisting literature [50, 51, 52] has demarcated different forms of human con-versations based on purpose. Clark et al. [28] takes these forms, and classi-fies them into two broad categories - social and transactional. In the formercategory, he notes that the aim is to establish and maintain long-term re-lationships, whereas, in the latter, the focus is on completing tasks. In ourstudy, we ground some of the characteristics that designers mention on thebasis of these two categories.72.4 Difference Between Spoken Language andWritten LanguagePrevious studies from linguistics have identified the differences between spo-ken and written languages [53, 54, 55]. Researchers found that people usemore complex words for writing compared to when they are speaking [53,54, 56, 57]. Bennett said that passive sentences are more frequently usedin the written texts [58]. Also, more complex syntactic structures are usedin written language than spoken language [56]. In our study, we analyzedour interview data based on these previous works to find out what specificaspects of spoken dialogues that VUI designers find it challenging to mimicwhen they are writing VUI dialogues.2.5 Human-likeness in Embodied AgentsA rich body of studies explore issues revolving around human-likeness in em-bodied agents and our relationships with them. They investigate a plethoraof concerns such as ways to transfer of human qualities onto machines [59, 60],ways to maintain trust between users and computers [61, 62, 63], modelinghuman-computer relationships [64, 65], designing for different user groupssuch as older adults [66, 67], children [68, 69], and stereotypes [70, 71].A series of studies by Naas et al. on how people respond to voice assistantshave been done. Results from these studies suggest that people apply existingsocial norms to their interactions with voice assistants [72]. The “Similarityattraction hypothesis” posits that people prefer interacting with computersthat exhibit a personality that is similar to their own [73], and that cheerfulvoice agents can be undesirable to sad users.In our study, designers reflect on issues that echo the literature by con-sidering factors such as personality, trust, bias, and demographics in theirVUI design practice.82.6 Tools and Guidelines for VUI DesignMany large vendors of commercial voice assistants provide their own separateguidelines for designers [74, 12, 75]. These guidelines offer design advicetailored to developing applications for a specific platform. With regards toplatform-independent options, some preliminary effort has been undertakenin the form of principles [76], models [77] and design tools [78] for VUIs.More specifically, Ross et al. provided a set of design principles for the VUIapplications taking a role as a faithful servant [76] while Myers et al. analyzedand modelled users’ behaviour patterns in interaction with unfamiliar VUIs[77]. Lastly, Klemmer et al. introduced their tool for intuitive and fastVUI prototyping process [78]. Our study adds the design implications fortools and design guidelines to help VUI designers for creating natural VUIexperience.9Chapter 3MethodsTo understand how VUI designers perceive naturalness in their design prac-tices, we conducted semi-structured interviews with 20 VUI designers. Wedesigned the interview questions with a constructivist epistemological stance,viewing the interview as a collaborative meaning-making process between theinterviewer and the interviewee [79]. This chapter will describe the designof our study and the methodologies we used for collecting and analyzing ourdata.3.1 ParticipantsWe recruited 20 VUI designers (7 female, 13 male) using purposeful sampling.To draw findings from a varied set of perspectives, we interviewed both am-ateur (N = 7) and professional (N = 13) VUI designers. We recruited theparticipants through flyers (Appendix A.1) and study invitation messageson social network services such as Facebook and LinkedIn (Appendix A.2).Participants’ ages ranged from 17 to 73 (M = 34.3, Median = 30.5, SD =14.7). The nationalities of our participants were as follows: 4 American, 1Belgian, 1 Brazilian, 5 Canadian, 1 Dutch, 1 German, 5 Indian, 1 Italian,and 1 Mexican.10In this study, we define professional VUI designers as people workingfull-time on designing VUI applications regardless of their actual job titles(e.g., VUI designer, Voice UX Manager, UX manager, CEO). Participant’slength of professional VUI design experience ranged from 9 months to 20years (M = 4 years and 2 months, Median = 2 years, SD = 6 years). Ourparticipants worked in companies that ranged considerably in size, from 3employee startups to large corporations with over 5,000 employees. Most ofthe professional VUI designers we recruited (8 out of 13) were working forrelatively small size companies (from 2 employees to 49 employees), and theremaining 2 participants were working for medium size companies (from 50to 999 employees).In addition to the information stated above, we collected the participants’highest level of education (2 with a high school diploma, 2 with a technicaltraining certificate, 9 with a bachelor’s degree, 5 with a master’s degree,and 2 with a doctoral degree), and their familiarities with Speech SynthesisMarkup Language (SSML) (Figure 3.1). We collected information aboutSSML familiarity because SSML is the only available standard method tomodulate synthesized voices across different platforms.Figure 3.1: The distribution of the participants’ familiarities with SSML(N=20)11About half of the participants considered them to be unfamiliar withSSML, while the same amount of the participants considered them to befamiliar with SSML.All of our participants had previously designed at least one conversationalvoice user interface, including voice applications for Amazon’s and Google’ssmart speakers, humanized Interactive Voice Response (IVR) systems, andVUIs for a virtual nurse, a smart home appliance and a companion robot.3.2 InterviewsFor each participant, we conducted one session of a semi-structured interviewbased on the interview script in Appendix A.5. The duration of interviewsvaried from 30 minutes to about an hour, depending on the participants’ timeavailabilities. All of the interviews were conducted by the thesis author. Wearranged online interviews for the participants (17 out of 20 participants)who could not come to the University of British Columbia for the interview.Prior to each interview session, the participants were asked to fill out asurvey asking about their demographic data, familiarities with SSML andprevious VUI design experiences (Appendix A.4). In this survey, we alsoasked their most memorable VUI projects to contextualize our research ques-tions based on their vivid memories. The collected data through pre-interviewsurveys were analyzed using descriptive statistics to ensure the diversity ofthe participant demographics (i.e., gender and VUI design professional level).The interview questions can be broadly divided into four sections. Thegoal of the first section was to understand the participants’ general VUI de-sign practices and their previous VUI experiences in depth. In this part, werequested them to describe their design practices for the two most memo-rable VUI projects that they reported in the pre-interview survey. For thesecond section, we sought to understand the participants’ conceptions of anatural VUI. To achieve this goal, we requested them to provide their own12definitions of a natural VUI. Then, we asked them how important it is forthem to create a natural VUI, and if it is important, what benefits they areexpected to gain by doing so. The goal of the third section was to under-stand the participants’ design practices for creating natural VUIs and thechallenges our participants are currently facing in creating natural VUIs.Therefore, we asked the participants what particular design steps they takefor creating more natural VUIs, and asked them what the most challengingaspects of carrying out those steps are. The last section of the questionswas to understand how useful the current design guidelines and tools are fordesigning natural VUIs. In this part, we asked the participants about whattools or design guidelines they are currently using and how helpful they arefor designing natural VUIs.3.3 User TasksDepending on the participants’ time availabilities, 15 out of 20 participantshad the time to do the user task after the interview. In this user task,participants were asked to write relatively short VUI dialogues in two ways:by typing using the keyboard, as well as by using a voice typing tool thatwe created. However, one participant whose age was 73 dropped in themiddle of the user task because he felt tired of typing. The duration of eachuser task was about 15 minutes. This user task was designed to accomplishtwo goals. The first goal was to understand the participants’ VUI dialoguewriting procedures and their perceptions of the current synthesized voices.The second goal was to explore the possibilities of using voice typing insteadof the keyboard input for creating natural VUIs. Please note that the datacollected from the user tasks did not directly inform our study results dueto the lack of the amount and richness of the collected data. However, toprovide a more transparent understanding of our study, we lay out the wholeprocedure of the user task in Appendix A.6.133.4 ProcedurePrior to the interview, each participant was asked to fill out the pre-interviewsurvey (Appendix A.4) mentioned above. Before each interview session, anemail containing the consent form (Appendix A.3) was sent to each partici-pant, so that our participants could provide their consent by replying to theemails.Before recording the interview sessions, the participants were informedthat the recording would start. After asking the interview questions, theparticipants who were able to spend more time carried out the user task.After they finished the user task, they were asked to fill out the post-tasksurvey (Appendix A.7) to provide their feedback on the task and the currentdesign guidelines. We provided a link for the survey to each participant.Lastly, we asked participants if there were any concerns or questions re-garding this study. We addressed their questions if there were any, and ifthere were no further questions, we informed them that the interview wasfinished, and thanked them for their help in this study. At the end of eachinterview session, each participant received $15/hour for their participation.The payment was made electronically through Paypal or Interact e-Transfer.Some of the participants refused to get paid and expressed their desire tohelp the study for free.3.5 Data AnalysisAll 20 interviews were transcribed before being analyzed. We used Braunand Clarke’s approach for reflexive thematic analysis [24] for analyzing theinterview data. Their approach was particularly suited for our study be-cause of its theoretical flexibility and rigour. The three members of theresearch team, including the thesis author and her two supervisors, had aone-hour weekly meeting where they developed the themes over the courseof several months. Instead of seeking the objective truth, we took an ap-14proach to crystallization [80], and developed a deeper understanding of thedata by sharing each other’s interpretations of the data during each meet-ing. For facilitating a productive discussion, we organized the themes us-ing several different ways in a concurrent manner, and this includes usingpost-its, ‘Miro’ (, an online collaborative whiteboardplatform, and ‘airtable’ (, an online collaborativespreadsheet application (Appendix A.8). All of the members in the teamcoded the interview data using the open coding methodology [81] “where thetext is read reflectively to identify relevant categories.” The two members ofthe team went through at least three interview transcripts, and the thesisauthor went through them all. We took both inductive and deductive ap-proaches for coding the data and developed a set of coherent themes thatform the basis of our findings. Our deductive approach for coding was de-rived from the previous works on “the classification of human conversation”[52, 50, 51].3.6 ApparatusFor the participants who were not able to visit the University of BritishColumbia to have an in-person interview, we used ‘Zoom’, an online com-munication software (, to conduct the interviews andrecord the audios of the interviews. For the in-person interviews, ‘Easy VoiceRecorder’, an Android application for audio recording (, was used to record the audio of the interview. The intervieweralso wrote interview notes on papers.For the user task, the participants were asked to write VUI dialoguesusing Google Docs. Then, the interviewer used a software tool named ‘TTSreader’, which we built for playing the sound of VUI dialogues ( TTS reader used Amazon Pollyvoices ( to read the VUI dialogues that15the participants wrote. An add-on for Google Docs, ‘Kaizena’ (, was used to let the participants add voice comments onthe dialogues that they wrote.16Chapter 4ResultsIn this section, we first describe how our VUI designers characterize natu-ralness, followed by the challenges that they are facing in creating a naturalVUI. We also elicit how each challenge is related to the characteristics ofnaturalness defined by our designers.4.1 Designers Characterize NaturalnessTo contextualize our findings, we briefly summarize the most memorableprojects described by our 20 participants. In total, we collected data about38 VUI projects. There was only one chat-oriented dialogue system (e.g.,[82, 83]) where a participant built a conversational agent, for reducing elders’loneliness and the rest of them were grouped as task-oriented systems accord-ing to the definition provided by [84]. Among the types of the applicationsmentioned during the interviews, there were 27 Intelligent Personal Assistant(IPA) systems for smart home speakers (23 Amazon Alexa, 4 Google Home),8 Interactive Voice Response (IVR) phone systems, 1 voice agent system fora smart air-conditioner, 1 voice agent system for a mobile application, and 1voice agent system for a humanoid robot.When asked to provide their definitions of a natural VUI, our participants17responded in terms of the characteristics of human conversation that theyconsider important for creating a natural VUI. Later, our thematic analysisrevealed the three categories for classifying these characteristics of humanconversation, namely ones that are: (1) fundamental to any good conver-sation, (2) ones that promote good social interactions, and (3) those thathelp users to accomplish their tasks. Perhaps not surprisingly, the latter twocategories echo classifications for human-to-human conversations in existingliterature [52, 50, 51], labeled as “social conversation” and “transactionalconversation”. To be consistent with that literature, we adopted those la-bels.We found that instead of pursuing all the characteristics under the threecategories, our designers selectively choose a particular set of characteris-tics depending on the context of their VUI applications and their designpurposes. We also found that pursuing multiple characteristics at once cancreate conflicts.The three categories we use in this study can help readers conceptualizethe 12 characteristics that we found in this study. They can also serve asa lens to understand why designers pursue these characteristics and howthese characteristics can often conflict with each other. The following sectionprovides detailed descriptions of each characteristic.4.1.1 Fundamental Conversation CharacteristicsAmong the conversation characteristics mentioned by our participants, therewas a set of fundamental verbal communication characteristics that a natu-ral VUI should have, regardless of whether the aim is to support social con-versations or transactional conversations. The participants consider a VUIthat does not achieve these elements as unnatural. For example, synthesizedvoices that do not show appropriate prosodies (e.g., having a monotonousintonation) were frequently referred to as “robotic” (P1, P4, P6) or being a“machine”. (P18)18Conversation Characteristics for Imbuing Naturalness in VUIsFundamental CharacteristicsSocial Characteristics Transactional Characteristics§ Express sympathy and empathy.§ Express interest in the user.§ Be lovable, charming and lovable.§ Proactively help users.§ Present a task-appropriate persona.§ Be capable to handle a wide range of topics in the task domain.§ Deliver information with machine-like speed and accuracy.*§ Maintain user profiles to deliver personalized services.*§ Sound like a human speech.§ Understand and use variations in human language.§ Use appropriate prosody and intonation.§ Collaboratively repair conversation breakdowns.*beyond-human aspectsFigure 4.1: The twelve characteristics of naturalness that designers deemimportant19Sound Like a Human SpeechSix participants (P2, P5, P6, P7, P13, P15) mentioned that utterances ofa natural VUI should have characteristics of spoken language as opposedto written text. For example, people tend to use more abstract words andcomplex sentence structures when writing [85, 86]. To specify, a natural VUIshould use simple words:“When you’re writing it down, and you say it outloud, sometimes you realize that whatever you’ve written down is way toolong or has way too many big words.” (P6) However, our participants saidthat the simple words do not mean less formal words such as slang or vulgarexpressions, but rather more typical words that the persona of the VUI woulduse. P4 mentioned that he avoided using words that are “too casual” for hisvirtual doctor application because “people would take it more seriously ifthey felt that it was a natural doctor.”A natural VUI should also incorporate filler words [87], breathing, andpauses:“You need to introduce those pauses...conversation bits, like, ‘Um’,‘Like’, ‘You know’ makes the conversation more natural.” (P13) The par-ticipants mentioned that the patterns for breathing and pausing should givethe impression of a ‘mind’ in the VUI, and make the user feel more like theyare talking to a human:“Pauses before something, like a joke...we need tocreate an anticipation for the final jokes.” (P15)Understand and Use Variations in Human LanguageHuman language is immensely flexible, and we can express the same requestin countless ways. Thirteen participants (P1, P2, P5, P6, P7, P9, P10, P11,P16, P17, P18, P19, P20) mentioned that a natural VUI should understandvarious synonymous expressions spoken by users, such as:“‘Increase the vol-ume.’, ‘Turn up the volume.’ the end of the day, you just want [it] toincrease the volume.” (P16) The participants considered VUIs that heavilyrestrict what users can say to be unnatural and of diminished value:“...if youare instructing people to speak in a certain way, that’s not how, I feel, voice20technology can be used.” (P16)In addition to understanding varied expressions, 4 participants (P6, P7,P17, P20) mentioned that a natural VUI should be able to respond usinga varied set of expressions to avoid sounding repetitive. “...we don’t alwayssay, ‘Good choice!’, we say, ‘Great choice!’, or we say, ‘Awesome!’” (P6)When users repeat the same input, a natural VUI should respond to it usingdifferent expressions:“So if you [the user] go through the application morethan once, the structure [of the dialogue] is similar, but you will always heardifferent sentences.” (P7)The participants mentioned that variation of expressions in human lan-guage should be considered within the context of the target user. For ex-ample, factors such as different age groups and even individual differencesneed to be considered:“...we found out like hanging up the house phone, anddial the phone clockwise. These are all words that older people use, while wedon’t anymore.” (P8) Age aside, the same expression can mean very differentthings when coming from different individuals:“If I say ‘it’s hot’, then it’sdifferent from you say ‘it’s hot’, right?” (P5)However, there are certain use cases where language variation is unde-sirable. This characteristic is less important when the primary purpose of aVUI application is having a transactional conversation for helping the users’tasks, and the target users of the application are not the general public, butrather people from certain professions such as police officers or fire workers.This is because such target users are often trained to use special keywordsand have a fixed workflow for faster and efficient communications. Hence, inorder to help them effectively, the application should stick to the fixed setof the vocabularies:“So one of the main users of this type of application willbe police or fire ambulance [drivers]...They’re used to a very rigid commandset. So they’re always saying things in the same way.” (P2)21Use Appropriate Prosody and IntonationProsody “refers to the intonation contour, stress pattern, and tempo of an ut-terance, the acoustic correlates of which are pitch, amplitude, and duration.”[88]. Eleven participants (P2, P3, P4, P5, P6, P8, P9, P11, P12, P13, P18)said that a natural VUI should present messages clearly with the appropriateprosody:“For me, it’s important to put the right intonation in some parts ofthe text to make it clear.” (P3) However, they highlighted that the appro-priate prosody can differ for age groups. Many participants reported thatAmazon Alexa, a popular commercial voice assistant, is “way too fast” (P11)by default and that prosodies should be modified for seniors:“...for seniors,you may want to slow down the speed and potentially increase the volumeor put an emphasis on the words.” (P2) Since there is no voice customizedfor seniors, P11 had to put a break for each sentence manually. “’s like,‘Okay, here’s your calendar!’, another break. ‘Today you have...’, a slightpause, like point two, point three-second pause.” (P11)Collaboratively Repair Conversation BreakdownsDuring verbal communication, we often encounter small conversation break-downs when people do not respond in a timely way or do not understandwhat each other said. Four participants (P3, P6, P7, P8) mentioned that anatural VUI should solve these kinds of conversation breakdowns in a sim-ilar way how humans collaboratively solve them by asking each other:“Likein this conversation, how often did we already say, ‘I don’t understand you’or ‘What do you mean?’ or ‘Can you explain more?’. That’s already forhumans that way...and the robots and the user interfaces have to learn fromhumans.” (P7)A VUI is considered as very unnatural and machine-like if it repeatsthe same information when the conversation breaks down:“...if you don’tunderstand something sometimes, [and if] the system just keeps repeating thesame information [to get the response from you], just like a robot.” (P3)224.1.2 Social Conversation CharacteristicsWhile the main focus of our participants was on task-oriented applications asmentioned in section 4.1, they also emphasized the importance of providingpositive social interactions to users. Humans have social conversations tobuild a positive relationship with each other [89]. In the VUI context, de-signers incorporate the social conversation characteristics for providing har-monious and positive interactions, a more realistic feeling of conversation,and a feeling of being heard.Express Sympathy and EmpathyTen participants (P2, P3, P4, P5, P6, P8, P9, P11, P12, P13) mentionedthe importance of providing empathetic responses to users’ sentiments tomaintain harmonious interactions:“...if I know your favorite team won, I’dhave a happy voice. If I know your favorite team lost, I’d have a sad voice.”(P2)Most of the participants’ elaborations on this part were focused on show-ing sympathy when users experience negative sentiments. The participantstry to make the voice assistants console the users and present empatheticvoices when users feel negative:“If they respond negatively, the Alexa re-sponds, ‘Oh, I’m sorry to hear that.’” (P4)Beyond being sympathetic, the participants even actively try to sootheusers’ feelings in situations when they feel heightened emotions such asanger:“You have a calm reassuring voice when they’re upset because there’sa traffic.” (P9)To find out if users feel negative, the participants use user responses, theirpersonal information (e.g., their favorite sports teams) and the location ofthe conversations (e.g., hospital). There was no participant who mentionedtheir experiences of using a sentiment analysis approach, and one designerspecifically mentioned that using such an approach required too much of atime commitment:“I don’t have time to know the APIs that can do sentiment23detection.” (P11)If there is no way to detect users’ real-time sentiments, then our designerschoose to use a “flat voice” (P3) to prevent the happy voice of a voice agentupsets the user who are currently feeling down, as suggested in [72]:“Youhave to control the tone of voice, because you can’t sound very enthusiastic,things like that, because you never know the situation of the person on theother side.” (P3)Express Interest to UsersFour participants (P4, P6, P11, P12) said that they incorporate greetings,compliments, and words that express interest to the user. These words makethe conversations appear “real-ish” (P11), and make users feel important:“Ithink the benefit of providing this type of responses, instead of just blank ones,is that it actually helps the person feel like their responses actually got heard.”(P4)P11 and P6 mentioned that VUIs can even make users feel as if theyhave personal connections the applications by providing daily greetings orfeedback on users’ actions:“I’ll say, ‘See you tomorrow!’. Little snippets ofhumanity.” (P11) “We could just say the recipe steps and all of that, and nothave to ask questions like, ‘How’s the spice?’ and all of that, but if we do,then there’s some kind of personal connection.” (P6)Be Interesting, Charming and LovableSocial conversations include humour and gossip which fulfill hedonic values[90, 91] that transactional conversations do not contain. In order to bringmore user engagement for task-oriented applications, 4 participants (P1, P7,P13, P17) reported trying to write more interesting dialogues and to createa charming persona:“Interactive means using some good words. Somethingwhich sounds interesting to the user.” (P17)24The importance of being entertaining was emphasized, especially whenthe target users are children. “So, when it’s a kid’s application, you respondback in a very funny way. You use, terms like ‘Okie Dokie’.” (P13)P7, who created an Alexa application for resolving the conflicts betweenchildren, mentioned that the persona does not need to be loving and nice tobe charming. It can be sarcastic and funny, instead:“She’s not loving andcaring, but she’s maybe a little sarcastic. She makes fun of what they say andI would say she’s lovable, not loving.” (P7)When the system fails to accomplish what the user asked for, design-ers mentioned that a persona of a VUI that presents socially preferable be-haviours can abate negative emotions from users:“...if I had the sense thatit understood its own limitation as opposed to telling me it can not do some-thing...I’ll be more flexible with it.” (P1)4.1.3 Transactional Conversation CharacteristicsAs mentioned in section 4.1, all participants, except one, grounded their an-swers in their design experiences for task-oriented applications. For transac-tional conversations, our designers want to achieve naturalness by leveragingthe way the user used to exploit verbal interactions with others to get thingsdone. P9’s example is especially illustrative:“To me, the greatest benefit [ofvoice interaction] is that it’s an interface that someone will naturally knowhow to use and won’t have to learn how to use a new [command]. Man, Iknow people, especially kids, are great at using phones and all the things theylearn, but I really like how we can do [many things] with voice. One exampleI like to give is that our internet was down, and we had to reboot the routeror whatever, and I was wondering if it worked, so I just said ‘Alexa, are youworking?’ and it says ‘Yes, I’m working’ before I even thought about it.”(P9)Interestingly, not all of the conversation characteristics they identifiedin this category referred to human-like conversation characteristics. Our25designers considered machine-like speed and memory as characteristics thata natural VUI should have for transactional conversations, which we labelas “Beyond-human” aspects. This was surprising given that existing notionsof naturalness are primarily based on being human-like. In other words,the designers’ expectations of naturalness in VUIs extend beyond providingrealistic conversation experiences, and include machine-specific benefits:“SoI guess a natural agent would be close to having a real conversation withsomeone, but with all the added benefits of an actual application.” (P12)Proactively Help UsersEleven participants (P2, P3, P8, P9, P10, P12, P13, P15, P16, P18, P20)mentioned that a natural VUI should be efficient, and proactively “detect oreven ask for the things that [the user] needs.” (P12)P18 said a VUI should not wait for a command. If someone says theyhave a problem, a natural response for humans is to ask if the person needshelp, even when not explicitly asked:“From a linguistic perspective, ‘Couldyou help me with my software?’ is a yes-no question. ‘I have a problemwith my software.’ is not even a question yet. So for ‘I have a problem’,bots need to be more proactive and ask a question, ‘Could I help you withthe software?’” (P18) Hence, a natural VUI should understand the meaningbehind the statement and take action proactively to help the users.The efficiency of a VUI was not strictly measured by the total time takenfor the task. Instead, they considered the quality of the results that userswould obtain compared to the number of the conversation turns that theytook to finish:“...not necessarily as short of a time as possible. But, somethingthat makes sense and it is value-driven for me as a user. So that means ifI’m engaging in multiple [conversation] turns just to get more, I guess, morevaluable information on my end, I’m okay with that.” (P12)Related, a natural VUI should avoid overloading users with excess infor-mation, and instead, it should minimize the number of conversation turns:“...you26do not become a robot who can keep going on and on and on and on aboutall this information...You should not overload the user with a lot of informa-tion. You should try to cut down as many decisions for the user as possible.”(P13) The number of conversation turns can be minimized by proactively“asking them [users] less and less and assuming more...” (P13) To ask fewerquestions, a natural VUI should actively make decisions based on contextualinformation:“But if the user tells me the zip code correctly, I don’t ask him forcity and state, I use some libraries to find the name of the city and state...Weneed to have a record of the entire conversation from top to bottom.” (P13)Even though minimizing the number of questions is important, if theconsequence of failing the task is considerable, a natural VUI should ask theuser to confirm:“ if I say things like ‘You wanted your checking account.Is that correct?’ and I say ‘No, I want my savings account’ then that to me,that confirm and correct [strategy] is a very important part in making it moreconversational.” (P10)Present a Task-appropriate PersonaFour participants (P3, P4, P5, P13) said that a natural VUI applicationshould present an appropriate persona for certain tasks. The tone of voiceshould match the application’s purpose to increase user trust and elicit moreuseful responses. For example, P4 mentioned that for financial applications,the voice agent should sound serious for making the application feel more re-liable:“So when you’re creating these prompts, every company has a differenttone of voice...Like do you want the machine to be quirky? Do you want amachine to be very serious? If you’re talking about your wealth management,you don’t want to have a fun guy. It has to be serious.” (P5)As another example, P4 designed an application for collecting elders’health status. He tried to make the voice agent sound like a real doctor asmuch as possible. This was done to ensure that users take the task moreseriously and report their status correctly. “ if someone was visiting27their doctor and asking the was better than making it seems likeyou were having a conversation with a friend, because it was kind of a serioustopic dealing with...people would take it more seriously if they felt that it wasa natural doctor, something like that.” (P4)Be Capable to Handle a Wide Range of Topics in the Task DomainNine participants (P2, P5, P6, P7, P9, P10, P11, P16, P17) mentioned thata natural VUI should not only be able to respond to the questions directlyrelated to its task but also be able to handle a wide range of topics withinthe domain of its task:“I would think it [a natural VUI] would need to handleanything that is specific to that institution, right? If I call Bank of Americaand ask about my Bank of America go-card, you know you need to understandme.” (P10)A natural VUI should also be able to handle changes in conversationtopics as long as the topic belongs within the task domain. “Let’s say Iwant to book a table for three people, and I said [to a waiter] ‘I want to booka table.’ and [a waiter asked] ‘For how many people?’ and what if I say,‘What do you have outdoor seating?’ I didn’t answer the question. I didn’tsay like six people or two people, because I need to know this other piece ofinformation, but I also didn’t say like ‘How tall is Barack Obama?’” (P9)When a user brings a topic that is beyond the task domain, a naturalVUI should still continue the conversation and remind the user about the taskdomain in which it can help with:“...if a person says ‘I want to order a pizza’,and your skill [Amazon VUI application] has no idea what that is...give thema helpful prompt saying ‘This is the senior housing voice assistant. I can helpyou with finding when the next bus is, or finding when the next garbage dayis, or this or this.” (P2)To help users aware of the boundaries of the serviceable topic domain, thedesigners recommended preemptively providing context to users to help themunderstand what they can do with the application:“A lot of people make a28mistake in the design by saying ‘Welcome to Toyota. How can I help you?’And it’s like you’re going to fail right there because that’s so open-ended.No one will have an idea of what they can or can’t say. They will probablyfail. So you have to be really ‘Welcome to Toyota’s repair center!Would you like to schedule an appointment?’” (P9)However, when the target users are children, the designers should expectthem to ask a lot of questions outside of the task domain:“ What’s yourfavorite color, Alexa?’ and they [children] would like to shift the conversationor just like go to a totally different topic.” (P7)Beyond-human Aspect #1:Deliver Information With Machine-Like Speed and AccuracyOur designers mentioned that, to accomplish its tasks in an efficient manner,a natural VUI should incorporate machine-specific attributes such as highprocessing powers, and selectively mimic certain parts of human conversa-tion instead of pursuing every aspect of a natural human conversation:“Thismachine will be able to talk to us as if it was a human of course...more effi-cient, of course, know you have a couple of these rules, but that’show I personally define natural.” (P5)P12 mentioned that a natural VUI should attain the human-level abilityto maintain conversational context while being able to deliver accurate in-formation in a blazing fast manner:“So it’s just super-fast processing times,being able to deliver information while maintaining conversational context.”(P12)Our participants described natural human speeches as often being indirectand inefficient, so these aspects of human conversation should be left outwhen designing for a natural VUI.“There are so much more words like in the human version of asking, butit sounds human. You know, it’s not as direct or not as efficient, but thereis a kind of like a personality behind it, I guess?” (P1)29“Oh, no less conversational, because you don’t want...something that you’reusing every day. You don’t want to have that be chatty and friendly right?You want to get your work done. so you know concentrating on being effi-cient and giving them the information and exactly the way that they wantit.” (P10)Beyond-human Aspect #2:Maintain User Profiles to Deliver Personalized ServicesHumans’ memories are volatile in contrast to machine memories. Designersmentioned that a natural VUI should store a vast amount of personal infor-mation of users such as personal histories or family relationships to combinethis information all together and provide a customized experience for users.“We customize all the knowledge of the user.” (P8)“We personalize things and make things fit each user. Suppose you havean allergy or specific dietary requirements, then we could filter out all of thoserecipes and only suggest you the recipes that fit your needs.” (P6)However, designers are, of course, aware that storing a huge amount ofinformation comes with concerns about privacy. So the importance of trans-parency on what data is stored was highlighted:“You need to be transparentabout the collected and stored data.” (P8)4.2 Designers Experience ChallengesIn order to inform where and how we should invest our efforts for futuretechnological and theoretical advancement, we asked our participants whatmakes designing for a natural VUI the most challenging. We grouped thechallenges they described and ordered them in a list by the number of par-ticipants who mentioned the issue. To contextualize their challenges, we alsoasked them to describe their design practices.We found that the designers largely follow a user-centered design process30that includes three phases, namely a user research phase, a high-level designphase, and a testing phase each described next. This matches with the VUIdesign process laid out by Google [92]. During the user research phase, theyconduct user research to determine requirements for their applications, collectthe user utterances, and create the personas for their applications. Thisphase involves multiple user observations and interviews. In the high-leveldesign phase, designers create high-level designs such as sample dialogues andflowcharts of dialogues for their applications. Designers interactively develophigh-level designs through collaborations with other designers. During thisphase, while writing the dialogues, designers create the audio outputs of thedialogues and modify the audio and dialogues more or less in parallel. In thetesting phase, designers create prototypes and conduct user tests to validatetheir high-level designs and iteratively develop real VUI applications throughmultiple user testings.Here we present the most prevalent challenges, as reported by our par-ticipants (Figure 4.2), in designing a natural VUI. For each challenge, weillustrate how it is related to the twelve characteristics of naturalness de-scribed in Figure 4.1.1. Synthesized Voice Fails to Convey Nuances and EmotionThe same sentence can convey different meanings depending on the way onenarrates the text. Paralanguage, such as intonation, pauses, volume, andprosody, are essential components in expressing subtle nuances and emotionsin speech. During the high-level phase, to make a VUI sound natural, thedesigners want to have control over the way the speech synthesizer will nar-rate their dialogues to the users. However, 9 participants (P1, P4, P5, P7,P8, P10, P13, P18, P20) reported that current speech synthesis technologyis lacking the expressivity to interpret the intended meaning of the dialoguetext and convey it to the user. They felt that even the best speech synthe-sizer still sounds like “just a robo-voice” (P4) or like “just putting the sounds31Primary  10  Challenges  in  Designing  a  Natural VUI1. Synthesized voice fails to convey nuances and emotion.2. SSML is time-consuming to use while not producing the desired results.3. Existing VUI guidelines lack concrete and useful recommendations on how to design for naturalness.4. Writing for spoken language is difficult.5. Reconciling between “social” and “transactional” is hard.6. Conveying messages clearly is difficult due to the limitation of synthesized voice.7. Handling various spoken inputs from the users is difficult.8. Impossible to capture all the possible situations.9. Difficult to capture the users’ emotions.10. Difficult to understand users’ perceived naturalness.Figure 4.2: The 10 challenges that designers are currently encountering indesigning natural VUIs32together” (P18) rather than “really meaning it [the script].” (P18) Theythink that the voice synthesis technology has a large gap to bridge, saying“there’s a long way to go for it to become very expressive.” (P7)Hiring voice actors who can narrate the script in a natural tone and flow ofa “real voice” was reported by many participants (P3, P6, P7, P8, P9, P13)as a common solution to make a VUI sound natural. However, recordingaudio is considered to be significantly limited in flexibility and scalability.When there is a need to change the narration, editing audio of a recordedspeech is more laborious than re-synthesizing audio from an edited text:“...ifwe discover during research there are more words, then we have to hire thatactor again to speak those words again. So it was not practical at all.” (P8)Moreover, using pre-baked recordings was not a scalable solution to gen-erate spoken utterances of modern VUIs that are required to handle a widevariety of data and conversation context:“Of course, when you are using areal voice actor, it’s impossible because I should record every single streetname that we have in Brazil.” (P3)This challenge is more severe for non-English languages. “Also the wayshe [Amazon Alexa] is speaking for us in German it sounds very ironic andsarcastic, her tone of the voice.” (P18) “[The] Dutch language is not so de-veloped yet for Google...English words, sometimes Dutch people use Englishwords, but if Google is [speaking] in Dutch, then it’s sometimes complicated[not correct].” (P8)Due to the limited expressivity of the synthesized voices, for the applica-tions that need to convey human-like emotions, our designers reported thatthey often had no other choice than using their own voices to record the au-dios. For example, P1 found that the currently available synthesized voiceswere not good enough to express nuanced emotions that he desired to expressfor his storytelling application:“...there are some subtleties that I couldn’t getAlexa to feel nostalgic about, you know, there is no command like nostalgiaabout the house party that you first met this guy that you are still in love with33at, you know?” (P1)Interestingly, for the applications that do not need to express emotions,the importance of the expressive voice was not highlighted, and was evende-promoted. Being humorous is often considered to be relatively human-specific behaviour. P1 said that using Alexa’s “robotic” voice for pulling ajoke creates irony and makes the situation funnier. Hence, he used Alexa’svoice for his VUI application where Alexa is asking about the users’ fecesevery day:“Alexa asks a question that only a human would ask and it’s likea very human written skill [Amazon VUI application] that sounds like veryrobotic and because you’re talking about poop because it’s like a joke...There’ssomething I think funnier about it.” (P1)Relation with the characteristics of naturalness: The difficulty inexpressing nuanced emotions inhibits the designers from achieving the socialcharacteristics of the VUIs listed in Figure 4.1. P8 found that the synthesizedvoice used by IBM Watson was not able to produce natural laughing soundsand posit a risk of misrepresenting itself with unintended negative socialexpressions:“The robot can not laugh, because if the robot laughs and youjust say, ‘ha, ha, ha’ it sounds sarcastic...older adults actually feels like Alice[the robot] is laughing at them. So that’s bad.”2. SSML is Time-Consuming to Use While Not Producing theDesired ResultsDesigners can emphasize the parts of dialogues or modify the prosodies of theaudio such as pitch, volume, and speed by using Speech Synthesis MarkupLanguage (SSML) during the high-level design phase. Tech giants such asGoogle, Amazon and IBM continuously develop and support their own sets ofSSML tags. However, many of our participants (P2, P3, P7, P9, P10, P11,P13, P15, P16) pointed out that writing and editing SSML tags is “timedigging” (P13), while it frequently fails to yield the desired result:“I haven’tbeen very successful at doing this with SSML. It makes very minor changes,34but it doesn’t come close to what it would be if you use a voice actor, forexample.” (P7)It was difficult to make the whole sentence flow naturally, and it felt “stilltoo mechanical.” (P5) Even after fine-tuning speech timings by meticulouslyentering numerical values (e.g., 0.5 seconds of whispering):“I think it’s notvery natural, like another 0.5-second break here, another somewhat slowerhere, all those things.” (P15)The poor design of SSML authoring interfaces, which resulted in SSMLrequiring too much time to use, was another point of consternation. Most ofour designers were using a simple text editor or generic XML mark-up toolsfor writing SSML tags. Hence, they had to (re)write and (re)listen to thewhole sentence or paragraph even when only making a small change to theirdialogues: “I mean even just in the best-case scenario, let’s say you listen toa prompt, you decided that you wanted to change one thing by using SSML.You change that thing. You listen to it again. That‘s your best-case scenario,and right there you just spent, I don’t know, couple minutes maybe, and ifyou have a hundred prompts to do, it’s just not worth it for the small benefityou’ll get.” (P9) Also, it was hard to evaluate when the SSML tag reachedthe optimal level of expressiveness. Hence, our designers often spent a lot oftime iteratively modifying SSML tags without knowing when to stop:“Hardto stop, like, I’m not satisfied with what I got there, so I just keep on changingsomething here and there.” (P15)Our designers had a hard time using SSML when trying to imbue a nar-ration with emotional expressivity. Designing for expressing a subjectiveexperience of emotion requires holistic control of all prosody features at thesame time. However, SSML only offers control of each prosodic element at atime separately:“The technology should be mature a little bit for us to havean SSML tag that is empathetic. It has a very subjective nature. The thingis that it’s not very objective. The objective things are being loud, slient...butthis is completely different.” (P13)35Due to these limits of the current SSML, most of the participants hadabandoned using SSML except for making simple changes of the audio suchas putting the breaks, slowing the speeds, and correcting the pronunciationsof mispronounced words. “In the SSML, I spent some time a while ago, likethere is a prosody tag like that. I just tried and tried and, usually it just didn’tdo what I wanted it to do.” (P9) “The prosody tag can be really difficult todeal with right?...I haven’t played with SSML for a couple [of] years.” (P10)“It’s been a while since I played with SSML. Well...[the SSML tag for puttinga] break I use it all the time still.” (P7)If the application is targeting multiple platforms (e.g., Google Home andAmazon Alexa), using SSML takes even more time, because designers shouldtest it for each platform. In other words, even for the same tag, the resultedvoices can be different depending on the platforms and the same set of SSMLtags might not be available on some platforms. This requires designers to testtheir SSML tags for each platform:“Different speech synthesizers are goingto have different packages, so I want to be able to play with the SSML beforeI decide on how this is going to work.” (P10)Relation with the characteristics of naturalness: Being able toproduce an empathetic voice is essential for having positive social interactionswith users, as stated in Figure 4.1. Even though SSML is the only availableway for modifying synthesized voices, our designers reported the limitationsof SSML in truly creating an empathetic voice and the significant amount oftime required as the obstacles to the achievement of the social characteristics.3. Existing VUI Guidelines Lack Concrete and Useful Recommen-dations on How to Design for NaturalnessIn terms of writing VUI dialogues during the high-level design phase, 6 par-ticipants (P2, P5, P6, P10, P11, P19, and P20) mentioned 3 problems withthe existing VUI guidelines.First, they found the existing VUI guidelines easy to dismiss as cliche.36More specifically, our designers mentioned that some of the existing guide-lines are “somewhat common sense in terms of avoiding using technical lan-guage, try and make it casual and simple”, and easy to let go:“I feel likeit’s kind of obvious and you know that when you’re creating something like avoice skill...I probably read it once, and I just left it.” (P6)Secondly, our designers found the existing design guidelines do not ap-ply to a certain VUI depending on the context of the project:“At the sametime, I think every company will have its own set of these [Design guide-lines]...I mean, some apps are made to comfort people and make them feelless alone, and those guidelines are completely irrelevant, so it does dependon the context.” (P5)Lastly, they pointed out that the guidelines are lacking useful linguisticinsights:“I feel like linguistic principles are a little more difficult to comeby for voice interaction designers.” (P2) Linguists have been discoveringhidden patterns of natural human conversations that people use withoutrealizing it. For creating VUIs that enable natural conversations with users,VUI designers should know how to use these linguistic insights on naturalconversations as well as how to incorporate these insights with their designstrategies for better user experiences. However, our designers reported thatthere is a “disconnection” between the linguistic insights and their designstrategies.“So there’s a lot you can do with language, with pragmatics [a sub-fieldof linguistics] and this is not the rocket science. I mean, we have a lot ofresearch about pragmatics already, since the 70s, since the 80s. Well, they’reexplored in research well enough, but they’re not connected well enough withthe IT department.” (P18)The designers who found the existing design guidelines as useful reportedthat whenever they are designing VUI applications, they need to put anintentional effort to return to the guidelines to brush up their memories onthem:“So for example, I need to work on confirmations. Let me go to refresh37my memory on how to do confirmation style...I don’t have to like constantlygo back to them [the design guidelines], but I certainly do go back in and look[at them]” (P9), or not applying them due to the time constraint:“So thereare principles that I work with, but that’s just going to have to come to meat the moment. I’m not going to go back.” (P11)Relation with the characteristics of naturalness: This challengehinders designers from creating a VUI that provides a great user experiencewhile achieving the fundamental characteristic of a natural VUI, ‘Sound likea human is speaking’. More specific design guidelines that provide actionableadvice on how to connect these linguistic components and design intentionsare requested from our designers.“I have no idea on how to use this construct behind these phrases, andhow to break it down to use in my favorite design process...The contrac-tion, phonetics, and ‘Umm’...I think understanding how meaning can changedepending on how things are structured and organized in the dialogue” [isimportant]. (P20)4. Writing for Spoken Language Is DifficultWhen the participants are writing dialogues during the high-level designphase, they often find it is hard to write for spoken language. When peopleare talking, they adopt spoken language characteristics such as filler words,personal pronouns and colloquial words [53, 54, 55]. However, they showthese characteristics subconsciously. Hence, when our designers are writing,they often forget to incorporate these characteristics.“What I want to say is when you just started [as a VUI designer], it’sdifficult because you have to really understand that the text you’re writingis not something to be read. It’s something to be spoken. It’s for a spokeninterface, so you have to really try to imagine yourself in that situation and,like I said, just write like you are writing a screenplay or something like that.”(P3)38Designers write the dialogues by typing on the keyboard instead of speak-ing it, and our designers reported that it is often hard to detect unnaturalnessof the dialogue just by reading it:“Challenges? So the platform limitation isdefinitely a challenge. A lot of times, the conversation sounds good on paper,but you really have to just say it.” (P12)Since we subconsciously use the characteristics of spoken language whenconversing with others, P10 mentioned that the unnaturalness caused bywriting can be hard to detect, and many people often treat this problem assomething insignificant and hence do not put the effort in enhancing it:“Ithink the hard thing about these kinds of interfaces is that people feel likebecause they can talk, because they speak English, so they can write one ofthese interfaces.” (P10)From our survey on the existing VUI design guidelines, we found that mul-tiple guidelines already exist for supporting designers in writing for spokendialogues [12, 13]. These guidelines tell VUI designers about what character-istics of the spoken language they should incorporate in their VUI dialogues.Since people use these characteristics unconsciously, if designers do not re-turn to guidelines and verify their dialogues manually, they often forget toapply the rules from design guidelines when they are writing their dialogues.For example, even though there is a design guideline asking designers toavoid putting too much information in one line [93], designers often makemistakes:“The frequent mistake is people are giving a big amount of the textsand yeah just read it.” (P18)This makes designing a natural VUI challenging. However, P3, a profes-sional VUI designer, mentioned that this could be overcome as designers gainmore experience:“At first, it is difficult, because, like I said, normally whenyou’re writing, you’re writing for someone to read it, not to speak it...[It]took something like maybe three months or so to really get used to this wayof writing, the practice of writing dialogue.” (P3)Relation with the characteristics of naturalness: This difficulty39hinders designers from creating VUI dialogues that sound like spoken dia-logues, and it thus counters to the fundamental characteristic of a naturalVUI, ‘Sound like a human is speaking’.5. Reconciling Between “Social” and “Transactional” Is HardOur designers tried to embrace the characteristics of social conversationsfor providing more realistic and interactive conversation experiences to theirusers, however since most of them were designing for task-oriented applica-tions, they found the goals for task-oriented applications often conflict withthe desire to embrace social conversation characteristics.Five designers (P5, P11, P15, P17, P20) mentioned their desire to addmore components of social conversations to their VUI designs to make themmore realistic. However, in an attempt to do so, they found that the dialoguegets longer and it conflicts with the task-oriented goal to complete the tasksefficiently:“So obviously I wanted to write the dialogues that felt [like a] human[is speaking and] didn’t feel robotic, but I soon realized that things are morecomplicated. The more you want to add personality to things, then the longerbecomes your dialogue.” (P20)So, our designers felt that the two categories of the naturalness character-istics often don’t get along with one another:“...efficient, but it has to comeup as like friendly [and] conversational.” (P5) “Challenges, I told you earlier,challenges are keeping it simple, yet interactive. It should sound familiar. Itshould sound friendly. It should not go out of the voice, so like that.” (P17)Relation with the characteristics of naturalness: This often con-fused our designers when they were writing the dialogues and often madethem give up incorporating social characteristics. “But again we’re stillthinking, we’re still in the process of, ‘Should we actually put in those lit-tle sentences [for having social interactions] or not?’” (P6) “I would prefer,right now, to focus more on helping people achieve their goals and move onwith their lives more than a kind of having these artificial entities talking to40me in slang.” (P20)6. Conveying Messages Clearly Is Difficult Due to the Limitationof Synthesized VoiceOur designers (P4, P8, P9, P12, P14) are facing a challenge in conveyingprecise meanings through synthesized voices. This is because synthesizedvoices often mispronounce certain types of words and sentences, and are notable to produce proper intonations and tones.Our designers elaborated on three specific problematic cases of mispro-nunciation. Synthesize voices often fail to: (1) put proper breaks when pro-nouncing long sentences:“...some words just kind of got mashed together, likeinto one whole long word, and it [the long setence] sounded weird.” (P4), (2)pronounce contractions clearly. “That’s what it’s supposed to say, ‘What’llit be?’, and then when it [Amazon Alexa] actually says it, it will be like,‘Whatill it be? ’” (P12), and (3) pronounce proper nouns such as names ofcryptocurrency and people. “...she can’t pronounce ethereum [a cryptocur-rency] and mmm that’s a popular one.” (P14), “Names of people, Google saysit differently.” (P8)Our participants also elaborated on two problematic cases inadequatetones and intonations: (1) sentences with the question mark, and (2) non-lexical words. P9 mentioned the difficulty of producing natural sounds forinterrogative sentences ending with a question mark. “It’s so hard sometimesto get the text to speech to ask a question in a way that makes sound I changed the question mark to a period, and then it said ‘Which onewould you like?’, which is more how a person would actually say it, because alot of times when we ask a question, we’re not doing a rising intonation.” P8reported that synthesized voices do not produce proper tones for non-lexicalwords (i.e., words do not have a defined meaning) such as laughter. Herdesign intention was to make her robot laugh with a happy tone, but it hada sarcastic tone instead:“The robot can not laugh, because if the robot laughs,41and it just says, ‘Ha-ha-ha’, it sounds sarcastic.” (P8)Relation with the characteristics of naturalness: The limitationsof speech synthesis systems (i.e., mispronouncing words and sentences andnot being able to produce proper intonations) hinder designers not only fromattaining the fundamental characteristics of a natural VUI (‘Sound like ahuman is speaking’ and ‘Use appropriate prosody and intonation’) but alsofrom achieving a social characteristic of it (‘Express sympathy and empa-thy’). This is because non-lexical words are essential in human conversationsfor social interactions. People feel mutual understanding and compassiontowards each other when they laugh together. Hence, by not being able toproduce natural laughing sounds, it’s more challenging to provide harmo-nious interactions with users.7. Handling Various Spoken Inputs From the Users Is DifficultFour participants (P2, P6, P9, P11) mentioned that the current NLU engineis still not good enough to understand various expressions of our language,and this requires VUI designers to provide possible synonyms and expressionswhen they are writing dialogues for training the NLU engine:“Yeah, I stillneed to interview them [the VUI users]. I will, just because I don’t thinkthe natural language engine is good enough to be able to figure out all thedifferent ways you can ask for a bus.”(P2) However, they often found thatthe collected synonyms and expressions do not cover all the possible ones:“Idon’t necessarily know what the users are going to say back. So I’ll makeup something that you might say but that’s certainly not the same as a userwho’s never used it [the VUI application] before.”(P9)Designers pointed out that understanding human vocabularies can beespecially challenging due to its personal aspect. The same word can be usedto express different meanings based on the conversational context (e.g., ageand individual differences). Hence, designers mentioned that user utterancesshould be understood in the personalized context:“You need to personalize42the VUI to a user...You can not talk with Alexa if you‘re older because Alexawon’t understand you.” (P8)Relation with the characteristics of naturalness: Understandingthe specialized vocabularies that elders are using seems to be critically needed.The lack of capacity to correctly understand various words from users risksadditional interaction break-downs, and resulting in more designers’ effortsto collect additional user utterances. This prevents designers from achiev-ing the fundamental characteristic, ‘Understand and use variations in humanlanguage’. However, as mentioned in that section, when the target audienceof the application is the population using special terminologies, this challengemay have less impact.8. Impossible to Capture All the Possible SituationsOur participants reported that VUIs provide more flexibility in terms of userinputs than other user interfaces such as graphical user interfaces or text-based user interfaces:“...there’s so much more flexibility that you have invoice than with the searches [text-based search interfaces]” (P12)Our designers (P2, P8, P13, P16) reported that this great amount offlexibility makes it hard for VUI designers to expect and prepare for allpossible scenarios that can occur during user interaction:“I also have to worryabout how to handle the language, if they say something completely differentfrom what I expected.” (P2)We found that our designers use two strategies to prevent the conversationbreakdown from unexpected inputs. First, they strive to narrow down aset of available conversation pathways, in advance, when the task contextis ‘predictable’(P6) and ‘typical’ (P3):“Normally, when you’re designing aninterface like that, you have three typical errors that we have to prevent.”(P3)Second, they can preemptively guide the users to ask questions in the servicedomain only. “We can’t handle all those things. So you really need to knowhow to guide the conversation to get the person to know what they can say,43and help them say in a way that your technology can actually handle.” (P9)P8, P13 and P16 mentioned that this is especially challenging designingfor children:“People can say anything. Children can say more than any-thing...they can go random into anything” (P16) “I think that’s very impor-tant and that’s most challenging for me to anticipate what each of kids willsay at each of the points because they’re very unpredictable.” (P13)Even though most situations are predictable and can be handled with acommon strategy, our designers still feel challenged in designing for children,because they do not follow the voice assistants’ guidance:“...they don‘t followinstructions, usually. So they can go randomly into anything.” (P16)Relation with the characteristics of naturalness: This challengemay hinder designers from attaining the fundamental characteristic of a nat-ural VUI, ‘Collaboratively repair conversation breakdowns’, especially whenthe target VUI users are children.9. Difficult to Capture the Users’ EmotionsOur designers found it challenging to write VUI dialogues that are empatheticto VUI users’ emotions due to lack of a way to capture emotions of VUIusers and incorporate the detected emotions into their VUI dialogue designs.Hence, our participants mentioned that in their practices, they tried to avoidin making their VUI assistants sound too happy for incase their users areexperiencing negative feelings:“You have to control the tone of voice becauseyou can’t sound very enthusiastic, things like that because you never knowthe situation of the person on the other side. You don’t know if the personis really emotionally ill or something more serious is happening at the timethat the person is calling and interacting with the system.” (P3)Our participants wished for a VUI design tool where they can write VUIconversation flows depending on the detected emotions of VUI users:“I thinkit would be good to identify emotions...More useful [thing] would be to detectemotional content on an utterance [from a VUI user] to give you the context”44(P2)Relation with the characteristics of naturalness: For having har-monious social interactions with users, understanding users’ emotions, andproviding truly empathetic messages are essential. Misunderstanding users’emotions and saying something that breaks the emotional context may hin-der designers from achieving the social characteristic, ‘Express sympathy andempathy’.10. Difficult to Understand Users’ Perceived NaturalnessOur participants (P6, P12) reported that they lack a practical way to findout how VUI users really perceive the naturalness of VUI applications:“It’salso hard to detect if people find that, at scale, if things are unnatural. Be-cause there’s, at least at this point, no real way of asking users, ‘Does thisconversation seem natural?’ at the end of their experiences.” (P12)Our participants mentioned that even with the best effort for creating nat-ural VUIs (e.g., following guidelines for creating natural VUIs), VUI users’real experiences with the applications could be different from designers’ ex-pectations:“I mean the designers and project strategists might have an ideaof what the ideal conversation is, and like, yeah, everything makes sense. Butwhen you actually deploy [the VUI applications], the users are going to comeback to you and say, ‘No, no, no, no, we don’t care about any of this.’” (P12)There is built-in support from Google for letting VUI users rate VUIapplications with a five-star rating system. However, our designers found therating results are not comprehensive enough for understanding their users’experiences:“Google handles that by giving you a review...Option to reviewthe application. If you’ve interacted with the Google action [a VUI applicationdeveloped with Google’s platform]. Even that isn’t comprehensive.” (P12)Instead of using the built-in systems from Google or Amazon, designingvoice assistants to ask VUI users about their experiences was mentioned asan alternative solution for this problem by our participants:“I think down45the line we would actually collect the responses [from the users]...For ourfeedback, we’d say, ‘Could you rate it one to five?’, and then we’d say, ‘Ifyou want, you can say a little more about it.’, and then we’d start recordingit.” (P6) However, P12 doubted that many people would answer the feedbackquestion:“Unless we explicitly asked them, like some QA [quality assurance]tools do for web applications. But, are people going to stick around to answerthose questions? Maybe not.” (P12) Instead, our participants reported thelack of tool support for understanding the perceived naturalness by VUIusers:“It’s a lot harder to, I guess, find out if the actual conversation isnatural, per se, at scale, because there just aren’t a lot of tools to help withthat.” (P12)Relation with the characteristics of naturalness: This challengedoes not particularly hinder designers from attaining a specific characteristicof a natural VUI. However, this makes it difficult for designers to validatetheir design practices for creating a natural VUI and makes it harder tochoose the design directions for enhancing the naturalness of their VUI ap-plications.46Chapter 5Discussion and Implications forDesign5.1 Discussion5.1.1 Capturing Naturalness in ContextDespite the lack of a consistent definition of naturalness, our study demon-strates that characterizing the term is a feasible inquiry and that structuringthe characteristics with respect to context is important—in our case, theprimary context refers to the role of the conversation, social (i.e., for hav-ing a harmonious social interaction) or transactional (i.e., for helping VUIusers’ tasks), in which a VUI is situated. Beyond the role of conversation,additional contextual factors that shape naturalness include the target user;for example, user age, such as children and older adults, and occupationsuch as physician vs. general public. Using this range of factors as a lensto interpret the designers’ practices enabled us to characterize where differ-ent challenges in design practices stem from. Building on these findings, weideate approaches to address the found challenges at the end of Discussion.These context-dependent characteristics of naturalness hinge primarily47on the role that a VUI is expected to play. For the present, we discoveredthe two roles, transactional and social, as the prevalent types that designerscould identify and discuss. However, given the pervasiveness of computingto people’s everyday lives [94] and the near-ubiquity of VUI-enabled devices,it’s entirely possible that the role of VUIs can be extended and diversifiedbeyond the two. Then, designers will need to adjust their conception ofnaturalness in VUI to the new roless accordingly. For instance, if a VUI inan interactive learning system serves the role of an instructor [95], and notthat of a task-oriented assistant nor a social companion,We categorized the characteristics of naturalness according to the roleof VUIs, drawing heavily on Clark et al.’s categories of conversation types[28]. Another researcher may leverage dimensions other than role of VUIs tocategorize the characteristics. For instance, a different but complementaryapproach could involve labeling each found characteristic with different qual-ities of natural VUI experiences, such as anthropomorphism, transparency,efficiency, pleasantness.5.1.2 Positioning Naturalness of VUI in the LiteratureAs noted above, roles were not the only determinant of naturalness for acharacteristic. The target user demographic was an apparent, though some-what buried, factor to consider when designing for naturalness. For example,children tend to anthropomorphize VUIs [96] and the designers in our studyexpect their VUIs to offer children a more entertaining experience. Our de-signers focused on tailoring characteristics to the specific needs of the user.This is in alignment with the conception of naturalness from Wigdor et al.[21], where the term focuses on the user experience and not the device.When asked what is natural in VUIs, designers tend to think of natu-ralness as a quality of the system that can provide a positive experience inthe given application context. This concept differs from the existing notionof naturalness that equates it to behavioural realism (being like a human)48or considers naturalness to be the Holy Grail of all VUI applications. Forexample, the VUI can be designed to produce unnatural sounds for the pur-pose of entertainment or to use technical jargon for users with specializedoccupations (e.g., police and firefighters) as mentioned in P1’s example fromsection 4.2.1 and P2’s example from section 4.1.1.It is worth noting that our results provide a conceptual departure fromthe notion of naturalness as an imitation of the qualities in human-to-humaninteractions. In their seminal work on mediated communication [97], Hollanand Stornetta claim that creation of electronic media should go “beyondbeing there”, so that the new tools can offer beneficial interaction mechanismsthat humans in the flesh cannot offer, rather than blindly mimicking thereality. Given that most of our findings on ‘beyond-human’ aspects were intransactional characteristics, it can be inspiring to contemplate what are themachine-specific characteristics that can enhance natural social encounters.5.1.3 Contrasting Naturalness of Transactional vs. So-cial AgentsCharacteristics of naturalness may not transfer across different roles thatVUI plays. We identified conceptual groupings of naturalness characteristicsbounded by types of conversation (Social or Transactional). For instance, an-swering a given question with high accuracy and speed may not be conduciveto the naturalness of a social conversation, but may be highly desirable andfeel perfectly natural in a transactional conversation. As such, designersshould not simply follow base naturalness characteristics; they need to startwith the role of the conversation and select the appropriate related character-istics, and sometimes even demote a certain naturalness characteristic whenit mismatches the target application.Our finding that naturalness characteristics differ according to the roleof a VUI application (e.g., social or transactional) uncovered a significantdesign challenge that VUI designers are facing: namely, how to reconcile the49dual role of helpful assistant and pleasant social interlocutor. Such agentsoffer a fundamentally transactional interaction that is structured in a conver-sational format only on the surface [19]. Most of the designers in our studywere experienced primarily in designing for task-oriented VUIs. When theytry to imbue their interfaces with social characteristics, such as empatheticgreetings or small chit-chat, they face difficulty in finding an appropriatetradeoff between designing for an effective assistant vs. an affable compan-ion; P20 adds that “the more you want to add personality to things that thelonger become your dialogue”.5.1.4 LimitationsWe tried to recruit designers who worked on various types of VUIs such asInteractive Voice Response (IVR) systems, humanoid robots and smart homespeakers such as Google Home and Amazon Alexa. However, more than halfof the recruited designers (N=14) had experienced only with smart homespeakers. Hence, a future study should recruit similar numbers of designersfor different types of VUIs.Our study also did not verify if designers’ characterizations of a naturalVUI are matched with the VUI users’ perceptions of a natural VUI. Giventhe scope of this study, the impact of pursuing the characteristics of a nat-ural VUI defined by our designers on user experience is unexplored as well.Finally, even though we recruited both professional and amateur VUI design-ers, we did not explore the patterns specific to each participant group due tothe time constraint, which could have enriched the findings of our study.5.1.5 Implications for VUI Design ToolsAccording to our findings, all of the top 10 primary challenges were relatedto the high-level design phase where VUI designers write VUI dialogues andcreate flowcharts for the dialogues. Delving into the designers’ practices for50naturalness in the high-level design phase revealed two significant criticalgaps in the way existing VUI related technologies facilitate their jobs.Towards More Natural Sounding NarrationAlthough fine-tuning the way VUIs sound was given great importance in pro-viding natural user experience with VUI, the VUI scene is lacking in designtools for producing narrations that sound rich and nuanced. Our designersregard SSML and voice talents to be the only possible two approaches avail-able to them, but the pros and cons of the two were complementary: workingwith voice talents gives designers a great deal of control over para-language(i.e. non-lexical attributes of speech, such as intonation, timing, pitch, etc.),but hiring them is expensive and recorded audio clips are not scalable. SSMLlacks sufficient control but is scalable. VUI designers need a solution that isboth scalable and offers a lot of control.The existing interfaces for SSML editing have deep ‘gulfs of executionand evaluation’ introduced by Norman and Draper [98]. In other words, cur-rent SSML editing interfaces lack the convenient ways for producing naturalsounds (‘a gulf of execution’) and for evaluating the naturalness of the pro-duced sounds (‘a gulf of evaluation’). The designers use a simple text editorto implement SSML tags, which is laborious and error-prone. The gulf ofevaluation is also very deep because the existing tools do not support spotlistening of the edited part of a script. Hence, repetitively playing throughthe entire sentence is tedious when checking if a minor change of tags yieldsa desirable output.There’s a lot to learn from the way VUI designers guide voice talentsto narrate a given script in intended prosody and timing. Their direc-tions are primarily demonstration-based, such as ghost narration. Hence,a demonstration-based prosody editing, leveraging our own voices [99], is apromising approach worth investigating. Also, multi-modal cues, such ashand gestures, are effective ways to convey and highlight intended changes51to the voice talents. Similarly, graphical or direct manipulation of pitch,accent, and timing will enable faster and easier creation of natural-soundingnarration.Writing for More Natural Spoken DialoguesOne of the primary challenges for VUI designers is to write dialogues withspoken language characteristics, such as frequent use of filler words and fewerbig words. In the field, there exist several dialogue design tools that of-fer many beneficial features like dialogue mapping and instant speech-basedtesting, but they lack in evaluation, and they do not recommend linguisticproperties for a given script. Given that designers tend to dismiss this kindof naturalness characteristic easily, a proper scripting interface, similar toproof-reading tools, should warn the user when the dialogue has too manyattributes of written language or even suggest alternative phrasing in spokenlanguage. Also, such tools can offer editing suggestions that are tailored tothe target users or the purpose of the application as the required naturalnessvaries by such design contexts.Implementing such tools will require linguistic modelling of spoken vs.written language, computational prediction for evaluating the given script,and alternative searching for recommending different expressions. Naturallanguage processing techniques are becoming increasingly sophisticated. Forexample, F-Score [100] is a linguistic measure of how formal a given textis. The Stanford Politeness API can identify linguistic markers that areindicative of politeness.52Chapter 6Conclusion and Future WorkNaturalness is a construct that has been loosely defined in HCI. In this study,we articulated its meaning from the designers’ perspective. To this end, weconducted 20 interviews with VUI designers to see how they characterize theterm, how they incorporate it in their design practice and the challenges theyface in doing so.Through an inductive and deductive thematic analysis, we uncovered 12characteristics of a natural VUI and introduced 3 categories for subdivid-ing these characteristics: ‘Fundamental’, ‘Social’ and ‘Transactional’. Whilemany of the traits we found were human-like in essence, designers mentionedthat they saw naturalness in VUIs as a quality that is beyond-human - intheir conception, exhibiting machine-like speed, accuracy, and memory werenatural.In addition to these findings, we also elicited 10 challenges that VUI de-signers are facing in creating a natural VUI. Then, we used the 3 categories(‘Fundamental’, ‘Social’ and ‘Transactional’) as a lens to obtain a deeperunderstanding of these challenges. As a result, we found that most of thesechallenges hinder VUI designers from attaining the fundamental characteris-tic (‘Sound like a human is speaking’) and the social characteristic (‘Expresssympathy and empathy’) of a natural VUI. We also uncovered an interest-53ing conflict between attaining social characteristics versus achieving trans-actional characteristics, which are expected to become more problematic asthe role of VUI is expanding.To resolve the primary challenges mentioned above, we developed designimplications for future tool support for producing natural synthesized voicesand for writing VUI dialogues that incorporate the characteristics of spokenlanguage.54Bibliography[1] G. Skantze, Error handling in spoken dialogue systems-managing un-certainty, grounding and miscommunication. Gabriel Skantze, 2007.[2] T. K. Harris and R. Rosenfeld, “A universal speech interface for appli-ances,” in Eighth International Conference on Spoken Language Pro-cessing, 2004.[3] D. Griol, J. Carbo´, and J. M. Molina, “An automatic dialog simulationtechnique to develop and evaluate interactive conversational agents,”Applied Artificial Intelligence, vol. 27, no. 9, pp. 759–780, 2013.[4] H. Hofmann, U. Ehrlich, S. Reichel, and A. Berton, “Developmentof a conversational speech interface using linguistic grammars,” inAdjunct Proceedings of the 5th International Conference on Automo-tive User Interfaces and Interactive Vehicular Applications, Eindhoven,The Netherlands, 2013.[5] Amazon. (2020) Conversational ai. [Online]. Available:[6] Shopify. (2020) Ui of the future: Conversational inter-faces. [Online]. Available:[7] Google. (2020) Conversation design. [Online]. Available:[8] R. J. Moore and R. Arar, Conversational UX Design: A Practitioner’sGuide to the Natural Conversation Framework. Morgan & Claypool,2019.[9] M. H. Cohen, M. H. Cohen, J. P. Giangola, and J. Balogh, Voice userinterface design. Addison-Wesley Professional, 2004.[10] Amazon. (2018) Things every alexa skill should do: Pass the one-breath test. [Online]. Available:[11] INTUITYTM CONVERSANT R© System Version 6.0. Lucent Tech-nologies, Bell Labs Innovations Bell Labs Innovations.[12] Google. (2020) Conversation design guideline. [Online].Available:[13] Amazon. (2020) Write out a script with conversational turns. [On-line]. Available:[14] S. Inc. (2020) Natural dialogue and conversation with voiceui. [Online]. Available:[15] Amazon. (2020) Voice design for alexa experiences: Be adapt-able. [Online]. Available:[16] L. K. Hansen and P. Dalsgaard, “Note to self: Stop callinginterfaces “natural”,” in Proceedings of The Fifth Decennial AarhusConference on Critical Alternatives, ser. CA ’15. Aarhus N:56Aarhus University Press, 2015, p. 65–68. [Online]. Available:[17] J. K. Burgoon, “Interpersonal expectations, expectancy violations, andemotional communication,” Journal of Language and Social Psychol-ogy, vol. 12, no. 1-2, pp. 30–48, 1993.[18] E. Goffman et al., The presentation of self in everyday life. Har-mondsworth London, 1978.[19] M. Porcheron, J. E. Fischer, S. Reeves, and S. Sharples, “Voiceinterfaces in everyday life,” in Proceedings of the 2018 CHI Conferenceon Human Factors in Computing Systems, ser. CHI ’18. New York,NY, USA: Association for Computing Machinery, 2018. [Online].Available:[20] V. M. LLC. (2020) Google duplex puts ai into a social uncannyvalley. [Online]. Available: us/article/d3kgkk/google-duplex-assistant-voice-call-dystopia[21] D. Wigdor and D. Wixon, “Chapter 2 - the natural user interface,”in Brave NUI World, D. Wigdor and D. Wixon, Eds. Boston:Morgan Kaufmann, 2011, pp. 9 – 14. [Online]. Available:[22] B. Gates, “The power of the natural user interface,” Oct 2011.[Online]. Available:[23] M. F. McTear, “The rise of the conversational interface: A new kid onthe block?” in Future and Emerging Trends in Language Technology.Machine Learning and Big Data, J. F. Quesada, F.-J. Mart´ın Mateos,and T. Lo´pez Soto, Eds. Cham: Springer International Publishing,2017, pp. 38–49.57[24] V. Braun and V. Clarke, “Using thematic analysis in psychology,”Qualitative Research in Psychology, vol. 3, no. 2, pp. 77–101, 2006.[Online]. Available:[25] K. H. Davis, R. Biddulph, and S. Balashek, “Automatic recognitionof spoken digits,” The Journal of the Acoustical Society of America,vol. 24, no. 6, pp. 637–642, 1952.[26] B. T. Lowerre, “The harpy speech recognition system,” CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCI-ENCE, Tech. Rep., 1976.[27] N. Yankelovich, G.-A. Levow, and M. Marx, “Designing speechacts:Issues in speech user interfaces,” in Proceedings of the SIGCHI confer-ence on Human factors in computing systems, 1995, pp. 369–376.[28] L. Clark, N. Pantidi, O. Cooney, P. Doyle, D. Garaialde, J. Edwards,B. Spillane, E. Gilmartin, C. Murad, C. Munteanu et al., “Whatmakes a good conversation? challenges in designing truly conversa-tional agents,” in Proceedings of the 2019 CHI Conference on HumanFactors in Computing Systems, 2019, pp. 1–12.[29] T. Ammari, J. Kaye, J. Y. Tsai, and F. Bentley, “Music, search,and iot: How people (really) use voice assistants,” ACM Trans.Comput.-Hum. Interact., vol. 26, no. 3, Apr. 2019. [Online]. Available:[30] A. Pyae and T. N. Joelsson, “Investigating the usability anduser experiences of voice user interface: A case of google homesmart speaker,” in Proceedings of the 20th International Conferenceon Human-Computer Interaction with Mobile Devices and ServicesAdjunct, ser. MobileHCI ’18. New York, NY, USA: Association58for Computing Machinery, 2018, p. 127–131. [Online]. Available:[31] E. Luger and A. Sellen, ““like having a really bad pa”: The gulfbetween user expectation and experience of conversational agents,”in Proceedings of the 2016 CHI Conference on Human Factors inComputing Systems, ser. CHI ’16. New York, NY, USA: Associationfor Computing Machinery, 2016, p. 5286–5297. [Online]. Available:[32] A. M. Turing, “Computing machinery and intelligence,” in Parsing theTuring Test. Springer, 2009, pp. 23–65.[33] P. Garrido, F. Martinez, and C. Guetl, “Adding semantic web knowl-edge to intelligent personal assistant agents,” in Proceedings of theISWC 2010 Workshops. Philippe Cudre-Mauroux, 2010, pp. 1–12.[34] P. Braunger and W. Maier, “Natural language input for in-car spokendialog systems: How natural is natural?” in Proceedings of the 18thAnnual SIGdial Meeting on Discourse and Dialogue, 2017, pp. 137–146.[35] C. Nass, Y. Moon, B. J. Fogg, B. Reeves, and C. Dryer, “Cancomputer personalities be human personalities?” in ConferenceCompanion on Human Factors in Computing Systems, ser. CHI ’95.New York, NY, USA: Association for Computing Machinery, 1995, p.228–229. [Online]. Available:[36] A. Malizia and A. Bellucci, “The artificiality of natural userinterfaces,” Commun. ACM, vol. 55, no. 3, p. 36–38, Mar. 2012.[Online]. Available:[37] A. Soro, S. A. Iacolina, R. Scateni, and S. Uras, “Evaluation ofuser gestures in multi-touch interaction: A case study in pair-programming,” in Proceedings of the 13th International Conference59on Multimodal Interfaces, ser. ICMI ’11. New York, NY, USA:Association for Computing Machinery, 2011, p. 161–168. [Online].Available:[38] B. Loureiro and R. Rodrigues, “Multi-touch as a natural user interfacefor elders: A survey,” in 6th Iberian Conference on Information Systemsand Technologies (CISTI 2011), June 2011, pp. 1–6.[39] U. Lee and J. Tanaka, “Finger identification and hand gesturerecognition techniques for natural user interface,” in Proceedings ofthe 11th Asia Pacific Conference on Computer Human Interaction,ser. APCHI ’13. New York, NY, USA: Association for ComputingMachinery, 2013, p. 274–279. [Online]. Available:[40] P. G lomb, M. Romaszewski, S. Opozda, and A. Sochan, “Choosingand modeling the hand gesture database for a natural user interface,”in Gesture and Sign Language in Human-Computer Interaction andEmbodied Communication, E. Efthimiou, G. Kouroupetroglou, and S.-E. Fotinea, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012,pp. 24–35.[41] L. Bian and H. Holtzman, “Qooqle: Search with speech, gesture,and social media,” in Proceedings of the 13th International Conferenceon Ubiquitous Computing, ser. UbiComp ’11. New York, NY, USA:Association for Computing Machinery, 2011, p. 541–542. [Online].Available:[42] A. Adler and R. Davis, “Speech and sketching for multimodal design,”in ACM SIGGRAPH 2007 Courses, ser. SIGGRAPH ’07. New York,NY, USA: Association for Computing Machinery, 2007, p. 14–es.[Online]. Available:[43] J. K. Zao, T. T. Gan, C. K. You, S. J. R. Me´ndez, C. E. Chung, Y. T.Wang, T. Mullen, and T. P. Jung, “Augmented brain computer inter-action based on fog computing and linked data,” in 2014 InternationalConference on Intelligent Environments, June 2014, pp. 374–377.[44] S. Siltanen and J. Hyva¨kka¨, “Implementing a natural user interfacefor camera phones using visual tags,” in Proceedings of the 7th Aus-tralasian User Interface Conference - Volume 50, ser. AUIC ’06. AUS:Australian Computer Society, Inc., 2006, p. 113–116.[45] M. Weiser, “The computer for the 21st century,” SIGMOBILE Mob.Comput. Commun. Rev., vol. 3, no. 3, p. 3–11, Jul. 1999. [Online].Available:[46] T. K. Hui and R. S. Sherratt, “Towards disappearing user interfaces forubiquitous computing: human enhancement from sixth sense to supersenses,” Journal of Ambient Intelligence and Humanized Computing,vol. 8, no. 3, pp. 449–465, 2017.[47] A. Van Dam, “User interfaces: disappearing, dissolving, and evolving,”Communications of the ACM, vol. 44, no. 3, pp. 50–52, 2001.[48] D. Wigdor and D. Wixon, Brave NUI World: Designing Natural UserInterfaces for Touch and Gesture, 1st ed. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 2011.[49] D. A. Norman, “Natural user interfaces are not natural,” Interactions,vol. 17, no. 3, p. 6–10, May 2010. [Online]. Available:[50] S. Eggins and D. Slade, Analysing casual conversation. Equinox Pub-lishing Ltd., 2005.[51] K. P. Schneider, Small talk: Analyzing phatic discourse. Hitzeroth,1988, vol. 1.61[52] G. Brown, G. D. Brown, G. R. Brown, B. Gillian, and G. Yule, Dis-course analysis. Cambridge university press, 1983.[53] C. H. Woolbert, “Speaking and writing—a study of differences,” Quar-terly Journal of Speech, vol. 8, no. 3, pp. 271–285, 1922.[54] G. Borchers, “An approach to the problem of oral style,” QuarterlyJournal of Speech, vol. 22, no. 1, pp. 114–117, 1936.[55] G. Redeker, “On differences between spoken and written language,”Discourse processes, vol. 7, no. 1, pp. 43–55, 1984.[56] G. H. Drieman, “Differences between written and spoken language: Anexploratory study,” Acta Psychologica, vol. 20, pp. 78–100, 1962.[57] R. C. O’Donnell, “Syntactic differences between speech and writing,”American Speech, vol. 49, no. 1/2, pp. 102–110, 1974.[58] T. Bennett, “Verb voice in unplanned and planned narratives,”Keenan, EO yT. Bennett (ed.): Discourse across time and space.SCOPIL, vol. 5, 1977.[59] J. Cassell, T. Bickmore, M. Billinghurst, L. Campbell, K. Chang,H. Vilhja´lmsson, and H. Yan, “Embodiment in conversationalinterfaces: Rea,” in Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems, ser. CHI ’99. New York, NY, USA:Association for Computing Machinery, 1999, p. 520–527. [Online].Available:[60] R. S. Nickerson, “On conversational interaction with computers,”in Proceedings of the ACM/SIGGRAPH Workshop on User-OrientedDesign of Interactive Graphics Systems, ser. UODIGS ’76. New York,NY, USA: Association for Computing Machinery, 1976, p. 101–113.[Online]. Available:[61] T. Bickmore and J. Cassell, “Relational agents: A model andimplementation of building user trust,” in Proceedings of the SIGCHIConference on Human Factors in Computing Systems, ser. CHI ’01.New York, NY, USA: Association for Computing Machinery, 2001, p.396–403. [Online]. Available:[62] N. Wang, D. V. Pynadath, S. G. Hill, and A. P. Ground, “Building trustin a human-robot team with automatically generated explanations,”in Proceedings of the Interservice/Industry Training, Simulation andEducation Conference (I/ITSEC), vol. 15315, 2015, pp. 1–12.[63] T. Sanders, K. E. Oleson, D. R. Billings, J. Y. Chen, and P. A. Hancock,“A model of human-robot trust: Theoretical model development,” inProceedings of the human factors and ergonomics society annual meet-ing, vol. 55, no. 1. SAGE Publications Sage CA: Los Angeles, CA,2011, pp. 1432–1436.[64] T. W. Bickmore and R. W. Picard, “Establishing and maintaininglong-term human-computer relationships,” ACM Trans. Comput.-Hum. Interact., vol. 12, no. 2, p. 293–327, Jun. 2005. [Online].Available:[65] M. D. Abrams, G. E. Lindamood, and T. N. Pyke, “Measuringand modelling man-computer interaction,” in Proceedings of the 1973ACM SIGME Symposium, ser. SIGME ’73. New York, NY, USA:Association for Computing Machinery, 1973, p. 136–142. [Online].Available:[66] L. P. Vardoulakis, L. Ring, B. Barry, C. L. Sidner, and T. Bickmore,“Designing relational agents as long term social companions for olderadults,” in Intelligent Virtual Agents, Y. Nakano, M. Neff, A. Paiva,and M. Walker, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,2012, pp. 289–302.63[67] S. Sayago, B. B. Neves, and B. R. Cowan, “Voice assistants and olderpeople: Some open issues,” in Proceedings of the 1st InternationalConference on Conversational User Interfaces, ser. CUI ’19. NewYork, NY, USA: Association for Computing Machinery, 2019. [Online].Available:[68] R. Cole, D. W. Massaro, J. d. Villiers, B. Rundle, K. Shobaki,J. Wouters, M. Cohen, J. Baskow, P. Stone, P. Connors et al., “Newtools for interactive speech and language training: using animated con-versational agents in the classroom of profoundly deaf children,” inMATISSE-ESCA/SOCRATES Workshop on Method and Tool Innova-tions for Speech Science Education, 1999.[69] D. Pe´rez-Mar´ın and I. Pascual-Nieto, “An exploratory study on howchildren interact with pedagogic conversational agents,” Behaviour &Information Technology, vol. 32, no. 9, pp. 955–964, 2013.[70] C. Nass, Y. Moon, and N. Green, “Are machines gender neutral?gender-stereotypic responses to computers with voices,” Journal ofApplied Social Psychology, vol. 27, no. 10, pp. 864–876, 1997.[Online]. Available:[71] R. C.-S. Chang, H.-P. Lu, and P. Yang, “Stereotypes or goldenrules? exploring likable voice traits of social robots as active agingcompanions for tech-savvy baby boomers in taiwan,” Computers inHuman Behavior, vol. 84, pp. 194 – 210, 2018. [Online]. Available:[72] C. I. Nass and S. Brave, Wired for speech: How voice activates andadvances the human-computer relationship. MIT press Cambridge,MA, 2005.64[73] C. Nass, Y. Moon, B. J. Fogg, B. Reeves, and C. Dryer, “Cancomputer personalities be human personalities?” in ConferenceCompanion on Human Factors in Computing Systems, ser. CHI ’95.New York, NY, USA: Association for Computing Machinery, 1995, p.228–229. [Online]. Available:[74] A. Inc, “Siri style guide - sirikit.” [Online]. Available:[75] “Design process: The process of thinking through the design of avoice experience,” 2020, accessed: 2020-01-31. [Online]. Available:[76] S. Ross, E. Brownholtz, and R. Armes, “Voice user interface principlesfor a conversational agent,” in Proceedings of the 9th InternationalConference on Intelligent User Interfaces, ser. IUI ’04. New York,NY, USA: Association for Computing Machinery, 2004, p. 364–365.[Online]. Available:[77] C. M. Myers, D. Grethlein, A. Furqan, S. Ontan˜o´n, and J. Zhu,“Modeling behavior patterns with an unfamiliar voice user interface,”in Proceedings of the 27th ACM Conference on User Modeling,Adaptation and Personalization, ser. UMAP ’19. New York, NY,USA: Association for Computing Machinery, 2019, p. 196–200.[Online]. Available:[78] S. R. Klemmer, A. K. Sinha, J. Chen, J. A. Landay, N. Aboobaker,and A. Wang, “Suede: A wizard of oz prototyping tool for speechuser interfaces,” in Proceedings of the 13th Annual ACM Symposiumon User Interface Software and Technology, ser. UIST ’00. NewYork, NY, USA: Association for Computing Machinery, 2000, p. 1–10.[Online]. Available:[79] H. H. HILLER and L. DILUZIO, “The interviewee and the researchinterview: Analysing a neglected dimension in research*,” CanadianReview of Sociology/Revue canadienne de sociologie, vol. 41, no. 1, pp.1–26, 2009. [Online]. Available:[80] L. Richardson and E. St Pierre, “A method of inquiry,” Collecting andinterpreting qualitative materials, vol. 3, no. 4, p. 473, 2008.[81] A. Strauss and J. Corbin, Basics of qualitative research. Sage publi-cations, 1990.[82] L. Zhou, J. Gao, D. Li, and H.-Y. Shum, “The design and imple-mentation of xiaoice, an empathetic social chatbot,” ComputationalLinguistics, no. Just Accepted, pp. 1–62, 2018.[83] R. E. Banchs and H. Li, “Iris: a chat-oriented dialogue system basedon the vector space model,” in Proceedings of the ACL 2012 SystemDemonstrations. Association for Computational Linguistics, 2012, pp.37–42.[84] Z. Yu, L. Nicolich-Henkin, A. W. Black, and A. Rudnicky, “A wizard-of-oz study on a non-task-oriented dialog systems that reacts to userengagement,” in Proceedings of the 17th annual meeting of the SpecialInterest Group on Discourse and Dialogue, 2016, pp. 55–63.[85] F. N. Akinnaso, “On the differences between spoken and written lan-guage,” Language and speech, vol. 25, no. 2, pp. 97–125, 1982.[86] D. Tannen, “Oral and literate strategies in spoken and written narra-tives,” Language, pp. 1–21, 1982.[87] C. M. Laserna, Y.-T. Seih, and J. W. Pennebaker, “Um... who like saysyou know: Filler word use as a function of age, gender, and personality,”66Journal of Language and Social Psychology, vol. 33, no. 3, pp. 328–338,2014.[88] H. N. Nagel, L. P. Shapiro, and R. Nawy, “Prosody and the processingof filler-gap sentences,” Journal of psycholinguistic research, vol. 23,no. 6, pp. 473–485, 1994.[89] J. C. Richards, J. C. Richards et al., The language teaching matrix.Cambridge University Press, 1990.[90] S. Oishi, S. Kesebir, C. Eggleston, and F. F. Miao, “A hedonic storyhas a transmission advantage over a eudaimonic story.” Journal of Ex-perimental Psychology: General, vol. 143, no. 6, p. 2153, 2014.[91] G. S. Berns, “Something funny happened to reward,” Trends in cogni-tive sciences, vol. 8, no. 5, pp. 193–194, 2004.[92] Google. (2020) Google conversation design process. [Online].Available:[93] ——. (2020) Conversational components–informational statements.[Online]. Available:[94] S. Bødker, “When second wave hci meets third wave challenges,” inProceedings of the 4th Nordic conference on Human-computer interac-tion: changing roles, 2006, pp. 1–8.[95] H. Jung, H. J. Kim, S. So, J. Kim, and C. Oh, “Turtletalk: an edu-cational programming game for children with voice user interface,” inExtended Abstracts of the 2019 CHI Conference on Human Factors inComputing Systems, 2019, pp. 1–6.67[96] C. Nass and Y. Moon, “Machines and mindlessness: Social responsesto computers,” Journal of Social Issues, vol. 56, no. 1, pp. 81–103,2000. [Online]. Available:[97] J. Hollan and S. Stornetta, “Beyond being there,” in Proceedings of theSIGCHI conference on Human factors in computing systems, 1992, pp.119–125.[98] D. A. Norman and S. W. Draper, User centered system design: Newperspectives on human-computer interaction. CRC Press, 1986.[99] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.[100] F. Heylighen and J.-M. Dewaele, “Variation in the contextuality oflanguage: An empirical measure,” Foundations of science, vol. 7, no. 3,pp. 293–340, 2002.[101] Google. (2020) Conversation design process - write sample di-alogs. [Online]. Available:[102] M. F. Scheier and C. S. Carver, “The self-consciousness scale: A revisedversion for use with general populations 1,” Journal of Applied SocialPsychology, vol. 15, no. 8, pp. 687–699, 1985.68AppendixThis appendix includes every resource mentioned in Chapter 3 and 4 forconducting this study.A.1 The Recruitment Poster69THE UNIVERSITY OF BRITISH COLUMBIA   Recruitment Form for ‘Voice User Interface Prototyping Study’ Department of Computer Science 2366 Main Mall Vancouver, B.C., V6T 1Z4  Interested in using Voice User Interfaces? Come help us! NO technical background is required.   Who we are: We are researchers from the University of British Columbia: Dr. Dongwook Yoon, Dr. Joanna McGrenere, and Yelim Kim. We are conducting a study to create design / technology solutions to create natural VUI dialogues.   We are looking for participants who are: ● 19 years of age or older  ● Fluent english speakers   Receive a $23 honorarium for this 1.5 hours study ($15/hour)  Interested in participating?  If you are interested in participating or would like more information, please contact ​Yelim Kim​ at ​[email] ​or ​[phone number]  Call for participation (Version 1.2 / May 21, 2019) A.2 The Recruitment Message Posted ThroughSNS71Subject: Call for Study Participants - $15/hour for Voice User Interface Prototyping Study   Voice User Interface (VUI) is currently gaining high industrial and academic interest!  Are you interested in using the latest voice technologies such as Google home or Amazon Alexa, or even making an application for it? If so, you can help us! We are conducting a study to create design / technology solutions to create natural dialogues for VUI applications.   Who are we?  We are researchers from the University of British Columbia: Dr. Dongwook Yoon, Dr. Joanna McGrenere, and Yelim Kim.   Whom are we looking for?  ● Native level English speaker  ● 19 years of age or older  You ​do not need​ to have any previous programming experience ​nor​ previous experience with any voice user interface (e.g., Google home)   How long will it take? You will be asked to complete 1 session that takes at most 1 to 1 and a half hours.   What will you get?  We will pay $15 for 1 hour and $23  for 1 and a half hour in cash.   What will you do?  ● Two short online surveys (Pre-task and Post-task)  ● Creating or interacting with Voice User Interface ● One post-task interview   The audio recording may be obtained under your agreement which will be transcribed by Yelim Kim for the study.    For further information, please contact Yelim Kim (E-mail: ​[email]​, Phone: ​[phone number]​). Yelim Kim will provide all the details of the study, and provide a consent form for the study. You can opt-out of further contact by noticing Yelim Kim at any point of the time.    Yelim Kim  MSc Student, UBC Computer Science  Version 1.1 /  July 29,2018 1 A.3 The Consent Form73  THE    UNIVERSITY    OF    BRITISH    COLUMBIA   Consent Form for ‘Voice User Interface Prototyping Study’ Department of Computer Science 2366 Main Mall Vancouver, B.C., V6T 1Z4  Principal Investigator: Dongwook Yoon, Computer Science Department, University of British Columbia. Ph: ​[phone number]​; Email: ​[email]​.  Co-Investigator(s): Joanna McGrenere, Computer Science Department, University of British Columbia. Ph: ​[phone number]​; Email: ​[email]​. Yelim Kim, Computer Science Department, University of British Columbia. Ph: ​[phone number]​; Email: ​[email]​.  Sponsor: This project is funded by the Natural Sciences and Engineering Research Council (NSERC).   Purpose: Voice User Interface (VUI) has been gaining high interest from the human-computer interaction research field in recent years. When designing VUIs, one of the essential things to do is to write natural dialogues between the voice assistant and users. The dialogue is a written script of an expected conversation between the voice assistant and users, which also includes the prosodic information such as tones and pitches that the voice assistant should use when it speaks. Since we have natural conversations with other people on a daily basis without knowing exactly how we do, it is often challenging for VUI designers when they are purposely trying to make their dialogue more natural. Therefore, in this study, we will aim to find the methods to help designers to create more natural VUI dialogues.   Study Procedure: After you have read this document, if you have any question or concern, please don’t hesitate to contact Yelim Kim through email. Yelim Kim will respond to you with detailed within one or two business days. Once you have signed this consent form, you will be asked to: Version 1.1 / February 25, 2019  Page 1   THE    UNIVERSITY    OF    BRITISH    COLUMBIA - Fill up the online survey asking the basic demographic data (i.e., gender, age, occupation, academic background) and your previous experience of designing VUIs  - Answer interview questions about your VUI design experiences, practices and your notion of natural VUI dialogues  - Complete a user task of creating VUI dialogue scripts with the keyboard typing and the voice typing - Answer interview questions asking about your experience of creating dialogues with keyboard typing and the voice typing  The online survey should take approximately 5~15 minutes to fill out, and the interview should take approximately one hour to one and a half hours and be completed in 1 session.   During the interview, the audio recording is required, and we may ask your permission to take pictures of the tools that you are using to create VUI dialogues. For the user task, you will have a choice to use your own computer or use the computer provided by the researcher. During the user task,to understand your experience of creating VUI dialogues better, we may ask your permission to take a screen recording to observe the process of creating VUI dialogues. We will share the pictures and the screen recordings we captured during the interview with you to make sure there is no element you do not want to share with us. Any identifiable element in the images, videos, or audios will be blurred or obscured to prevent identification.  Check off the box if you agree with our condition:  [   ] For the interview, I agree that an audio tape recording can be made of what I have to say. [   ] For the interview, I agree that the research can take pictures of the tools I am using to create VUI [   ] For the user task, I agree that the research can take screen recording to observe the process of creating VUI dialogues   Participants are free to withdraw without giving a reason. If you withdraw from the study, all of the data obtained from you will be permanently discarded on the date of your withdrawal and will not be used for this study. All of the electronic files will be permanently deleted and all of the physical copy of documents (e.g., handwritten notes, paper transcriptions and the consent form) obtained from you will be shredded. Your name will be also removed from code assignment file.  Version 1.1 / February 25, 2019  Page 2   THE    UNIVERSITY    OF    BRITISH    COLUMBIA Project Outcomes: Although the project outcomes will be determined by the research findings, possible research products will include: journal articles, a report, and plain language summaries.  Potential Benefits: There are no explicit benefits to you by taking part in this study. However, you will be asked to describe your current practices of creating VUI dialogues which may help you reflect on your current practices, and think about new approaches to improve them.   Potential Risks: Since you are asked either to write the script with the voice typing or to interact with VUI using your voice, you might feel thirsty during the interaction. We can provide the water or find the water source for you. This study may take one and a half hours, and you may feel tired during the session. If you need any break at any point of the time, you can always ask the researcher to have a 5-15 minutes of break. You can also withdraw your participation in the project at any time.   Confidentiality: The survey will be conducted through a UBC survey tool provided by Qualtrics. All the hard copies of documents will be identified only by code numbers that Yelim Kim will assign to each participant. Any of electronic file names will not contain any identifiable data such as participant name. You will not be identified by name in either the screen recording, the VUI scripts you wrote, survey data or interview transcript. The link between the code associated and the actual names will be stored in the “code assignment file”. The only documents containing your real name will be the “code assignment file” and this consent form.   This file will be stored on an encrypted hard drive of a password protected laptop owned by Yelim Kim, and there will not be any other copy stored than the one in the Yelim’s hard drive. The only documents containing your real name will be the “code assignment file” and this consent form. Audio recordings will be transcribed by Yelim Kim and the any identifiable data such as participant’s name will be removed from the transcription. As mentioned above, any identifiable element in the images, videos, or audios will be blurred or obscured to prevent identification.   During the study any handwritten notes and paper transcripts will be kept in a locked cabinet in the researchers' laboratory (Room X508 at ICICS) with controlled access at UBC, and all electronic files including audio files and “code assignment file” will be Version 1.1 / February 25, 2019  Page 3   THE    UNIVERSITY    OF    BRITISH    COLUMBIA stored on an encrypted hard drive of a password protected laptop owned by Yelim Kim, and there will not be any other copy stored than the one in the Yelim’s hard drive.   In any of publishing contents (i.e., any reports, research papers, thesis documents, and presentations), participants will be named by the code Yelim Kim will assign, instead of their real names. There will be no identifiable data published. The audio files and the screen recordings will not be used in any of publishing content (i.e., any reports, research papers, thesis documents, and presentations), and the transcription of the audio files will be modified to remove any identifying details of the participants. The pictures of the tools may be used for the publication, however, any identifiable element will be blurred to prevent identification.   Remuneration/Compensation: In order to acknowledge the time you have taken to be involved in this project, each participant will receive $15 / hour.   Contact for information about the study: If you have any questions or desire further information with respect to this study, you may contact Yelim Kim (​phone number​; email: ​email​)  Contact for concerns or complaints about the study: If you have any concerns or complaints about your rights as a research participant and/or your experiences while participating in this study, contact the Research Participant Complaint Line in the UBC Office of Research Ethics at 604-822-8598 or if long distance e-mail or call toll free 1-877-822-8598.  Consent: Your participation in this study is entirely voluntary and you may refuse to participate or withdraw from the study at any time. Your signature below indicates that you have received a copy of this consent form for your own records. Your signature indicates that you consent to participate in this study.    ______________________________________________________________________ Subject Signature                                                         Date   ______________________________ Printed Name  Version 1.1 / February 25, 2019  Page 4 A.4 The Pre-interview Survey78A.5 The Semi-structured Interview Script86Participant Code: _______  Interview Date: 2019 /___ /___  Interview Time: ____:____ ~ ____:____ Interview Place: __________________  Recruitment checklist  1. Attach the consent form  2. Let them prepare the microphone 3. Send the pre-interview survey with the correct participant ID: send the survey 2~3 day earlier. So that I can understand their applications better  Before starting the interview - Note to myself   1. Make sure the devices have enough battery  2. Prepare an extra laptop 3. Prepare the microphone and the voice recorder  4. Turn off the cell phones (Before interview)  5. Start the recording devices are working  a. Screen capturing  b. Zoom audio c. Voice Recorder d. Smart-phone voice recorder   Opening Script   Introduce Study Phase:  1. Answer the interview questions regarding VUI designing experience (At most 20 minutes) 2. Complete a user task for writing dialogues with voice typing and keyboard typing (At most 20 minutes) 3. Complete a post user task survey (At most 30 minutes) 4. Answer the interview questions regarding the user task (At most 20 minutes)   Pre-interview questionnaire   Make sure the participants finished the pre-interview questionnaire prior to the interview.   Interview questions prior to the user task   Q1.​ I would like to begin our interview by asking some of your previous voice user interface projects. Please think of one or two of ​the most memorable projects ​(It doesn’t need to be the latest projects. It means the projects you remember the procedure most vividly). Could you tell me what was the ​goal of the project, the role you took in this project​ and ​what you have designed​ and ​built​ in the course of the projects (This question will be skipped if the participants already answered in the survey in detail.)?  VUI project 1: (name:                                 )     VUI project 2: (name:                                 )    Were these applications conversational or task-oriented?  - If it’s a task-oriented application, how much effort the designers made for the conversational aspect?  - How many goals to achieve?   “Non-task-oriented conversational systems do not have a stated goal to work towards.”   Q2.​ For this project, we are especially interested in the process of creating ​dialogue​. Could you walk me through ​the steps you took​ to create the dialogues ​for one of the projects you mentioned earlier​?    VUI project name: ______________________       Software or hardware tools used to created the dialogue (e.g., prototyping tool, audio recorder):  Did the same procedure and tools were taken for the other VUI project? (Yes /  No) - If not, what are the differences?    Is creating a natural dialogue was one of your goals for this process? If so, how important?      Q3. How do you define “natural dialogue”? What is the expected value in creating a natural dialogue? (Dig more deep down until I get the concrete answers)   Subjective definition of naturalness:      Expected benefits of natural dialogues:      Q4. How do you create a dialogue to be more natural?   Strategies of creating natural dialogues  (e.g., In which step, does it apply? and how many people are included?):    Criteria to determine the naturalness:     In any of the tools you mentioned earlier, were any of them helpful in creating more natural dialogue?     The most chellenges in creating more natural dialogue:    Q5. In any of the steps you took for creating the dialogue, was there any step where you use your voice (e.g., say things aloud during the process in order to create a natural dialogue)?      User Task for Creating Dialogues  1. Practice task for creating dialogue with the voice input  a. For the online meeting, turn off the cameras b. For the offline meeting, be away from the user  c. Prepare the microphone  d. Put the microphone UI button to the left side of the document  2. Go through the instruction 3. Complete user task with the text input  4. Complete user task with the voice input  5. Make them listen to two audios while asking them compare the two audios regarding the naturalness  a. Let them take a note if they want   SSML Task   1. Do the screen sharing to show my google doc page 2. Increase the speaker volume to create a note from the speaker sound  3. Turn on the Add-ons (Note this voice note takes time) 4. Ask them if they want to insert SSML to make the dialogue more expressive  5. For the same place where they used the SSML, ask them if they want to use their own voice to enhance the expressivity of the dialogue     Interview questions post to user task  Q6: Could you tell me about your experience of using the two modalities to create the dialogues?     Post-user task survey    This survey will ask  ● How helpful the voice input was for each VUI design guidelines  ● About designers’ current practice of following the design guidelines ● About Perceived helpfulness of using the voice input and text input to create a natural dialogue   Q7 (Follow-up of the survey): Have you ever heard of these guidelines before and do you think they are important?    Q8: How would you compare the two input modalities for creating dialogue? (Follow-up questions regarding the survey result)    Other difference regardless of the guidelines suggested from the survey:      Q9: Is there any situation you would use your voice to create VUI dialogues?   A helpful situation to use the voice:   Situation:    Why does it make it more helpful:    How should it be implemented to maximize the benefit:     In which step of the dialogue writing process, does this situation occur:        Closing Script   Thank you so much for helping us! The data we obtained will be very helpful in guiding us to create a better VUI prototyping tool in the future. If you’d like to know the result of this study, I can send you a copy of the paper once we publish this work, so please let me know. Also, if you have any questions after this interview about our study, please don’t hesitate to contact me through the e-mail address written in the consent form. Again, we really appreciate your participation in this study!  A.6 More Descriptions on the User TaskEach participant who carried out the user task was given the requirementsfor writing two VUI dialogues as below. The formats and the contents of therequirements were created to be as close as to the requirements used in realpractices [101]. The requirements outline the use context and key use casesof the application, the desired persona of the voice assistant, the perspectivepersona of target users, and high-level schema for the VUI dialogue. Whilethe participants were writing the VUI dialogues, the researcher observedthem and wrote notes on their VUI writing procedures.One of the VUI dialogue scenarios was to help users lose weight by lettingthem reflect on their previous diet and providing suggestions for creatinghealthier eating plans. Another scenario was to help users to have healthierexercise routines by letting them reflect on their current exercise practicesand providing suggestions for better exercise plans. The participant hadto write one of the VUI dialogues with a keyboard input while writing theother using their own voices with Google Doc’s voice typing functionality.The scenario for each input method was chosen randomly by the uniformdistribution. By asking the participants to use the voice typing instead ofthe keyboard typing, we sought to understand if the participants will seethe benefits of using voice input in writing natural VUI dialogues. Sinceusing the voice input in front of the interviewer may cause anxiety for somepeople, to take into account the possibility that individual differences in self-consciousness play a role as a latent variable, we asked the participants to fillout the questions from the “Self-Consciousness Scale” [102] in the pre-survey.After each participant wrote the two VUI dialogues, the researcher gen-erated the audio for the dialogues with ‘TTS reader’, the tool we created touse Amazon Alexa voice to read written scripts with SSML tags. We madethe participants listen to the audio and asked them if there’s any part inthe synthesized audios that they wished to change, and if so, we made themdemonstrate the desired changes with their own voices. At last, we asked93the participants to fill out the post-user task survey (Appendix A.8), wherewe asked about their experiences of using the voice typing and keyboard in-put. In this survey, we also asked about their perceived effectiveness of theexisting VUI design guidelines for creating natural VUIs.94Application Name: Dr. Nutrio (Nutrition counsellor)  User persona:​ Emily, 35, recently gained weight. Her ​goal​ is to create a healthier eating plan for the next week to help her lose 5 lbs. She regularly eats healthy meals, but ​often feels hungry late at night​. She’s easily tempted by hunger, and her late night snacks of choice are pizza and chips. She doesn’t understand people who think raw vegetables are snacks.   User Context: ​On every Friday night, Emily uses Dr. Nutrio to create her diet plan for the following week.   Key use cases:  1. Helps users reflect on their previous diet 2. Gives suggestions for creating a healthier eating plan and provides tips for following the plan   High-level schema 1:  1. Dr. Nutrio and Emily engage with a small talk (Asking about last weekend) 2. Dr. Nutrio and Emily discuss about Emily’s current eating habit. 3. Emily reports her goal to Dr. Nutrio (e.g., losing 5 lb) 4. Dr. Nutrio suggest possible solutions to her  5. Dr. Nutrio and Emily negotiate the plan for the next week.  6. Farewell                Application Name: Dr. Fitna (Exercise counsellor)  User persona:​ Nick, 31, recently gained weight. His ​goal​ is to create an effective exercise plan for the next week to help him lose 5 lbs. Last week, he intended to wake up early to run in the morning, but as a “night owl”,​ ​Nick failed to wake up in time​. As a runner, Nick is currently a beginner, running 2km per day. He doesn’t believe that he can increase his daily running distance in a short amount of time.   User Context: ​On every Friday night, Nick uses Dr. Fitna to create his exercise plan for the following week.   Key use cases:  1. Helps users reflect on their current exercise practices 2. Gives suggestions for creating a more effective exercise plan and provides tips for following the plan   System persona:​ Dr. Fitna, ​a cheerful and supportive exercise counsellor​, believes that the best way to help users is by establishing a comfortable environment for sharing concerns and problem-solving. Therefore, she prefers to ​negotiate a plan​ with the users instead of telling them what to do.   High-level schema 1:  1. Dr. Fitna and Nick engage with a small talk (Asking about last weekend) 2. Dr. Fitna and Nick discuss about Nick’s current exercise practice 3. Nick reports his objective to Dr. Fitna (e.g., losing 5 lb) 4. Dr. Fitna suggest possible solutions to Nick  5. Dr. Fitna and Nick negotiate the plan for the next week. 6. Farewell  A.7 The Post User-task Survey97Default Question BlockReflection on User Task Please indicate your level of agreement on the following statements. During the user task, voice typing was helpful in making the dialogue sound like a natural spoken conversation. During the user task, keyboard typing was helpful in making the dialogue sound like a natural spoken conversation. During the user task, using my voice to create the dialogue made me feel self-conscious. Design Guidelines and Voice Typing/Note Please indicate your level of agreement on the following statements. During the user task, voice typing helped me use less formal words. Example:Instead of using 'extract' or 'eliminate', use 'remove'.During the user task, voice typing helped me use common words instead of using overly technical words or jargons. Example:Strongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeStrongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeStrongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeStrongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeInstead of saying “Your database transaction has failed due to the storage limit”, say “We can’t store the information, because you ran out of storage.”During the user task, voice typing helped me use contractions (e.g., I’m, she’d, shouldn’t) to shorten words  Example:Change ‘You would like to’ into ‘You’d like to’ During the user task, voice typing helped me create simple sentence structures. Example:Instead of saying “I like the house that you bought, which has a green roof”, say “I like yourhouse with the green roof” During the user task, voice note helped me create an expressive voice that conveys nuances by adding vocal variety, breaks, and non-verbal sounds.Example:Nuances can be conveyed in a voice  by manipulating intonation, pitch, or volume. breaks or non-verbal sounds (e.g. breathing) can also be added. This manipulation can be done through either appropriate SSML tags or better speech synthesis engines.Using a SSML tag to emphasize: “I already told you I <emphasis level="strong">really love</emphasis> it.”During the user task, voice typing helped me provide only relevant information, from the users' point of view, at each conversation turn  Example:Strongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeStrongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeStrongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeStrongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeLet’s say a user wants to buy a ticket for a sport event. When booking a ticket for a sport event, the seating area (e.g., VIP seats, section A) is very important. Hence, don’t burden the user’s short-term memory by listing all of the information about seats and prices. First, introduce available areas. After the user chose the section, introduce the prices of seats.During the user task, voice typing helped me use various ways of presenting the same questions and various ways of giving the same responses.  Example:A variation of “What movie do you wanna see?” might be “What movie would you like to watch?”Your past design practice Please indicate your level of agreement on the following statements. When I was designing my VUI applications, I successfully followed the guideline of using less formal words. Example:Instead of using ‘extract’ or ‘eliminate’, use ‘remove’.You just selected either "Disagree" or "Strongly disagree."  Was this because it was hard to follow theguideline or was this because the guideline was not important to you? When I was designing my VUI applications, I successfully followed the guideline of using common words instead of using overly technical words or jargons. Strongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeStrongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeStrongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeBecause it was hard to follow itBecause it was not important to meNeither of them, it was because (e.g., often forgot to follow it): Example:Instead of saying “Your database transaction has failed due to the storage limit”, say “We can’t store the information, because you ran out of storage.”You just selected "Disagree" or "Strongly disagree."  Was this because it was hard to follow theguideline or was this because the guideline was not important to you? When I was designing my VUI applications, I successfully followed the guideline of using contractions (e.g., I’m, she’d, shouldn’t) to shorten words. Example:Change ‘You would like to’ into ‘You’d like to’ You just selected "Disagree" or "Strongly disagree."  Was this because it was hard to follow theguideline or was this because the guideline was not important to you? When I was designing my VUI applications, I successfully followed the guideline of using simple sentence structures. Example:Instead of saying “I like the house which you bought that has a green roof”, say “I like your house with a green roof” Strongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeBecause it was hard to follow itBecause it was not important to meNeither of them, it was because (e.g., often forgot to follow it): Strongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeBecause it was hard to follow itBecause it was not important to meNeither of them, it was because (e.g., often forgot to follow it): Strongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeYou just selected "Disagree" or "Strongly disagree."  Was this because it was hard to follow theguideline or was this because the guideline was not important to you? When I was designing my VUI applications, I successfully followed the guideline of creating an expressive voice that conveys nuances by manipulating the audio sound. Example:Voice can be expressive by manipulating its intonation, pitch, or volume, and you can also add breaks or non-verbal sounds (e.g., breathing). This manipulation can be done through either appropriate SSML tags or better speech synthesis engine.Using a SSML tag to emphasize: “I already told you I <emphasis level="strong">really love</emphasis> it.”You just selected "Disagree" or "Strongly disagree." Was this because it was hard to follow theguideline or was this because the guideline was not important to you? When I was designing my VUI applications, I successfully followed the guideline of providing only relevant information, from the user’s point of view, at each conversation turn. Example:Let’s say a user wants to buy a ticket for a sporting event. When searching for tickets, the date, time, and location of the event  are the most optimally relevant information at the initial turn of the conversation. Information about seats and prices are not optimally relevant at this turn of the conversation. Introducing all this information at the initial turn of the conversation would likely serve as a burden to the user, rather than be helpful. Because it was hard to follow itBecause it was not important to meNeither of them, it was because (e.g., often forgot to follow it): Strongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeBecause it was hard to follow itBecause it was not important to meNeither of them, it was because (e.g., often forgot to follow it): Strongly agree AgreeNeither agree nordisagree Disagree Strongly disagreePowered by QualtricsYou just selected "Disagree" or "Strongly disagree."  Was this because it was hard to follow theguideline or was this because the guideline was not important to you? When I was designing my VUI applications, I successfully followed the guideline of using various ways of presenting the same questions and various ways of giving the same responses.    Example:A variation of “What movie do you wanna see?” might be “What movie would you like to watch?”You just selected "Disagree" or "Strongly disagree."  Was this because it was hard to follow theguideline or was this because the guideline was not important to you? Because it was hard to follow itBecause it was not important to meNeither of them, it was because (e.g., often forgot to follow it): Strongly agree AgreeNeither agree nordisagree Disagree Strongly disagreeBecause it was hard to follow itBecause it was not important to meNeither of them, it was because (e.g., often forgot to follow it): A.8 The Data Analysis ProcessI used ‘airtable’ (‘’), an online collaborative spread-sheet application to orgnize the codes.104Grid view#Factor Categories Factors Factor elements1 Technologies/Frameworks Limitations of the current technology Speech Synthesis2 Design Process Design Constraints Design Goals3 Technologies/Frameworks Limitations of the current technology Speech Synthesis4 Technologies/Frameworks Limitations of the current technology Speech Synthesis5 Technologies/Frameworks Limitations of the current technology Artificial Intelligence6 Design Process Design mistakes to be avoided Dialogue Presentation7 Technologies/Frameworks Lack of technical / framework supp… Artificial Intelligence8 Design Process Design Constraints Design procedure9 Technologies/Frameworks Limitations of the current technology Speech Synthesis10 Design Process Design mistakes to be avoided Dialogue Style Design11 Technologies/Frameworks Lack of technical / framework supp… Speech Synthesis12 Technologies/Frameworks Limitations of the current technology Speech Synthesis13 Technologies/Frameworks Lack of technical / framework supp… Design guideline14 Design Process Design Constraints Design Environment15 Technologies/Frameworks Limitations of the current technology Speech Synthesis16 Technologies/Frameworks Limitations of the current technology Artificial Intelligence17 Technologies/Frameworks Lack of technical / framework supp… Artificial Intelligence18 Technologies/Frameworks Lack of technical / framework supp… Design guideline19 Technologies/Frameworks Lack of technical / framework supp… Design guideline20 Technologies/Frameworks Lack of technical / framework supp… Design Tools21 Design Process Design mistakes to be avoided Interaction Design22 Design Process Design mistakes to be avoided Interaction Design23 Design Process Design Constraints Design Goals24 Technologies/Frameworks Limitations of the current technology Speech Synthesis25 Technologies/Frameworks Limitations of the current technology Speech Synthesis26 Design Process Design Constraints Design Domain27 Technologies/Frameworks Limitations of the current technology Artificial Intelligence28 Technologies/Frameworks Limitations of the current technology Artificial Intelligence29 Technologies/Frameworks Lack of technical / framework supp… Design Tools30 Technologies/Frameworks Lack of technical / framework supp… Design Tools31 Design Process Design mistakes to be avoided Dialogue Presentation32 Technologies/Frameworks Limitations of the current technology Speech Synthesis33 Technologies/Frameworks Limitations of the current technology Speech Synthesis34 Technologies/Frameworks Limitations of the current technology Speech Synthesis35 Technologies/Frameworks Limitations of the current technology Speech Synthesis36 Technologies/Frameworks c… Limitations of the current technology Speech Synthesis37 Design PorcessSUM 110Grid view#Sub-element Partic…1 Synthesized Voice 2 Being efficient 3 Synthesized Voice 4 SSML 5 Natural Language Understanding Engine 6 Spoken Language 7 Natural Language Understanding Engine 8 Pre-scripted nature 9 Synthesized Voice 10 Spoken Language 11 Evaluation Method 12 SSML 13 Specific guideline 14 Access to end-users 15 Synthesized Voice 16 Natural Language Understanding Engine 17 Sentiment Analysis 18 Personalization 19 Linguistics 20 Analyzing User Experience 21 Rigidness 22 Rigidness 23 Lowering development cost 24 Synthesized Voice 25 Synthesized Voice 26 No invalid input 27 Processing time 28 Natural Language Understanding Engine 29 Dialogue Writing 30 Prototoyping 31 Rigidness 32 Synthesized Voice 33 SSML 34 SSML 35 SSML 36 SSML 37 Grid view#Participant1 p8 p5 p20 p4 p18 p7 p10 p1 p132 p15 p11 p10 p5 p20 p17 p13 p9 p5 p4 p3 p2 p12 p14 p13 p7 p9 p3 p16 p15 p15 p2 p9 p6 p11 p16 p3 p2 p17 p16 p12 p107 p5 p8 p7 p16 p138 p18 p8 p6 p2 p39 p12 p8 p7 p4 p1410 p9 p7 p18 p1311 p14 p11 p10 p612 p3 p5 p16 p1413 p6 p2 p514 p5 p3 p215 p9 p5 p416 p16 p217 p11 p1818 p16 p1319 p20 p1820 p6 p1221 p12 p1022 p6 p9 p1 p323 p20 p624 p18 p825 p7 p1126 p16 p1327 p428 p529 p630 p531 p932 p433 p1034 p535 p1036 p1337Grid view#Factor Description1 It is not expressive enough to display emotions2 Designing a VUI involves other goals than creating a natural VUI that are often conflicting3 It doesn't have a proper intonation4 The time spent for implementing SSML is not justified (Too many iterations required)5 NLU engine still can't understand various ways of expressing the same meaning6 Designers are often creating dialogues to be sound like the written language7 Methods to understand personalized meanings or interaction are missing (Different generation, Individual differences)8 Designers script the dialogue before when itʼs spoken9 It doesn't pronounce words correctly10 Designers are often overloading the information11 There is no evaluation method other than the instinct-based one12 There is no clear way to utilize it (because it provides too low-level modifications)13 More specific design guidelines are missing14 Designers often canʼt have direct contacts with their end-users15 It doesn't handle punctuation properly16 NLU engine doesn't understand the meaning of the words based on the context17 Methods to collect real-time user sentiments are missing18 There is no guidelines considering individual differences (Culture, Generation differences)19 Design guidelines tied to linguistic knowledge are missing20 Tools for analyzing user experiences are missing21 Designers are often making users to go through unnecessary conversation turns22 VUI often repeat the same information or stop the process when it doesn't know how to handle the current situation23 Designing a VUI involves other goals than creating a natural VUI that are often conflicting24 Multi-langauge support is still very bad (e.g., German pronunciation sounds awkward)25 It doesn't have the appropriate speed for certain groups (kids, elders)26 There is no invalid input for VUIs and this is even sever problem for kids and elders27 The processing time takes too long28 It doesn't understand slangs29 Designers should manually vary words not to be repetitive30 Tools for rapid prototyping are missing31 Designers are often forcing people to use special keywords32 It doesn't put a break properly33 Prosody tags are not working well34 The transition between the words are not smooth35 Different packages for SSML have different effects36 Feeling is not a subjective matter37Grid view#Need …12345678910111213141516171819202122232425262728293031 yes323334353637Grid view#Project Main Purpose1 Entertainment Education Programming learning Game To make more efficient work process Information Querying To help daily life For family bonding2 Entertainment To help daily life To make more efficient work process Information Querying Meditation3 Entertainment Healthcare To make more efficient work process Customer Service Health Check-ups To help daily life4 Entertainment Meditation Information Querying Customer Service To help daily life For family bonding Education Programming learning Game5 Entertainment To help daily life Information Querying Customer Service6 To make more efficient work process Information Querying Healthcare Entertainment Customer Service7 Entertainment Education Programming learning Game Information Querying To help daily life For family bonding Companion support8 Customer Service To help daily life Companion support For family bonding Education9 Entertainment Game Healthcare Health Check-ups To help daily life For family bonding Companion support To make more efficient work process10 Entertainment Education Programming learning Game To help daily life For family bonding11 To help daily life To make more efficient work process Information Querying Entertainment Game12 Entertainment Game Information Querying To help daily life Customer Service13 To help daily life Customer Service14 Customer Service To help daily life15 Healthcare Health Check-ups To help daily life16 Customer Service Information Querying17 Education To help daily life Information Querying18 Entertainment Education Programming learning Game Information Querying19 Education To help daily life20 Healthcare To make more efficient work process To help daily life21 To make more efficient work process Information Querying Healthcare22 Customer Service Entertainment23 To help daily life24 Companion support For family bonding To help daily life Education25 To help daily life Information Querying For family bonding26 Entertainment Education Programming learning Game Information Querying27 Healthcare Health Check-ups28 To help daily life29 To help daily life30 To help daily life31 To help daily life32 Healthcare Health Check-ups33 To make more efficient work process Information Querying34 To help daily life35 To make more efficient work process Information Querying36 Entertainment Education Programming learning Game37Grid view#Affected Goals - Creating natural VUIs Affected Goals - other than for creating natural VUIs1 Assistant General People2 General People General People3 Conversationalist4 Conversationalist5 General People6 General People7 Conversationalist8 Conversationalist9 Conversationalist10 Assistant11 Effective developement Time consuming No clear deve12 Conversationalist13 Effective developement No clear development direction14 Conversationalist15 Conversationalist16 Conversationalist17 Assistant18 Conversationalist19 Conversationalist20 Effective developement No clear development direction21 Assistant22 Conversationalist23 Effective developement Time consuming24 Conversationalist25 Conversationalist26 Conversationalist27 Conversationalist28 General People29 General People30 Effective developement Time consuming31 General People32 General People33 Conversationalist34 Conversationalist35 Effective developement Time consuming36 General People37Grid view#comments for now Participant-SSML familiarity1 p8 p5 p20 p4 p18 p7 p10 p1 p132 p15 p11 p10 p5 p20 p17 p13 p9 p5 p4 p3 p2 p12 p14 Other related goals p13 p7 p9 p3 p16 p15 p15 p20 p2 p9 p6 p11 p16 p3 p2 p17 p16 p12 p107 p5 p8 p7 p16 p138 Need to work again p18 p8 p6 p2 p39 p12 p8 p7 p4 p1410 Need to work again p9 p7 p18 p1311 p14 p11 p10 p612 p3 p5 p16 p1413 p6 p2 p514 Need to work again p5 p3 p215 p9 p5 p416 p16 p217 p11 p1818 p16 p1319 p20 p1820 p6 p1221 Defined p12 p1022 p1 p323 p20 p624 N/A p18 p825 p7 p1126 p16 p1327 p428 p529 p630 p531 Defined p932 p433 p1034 p535 Other related goals p1036 p1337Grid view#Participant-Project1 p8 p5 p20 p4 p18 p7 p10 p1 p132 p15 p11 p10 p5 p20 p17 p13 p9 p5 p4 p3 p2 p12 p14 p13 p7 p9 p3 p16 p15 p15 p20 p2 p9 p6 p11 p16 p3 p2 p17 p16 p12 p107 p5 p8 p7 p16 p138 p18 p8 p6 p2 p39 p12 p8 p7 p4 p1410 p9 p7 p18 p1311 p14 p11 p10 p612 p3 p5 p16 p1413 p6 p2 p514 p5 p3 p215 p9 p5 p416 p16 p217 p11 p1818 p16 p1319 p20 p1820 p6 p1221 p12 p1022 p1 p323 p20 p624 p18 p825 p7 p1126 p16 p1327 p428 p529 p630 p531 p932 p433 p1034 p535 p1036 p1337For visualizing the themes, I used ‘Miro’ ( as below.A more detailed version of the diagram can be found at note that the below diagram reflects the themes from the earlier phaseof our study.114


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items