Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A camera-based approach to remote pointing interactions in the classroom Escalona Gonzalez, Francisco 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_november_escalonagonzalez_francisco.pdf [ 5.06MB ]
JSON: 24-1.0166754.json
JSON-LD: 24-1.0166754-ld.json
RDF/XML (Pretty): 24-1.0166754-rdf.xml
RDF/JSON: 24-1.0166754-rdf.json
Turtle: 24-1.0166754-turtle.txt
N-Triples: 24-1.0166754-rdf-ntriples.txt
Original Record: 24-1.0166754-source.json
Full Text

Full Text

A Camera-Based Approach to RemotePointing Interactions in the ClassroombyFrancisco Escalona GonzalezB.Sc., Universidad Nacional Auto´noma de Me´xico, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)September 2015c© Francisco Escalona Gonzalez 2015AbstractModern classrooms are brimming with technological resources, from soundsystems and multiple large wall displays to the individual computers andcellphones of students. Leveraging this technology presents interesting op-portunities in human-computer interaction to potentially create a richer ex-perience that improves the effectiveness of classrooms both for lecturers andstudents. A key enabler of this experience is the ability for lecturers to ma-nipulate objects on the screen by direct pointing, allowing them to be awayfrom a computer for input. We designed and implemented a remote pointingtechnique that makes use of a web-camera and a pattern of shapes on thewall display to perform target tracking. A controlled study to evaluate per-formance and compare the camera-based technique to a traditional mousefor a target selection task in a classroom setting revealed that both deviceshave comparable error rates but that users are almost twice as fast withthe mouse. The increased freedom of movement and immediacy of inter-action provided by direct pointing makes the trade-off between speed andconvenience reasonable. The technique does not require specialized hard-ware: the ubiquity of personal pocket cameras and computers makes targettracking with a camera a feasible future option for enabling direct pointinginteractions on large wall displays in classroom settings.iiPrefaceThe research presented in this thesis was carried out under the supervisionof Dr. Kellogg S. Booth. I was the primary researcher in all work presented.Peter Beshai provided the code for the i>Clicker driver that was used in theimplementation of our pointing device.Ethics approval for the experimental study with human participants wasprovided by the Behavioural Research Ethics Board at UBC under ID H11-01756.The research reported in this thesis was funded under the DiscoveryGrant program by the Natural Sciences and Engineering Research Councilof Canada and under the Network of Centres of Excellence program throughGRAND, the Graphics, Animation and New Media Network of Centres ofExcellence.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Two Important Input Devices for Pointing . . . . . . . . . . 41.1.1 Light Pen . . . . . . . . . . . . . . . . . . . . . . . . . 61.1.2 Mouse . . . . . . . . . . . . . . . . . . . . . . . . . . 101.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 151.3 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . 162 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Chiroptic Tracker: Camera-Based Remote Pointing . . . . 243.1 Design Constraints . . . . . . . . . . . . . . . . . . . . . . . 243.2 The Camera Tracking Process . . . . . . . . . . . . . . . . . 263.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.1 Grid of Markers . . . . . . . . . . . . . . . . . . . . . 333.3.2 Cursor . . . . . . . . . . . . . . . . . . . . . . . . . . 353.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . 363.3.4 Computing Relative Coordinates . . . . . . . . . . . . 37ivTable of Contents3.3.5 Cursor Position . . . . . . . . . . . . . . . . . . . . . 403.3.6 Performance . . . . . . . . . . . . . . . . . . . . . . . 423.4 Known Limitations . . . . . . . . . . . . . . . . . . . . . . . 423.4.1 Occlusion Caused by the Grid . . . . . . . . . . . . . 423.4.2 High Color Contrast . . . . . . . . . . . . . . . . . . . 443.4.3 Chaotic Movement . . . . . . . . . . . . . . . . . . . 453.4.4 Lens Blur . . . . . . . . . . . . . . . . . . . . . . . . . 453.4.5 Lag . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4.6 Acute Angles . . . . . . . . . . . . . . . . . . . . . . . 463.5 Pilot Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Comparing Remote Pointing to the Mouse . . . . . . . . . 504.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2 Empirical Models of Pointing Performance . . . . . . . . . . 514.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.3.1 Participants . . . . . . . . . . . . . . . . . . . . . . . 544.3.2 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . 554.3.3 Study Design . . . . . . . . . . . . . . . . . . . . . . . 594.3.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . 604.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.4.1 Movement Time . . . . . . . . . . . . . . . . . . . . . 634.4.2 Error Rates . . . . . . . . . . . . . . . . . . . . . . . 654.4.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . 664.4.4 Subjective Data . . . . . . . . . . . . . . . . . . . . . 684.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705 Design Brief for a Classroom Interface . . . . . . . . . . . . 725.1 Goals of the Interface . . . . . . . . . . . . . . . . . . . . . . 725.2 Required Resources Available Today . . . . . . . . . . . . . . 745.3 Styles of Interaction . . . . . . . . . . . . . . . . . . . . . . . 755.4 General Recommendations . . . . . . . . . . . . . . . . . . . 765.5 Example of Future In-Classroom Interaction . . . . . . . . . 776 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 80vTable of ContentsBibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83AppendixA Experimental Resources . . . . . . . . . . . . . . . . . . . . . 90A.1 Consent Form . . . . . . . . . . . . . . . . . . . . . . . . . . 90A.2 Initial Questionnaire . . . . . . . . . . . . . . . . . . . . . . . 93A.3 Final Questionnaire . . . . . . . . . . . . . . . . . . . . . . . 95viList of Tables4.1 The six pose conditions used in the study. . . . . . . . . . . . 604.2 The four formulations of Fitts’s Law considered for our anal-ysis. Each predicts movement time MT from target width Wand amplitude (distance) of movement A. . . . . . . . . . . . 644.3 Movement time models for the Fitts and Welford formula-tions. Significant differences in nested models are highlightedin bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.4 Movement time models for the Shannon-Fitts and Shannon-Welford formulations. Significant differences in nested modelsare highlighted in bold. . . . . . . . . . . . . . . . . . . . . . 644.5 Error rates for pose and target conditions. . . . . . . . . . . . 654.6 Throughput means for all pose conditions, in bit/s. . . . . . . 66viiList of Figures1.1 The difference between relative and absolute pointing. Inrelative pointing (left) the cursor’s final position Pf dependson both the movement of the device and the cursor’s initialposition, Pi. In absolute pointing (right) the cursor’s finalposition depends only on the device’s final position. . . . . . 51.2 The light pen (top) and a diagram showing its composition(bottom). Taken from Sutherland’s dissertation. . . . . . . . 81.3 A replica of the original mouse by Engelbart and Englishseen from its side and bottom. Images borrowed from theComputer History Museum’s webpage. . . . . . . . . . . . . . 111.4 Optical system for a mouse. A light source, in this case anLED, shines through a plastic lens to illuminate the surface,while a small chip that contains a tiny camera processes theimages to detect translation changes from one frame to thenext. Image by Jeroen Domburg from . . . 133.1 The three processes involved in chiroptic tracking. The dis-play renders the cursor and screen contents, the sensor inter-prets what it sees in the display to extract a position, and thehuman brain adjusts for errors. . . . . . . . . . . . . . . . . . 273.2 Two different perspectives of the same grid cell, as capturedby the camera. Corresponding points are labeled with thesame letter. The tracker has to deal with the distortionscaused by perception. Note also the marked contrast in illu-minations on the top and bottom sides of the left image. . . . 293.3 The architecture of our tracker’s implementation. . . . . . . . 32viiiList of Figures3.4 Grid of fiducial markers. The black ellipses define a referenceframe, the gray circles determine proper orientation, and thegreen squares encode row and column number. . . . . . . . . 333.5 A sequence of bits used to encode row and column numbers.Numbers from 0 to 15 are arranged so that the two high-orderbits of one are the same as the two low-order bits of the next.Each of the 16 numbers appears once in the sequence. . . . . 343.6 Design of the cursor for the chiroptic tracker. . . . . . . . . . 353.7 The results of feature extraction on one frame of the camera.On the left is the original image with perimeter pixels andthe main axis of each ellipse highlighted in red. On the rightis the the binarized version used to find the ellipses, and theirbounding boxes. . . . . . . . . . . . . . . . . . . . . . . . . . 373.8 (Top) Difference between affine (a) and perspective (b) trans-formations. (Bottom) Although the affine transformationbased on points A, B and C is not good enough to predict theposition of point D, it finds a good approximation in point D′. 393.9 An early version of the chiroptic sensor that incorrectly identi-fied four blobs, highlighted with red, green, blue and magenta,as the corners of a grid cell. The cause is extreme perspectivedistortion in the image and the use of weaker heuristics forgrid identification. . . . . . . . . . . . . . . . . . . . . . . . . 463.10 Two of the grips used to hold the chiroptic tracker. Thescrewdriver grip on the left is commonly used to hold a laserpointer. The pencil grip on the right seems to be slightlymore intuitive for aiming the tracker. . . . . . . . . . . . . . . 494.1 Illustration of the task performed by participants: (left) thecursor is moved from its starting position towards the darkerrectangle, (center) if the user clicks correctly on the target theother rectangle becomes the new target, and (right) if insteadthe user misses the target it flashes red and the targets thenswitch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55ixList of Figures4.2 (top) A picture of the room illustrating what participantssaw. All lights were turned off during the study. (bottom) Adiagram drawn to scale of the room layout. Positions #1 and#2 are perpendicular to the center of the screen, and position#3 is off to the side. . . . . . . . . . . . . . . . . . . . . . . . 564.3 The physical prototype of the tracker device used by partici-pants. The image also shows the pencil grip they were askedto use during the study. . . . . . . . . . . . . . . . . . . . . . 594.4 Mean movement times in milliseconds at different indexes ofdifficulty for all pose conditions. Lines are included only forreadability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.5 Comparison of average throughput values per participant.Participants were sorted by increasing throughput in the “nogrid” condition, and the values for the “standing” conditionare also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . 674.6 Difficulty ratings subjectively reported by participants for thepointing task using the chiroptic tracker, from 1 (easy) to 5(impossible). . . . . . . . . . . . . . . . . . . . . . . . . . . . 68xAcknowledgementsTime flies when you are having fun, and two years at UBC can go by veryquickly. Still, much has happened during this time that has allowed me togrow as a person, and I owe a debt of gratitude to many people who mademy experience unforgettable.First and foremost, I am thankful to all the people who took the timeto share their knowledge with me and challenge me; my teachers, you areoutstanding. Among them I have to single out my supervisor, Dr. KelloggBooth, whose mind is a cornucopia of interesting ideas and projects, andwhose kindness has given me space to grow in my field. Thank you for yourtime and your conversation.One of the salient features that I love about the CS department at UBCis the willingness of its people to help each other, more so when everyone’stime is so constrained. I am thankful to my second reader, Dr. Jim Little,for his advice on this work that has made it better, and to all the otherswho likewise provided guidance and insight on my work.My friends and fellow grad students are among the most capable andsmart people I have met; working with them has been a pleasure. Thankyou, Antoine, Kamyar and Derek, for our work together, and all the peoplein the MUX lab who inspire me with their work and friendship, especiallyPeter, Matt, Jessica and Oliver, for advice on things academic and otherwise.To my family, most of them encouraging me from afar, but alwayspresent. And to my wife, Jazmin, for the past and for the future.I am blessed.xiTo Lulu, He´ctor and Edel,for everything.xiiChapter 1IntroductionThe chiroptic1 tracker is a pointing input device designed for instructors asa means of interacting directly with content on classroom displays. Thosedisplays are usually large and out of reach, so the interaction happens ata distance and from all kinds of angles. The chiroptic tracker technique isbased on the metaphor of pointing a camera at the large wall display as if aline were shooting out of its center to interact with the display at the point ofintersection. We achieve this direct pointing interaction by interpreting thecontents of the image captured by the camera, and figuring out a positionon the screen to render the cursor. The tracker is aimed in a similar wayas a laser pointer, but does not require more hardware resources than thosecurrently available to instructors.Modern classrooms are equipped with lots of resources: multiple dis-plays driven by powerful projectors, sound systems, computers, networking,lighting controls, and all the personal devices contributed by its occupants,which means more displays, microphones, processors, cameras, etc. Theseresources are not being used to their full potential. Imagine some futuretime when all of these devices are the instruments of an orchestra: theytune and synchronize at the beginning of the class, identifying each otherand their layout around the room, so that as the class develops they createa richer experience than what they could each bring in isolation. When1 This is a word we made up from two roots, chiro- and optic, defined in the CollinsEnglish Dictionary as:chiro- combining form indicating the hand; of or by means of the handoptic adj. (Anatomy) of or relating to the eye or visionWe are aware of a previous meaning of the word used in Chemistry to describe opticaltechniques for investigating chiral substances, but that is not how we will use the word.We use it to qualify an object that allows a person to “see through their hands”, possiblyby holding a camera.1Chapter 1. Introductionstudents ask a question, microphones around them pick it up and send it tospeakers at the other end of the room to echo it so that everybody can hear.Each personal laptop, tablet, or smartphone screen is a window into thecontent shown on the main displays that students can annotate together.The lecturer can move around freely and have conversations that engageeveryone, and when an unexpected question comes up, both students andthe lecturer have the freedom to modify the display without having to go allthe way back to the lectern. These are real possibilities that can be exploredwith what is available right now. Direct pointing is an enabling capabilityfor this kind of experience.The highly specialized classrooms for subjects such as Geography, Chem-istry and Art, with posters and tools always at hand, can be seen as inter-faces that provide a better user experience. These classrooms make learningsubject matter easier because the maps, burners and vials, or art suppliesthe class needs at any given time are readily accessible. A Geography class-room that does not have a map of a territory posted on a wall for referencewhen a class is trying to learn about its terrain is not doing its job well. InDon Norman’s terminology [34], it does not have good affordances for learn-ing Geography. Other subjects such Literature or History might requirefewer visual displays, but they should present an interface conducive for dis-cussion. Mathematics might require many surfaces to write on. In sharpcontrast to these examples, classrooms that are multi-purpose like thosecommonly found in universities are stripped down of any specific content,and instead provide tools so that lecturers can “dress” them however theysee fit for their class, which often means the lecturer prepares beforehand aset of slides to show on the room’s main displays, to a great extent estab-lishing the way the lecture will flow over time. This approach to learning isless organic and less flexible, and often lacks engagement for students.Consider what a lecturer that wants to improvise would have to do tomodify the contents of the screen. They often only have a limited mecha-nism that allows them to move back and forth between slides, so for a morecomplex maneuver they have to physically approach their computer to usethe keyboard and mouse, getting engaged in their personal screen, setting2Chapter 1. Introductionthings up and then glancing back towards the display to check that it isshowing what they need before resuming where they left off. If we acknowl-edge that lecturing is a little like putting on a show, then this interaction isterrible because it leaves the audience unattended, completely ignored by theperformer, disrupting their attention and the flow of the conversation. Theinterface is in the way of the users. These limitations of the classroom canbe overcome with better interfaces, and direct pointing is a crucial element.It is not that current input devices do not work, but they were designedfor a different kind of environment, a more personal one. If we use themnaively in the classroom, they create an obstacle between the users and whatthey want to do. Current devices do not offer good affordances for the kindof interaction we believe is important in the modern classroom.In this thesis we explore the problem of direct pointing on large walldisplays in a classroom setting, using technology already available to lec-turers and without prior calibration to the specific classroom in which it isdeployed. We design a technique that matches the requirements that weidentify, we describe its implementation, then we test it against the mousein a pointing task.Our focus is on providing an improved input technique for the lecturer.While members of the audience could also use the technique presented here,they already have better means of interacting with the contents of the dis-play, such as their network-connected personal computers, or personal re-sponse systems such as the i>Clicker. Previous research by Beshai [6] andShi [43] has already looked into that side of classroom interactions, so wewill not go further into it here. For an example of a system that allows theaudience to share and control the large display, Liu and MacKenzie offer LA-COME [23, 28] as a solution. In any case, it is worth noting that improvingthe experience for the lecturer will likely also improve the experience for theaudience, by creating a more engaging and dynamic classroom environment.Two fundamental assumptions guide our research:1. Remote pointing in the classroom can be achieved without introduc-ing any more hardware resources than those commonly available to31.1. Two Important Input Devices for Pointinglecturers: a general-purpose computer controls the contents of a largewall display, and the lecturer holds a camera that can communicatewith the computer. Nothing else is required.2. The lecturer’s interactions with the display are sporadic. Among otherthings, this implies that a trade-off between speed for the convenienceof the lecturer and clarity for the engagement of the audience is ac-ceptable.1.1 Two Important Input Devices for PointingPointing devices can be used for either relative or absolute input, and somecan be used for both modalities. Absolute pointing means that there is a1–1 correspondence of positions in physical space to positions in the virtualdisplay, so that physically pointing or moving to the same physical positionwill always give the same result in virtual space. A prime example of this isa touch-sensitive surface, such as modern tablets and smartphones, wheretouching a part of the screen means also “clicking” on whatever is shownat that position in the display. In contrast, other devices have relativepositioning, like the mouse, where the cursor’s final position depends onits initial position on the screen, even when the exact same movement isperformed with the mouse on a surface. Figure 1.1 illustrates these concepts.Another aspect of pointing devices is that their control to display ratio(or its reciprocal, gain) can be adjusted to tune their sensibility. Gain isa multiplicative factor that is applied to the movement of the device toincrease or decrease the movement in the virtual space. A small movementin a mouse with high gain can make the cursor go from one side of thescreen to the other. This has the advantage that users have to spend lesseffort to move bigger distances in virtual space, but comes with the trade-offof lowering their precision, because it is harder to point at things preciselywhen even the slightest movement will cause a large change in position. Onthe other hand, a low gain value means that the device has to be moved alarge distance to make the cursor go from side to side.Users of the mouse who have experienced low gain settings are probably41.1. Two Important Input Devices for PointingFigure 1.1: The difference between relative and absolute pointing. In relativepointing (left) the cursor’s final position Pf depends on both the movementof the device and the cursor’s initial position, Pi. In absolute pointing (right)the cursor’s final position depends only on the device’s final position.familiar with a technique in which, after moving the mouse through a largedisplacement, it is raised and brought back and moved again, repeating thisperhaps many times to reach the desired target destination. When the mouseis raised it “stops working”, in the sense that it does not change the positionof the cursor on the screen while it moves through air, but as soon as it is setback down the cursor starts moving again. This technique is called clutchingand can be used in other devices as well to temporarily disable them whilethe user gets into position, preventing unwanted movement of the cursor.Naturally, an interaction that requires much clutching takes more time andcan be frustrating, but low gain values increase precision.Modern operating systems can choose to vary the control to display ratioof a pointing device, and often do this dynamically based on the speed ofmovement as the user moves around, often without them noticing. Whenthey detect a fast movement they increase gain, making the assumption thatthe user wants to go to a far target, but the value is decreased for smallermovements to improve accuracy and precision. Varying gain dynamicallycan improve user’s performance with a device, automatically balancing speedand precision trade-offs.Gain is most often associated with relative pointing, but it can also be51.1. Two Important Input Devices for Pointingused for absolute pointing. The defining property of absolute pointing isthat there is a 1–1 correspondence between control position (the location ofthe input device) and display position (the location of the cursor). Whenthe correspondence has no scale factor, the gain is one, but when there is ascale factor between control and display the correspondence is still 1–1.Because both relative and absolute pointing mechanisms can make use ofclutching and gain, in a remote pointing setting like the one we are interestedin clutching can be used to activate the device only when the user needs it,and get it out of the way when they do not. This blurs the distinctionbetween relative and absolute pointing, the difference being more abouthow clutching is triggered than about whether the mapping is relative vs.absolute.Gain occurs naturally in absolute pointing: if users are close to thescreen, the angles involved create a low gain that increases precision, andconversely when users are far away the angles create a high gain that al-lows them to reach all corners with reduced effort. Gain can be provided insoftware to simulate this, somewhat analogous to how camera angle changesviewing perspective. Changing gain for absolute pointing changes the angu-lar mapping for cursor location.In the next sections we briefly look into other devices that are relevantto our research for a variety of reasons: their popularity, the ideas theybring to the discussion, or because they are used in settings similar to theclassroom where there is a large display and an imbalance of roles betweenlecturer and audience. We describe how each device works and highlightkey ideas, weaknesses or strengths that they exhibit, with the purpose ofinforming our own design. Most of the historical claims in this Section arebased (without explicit citation) on Buxton’s excellent historical survey [8].Other sources are cited explicitly in context.1.1.1 Light PenIn the early 1950’s Robert Everett created an input device called the LightGun that looked like an actual gun with a handle and barrel, which the61.1. Two Important Input Devices for Pointinguser pointed to the screen to read the position of an object by pressing thetrigger. The Light Pen was descended from it, designed in 1957 by BenGurley to allow users to interact with a computer via a stylus-shaped inputdevice pointed directly at the screen. It consisted of a light sensor tuned todetect the distinct light signal emitted by the phosphor-coated glass in CRTdisplays of the era. The electron beam of the display would activate thephosphor at specific positions one dot at a time, and the light pen reactedwhen it saw the initial peak of the light emission. By keeping a record of theposition of the dots being rendered, the pen’s reaction told the computerroughly at what part of the screen the user was aiming, enabling directpointing input on the display.Ivan Sutherland introduced Sketchpad [46] to the scientific communityin 1963, a system that implements drawing on computer displays directlyusing a light pen as a pointing device. In his PhD dissertation he describesmany ideas for interacting with computers in graphical ways. Some of theconcepts he discusses have been adopted in modern interfaces, including thenotion of direct manipulation by pressing a button to select an object or dragit around. A light pen like the one shown in Figure 1.2 was a central piecein his work, because it enabled a type of interaction with the computer thatpeople could relate to well — it was like drawing on the screen — giving theuser a greater freedom of movement and expression that was not possibleusing only buttons and knobs. Sutherland described in detail the techniqueshe designed to track the light pen’s position across the screen continuously,and some finer details of his implementation that serve as inspiration whendesigning other pointing devices.A critical point is that the sensor needed to see some light in order toknow its position, but when the user moved it away from the light source theposition was lost again. In order to track the pen continuously the systemneeded to make sure there was always some shape being drawn under it, orclose enough that it could be seen. A clever solution to this problem was theuse of the cursor. The cursor is a visual indicator of what position the useris pointing at, but it is itself a shape on the display, and if it is moved fastenough mirroring the light pen there will always be a shape to see and thus71.1. Two Important Input Devices for PointingFigure 1.2: The light pen (top) and a diagram showing its composition(bottom). Taken from Sutherland’s get position information from, creating a nice loop that enables tracking.With a field of view of about half an inch on the light pen, Sutherland wasable to follow the path of the user’s movement at a speed of 20 inches persecond across the screen if he refreshed the cursor position 100 times eachsecond. This tracking process alone took 10% of the computer’s resourceson the first version Sutherland implemented, but it worked well enough thathe could develop his interface, Sketchpad.81.1. Two Important Input Devices for PointingUsers would first have to activate the light pen by pointing it at anyshape on the screen, a process Sutherland called “inking up”, which wouldcause a cursor to be drawn to indicate that the light pen was now tracking.They would then move it to perform whatever action was required, andwhen they were done they would just have to flick it fast enough to makethe sensor lose track of the cursor, causing it to disengage. That was theclutch mechanism for the light pen. A user would have to learn how toperform these three actions: acquiring the cursor, tracking, and releasingthe cursor, to use the light pen effectively in his system.Different cursor designs were tried for the light pen, from a cloud ofrandom points to a tracking cross built from individual dots that couldbe spaced apart evenly, or as Sutherland chose to do it, logarithmically,presumably because this increases the concentration of points at the edgesof the cross, which is the portion most likely to be in the sensor’s field ofvision. He also experimented with the idea of anticipating cursor movementby using constant velocity and constant acceleration equations, although hereports that those techniques caused instability and he did not pursue themfurther. In Chapter 3 we will talk about why this happens, and what canbe done to deal with it.Sutherland gives another important insight in making a distinction be-tween “actual” and “pseudo” locations. He realized that the actual positionthe user is pointing to is often not as important as what they mean to pointto, which in the case of Sketchpad could be a line, a point of intersection,or even an abstract quantity such as the length of a line segment. He useda host of geometry heuristics and threshold values to try to tease out thisinformation from the user’s actions, and would often render the cursor atits pseudo position to give better visual feedback to the light pen holder.This is an important lesson of the light pen that we want to stress: thecursor is an efficient design that has multiple purposes, directed to differentobservers. On the one hand it indicates to the user the screen position orobject they are interacting with, but on the other it allows the light pen’sposition to be tracked by the computer. Each observer, user and computer,benefits differently from the presence of the cursor, and it thus needs to be91.1. Two Important Input Devices for Pointingtuned to assist both at the same time.Light pen technology is not in use anymore. CRT-based line-drawingsystems are hard to find because they have been displaced by raster devicesthat refresh the whole screen at the same time instead of one shape at atime. Rendering now takes place in a virtual canvas that is copied all atonce many times per second onto the display, which means that we cannotuse the same interrupt-driven mechanism to get position information fromthe shapes being drawn on the screen. We know that direct pointing isstill a desirable interaction method, as evidenced by the prevalence of touchsensitive surfaces in all kinds of displays these days, but now it is performeddifferently. Instead of a light sensor we now use capacitive surfaces, infraredbeams or other hardware to do it. Unfortunately, in a room with a largedisplay that is potentially out of the user’s reach, direct touch is not a viableoption.Stepping back a bit, we see that the light pen was basically a lightsensor, like a camera, mounted inside a pen-shaped holder and connected tothe computer. A style of interaction similar to it is conceivable for remotepointing in a classroom. When users need to interact they point a camera atthe screen and specifically ask to acquire a cursor. The interaction then takesplace, after which the cursor goes away because it is not needed anymore.The ideas embodied in the light pen remain valid and inspirational. If thecamera is in direct communication with the processor in charge of renderingthe contents of the visual display, we could recover position information forpointing provided there is enough “stuff” to see in the field of view of thecamera. The question is what needs to be displayed for the camera to getthis position information. This will be discussed in detail in the next chapterwhere we present the chiroptic tracker inspired by early light pens.1.1.2 MouseThe mouse is perhaps the best known pointing device because of widespreadadoption for desktop computers. It was developed by Doug Engelbart andWilliam English at the Stanford Research Institute (SRI) in California in101.1. Two Important Input Devices for Pointing1964, and presented to the world in December of 1968 in what became avery famous demonstration of new ideas for human-computer interaction.It was a successor to the Trackball, a pointing device that consisted ofa ball nested inside a mechanism that detected its rotation to move thecursor on the screen. The original mouse consisted of a hand-sized box withtwo perpendicular wheels attached to its bottom face that translated theirrotation into X and Y displacements via rotary potentiometers. A replicafrom the Computer History Museum is shown in Figure 1.3.Figure 1.3: A replica of the original mouse by Engelbart and English seenfrom its side and bottom. Images borrowed from the Computer HistoryMuseum’s webpage.The idea to turn the trackball upside down so that the ball rested ona planar surface came in 1968 from a group in Germany led by RainerMallebrein, and independently again in 1973 from a group in Xerox PARCled by Ronald Rider. Ball mice were basically the same technology as theSRI original, but improved the experience of the user: the ball transferredits movement over a surface to two wheels inside the casing that translatedit into screen coordinates. Another improvement came in 1981 from twoindependent teams, one led by Steven Kirsch and another by Richard Lyon,through the use of an optics system that read displacements in a pattern ofshapes as the mouse was moved over it. The advantage the optical mousehad over the mechanical one was that the latter usually picked up dust anddirt from the surface where it was used, eventually degrading the movingparts inside and damaging the mouse. A disadvantage of the optical mouse111.1. Two Important Input Devices for Pointingwas that it needed the presence of the pattern to work. Almost 20 years later,in 1999, Agilent Technologies developed a new version of the optical mousethat got rid of that limitation by constantly taking “pictures” of the surfacewhere it was being used and comparing them in sequence to determine themagnitude of movement. Optical mice are very popular devices these days,with an LED mounted on the bottom arranged next to a lens that refractsits light to shine on the working surface, so that the camera can pick it up.A laser diode can also be used, which improves the contrast of elements inthe surface to make the mouse work almost anywhere.We can simulate the workings of an optical mouse by setting up a camerato look at a desk at a constant angle while we move it over its surface, com-paring successive images of the desktop to compute a translation value, andconverting that into a command for cursor displacement. Our simulationwould run much slower than the mouse, but the underlying principles thatmake it work would be the same. Optical mice are highly optimized hard-ware that works with very low resolution cameras, sometimes only 18×18pixels, with a sampling rate of hundreds or thousands of frames per second.Figure 1.4 shows the elements of the optics system for a mouse. Mice achievesuch speeds by using low resolution images and doing computations in themouse itself, so that when we move the mouse we perceive an instantaneouschange in cursor position on the screen.Engelbart’s 1968 demo presented many important ideas that have beenvery influential in modern interfaces, from video conferencing to hypertext,and the mouse became the tool that enabled what has been called “directmanipulation”, a technique by which the user can point and interact graph-ically with objects shown on the screen, and affect them directly by pressingbuttons, to do things such as selecting, dragging or highlighting. It was acrucial part in enabling a new metaphor for human-computer interaction.To a lesser degree but in the same vein, we believe that remote pointing is anenabling technology for powerful interaction metaphors with large displaysthat can change how interactions take place in classrooms.The mouse was designed for personal use on relatively small displays,and requires the user to be fairly stationary so the mouse can rest against a121.1. Two Important Input Devices for PointingFigure 1.4: Optical system for a mouse. A light source, in this case anLED, shines through a plastic lens to illuminate the surface, while a smallchip that contains a tiny camera processes the images to detect transla-tion changes from one frame to the next. Image by Jeroen Domburg while the user operates it. In that sense it fits very well on a desktopwith a personal screen, but not necessarily in a room with a large shareddisplay where the user is roaming around. A mouse is usually used as arelative pointing device: each time we move it on the desk the cursor movesrelative to its previous position on the screen, not to an absolute positiondetermined by the mouse location on the desktop. A mouse is also the primeexample of clutching: lifting it from the desk we can prevent the cursor frommoving, and setting it back down in a more advantageous position brings itback into motion.The cursor used in personal computers is often an arrow with the tipindicating the point of interaction. It is designed to be small enough thatit minimizes occlusion of targets in the screen while still remaining visibleto the user who is controlling the cursor. A user that loses sight of thecursor will often move the mouse around to find it again, which works dueto our increased visual sensitivity to moving objects (see Ware [50]). This isan easy trick that most mouse users have learned (perhaps of necessity) ontheir own. It helps balance the ease of visually finding the cursor when it131.1. Two Important Input Devices for Pointingis lost with the designer’s desire that the cursor blend out of the way whennot needed. Once users have found the cursor they can track it withoutdifficulty because they are the ones moving it, and so they know where tolook for it; but a passive observer will probably have more trouble followingthe cursor and thus the interaction that is taking place. A shared settingshould rethink the cursor design to improve the experience of the audienceif that is an important goal of the shared interaction.The most common type of mouse is restricted to two dimensions bydesign: it works only when on a surface. This has a few advantages. Forexample, optical mice only need to be concerned with movement in twodimensions, so the computations from frame to frame are easier to performand the lens can be optimized for the fixed distance of the camera to thesurface. Another advantage is that by placing buttons on top of the mouseit can be used as an input device for selection: the buttons require the userto press on an axis perpendicular to the surface, so clicking them does notcause the cursor to move. A six-degree-of-freedom freehand camera will havea greater challenge interpreting movement change, and placing buttons onit can potentially cause the user to accidentally change the pointing positionwhile clicking a button.A new user of the mouse has to learn to control it. Some users are soaccustomed to the mouse that they can operate it without much thought,but initially they were not as proficient and only became skilled with time.For many it is a worthwhile investment, so it is important to recognize thatproficiency with an input device may not be automatic. A simple exerciseshows that we are not particularly good at using a mouse without visualfeedback. Try the following: take a look at a computer screen with a mouseattached, notice the position of the cursor and choose a target some distanceaway, make a mental image of what you are seeing and close your eyes, thenmove the mouse to the target. Research by Phillips and Triggs [37] suggeststhat you will probably miss. This also happens outside the computer. Trythe same exercise with objects on your desk: reaching for one of them withyour eyes closed will probably make you miss, knock it over, or at best slowyou down considerably.141.2. ContributionsBut humans are adaptable and they learn new skills over time. Our sen-sorimotor system automatically adjusts to our movements based on feedbackfrom the senses to help us reach our target, which is a fundamental part ofwhy the mouse works so well: we do not even have to make a consciouseffort to do it. However, for this to work there should not be too much delaybetween the movement and the feedback or our error correction mechanismstarts to degrade, as research by MacKenzie and Ware [27] suggests.Mice are ubiquitous pointing devices that for many years have providedtremendous value for our interactions with computers. They will likely con-tinue to do so for some time. They were designed for a different purpose thanthe one we are interested in, interaction with a large display in a classroomsetting, but the insights gained from this brief review will be very valuableto us.1.2 ContributionsThis thesis presents the following four primary contributions:• A description for an architecture of chiroptic trackers. The processesinvolved and the challenges that have to be overcome to design themare presented. This high-level view provides insights and advice foranyone interested in studying the technology. It forms a frameworkthat allows many possible implementations.• A proof of concept implementation of a chiroptic tracker is presented,demonstrating that it has a realizable, low-cost instantiation that re-quires only a consumer-grade webcam. We provide details and clarifica-tion on many points that might not be obvious from the architecture’sabstract description, and we analyze the weaknesses still present inthe prototype along with ideas on how to improve it.• Results from a controlled experimental study that compares the proof-of-concept chiroptic tracker to a mouse provide baseline information onperformance. The study reveals that the tracker is not only feasible,it also works well for remote pointing tasks in the classroom setting.151.3. Overview of the ThesisThe results verify Fitts’s Law models of movement time for pointingtasks, which can be used by HCI researchers and interface designers.• A design brief explaining the key ideas desirable when building a class-room interface using direct pointing with available hardware. The de-sign can be realized with the chiroptic tracker or any other device thatprovides similar functionality.1.3 Overview of the ThesisIn this first chapter we have presented a comprehensive introduction to thetopic of the thesis and background information on the lightpen and themouse. The lightpen is the inspiration for the chiroptic tracker; the mouseis both the current dominant technology and the target for replacement.Chapter 2 provides further background. It discusses related work directlyinfluencing ours, or that presents alternative approaches to solving the prob-lem we are tackling.The design and implementation of the chiroptic tracker is presented indetail in Chapter 3, where we also discuss its limitations and opportunitiesfor further development. In Chapter 4 we report the results of the controlledstudy performed in a classroom at the University of British Columbia, whereparticipants performed a Fitts’s Law-style pointing task with our camera-based chiroptic tracker and also with a mouse for comparison.The research presented in this thesis validates the chiroptic tracker invarious dimensions. In Chapter 5 we discuss how a novel classroom inter-face could be designed using it, considering the lessons learned in previouschapters. Lastly, Chapter 6 summarizes the work that has been presentedand lays down a path for future work.16Chapter 2Related WorkThe idea of using the camera in a cellphone to control a cursor is not new.Madhavapeddy et al. [29] proposed enhancing widgets on the display withmarkers similar to the ones used for tracking in augmented reality envi-ronments, so that a user wielding a camera cellphone can twist, move orotherwise interact with the widgets by pointing at them. Rohs [40] wentinto detail about how this kind of marker can be used to detect position us-ing the camera, describing the algorithms involved. Their markers are verysimilar to the two-dimensional barcodes that are a popular tool to encodea text string, and so in addition to giving an estimate of the camera’s six-degree-of-freedom position, they can be used to tell apart different targetsor displays.Ballagas et al. [5] extended this work to create an interaction called“point and shoot”, where users aim through their phone screens at a targetof interest on a large display, and by pressing a button a grid of markers isshown for a few moments while the camera captures a picture, which is thenprocessed to decode the markers and decide what object they are pointing to.After this is done, the grid goes away and the user can continue interactingwith the selected object using other means. While the same marker gridcould be used to provide cursor tracking, the markers occlude most of thescreen due to their size, so they are not ideal. Similarly, an approach like theone by Celozzi et al. [10] to detect camera position, using markers like thoseproposed by Fiala [12], can be an accurate tracker but is not optimal dueto its denseness. Ballagas et al. also described another technique they call“sweep”, which is an analogous mechanism to the optical mouse. It worksby analyzing motion flow of consecutive images, and the user does not haveto point at the screen, any surface will do as long as it has recognizablefeatures. Both of their techniques were reported to exhibit a 200ms delay17Chapter 2. Related Workon the phone at the time, and selection with “point and shoot” can be donein around five seconds, but we should keep in mind hardware is faster now,so these times have probably improved.Decoding position information from shapes is an idea similar to the“structured light” technique used by computer vision researchers, wherea known pattern is projected onto a surface and the result analyzed by acamera to determine properties such as shape and depth. Our problem issimpler because we are interested in a planar surface, but the ideas areinsightful and will contribute to our discussion in Section 3.4.Salvi et al. [41] present a survey of pattern encoding techniques. Tem-poral encoding consists of sequences of patterns that change through time.A camera is carefully synchronized with the projector to pick up the waythe pattern is distorted as light falls on the scene, and from that extractsa three-dimensional approximation of what it sees. Spatial encoding usesa two-dimensional pattern such that the neighborhood of each projectedpoint is unique and can be recognized by finding corresponding points. Di-rect mapping projects shapes or numbers directly onto the scene that encodeposition. This is the most similar strategy to the marker-based approachdiscussed earlier.Displaying the patterns for structured light can be very intrusive, butthere are ways to make it invisible to the human eye, as explained by Fofiet al. [14]. One idea is to use infrared projection, which can be pickedup by a camera but not by humans. A different approach uses high-speedprojectors to quickly switch between the pattern and its negative, so that acamera with a very short shutter time can see it, but our perceptual systemwill fuse both patterns into a solid grey color. The main disadvantage wesee with using structured light as described is that it requires equipping theclassroom with more specialized equipment than what is commonly availableright now, however, it might become an interesting line of research as thistype of hardware becomes more common.There have been efforts to do feature-based tracking using a camera with-out the need of extra visual clutter from markers. Jeon et al. [20] proposea range of techniques to do cursor manipulation using a camera phone, one18Chapter 2. Related Workof which is called “marker-cursor”, where as the name suggests the cursoris the marker. They use a square marker that allows them to calculate acoordinate transformation function (a homography), and the interior of themarker is a triangle that both provides a point of interaction and a wayto fix the orientation of the homography. Jiang et al. [21] use a differentapproach without the use of any specialized markers, cleverly taking the lasttwo frames of the camera and using the cursor displacement to compute adifferent transformation function, an affine transformation, which they useto approximate the cursor position. Their approach is less powerful in thatit only provides two-dimensional translation and rotation information, andwhen the cursor is relatively stationary the rotation information cannot becomputed.Both of these approaches suffer from the limitation that there is norecovery mechanism when the user moves so fast that the camera loses trackof the cursor, meaning they have to go back to where it was left and begindragging it again. This is an inherent limitation of the “marker-cursor”approach. More recently, Baldauf et al. [3] have adapted more advancedvision techniques to do feature-based tracking of a scene in real time witha camera phone, which does not need any markers at all. They describe ageneral framework for multi-user and multi-screen interaction that works bycomputing a homography from a baseline image provided by the screen, andmatching of features on the camera image. This is a promising approachthat gives full six-degree-of-freedom information of the camera pose, and isreported to work at interactive rates (although no numbers are provided).The challenge remains of how to do cursor tracking on a blank screen, forexample for drawing applications.Computing the camera pose from consecutive images of a scene is oneaspect of estimating optical flow, the apparent change in position of imageelements through time; brightness patches can change from one frame tothe next for reasons other than motion, so observing optical flow does notnecessarily mean a movement happened. By imposing certain constraintson our interpretation of a scene, algorithms can build a geometric modelfrom optical flow data to estimate the position of the camera relative to19Chapter 2. Related Workthe scene. This model can be used to find the point of intersection of thecamera’s optical axis and some other object, such as a large wall display.One way of computing optical flow is through feature-tracking algo-rithms, which as their name suggest match individual points, edges or cor-ners across different images. An example is the Lucas-Kanade algorithm [24],which makes three assumptions: matching points look the same in every im-age (they have the same brightness), movements are relatively small, andpoints move similarly to their neighbors. These assumptions provide enoughconstraints for an equation that estimates the magnitude and direction ofmovement from one image to the other, and by assuming small movementsthe time required to compute the solution is kept manageable. Some imagefeatures can be tracked better than others, specifically textured regions workbetter than plain regions, so the presence of trackable features is importantwhen estimating motion. The technique presented in Chapter 3 is simplerbecause it assumes the presence of a specific pattern of features, that they alllie on a plane and that they do not move, so any movement comes from thecamera: the algorithm was designed to take advantage of these constraints.There are a variety of alternative techniques that can be used for in-teracting at a distance with large wall displays. The survey by Ballagas etal. [4] is a great resource for camera-based techniques, including direct andindirect pointing. An example is C-Blink by Miyaoko et al. [30] that con-sists of a screen mounted camera that is trained to look for color sequencepatterns on cellphone screens. Users can move their cursors on the screenby running a program that flashes one of these patterns, showing it to thecamera as they move their arms around. An easier approach is to use thetouchscreen of a smartphone as a sort of remote control for the cursor, asillustrated by Deller and Ebert [11], where dragging on the phone’s screenmoves the mouse on the large screen. This removes the benefits of directpointing, but works on any display. Gallo et al. [16] implemented an algo-rithm that uses the phone’s camera to track the hand of the user in front ofit, and the tip of their fingers becomes the cursor position. They did thisat an impressive 30fps on a camera phone with a very limited processor bytoday’s standards, but the algorithm’s performance suffers with increased20Chapter 2. Related Workbackground clutter.Another common approach involves more advanced camera systems, likethose used for motion capture by Vicon [47] and OptiTrack [19], and consistsof equipping a room with various precisely calibrated infrared cameras thattrack reflective markers worn by users in the room. By placing the markerscorrectly, one can obtain a model of the user’s joints and bones, or of a hand-held wand, which can then be used to compute a virtual point of intersectionwith the display. The main benefit of this technique is that it is very precise,but it requires a lot of equipment that has to be previously installed andcalibrated, and so is a more expensive solution.Others have used depth-sensing cameras to approximate the model of theuser’s body and carry out the same task with less precision. Muradov [31]implemented such a system where a Microsoft Kinect camera tracked theuser’s movements, built a user model, and then computed a cursor on thescreen as previously described.Sharp et al. [42] created highly optimized algorithms that use the samesensor to recognize a hand with high accuracy. By using the joint positions ofthe index finger we can determine a cursor position on the screen where userspoint, without them wearing any special equipment. The main disadvantageof depth sensors is that they have a restricted field of vision, only a coupleof meters wide, so to cover a large area one needs to use multiple sensorscarefully synchronized. This increases the cost of implementation and thesystem complexity considerably. In addition, the infrared light used by allof these devices may conflict with other technology in the room, like somemodels of personal response (clicker) systems.Alternatively, the depth-sensing camera can be held in hand and pointedat the scene. KinectFusion, by Newcombe et al. [33], uses this approach togenerate a volumetric surface reconstruction of room-sized scenes in realtime, which continuously tracks the six-degree-of-freedom position of thedepth camera while it moves around the room. Identifying the large displayin the scene, the camera pose can be used to estimate where a user is point-ing. As depth sensors become more widely available this technique shouldbe of great relevance for direct pointing.21Chapter 2. Related WorkLaser pointers are commonly used by lecturers in presentations to high-light material on the screen. By using a room-mounted camera that is cali-brated for the screen’s position and dimensions, a system can determine theposition of the laser beam on the screen to enable interactions, as describedby Kirstein and Mueller [22].Olsen et al. [36] provide advice to implement such a system, and alsoproposed an interaction scheme based on synchronized collaboration withthe windowing system to allow the user to press buttons, select items fromdrop down menus, etc. They suggest that the cursor should change depend-ing on the mode of interaction that is going on, and proceeded to test theirsystem to measure its effectiveness. Their system had considerable lag andlow sampling rate, due to technological limitations at the time, but workedsufficiently well for most of the purposes they designed. Although not astandard test, they observed that users took about twice as long to per-form a series of tasks with the laser compared to the mouse. Among theirfindings they realized that there is a real problem of hand jitter that makesinteraction with small objects difficult.The jitter problem can be expected to increase the farther the user isfrom the display. A nice feature of using laser pointers like this is that thelight on the screen is an effective cursor, easily seen by the audience and bythe camera, and its position is updated without delay. On the other handthis means we cannot use techniques like adjusted gain to improve pointingprecision. The laser tracking technique has seen many variations, and is apopular choice because it is relatively low-cost, requiring only the presenceof a calibrated camera on the room. It can even be extended for multipleusers by using modulated light patterns that help distinguish different lasersas shown by Vogt et al. [49], and some researchers have experimented withinfrared lasers which are invisible to the human eye, allowing once more toseparate the virtual cursor from the laser beam, for example Cavens et al. [9].Interestingly, users performed poorly with the IR laser when compared to avisible light laser, which the authors speculate is because of the delay causedby the rendering of the virtual cursor.Comparisons of devices, like the ones performed in the cited research,22Chapter 2. Related Workcan be carried out in different ways. One of the most frequent is to usethe theory developed by Fitts [13] where he describes a model of movementtime prediction for different tasks that involve users selecting targets andinteracting with them in different ways. It consists of a linear model thattakes into account the relationship between width of the targets and theamplitude of the movement, which he summarizes in a quantity known as theindex of difficulty of the task. His results have been refined by Welford [51] toconsider the individual differences in subjects and the separate contributionsof width and amplitude to the model, recognizing two distinct phases, oneof fast ballistic movement directly affected by amplitude, and one of hominginto the target affected more by target width.Fitts’s predictive model has been used by HCI researchers in the past dueto its consistent reliability, and is now called Fitts’s Law. MacKenzie [25]gives a good introduction of how Fitts’s tests are carried out, and Soukoreffand MacKenzie [45] present guidelines for standard practices. Furthermore,Shoemaker et al. [44] compare different variants of Fitts’s Law and concludethat for pointing at a distance on large displays, particularly when gainvalues are manipulated, two-part models based on Welford’s analysis workbetter.Myers et al. [32] performed comparisons of camera-tracked laser pointersand other devices, including the mouse, for pointing tasks from across theroom in a manner similar to what MacKenzie and Jusoh [26] did for otherremote pointing devices, and found that the laser pointer performed aboutone and a half times slower compared to the mouse. They also reinforcedwhat was known about the problem of hand jitter affecting laser pointerprecision. To the best of our knowledge, feature-based trackers that usemarkers have not been compared in this way to a baseline device like themouse.23Chapter 3Chiroptic Tracker: Camera-Based Remote PointingThe goal of this chapter is to describe a technique that has potential to beused effectively for remote pointing in a classroom setting without intro-ducing any extra resources into the classroom using only technology that isalready there, and to explain the main challenges that have to be consideredby designers who build similar tools. The basic idea is to have a patternof shapes shown on the screen that encode position information. A cameratakes pictures of this pattern and then analyzes them to extract informationand thus understand where it is looking. In this way a user holding thecamera can “point it” at specific targets of interest on the display to initiateinteraction.We begin by specifying a set of design constraints that help guide thecreation of the camera tracker, then describe the process of chiroptic track-ing in the abstract, what systems are involved, and what are their roles anddependencies. This high-level view of things helps to understand the gen-eral concepts behind our tracking technique. We proceed to then look at aspecific implementation made to serve as a proof of concept, and compareit to other technologies and measure its capabilities and deficiencies. Weend by going over the current known limitations and problems with our im-plementation, and propose ways in which they can be avoided or overcome,paving the road for future research in this area.3.1 Design ConstraintsFrom our analysis of the problem, and looking at design lessons from otherinput devices, we came up with a set of design constraints that helpedguide the construction and implementation of the chiroptic tracker. We willsummarize them briefly and then explain in more detail what the rationale243.1. Design Constraintsis behind each one.For a classroom setting. The design should consider that the tracker isbeing designed for use in a classroom setting. That is one of our mainobjectives, so if it does not work there it is a bad design. Specifically asolution should acknowledge the fact that there are multiple spectatorsinvolved, the lecturer and the audience, and that they have differentroles. There is a natural imbalance of control in the roles and thetracker is meant for use mainly by the lecturer, however, the audienceshould not feel lost or abandoned while the lecturer is using the device.A solution should also consider how people will use it. We expect aninstructor to have infrequent interactions with the material that isbeing displayed on the display, and more frequent interaction withthe audience, so the device should enable these interactions withoutgetting in the way. Because pointing interactions are infrequent, it isokay if they take a little more effort than with other devices, as longas they keep the flow of the lecture going.Use existing hardware. There are plenty of technological resources al-ready available that are not being fully exploited. Our belief is that asolution should be possible using only the existing resources, so ideallyno more hardware should be required. Specifically, we want to avoidhaving to introduce special lighting (such as infra-red) into classrooms,or cameras that must be mounted in the classroom, or modified high-frame-rate projectors beyond the standard ones that are commonlyfound in today’s classrooms.Focus on pointing. An interface meant for interacting with material on alarge display involves tackling many different challenges. The trackeritself should only worry about addressing the problem of pointing attargets of interest. If it can also be used for other things, like selecting(clicking), dragging or gesturing, that is fine, but that functionalityshould be secondary to the main objective, which is pointing.253.2. The Camera Tracking ProcessFlexible. The device should work by itself, without depending on otherinstallations or third party software. Specifically it should be indepen-dent of whatever program is used to display slides or other contenton the display. It should be possible to adapt it for other uses, so itis desirable to make it as general purpose as possible. It could seema slight contradiction to require both that it work in the classroomsetting and that it is as general purpose as possible, but that is notthe case. While it should be designed to work in a classroom first. asolution should not be constrained to work only in a classroom. Anysolution should allow common techniques used in other pointing inputdevices, such as clutching and gain control.Intuitive. Users should be comfortable using it, and it should not get inthe way of how they want to naturally interact. It should work as theyexpect, although some minimal training on how to use it is OK andfull proficiency may require a bit of practice. It should not require anycalibration. It should “just work”.The rest of this chapter explains the general idea for the chiroptic trackerand discusses an implementation. We believe our design meets most of theconstraints set forth for it. In Chapter 4 we report the results of a controlleduser study where we compared it with a mouse to get an idea of how wellit performs, and in Chapter 5 we present a design brief on an interface forclassroom interactions that uses it as a pointing input device.3.2 The Camera Tracking ProcessThere are three systems involved in making the chiroptic tracker work: thedisplay system, the chiroptic sensor and the user’s sensorimotor system, asdiagrammed in Figure 3.1. They work together in two loops that make upthe camera tracking process. The first loop is between the system displayand the chiroptic sensor, and its purpose is to update the position of thecursor based on where the sensor is being pointed at. The second loopis between the user’s sensorimotor system and the chiroptic sensor, and263.2. The Camera Tracking ProcessFigure 3.1: The three processes involved in chiroptic tracking. The displayrenders the cursor and screen contents, the sensor interprets what it sees inthe display to extract a position, and the human brain adjusts for errors.its purpose is to adjust the position of the sensor until the cursor is ona target of interest. We rely on the user’s sensorimotor system to makethe necessary adjustments to the position of the sensor in order to correctpointing errors without conscious effort, similarly to how we do with themouse. This is necessary because readings from the sensor are noisy dueto diverse factors that will be discussed in Section 3.4, and because thereis a delay in the whole system that has to be compensated for. The errorcorrection mechanism is left almost entirely to the human. In our prototypewe only provide slight help by smoothing the cursor movement, so we willnot discuss the second loop in any more detail except when we later describethe smoothing algorithm. Instead we will focus on the first loop, betweenthe sensor and the display, and discuss how that works. The overall processis as follows:1. The display system renders a series of shapes on top of the regularcontent of the screen, including the tracking cross (the cursor). Theseshapes are called fiducial markers2 and are used as a reference frame2 Fiducial markers are images that can be tracked with relative ease by vision algo-273.2. The Camera Tracking Processthat encodes position information for the sensor.2. The chiroptic sensor is made from a camera and a processing unit. Thecamera captures a frame that includes enough of the markers that theposition information can be decoded. The frame is processed and thesensor relays the position it read to the display system.3. The display system transforms the position read by the sensor into Xand Y coordinates in the virtual display, and then uses that informa-tion to update a model for the cursor position, possibly also a modelfor the shape and position of the markers, and then it starts againfrom step 1.The first step of the loop is relatively straightforward. The display shouldjust make sure to show all relevant markers in a way that the camera cansee them. The particular design of the markers is more interesting becauseof how it encodes position information. We will look at a possible set ofmarkers when we talk about implementation in Section 3.3.It is important to note that the position computed in step 2 by thesensor is relative to the markers. The markers create a reference frame thatis used by the display system to transform the relative position to absolutecoordinates. To get the relative position we usually choose a fixed point inthe camera, like the camera center, and then find the relative coordinatesof that point with respect to the markers. This is necessary because thecamera is capturing a frame out of context, it does not know what partof the display it represents, and does not care about the dimensions of thedisplay or its resolution.To process the image captured by the camera, the first step is to identifythe fiducial markers and make sense of them. These problems are sometimescalled feature detection and extraction and scene analysis and there are manytechniques that help address them, from simply looking at individual valuesof pixels to sophisticated mathematical analysis. See for example the booksrithms. They serve as points of reference and are commonly used to create augmentedreality systems or facilitate scene understanding.283.2. The Camera Tracking Processby Szeliski [48] and Forsyth and Ponce [15] for an introduction to computervision techniques. Depending on the design of the markers, it might befaster to use heuristics like geometric relationships to make sense of them.When the sensor computes a position it has to take into account thatthe image from the camera will be distorted. It should compensate for thisdistortion or its reading will be off by a factor depending on how distortedthe image is, or just plain wrong if the distortion is too severe. Distortioncan come from different sources, including the lens of the camera, but mostimportantly because of the perspective of the camera relative to the screen,which comes from the angle, distance and rotation between these two ele-ments. To illustrate the point, Figure 3.2 shows an example of two imagesof the same fiducial markers captured from different camera positions. Notethat on the left image in the figure the distance between points A and Bis perceptibly bigger than for points C and D, however in the grid thosetwo segments are the same size. That foreshortening is being caused byperspective distortion in the image.Figure 3.2: Two different perspectives of the same grid cell, as captured bythe camera. Corresponding points are labeled with the same letter. Thetracker has to deal with the distortions caused by perception. Note also themarked contrast in illuminations on the top and bottom sides of the leftimage.To compensate for distance, the relative position can be expressed in“marker units”, which is a unit of measurement derived from the fiducial293.2. The Camera Tracking Processmarkers. It can be, for example, the distance between two of the markers.As long as the display knows the value of that distance in pixels it cantransform sensor coordinates to screen coordinates. Compensating for angleand rotation is a little bit trickier because it requires that the sensor interpretthe image to understand how the different fiducial markers are arranged, andfrom that get a transformation function that allows it to undo the effects ofthe distortion. This is called a perspective transformation, or homography,and computing it is a common problem in the area of computer vision,so there are many techniques that can be used to do it. The previouslyreferenced books by Szeliski and by Forsyth and Ponce are good resourcesfor this.Step 3 begins by translating the raw reading from the sensor, which is therelative position of the cursor, to an absolute position in screen coordinates.This is generally a relatively straightforward step that depends on how theencoding of the position was done on the markers.The absolute position can be used to update several models. The first ofthese is a model of the cursor position that can be used to predict a usefulplace to draw the cursor, but can also be used for more advanced techniqueslike target prediction, where one tries to move one step ahead and predictwhat object of interest the user is ultimately moving to. The second modelis for the position of the markers on the screen. When the markers are fixedthere is nothing to do, but if the markers move around based on the sensorreadings the model can be used to predict good places to render them sothat the sensor will see them on the next reading (this is similar to whatSutherland did for the light pen’s cursor). The third model that can beuseful is for the position in space of the camera, relative to the screen. Thisposition is determined by six parameters, the three coordinates and threeangles of the camera, and computing them is known as the pose estimationproblem in computer vision, which has a strong connection to the problemof homography estimation.There are potential advantages gained from modeling the camera pose.For example, the pose itself is a good proxy of the position of the user in theroom, which can be used by interface designers to create interactions that303.2. The Camera Tracking Processvary depending on where the user physically located. Another advantage isthat by correctly modeling camera pose we can predict where it is likely tomove, and so the sensor can filter out readings that are inconsistent withthe model, so if the pose of the camera changes from one side of the roomto the other in a single frame, then one of those readings is likely wrong.There is an extra challenge involved when the shape and position of themarkers change based on the cursor position, which is that they create whatis called a closed-loop control system in control theory. We will not go intomuch detail about this3 except to explain some of the effects that such asystem has on the overall tracking process. In a few words, what happensis that the sensor cannot trust what it is reading from the screen because ofthe delay in the system, and so has to compensate for errors. To illustratethis consider a camera running at 30fps, in a system with a delay of 100ms,with the cursor being sensed stably in the center of the camera. If thecamera moves one cm to the right of its current position we would expecta similar movement to be seen on the markers in the display and then thesystem to return to a stable state where nothing moves. Instead, the firstframe after the camera is moved will send a relative movement of a unitsto the display which will take 100 ms to be seen by the camera again, andby that time the camera will have seen the same markers two more timescausing a movement of 3a units total. When this error is finally discoveredthe camera will compensate, but that reading will also get repeated a fewframes because of the delay, and so the markers will again overshoot theirmovement in the opposite direction. This causes an oscillation, which makesthe system unstable and is a problem that only gets more complicated if thedelay of the system is variable. When the position of the markers is keptin place we do not have this problem because the sensor can be confidentthat it will always be computing positions relative to the current position ofthe markers. There are techniques such as damping, Kalman Filters or PIDcontrollers that are used to fix these issues, but we will not discuss themhere.3 But see the books by Astrom et al. [2] and Ogata [35] for introductory material tothis field.313.3. Implementation3.3 ImplementationWe chose to make an absolute pointing system with a fixed grid of fiducialmarkers that are shown persistenly on the screen. The architecture for ourimplementation is diagrammed in Figure 3.3. There are two main processesthat communicate in a loop: the display communicates with the sensorby showing a grid of markers that the camera captures, and the sensorcommunicates its readings to the display using a network socket. Withinthe sensor process there are several subsystems that carry out the stepsrequired to compute a relative position. First, the camera image is analyzedto extract the markers using a blob detection algorithm, then the blobs areprocessed to identify four that are the corners of a cell in the grid. Thesensor computes a homography from these four points, which is used toidentify the rest of the components of the cell and to obtain the positionof the camera center relative to it. The relative position and cell row andcolumn numbers are sent to the display process that further transforms itto an absolute position, feeds it to the cursor model, and then renders thegrid of markers and the updated cursor according to the model.Figure 3.3: The architecture of our tracker’s implementation.This implementation makes no assumptions about the camera itself,which can have any sampling rate or lens distortion on it. We are ignoringthe radial distortion of the lens because we will be focusing on the centerof the image, where the effect is minimized, and because for the cameras323.3. Implementationthat we tested the negative effect on accuracy is very small and likely easilycompensated by the user. The sampling rate of the sensor is independentof the refresh rate of the display, so the model for the cursor can be used toupdate its position even when the sensor readings are relatively infrequent.3.3.1 Grid of MarkersThe fiducial markers are arranged in a grid as shown in Figure 3.4. Groupsof four ellipses with alternating horizontal and vertical orientations make upone cell of the grid. There are circular grey markers that indicate the correctorientation of the cell and green markers that encode its row and columnnumber. We use black ellipses for the corners because these are relativelyeasy to find, and they are the first objects that the sensor will look for tostart making sense of the image. By following the main axis of an ellipse wecan identify others in the same line or column of the grid. That informationis very helpful in deciding which ellipses belong to which cell.Figure 3.4: Grid of fiducial markers. The black ellipses define a referenceframe, the gray circles determine proper orientation, and the green squaresencode row and column number.With this design the camera only needs to be able to see four ellipses that333.3. Implementationmake up one cell of the grid to ensure that it can compute a position relativeto that cell, and because the green markers identify the cell uniquely, thedisplay process can translate that into an absolute position. The resolutionof the grid should be adjusted so that the camera can capture at least onecell in each frame while avoiding making the markers so small that they willnot be seen by the camera correctly. The size of the individual markers canalso be adjusted to improve sensing.Each cell has unique row and column numbers, represented as bits bythe green markers. If a green marker is present, it counts as a 1, if it isnot, it counts as a 0. The four horizontal green markers are used for columnnumber and the vertical ones for the row, however we exclude row andcolumn 0 because of an artifact of motion blur: when the camera image getsblurred the ellipses can still be read sometimes, but not the green markers,and the sensor incorrectly reports a movement relative to cell (0, 0) whichcauses the cursor to suddenly jump on the display. So we end up with adesign that supports a grid with a resolution of up to 15×15 cells. A finaldetail comes from the observation that contiguous cells share sides, so weneed to arrange the numbers in a way that ensures the two high-order bitsof one are the same as the two low-order bits of the next. One possible bitarrangement is shown in Figure 3.5.Figure 3.5: A sequence of bits used to encode row and column numbers.Numbers from 0 to 15 are arranged so that the two high-order bits of oneare the same as the two low-order bits of the next. Each of the 16 numbersappears once in the sequence.343.3. ImplementationFigure 3.6: Design of the cursor for the chiroptic tracker.3.3.2 CursorOur cursor design is shown in Figure 3.6. Its shape and size vary dependingon its recent position history, with the idea that both the lecturer and theaudience can follow it better. Previous research by Po et al. [38] shows thatfor pointer interaction orientation-neutral cursors or cursors aligned withthe direction of movement generally work better, so our design for the rest-ing cursor is a circle and for a moving cursor we show a trail, which alsofacilitates observers following it with their gaze. The trail gets longer thefaster the cursor moves and disappears when it moves slowly. The size of thecircle also changes dynamically, expanding with fast movements and shrink-ing when the cursor is relatively stable. In this way we expect observers cantrack it more effectively when it is moving, but it will be small enough toafford precise selection when users dwell on targets. A final element of thecursor is an orthogonal cross made up of a vertical and a horizontal segment,which appears only when the cursor is fairly stationary and is meant to en-able pixel-precision readings of the its position. The specific sizes, shapesand colors in the cursor were determined empirically. Our only recommen-dation for now is that they should be clearly visible to the humans in theroom, but invisible (or easy to tell apart and ignore) to the chiroptic sensor.For our implementation the cursor is always rendered close to the actualposition that the user is pointing. We smooth the actual position computedfrom the sensor using an exponential moving average, which is a weightedinterpolation of the current sensor reading and the previous average, definedby the following equation:353.3. ImplementationCt = α · Pt + (1− α) · Ct−1 (3.1)The value of α is a parameter that can be adjusted to change smoothness.For values closer to 1, the actual current position is weighted more heavilythan the history, and so there is less smoothing. Values closer to 0 will makethe cursor behave very smoothly, but movement will feel sluggish.3.3.3 Feature ExtractionThe image captured by the camera is just an array of color values withentries for every one of its pixels. To make sense of it we have to first extractfeatures, like shapes or corners, that we can use to conduct a higher levelanalysis. For our simple grid we chose to implement a naive blob detectionalgorithm that looks for the ellipses and filters everything else out. It worksby first creating a copy of the original image where every pixel has beensubstituted by just black or white based on a threshold luminance value, soif the pixel has a very dark color it will be changed to black, and if it isbelow the threshold, it will be colored white. This binarized image is thenscanned one line at a time looking for segments of black pixels, which aregrouped together to form blobs. Those are our best guess at identifying theellipses, so blobs that have a very small or very large area are filtered out asnoise. Note that for this to work we require a strong contrast between theblack color of the ellipses and the rest of the contents of the screen. This isone of the weaknesses of our method, and we will come back to discuss it atthe end of the chapter.As we find the blobs we also compute their bounding box, their perimeterand the direction of their main axis. The bounding box is the smallestrectangle parallel to the X and Y axes that contains the blob, and if theblob is an ellipse then its center is a good approximation to the center of theellipse. Blob pixels that are adjacent to some other white pixel are markedas part of the perimeter, then they are sorted by their distance to the centerof the blob and with the ten that are the farthest we find a regression lineof best fit, which will give us an approximation to the direction of the main363.3. Implementationaxis of the ellipse. Figure 3.7 shows an example of a frame captured by thecamera, its binarized version, and the perimeters and main axes of the blobsthat get identified with this method.Figure 3.7: The results of feature extraction on one frame of the camera.On the left is the original image with perimeter pixels and the main axisof each ellipse highlighted in red. On the right is the the binarized versionused to find the ellipses, and their bounding boxes.There are more advanced computer vision algorithms that perform ro-bust feature detection with subpixel precision, which would improve thequality of the sensor readings. An advantage of our approach is that it canbe done very fast and we can tune it to our specific needs, for exampleexploiting the properties of the ellipses. Our technique is weak because itrequires lighting conditions to be optimal in the room for the tracking towork, so it should be considered only as a proof of concept with much roomfor improvement.3.3.4 Computing Relative CoordinatesThe next goal is to identify four ellipses from the previous step that formthe four corners of a grid cell, which is challenging due to the perspectivedistortion in the image. Note that in Figure 3.7 the main axes of the ellipsesform a staircase pattern, and one “step” of this staircase forms a trianglethat is half of a cell of the grid. Using a few heuristics we can identify sucha triangle and get the fourth point from that. First, we assume that the373.3. Implementationblob that is closest to the camera center is part of the triangle. Then wecompare the distance of all other blobs to the line formed by the originalellipse’s main axis, and choose the one that is closest. Similarly, the thirdblob is the one closest to the axis line of the second blob. All comparisonsare made using the blob centers, and if there are any ties we break themby choosing the blob closest to the previous blob. In this way we find thetriangle we were looking for, and now we only need to identify one final blobto complete the grid cell.With the three ellipses found so far we can compute a function known asan affine transformation that can undo other forms of distortion. The dif-ference between an affine and a projective transformation is that the formerpreserves parallel lines and the latter maps them to lines that intersect at apoint, so the affine transformation can be used to fix a distortion of our gridlike the one shown on Figure 3.8a, but not like the one shown in Figure 3.8b.In a way, however, the affine transformation is a cheap way to approximatethe perspective distortion of the image, as shown in Figure 3.8c. We use itto compute an expected position for the fourth point of the cell and choosethe blob that is closest to that. Building the affine transformation relativeto the cell corners is straightforward if we arrange them as in the figure,and define the position of point A to be (0, 0), of B to be (0, 1), and of Cto be (1, 0), giving us two vectors that form a basis for the cell’s coordinatesystem. By mapping the respective vectors in image coordinates to thosein cell coordinates the affine transformation follows, and its inverse can beused to get the expected position of point (1, 1) in image coordinates.Now that we have the four points that make up the corners of a gridcell, we can use them to estimate the perspective transformation, whichcan be used to reverse the distortion of the camera, by using an algorithmknown as Direct Linear Transform described in the book by Hartley andZisserman [18]. This algorithm requires that we provide the correspondenceof four points in image space to their coordinates in real world space, soby using a similar idea as before we can map point A to (0, 0), point B to(0, 1), point C to (1, 0) and point D to (1, 1). We use the OpenCV library [7]to solve the required system of equations and get in return a homography383.3. Implementation(a) Affine distortion (b) Perspective distortion(c) Affine approximationFigure 3.8: (Top) Difference between affine (a) and perspective (b) trans-formations. (Bottom) Although the affine transformation based on pointsA, B and C is not good enough to predict the position of point D, it findsa good approximation in point D′.expressed as a matrix, which when multiplied by a point in image coordinatesgives us the corresponding point in grid coordinates relative to the cell. Wecan also perform the opposite transformation using the matrix inverse. Withthe inverse we look for the grey circle in the position where we expect to393.3. Implementationfind it for each of the four ellipses, and once we find it we know the correctorientation of the grid cell.Using the inverse of the homography again we look at the positions wherewe expect the green markers to be. If we find a pixel value that has moregreen saturation than red or blue, we consider it to be a 1. If not, it is a0. Putting together the eight green markers in the correct order we decodethe row and column numbers of the cell. We then use the homography onelast time to transform the position of the camera center to grid coordinates,and send this relative position with the grid row and column numbers to thedisplay process. With that, the sensor process is done and will go througheverything again when a new frame comes from the camera. The displayprocess uses its knowledge of the layout of the grid to perform a final trans-formation to the sensor’s reading, from cell-relative coordinates to absolutescreen pixels, which it uses from there on.3.3.5 Cursor PositionEvery time the display process gets a reading from the sensor, which happensasynchronously, it uses the absolute coordinates to updates its model of thecursor. Our current model is simple, it keeps its current position separatefrom the sensor reading, which is treated as a goal to reach eventually. In aseparate processing thread, when the display process renders a new frame,it asks the model for a position where it should show the cursor. The modeltakes its previous position and the newest sensor reading and interpolates avalue between the two using equation 3.1 to smooth the movement, storingthe result for the next interpolation. As long as no new readings come fromthe sensor, the model will keep doing these interpolations so that the actualcursor position approaches its goal a little at a time, but when a new readingcomes the goal is updated and the next interpolation will cause a movementtowards that. This simple model works surprisingly well using low valuesfor the α parameter in the interpolation. For a display refresh rate of 60fpswe determined an empirical value of α = 0.2.The cursor model has one other function. From time to time the sensor403.3. Implementationwill get confused, either due to noise, motion blur or other problems, andwill report an erroneous position. The model is configured with a thresholdvalue so that if the distance between a sensor reading and the previous oneis too big, it will be counted as a fluke and ignored. This heuristic givesstability to the cursor on the screen, however if the user makes a drasticmovement that is legitimate, the model will incorrectly filter out the newsensor readings. For that reason these “erroneous” readings are stored inthe model and if after a few of them the cursor seems to be stable in a newposition, the model updates to move there. We currently filter only onlyreading, so if the sensor reports the same general position information twotimes in a row, we update. The values used for this filtering depend on howreliably the sensor can identify the markers in any given setting, and willprobably not be needed when more robust vision techniques are used.In our implementation, when the sensor cannot read position informa-tion for any reason or gets a reading wrong, the cursor simply continuesinching towards its goal position and then stops. This means that when theuser changes position quickly the cursor will seem to “stick” for a momentand then shoot towards the new position, causing a very perceptible delayin response. We tried an alternative model using second order prediction(velocity and acceleration) to keep moving the cursor past its goal, but with-out imposing some synthetic deceleration, similar to the effect of friction,it created awkward cursor movements. Using the friction it behaved verysimilar to the simple model, so we did not pursue that idea any further.The position obtained from the model is used to render the cursor di-rectly. We keep a history of cursor positions that we use to vary the sizeand trail of the cursor, as explained before. Other implementations couldlook to use position information in different ways, for example choosing notto show the cursor but instead highlighting a predicted target of interest.The important point to keep in mind is that the computed position doesnot have to be the rendered position, they may serve different purposes if ithelps improve the usability of the interface.413.4. Known Limitations3.3.6 PerformanceWe implemented this prototype using the Processing 2.0 language, which isbased in Java and uses the OpenGL library. The algorithms used are notparticularly optimized, so they can probably be tuned to perform faster.Running both processes on an Intel i7-4710HQ CPU @2.5GHz with 12GBof RAM and an NVIDIA Geforce GTX 850M graphics card, the sensor codecan run at slightly more than 240fps processing a 640×480px resolutionimage each time. In the lab we measured the delay from the time the gridis shown to the cursor position being updated at around 150 ms.This is high-end equipment, but we expect that the same techniquesshould run on more modest resources at adequate sampling rates, so inprinciple it should be feasible to use something like a smartphone directlyfor processing. The biggest problem might be delay in getting the image fromthe camera to the software, and the restricted memory available. Loweringthe resolution of the camera would also improve time performance, and it ispossible that users could deal with the lower sensor accuracy. All of this isspeculation and remains to be tested, but we feel confident these ideas canbe implemented in current hardware.3.4 Known LimitationsOur implementation is enough for our purposes — it is a proof of conceptshowing that the idea is realizable, it works well enough under controlledconditions that we can perform a study to compare the chiroptic trackerto a baseline device like the mouse, and measure its performance to get anidea of its capabilities. It is still an early prototype implementation, and assuch has some problems and limitations that have to be addressed in futureresearch. Here we discuss the more pressing ones.3.4.1 Occlusion Caused by the GridThe markers that are displayed for the sensor have to be shown on top ofthe contents of the display, which causes the obvious problem of occluding423.4. Known Limitationsimportant information, and they are distracting for the audience. For theclassroom setting this problem is mitigated by the assumption that the lec-turer will have only sporadic interactions with the display, and so the gridonly needs to be shown during those times. In Chapter 5 we present a designbrief for a classroom interface that uses the tracker as it is now, so we believethat it can be made to work and that the benefits outweight the cost.Nevertheless there are things that can be done to fix or mitigate the issue.The simplest is to adjust dynamically the resolution of the grid and the sizeof the markers depending on the position of the camera, so that it can seethem as needed but at the same time keeping the density of markers to aminimum on the screen. Making the grid change dynamically introducesagain the problem of dealing with a closed-loop control system that wasdiscussed earlier, but it is a common problem with proven solutions in manyengineering applications. Alternatively, we could create better predictionmodels for where the markers need to be, leaving only the four that arenecessary to compute a homography and in this way removing most of thegrid from the screen. In the extreme case we could make the cursor itself bemade up of markers. This is also a closed-loop solution that we are startingto look into.If the projector in the room has the capabilities, the grid could be shownin infrared light so that the audience cannot see it, and a camera without aninfrared filter would still make an effective sensor. Or if the camera has highenough framerate and the projector can work at 120fps or faster, we couldshow the grid on only some frames and not others, or switch its colors inalternating frames so that they “cancel out”, in a way similar to the invisiblestructured light techniques discussed in Chapter 2. In theory, when donefast enough the audience would not even notice the presence of the grid. Wefeel these are weaker solutions because they depend on the availability ofless common hardware, and in particular a blinking pattern could end upbeing more disruptive than a grid that is always on, while infrared couldalready be in use by a different system in the room, such as some models ofclickers, however, they are still possibilities to look into.433.4. Known Limitations3.4.2 High Color ContrastThe algorithms we use are not robust, and the sensor simply stops workingwhen the colors on the display have little contrast. That is often the casewhen using projectors that have a weak light source, or with poor colorbalance settings. Another issue is ambient room light coming from sunlightor artificial light sources, which has the effect of attenuating the contrastof colors in the camera image. This is particularly bad when a light fallsdirectly on the surface of the display, because the same rendered color willhave different RGB values in the sensor’s image depending on how muchlight falls on it. The human perception system has color constancy, whichallows us to identify two very different patches of light as the same color (foran example consider the sun falling directly on half of a desk, even thoughthe half in shadow and the half in the light look very different when takenindividually, we know the desk surface is a single color). There are manytechniques in computer vision that model color constancy, see for examplework by Agarwal et al. [1] and by Gijsenij et al. [17], so this is a problemthat can likely be solved. As a partial solution the sensor could synchronizewith the display to show a pattern that helps calibrate the expected colorvalues, which would happen only at the beginning of the interaction or fromtime to time during normal operation.Another alternative is to be less reliant on color, for example by usingdistinct shapes instead of distinct hues, or by using a different technique fortracking altogether. For example, we could perform point correspondenceacross consecutive frames directly on features found on the contents of thedisplay, without having to show any extra markers, and get a homographyfrom that. In a situation where there are not enough features on the display,like a drawing application with a blank canvas, we could show a gentle,neutral texture as a background that presents enough recognizable featuresto the camera. Szeliski’s [48] discussion on motion estimation algorithmswould be a good starting point to go deeper into this.443.4. Known Limitations3.4.3 Chaotic MovementNatural hand jitter, noise in the camera sensor and motion blur can allcause erroneous position readings that send the cursor flying around thescreen chaotically, which is very disruptive when trying to interact with thedisplay. Some of the techniques we mentioned in the implementation, likesmoothing and filtering out extreme changes in position, help to alleviatethe problem at the cost of making the cursor feel somewhat more sluggish.Instead of smoothing we could try manipulating control to display ratiovalues, which would lower the impact of jitter. That would not help withproblems caused by motion blur, for which smarter and more robust algo-rithms could be designed. Ultimately, better cameras with a higher samplingrate and improved sensors would get rid of many of these issues, so perhapsas hardware continues to evolve this limitation will fix itself.3.4.4 Lens BlurWhen the image from the camera is not properly focused the blur can causethe sensor to misbehave. If the user is consistently interacting at a sufficientdistance from the display this is not a problem, because most lens systemscan focus objects at far distances accurately without major dynamic adjust-ments. Otherwise the tracker should have some way of dealing with this,like the auto-focusing features found on some cameras.3.4.5 LagThe 150 ms it takes for the sensor to capture an image and decode the cursorposition, together with the smoothing techniques used to improve accuracy,make the system feel slow to respond. In Chapter 4 we will present the re-sults of a study designed to measure the usability of the device as it is, butit would always be desirable to have less delay. This could be accomplishedby using different software tools with faster access to the camera, and even-tually we could manufacture dedicated hardware like it is done with themouse today. In the meantime, an interesting approach is to create other453.4. Known Limitationsmodels that help predict the user movements, either by keeping track of thesix degrees-of-freedom of the camera pose in the room or by other predictivetechniques.3.4.6 Acute AnglesFigure 3.9: An early version of the chiroptic sensor that incorrectly identifiedfour blobs, highlighted with red, green, blue and magenta, as the corners ofa grid cell. The cause is extreme perspective distortion in the image and theuse of weaker heuristics for grid identification.When the distortion from perspective is too strong, our heuristics failand the sensor reports erroneous information. Figure 3.9 shows an examplefrom an early version of our prototype that used different heuristics fordetecting the grid cell, where the sensor got confused and labeled four blobsincorrectly as the corners of a grid cell. When the angle is too extreme thedistortion is simply too much and the sensor will just not work, but there isa limit point where it sometimes works and sometimes not, and that can befrustrating to the user because the cursor moves erratically on the screen.The use of better heuristics or a different marker design could help addressthis problem.463.5. Pilot Testing3.5 Pilot TestingWe piloted the study described in the next chapter with seven colleaguesand found many interesting observations that were used to improve theexperiment. We consider them valuable lessons when designing other similarstudies, and also because they give insight into how users might interact withchiroptic trackers. We summarize them here.Mouse gain was set to a value that all participants considered comfort-able enough to reach all targets yet giving them enough precision for smallermovements. Windows 8.1 has a ten notch slider for adjusting mouse gain;after trying different values all seven pilot participants suggested indepen-dently that the 6th notch was optimal for the task.Our first design of a camera tracker used a glove in which all fingersexcept for the index had been cut out. The stripped-down plastic casing ofthe camera was attached to the tip of the finger with hot glue and the cableallowed to hang loose from it. Users reported that the weight of the camerawas enough that they were getting tired by the middle of the study. Fromobservations, it was also clear that the flexibility of the glove and weightfrom the cable made the camera droop from the end of the finger, so thatusers had to compensate by pointing higher than what was their intuition.We decided to drop the glove design and instead create a device similar to alaser pointer that was used for the study. It is possible that pointing usingjust a finger is a feasible idea, but clearly doing it well is a design challengeof its own.We reduced the number of trials per width and amplitude combinationfrom 16 to 12, incremented the duration of mandatory pauses between blockof trials, and increased the frequency of optional pauses. This was becauseseveral of the participants in the pilot study mentioned that they were get-ting tired before the end. We also adjusted the side position to be difficultyet doable by all participants, bringing the user a little closer to the screenand with a more obtuse angle than what we had originally planned. Thefinal position was nearly 4 meters from the screen center and roughly at a40 degree angle, a smaller angle caused significantly more trouble for most473.5. Pilot Testingusers.The grid of fiducial markers was adjusted so that the camera could per-ceive it from all three positions correctly. We used a grid of 6×3 cells, withellipses of 24.4 cm of major axis and 9.75 cm of minor axis. We adjustedthe smoothing factor α in equation 3.1 to 0.2 so that the cursor positionwas the weighted sum of 20% of the most recent position as reported rawby the sensor and 80% of the previous average. This was evaluated sub-jectively by pilot participants as a good trade-off between smoothness andresponsiveness.We had observed before that new users of the camera tracker tend to“drag it” a bit cautiously as if attempting not to lose it, while in realitythis is not necessary because it is an absolute pointing device and will finditself on the screen even if it gets temporarily lost. This observation wascorroborated with pilot participants. To try to encourage participants totake advantage of the affordances of absolute pointing, we experimented witha lower refresh rate for the cursor, which updated its actual position at fullsampling rate, yet only provided updated visual feedback every 150ms. Thisseemed to work, but qualitatively all pilot participants reported a strongpreference for the smooth cursor. We resolved the dilemma with anotherobservation: after dealing with the narrow targets that were farther apart,users usually discovered on their own that jumping to arbitrary positionswas safe and effective, so we designed practice sessions to start with preciselythose targets.A final problem with the study design came from one of the tracker’slimitations. As mentioned previously, in certain situations like those causedby motion blur, the tracker algorithm gets confused and decides the user ispointing at a radically different position. This causes the cursor to fly aroundthe screen in chaotic movement. While the problem was minimized withthresholding, it still happened from time to time. This would cause somepilot participants to lose the tracker completely and spend a long time, oftentens of seconds, trying to reacquire it. We asked them to experiment withdifferent grips, the two more common ones being the screwdriver grip andthe pencil grip, as shown in Figure 3.10, and the pencil grip seemed to be the483.5. Pilot Testingone where they could reacquire the tracker faster. Accuracy and precisionwith both grips seemed comparable, so we decided to ask participants touse the pencil grip as a requirement for the study. We should mentionthat a pistol grip, similar to how a person holds a gun, was suggested asa superior option, but we did not pursue it because it could potentiallymake people uncomfortable. We also observed an effect, possibly transferredfrom mouse use, where participants would move the camera around whenlosing the cursor in an attempt to find it again. This would often result inmotion blur in the camera, causing more chaotic movements and confusionfor participants. We preempted this problem during the study by explaininga better strategy for finding the cursor, which is described in the next chapterin Section 4.3.4.Figure 3.10: Two of the grips used to hold the chiroptic tracker. The screw-driver grip on the left is commonly used to hold a laser pointer. The pencilgrip on the right seems to be slightly more intuitive for aiming the tracker.These observations suggest that usage of the tracker is not always intu-itive, and users will benefit from understanding the general principles behindit, as well as from better design that takes into account these natural inter-action affordances.49Chapter 4Comparing Remote Pointing to the Mouse:a Study on the Feasibility of Chiroptic DevicesHaving described how to implement the camera tracker, we now turn ourattention to evaluating its performance and comparing it to what is probablythe most common input device used in classroom presentations today, themouse.Our goal is to obtain initial models of user performance for the cameratracker, and in the process attempt to demonstrate that the camera can bea valid device for pointing at a distance. To do this we designed and ran astudy in which we compared the camera to the mouse in a standard Fitts’sLaw task [13].One of the clear downsides of our current approach to the camera trackeris that the grid of fiducial markers has to be rendered on top of the screencontents, which causes occlusion of potential targets of interest and possiblyconfusion to the user. For this reason we were also interested in measuringthe effect of the grid on the ability of users to carry out tasks with targets ofdifferent sizes. Our design assumes that prolonged presence of the grid onthe display will become “background noise” that users will learn to “filterout” after a while.A secondary goal of the study was to collect data that might be used toinform the decisions of designers of interfaces that utilize the camera tracker.4.1 HypothesesBased on our observations while developing and testing the camera tracker,we expected it to perform worse than the mouse, but still well enough thatit can be considered an effective alternative input mechanism. The studytested the validity of the following hypotheses that each addressed some504.2. Empirical Models of Pointing Performanceaspect of performance:H1. Myers et al. [32] showed laser pointers are around 1.5 times slowerand have slightly higher error rates than the mouse for tasks requiringpointing at a distance. Due to the similarities with the laser pointer,we expect that the mouse will perform better than the camera, withboth a better throughput and a better error rate, but the camera’sperformance will only be a factor of 2 or 3 times slower than themouse and 10-25% less precise.H2. There will not be an effect on user performance with the mouse whenthe grid is on for relatively big targets (those targets that the fiducialmarkers can only occlude partially).H3. Due to the nature of the computer vision algorithms employed, par-ticipant hand jitter, and motion blur effects, pointing with the cameratracker will be less effective the farther a user is from the screen andthe more acute the angle to the screen is.H4. Pointing performance with the camera while sitting down will be com-parable to pointing with the camera while standing up.4.2 Empirical Models of Pointing PerformanceTo test our hypotheses we used empirical models of pointing performancedrawn from the literature. All are based on the well-known Fitts’s Law,first published in 1956 [13]. Fitts proposes a linear model for movement timethat takes into account the index of difficulty of a movement task, which is aquantity that depends on the ratio of the amplitude of the movement and thewidth of the target (Equation 4.1). His original formulation (Equation 4.3)has been refined by Soukoreff and MacKenzie [45] to better match Shannon’stheory of information, which in part inspired Fitts’s model, using a differentindex of difficulty (Equation 4.2). The resulting model (Equation 4.4) has514.2. Empirical Models of Pointing Performancebecome an accepted standard for movement time studies.(Index of Difficulty) ID = log2(AW)(4.1)(Shannon Index of Difficulty) ID = log2(AW+ 1)(4.2)(Fitts) MT = a+ b log2(AW)(4.3)(Shannon-Fitts) MT = a+ b log2(AW+ 1)(4.4)Both of these movement time models are called one-part formulationsbecause they consider only the ratio of amplitude and width—their indi-vidual magnitudes are not relevant. In 1968, Welford [51] recognized thatthere are two phases of movement, one of rapid ballistic motion affectedmainly by amplitude, and one of homing into the target which dependsmainly on target width. Welford’s proposed a two-part formulation (Equa-tion 4.5), which explicitly recognizes the different contributions of widthand amplitude. Recently, Shoemaker et al. [44] suggested a fourth formu-lation (Equation 4.6) that combines aspects of the Shannon-Fitts one-partformulation and Welford’s two-part formulation, which they used to analyzepointing tasks at a distance on large wall displays.(Welford) MT = a+ b1 log(A)− b2 log(W ) (4.5)(Shannon-Welford) MT = a+ b1 log(A+W )− b2 log(W ) (4.6)The b parameter in Equations 4.3 and 4.4 is the rate of change of move-ment time as the index of difficulty is varied. Looking at it helps to un-derstand how a particular model behaves. It can also be compared acrossdifferent models, for example for different pointing devices, to understandtheir relative performance, but care should be taken because this compar-ison does not take into account the a parameter of the models. Instead,Soukoreff and Mackenzie [45] suggest a comparison based on throughput perindividual, which they define as the average of ratios of effective index of524.2. Empirical Models of Pointing Performancedifficulty over movement time. Throughput is measured in bits per second,in keeping with the information-theoretic interpretation of Fitts’s Law andits variants. For participant i in a study performing in conditions indexedby jTPi = 1nn∑j=1IDijMTij (4.7)where n is the number of target conditions (combinations of target widthand amplitude). The average of all participant throughput values is theoverall throughput for a given condition. This value can be used to compareconditions directly.Welford’s also suggested that using the actual width of targets for thecomputations might be incorrect. For example, a person moving quicklytowards a wide target will tend to tap it on a position closer to the nearedge than the far edge of the target, and so the distribution of tap positionswill likely fall on a region narrower than the target’s width. The effectivewidth of a target is the region where most observations happen, irrespectiveof its actual width. A similar reasoning can be used to derive the conceptof effective amplitude. Soukoreff and Mackenzie [45] encourage the use ofeffective widths and amplitudes for Equations 4.3 and 4.4, and we adoptthis for our analysis of all model formulations. We use their definition ofeffective width as the region where approximately 96% of the observationsoccur, and so it is given byWe = 4.133σ (4.8)where σ is the standard deviation of the end-point positions of the observa-tions for the particular target. Effective amplitude is defined as the meanamplitude between successive observations for the target. We will writesimply W and A for effective width and effective amplitude throughout thetext, rather than the more cumbersome We and Ae.A thorough discussion of the four empirical models of pointing perfor-mance that we will use is provided by Shoemaker et al. [44]. We follow534.3. Methodclosely their approach for analyzing the results of our study and for com-paring between the models.4.3 MethodWe conducted a controlled study that utilized the one-dimensional hori-zontal serial tapping task commonly used in Fitts’s Law studies in whichparticipants have to point at and select alternating targets on the display.We based the design and analysis on similar experiments in previous HCIresearch, and we carried out some additional analyses suggested by Souko-reff and MacKenzie [45]. We also analyzed our data using Welford-styletwo-part formulations as suggested by Shoemaker et al. [44] to learn moreabout the separate effects of width and amplitude on movement time.We chose a one-dimensional task instead of a two-dimensional task be-cause we wanted to carry it out in a realistic setting, which for us is anauditorium-style classroom with a large display that is usually located highabove and out of reach of the instructor. Such a setup creates an acute verti-cal angle between the user and the display that causes important differencesin target perception due to foreshortening and, in early tests, it was seen tocause discomfort for some users. In the study participants were shown rect-angular targets that they had to point to and select in alternating motion,as illustrated in Figure 4.1, switching between devices as the experimentprogressed. The full details of the experiment are provided in the remainingsections of this chapter.4.3.1 ParticipantsThere were 24 participants who took part in the study (7 female) between theages of 22 and 34. They were recruited by advertising in UBC student emaillists and by word of mouth. To avoid experimental bias due to handednessor personal handicaps, all our participants were screened by self-report tobe right-handed, as well as having normal or corrected to normal vision, nocolor-blindness, and to be regular computer users averaging 8 hours a weekor more of computer usage.544.3. MethodFigure 4.1: Illustration of the task performed by participants: (left) thecursor is moved from its starting position towards the darker rectangle,(center) if the user clicks correctly on the target the other rectangle becomesthe new target, and (right) if instead the user misses the target it flashesred and the targets then switch.The study had approval from the Behavioural Research Ethics Board atUBC. Participants were fully informed of the purpose of the study and oftheir right to withdraw at any point. They were compensated $10 for theirtime.4.3.2 ApparatusThe study was run in an amphitheater-style classroom, which is the typeof room the camera tracker was designed for. The display was 390cm wideand 248cm tall, raised 216cm above the floor. An ASUS laptop with anIntel i7-4710HQ CPU @2.5GHz, 12GB of RAM, and an NVIDIA GeforceGTX 850M discrete graphics card using 64-bit Windows 8.1 ran all of theexperimental software. The laptop computer was connected via HDMI to anEPSON PowerLitePro Z8050W projector located at the back of the room.The display generated a 1280×800px resolution image at 60Hz. The laptopcomputer was used to record all experimental data.During each condition participants were located in one of three positionsin the room as shown in Figure 4.2. Positions #1 and #3 were closer tothe screen and thus at little or no elevation relative to where instructorsnormally stand during lectures. Position #2 was farther away, where the554.3. MethodFigure 4.2: (top) A picture of the room illustrating what participants saw.All lights were turned off during the study. (bottom) A diagram drawn toscale of the room layout. Positions #1 and #2 are perpendicular to thecenter of the screen, and position #3 is off to the side.564.3. Methodfloor is significantly raised relative to the front of the room, and so had anelevation of 74cm. Assuming eye-level at 120cm from the floor when sittingdown and 170cm when standing up, vertical angles from eyes to the centerof the display were approximately 27.5◦ and 22◦ for sitting and standingposes respectively in position #1, and 7.9◦ and 32◦ for a standing pose inpositions #2 and #3 respectively.One of the main goals of the study was to make a fair comparison betweenthe mouse and the camera, so both were tested from position #1, which hadthe line of sight perpendicular to the screen and provided an optimal angleof view for both devices. The other positions were only used for the camera.For this reason, position #1 is of main interest and is where most of thestudy took place.We tried moving position #1 closer to the screen to increase ecologicalvalidity for the camera condition, but this proved too straining for users,particularly in the mouse conditions. Position #1 is close enough to anactual lecturing position that we believe it is a good compromise that allowedusers to perform the study comfortably.Position #2 increased the distance to the screen, so we could get an ideaof how this factor affects performance and which formulation of Fitts’s Lawworks better. Position #3 was chosen because it represents more closelyhow a lecturer might use the tracker in actual practice. It is at the frontof the classroom but off to the side, so pointing can be done without fullyturning your back on the audience.We used the OpenCV library [7], written in C/C++, for homographyestimation and a Java-based driver for the i>Clicker devices that was devel-oped by Shi [43] and refined by Beshai [6] to capture clicker interaction. Therest of the software for the study was implemented natively in the Process-ing 2 language, which runs on the Java Virtual Machine and can maintaina refresh rate of 60fps.574.3. MethodMouseParticipants were given a Microsoft Comfort Optical Mouse 1000, whichthey could move freely on a desk to reach all parts of the display, and usethe left button to select targets. Mouse acceleration was disabled in theoperating system, and the gain adjusted to a value at which participantscould reach the more distant targets without clutching while still allowingthem enough precision for smaller targets. This gain value was determinedby early testing with multiple pilot participants. The mouse was connectedto the computer by a 6m active USB extension to ensure a strong signal,and the end of the extension was anchored in place so that its weight wouldnot pull on the mouse.CameraWe adapted a generic consumer-grade low-end USB webcam with a resolu-tion of 640×480px, sampled at 25fps with automatic exposure and white-balancing that could not be turned off. We removed extra parts from theplastic casing and attached it to a wooden dowel using hot glue to createa device resembling a laser pointer. The camera’s USB cable was twistedaround the dowel and held in place with tape to ensure its weight wouldnot pull on the front side of the pointer. The camera focus can be adjustedby twisting the screw-mounted lens on the front; it tends to change unpre-dictably with movement and vibrations, so it was brought to an appropriatestate and held in place with a rubber band twisted on itself. Figure 4.3shows the result, which is what was used for the study. The total cost ofthis device was less than $5.For target selection, we wanted to avoid the problem of the pointermoving out of position when the user clicked a button on it, so insteadparticipants held in the left hand an i>Clicker device that was synchronizedwith a base station connected to the experiment’s computer. Any of the5 buttons of the clicker could be used to indicate selection. As with themouse, we used the active USB extension to connect the camera, similarlyanchored, except for position #3 where we had no reliable way of doing584.3. MethodFigure 4.3: The physical prototype of the tracker device used by partici-pants. The image also shows the pencil grip they were asked to use duringthe For that position users held the end of the extension in their left handtogether with the i>Clicker so the weight of the cable would not interferewith the right hand’s use of the camera.4.3.3 Study DesignOur study was within-subjects. Experimental conditions for the target rect-angles were 3 widths, 24, 48 and 96 pixels — 7.31, 14.63 and 29.25 cmrespectively, and 3 amplitudes, 200, 400 and 800 pixels — 60.94, 121.88 and243.75 cm respectively, as measured from center to center of the rectangles.All 9 combinations of widths and amplitudes were used for the target con-ditions. Additionally there were 6 blocking pose conditions summarized inTable 4.1.The first four poses, all in position #1, made up the main part of theexperiment. We used those to determine the parameters for models of per-formance based on Fitts’s Law, to compare performance between the mouseand the camera, and to measure the effects of the grid being visible whenthe mouse was used. The order in which participants experienced these were594.3. MethodPose Pos Device Detailsno grid #1 Mouse Sitting down, grid off.grid #1 Mouse Sitting down, with grid visible.sitting #1 Camera Sitting down.standing #1 Camera Standing up.far #2 Camera Standing up.side #3 Camera Standing up.Table 4.1: The six pose conditions used in the study.fully counterbalanced to account for learning or tiredness effects. Those fourconditions were always presented first. After that, participants would do the“far” and “side” poses, which were partially balanced between them so thatof the 12 participants that experienced “standing” as the first camera pose,half of them did “far” first and the other half did “side” first. Similarly forthe 12 participants that experienced “sitting” first. Target conditions wererandomized within each block for every participant.Everyone experienced all 6 pose conditions, performing 12 trails for eachof the 9 targets, so there were 24 × 6 × 3 × 3 × 12 = 15,552 total trials.The whole session took about 45 minutes to complete.4.3.4 ProcedureAfter reading and signing a consent form explaining their rights and whatthe experiment was about, as required by the UBC Behavioural ResearchEthics Board, participants were presented with a questionnaire to make surethey met the requirements for participation.They were told they would see different shapes on the display, in par-ticular two vertical blue rectangles, one dark and one light, and that theirtask was to point at the dark rectangle and click, at which point the rectan-gles would switch colours and participants would then have to click on theother rectangle that had become dark, going back and forth between themuntil the end of a block. They were shown both devices, the mouse and thecamera tracker, and given a brief explanation on how to operate them.For the mouse they all had previous experience, and could use it as usual604.3. Methodpressing the left button to select a target.For the camera we explained that it was a regular webcam, that theywere supposed to hold it like a pencil and point it at the screen, and that forselecting they should hold the i>Clicker in their left hand, using any of thebuttons on it to indicate a selection. Other than the pencil grip, they wereallowed to hold and move their arm in front of them however they wanted,but were advised that they could rest their arm while sitting down or holdit close and bend it while standing to avoid getting tired. They were toldthat if ever they lost the tracker, the best strategy was to avoid waving theirhand in the air trying to find the tracker again, but instead to hold it stillfor a moment while pointing at the center of the display and let the trackerfind itself.Room lights were turned off for the remainder of the experiment, so theonly light in the room came from the display.Whenever participants clicked with the mouse or pressed a button onthe clicker, the active (dark) rectangle would become inactive (light) andvice versa. If a participant missed a target, the target would flash red for amoment to indicate that an error had occurred. The cursor consisted bothof a red circle and an orthogonal cross. Participants were told that as longas the center of the cross was inside the rectangle, it was considered a goodtap. They were instructed to emphasize accuracy first and speed second.Participants carried out practice sessions with the camera and with themouse before doing the main task until both they and the experimenter feltconfident about their proficiency with the devices and their understanding ofthe task. At a minimum, all participants did 3 practice taps for each of therectangle conditions, all while sitting down in position #1. Two participantsrequested a second practice session with the camera. For practice withthe mouse, visibility of the grid was synchronized with their first mousecondition, so those who would see the mouse with the grid condition first gotpractice with the grid visible but those who would see the no-grid conditionfirst did not.After each block consisting of all trials for one of the poses, there was aminimum 2 minute pause. During this time the experimenter would switch614.4. Resultsdevices (if appropriate) and move the participant to a new position in theroom as necessary. Additionally, within the block there were 2 optionalpauses, one after every 3 target conditions. Most participants used thepauses to take a few seconds of rest when using the camera tracker. Afterfinishing all 6 blocks for the poses, participants were presented with a ques-tionnaire to gather qualitative data about their experience with the camera.4.4 ResultsFor the analysis the first trial of each target condition was thrown out.From time to time during the study, the camera would have a series of badreadings causing the cursor to jump around the screen chaotically. Whenthis happened, participants sometimes had trouble finding it and getting itto stabilize again. For this reason, of the remaining 14,256 trials, 14 wereconsidered outliers for taking more than 10 seconds and were discarded (11of those were in the “side” condition, all on the left target of the narrowestwidth and widest amplitude). A further 31 trials were thrown out becausethey ended 5 standard deviations or more away from the target center.We encountered a problem during the experiment that was not found inpiloting. After pressing a clicker button, the remote imposed a delay of .75seconds where no other button press would go through. This meant thatsometimes when participants clicked one target and then another in less than.75 seconds, the second click would not be registered, causing them to do a“double take” when the targets did not switch, increasing their registeredmovement time for that trial. We did not remove these trials from theanalysis because they were sporadic and difficult to detect reliably, but thefact should be kept in mind when interpreting results.We measured movement time and tap position for each trial, from whichwe extracted error rates for each condition. Analyses for these variablesand the throughput calculations suggested in the literature for Fitts’s Lawexperiments are reported in the subsections that follow.624.4. ResultsFigure 4.4: Mean movement times in milliseconds at different indexes ofdifficulty for all pose conditions. Lines are included only for readability.4.4.1 Movement TimeData were aggregated by target and pose. Mean movement time, effectivewidth and effective amplitude were computed for pose-target conditions asdiscussed in Section 4.2. Figure 4.4 shows the mean movement time valuesplotted against index of difficulty (Equation 4.2) from the Shannon formu-lation of Fitts’s Law, which is an accepted standard for this type of study.Shoemaker et al. [44] recommend that two-part models be consideredwhen analyzing Fitts’s tasks with multiple levels of gain. Although we didnot vary gain in our study, we did vary distance from the screen, whichhas also been reported to be better modeled by a two-part forumlation by634.4. ResultsRajendran [39]. This is an area of much interest for current and future re-search, and so we present results for analyses using all four models discussedby Shoemaker et al., summarized in Table 4.2One-Part (Fitts) Two-Part (Welford)basic a+ b log2(AW)a+ b1 log2(A)− b2 log2(W )Shannon a+ b log2(AW + 1)a+ b1 log2(A+W )− b2 log2(W )Table 4.2: The four formulations of Fitts’s Law considered for our analysis.Each predicts movement time MT from target width W and amplitude(distance) of movement A.We also carried out F-test comparisons between pairs of nested modelsto see if the two-part formulations work better than the corresponding one-part models for our data. The results for Fitts and Welford models areshown in Table 4.3 and for Shannon and Shannon-Welford in Table 4.4.Fitts Welford F-testPose a b R2 a b1 b2 R2 F pno grid 287.45 229.82 0.969 657.96 201.23 276.76 0.994 24.60 0.003grid 207.58 256.68 0.961 496.64 235.68 295.57 0.977 4.10 0.089sitting 423.91 431.59 0.963 800.18 406.57 484.27 0.971 1.83 0.225standing 270.28 472.59 0.930 659.64 454.38 540.75 0.940 1.06 0.342far 343.00 489.98 0.907 1363.62 439.54 661.95 0.954 6.16 0.048side 130.03 622.24 0.899 770.34 595.10 733.73 0.912 0.89 0.383Table 4.3: Movement time models for the Fitts and Welford formulations.Significant differences in nested models are highlighted in bold.Shannon-Fitts Shannon-Welford F-testPose a b R2 a b1 b2 R2 F pno grid 153.23 257.05 0.970 539.28 225.15 300.44 0.994 24.20 0.003grid 53.77 288.66 0.966 357.04 265.05 325.39 0.982 5.45 0.058sitting 150.03 490.01 0.967 540.01 461.85 539.21 0.976 2.17 0.191standing -17.90 534.16 0.932 389.13 513.39 601.30 0.943 1.16 0.323far 20.56 560.55 0.915 1079.70 503.28 727.42 0.963 7.90 0.031side -297.77 718.35 0.908 366.29 687.16 827.14 0.922 1.01 0.353Table 4.4: Movement time models for the Shannon-Fitts and Shannon-Welford formulations. Significant differences in nested models are high-lighted in bold.In every condition the two-part models outperform their correspondingone-part formulations, as measured by R2 values. By the same measure, the644.4. ResultsShannon models did better than their respective basic counterparts. The F-tests show a significant difference between one and two-part models only forthe “no grid” and “far” conditions. From this data it seems the performanceof the different models is generally comparable in our intended classroomsetting: both one-part and two-part models work equally well, although theShannon-Welford model might be slightly preferred.For one-part models the b parameter is the slope, or rate of change ofmovement time as the index of difficulty increases. For our range of indexof difficulty values, the camera models have a slope roughly twice those forthe mouse, which suggests we can expect movement time for pointing tasksto take roughly twice as long to be performed. This however is not entirelyclear from the data because the a parameter, the intercept, also plays a role.We will come back to this point when we analyze throughput.Note that the R2 values are worst in conditions “far” and “side”, whichmakes sense because those correspond to tasks with a higher handicap, andare prone to more noise due to the bigger challenge presented to the camera’svision algorithms. In addition the “side” models are oversimplifying, in thatthey group taps from both left and right targets together, while in realitytapping on the left side was considerably harder than the right side becauseof the reduced visual angle due to the geometry.4.4.2 Error Rates24px 48px 96px200px 400px 800px 200px 400px 800px 200px 400px 800pxno grid 4.6% 5.0% 5.4% 2.7% 5.0% 3.0% 1.1% 1.5% 1.5%grid 3.8% 3.8% 3.8% 1.9% 4.6% 4.6% 0.4% 0.8% 0.8%sitting 6.4% 8.7% 7.3% 3.4% 3.4% 4.6% 1.9% 2.7% 1.1%standing 5.0% 9.5% 13.7% 1.9% 5.3% 3.1% 0.0% 1.9% 1.1%far 11.4% 12.9% 16.0% 3.8% 3.0% 3.0% 1.1% 1.5% 0.8%side 14.0% 15.2% 19.1% 3.4% 4.6% 6.0% 1.1% 2.7% 3.8%Table 4.5: Error rates for pose and target conditions.Error rates for each pose and target condition are shown in Table 4.5.Conditions “no grid” and “standing” represent the fairest comparison be-654.4. Resultstween the mouse and camera, respectively, and they have similar error ratesexcept for the more narrow rectangles, where the mouse is more reliable.There is no clear pattern between the “sitting” and “standing” condi-tions, however the “grid” condition outperforms the “no grid” condition inall but one case, suggesting there might be a correlation between presenceof the grid and improved error rate.Errors for narrow rectangles in the “far” and “side” condition rise dras-tically, suggesting that the camera tracker’s reliability falls with increaseddepth and acute angles, as expected. Hand jitter is probably a factor here,but also the increased presence of chaotic movements due to deficiencies inthe vision algorithms.4.4.3 ThroughputWe use the Shannon-Fitts formulation of index of difficulty (Equation 4.2)for the throughput equations described in Section 4.2. The results are pre-sented in Table 4.6.Pose grid no grid sitting standing far sideThroughput 3.54 3.58 2.04 2.08 1.92 1.79SD 0.29 0.39 0.29 0.30 0.34 0.32Table 4.6: Throughput means for all pose conditions, in bit/s.The “standing” and “no grid” throughput values strengthen the argu-ment that the mouse is roughly twice as efficient as the camera as an inputtechnique in this setting. We ran a one-way repeated-measures ANOVA todetect possible effects of pose on throughput. Mauchly’s test revealed no vi-olation of sphericity, χ2(14) = 22.55, p = 0.07, and the results show an effectof pose on participant’s throughput, F (5, 115) = 571.69, p < 0.001 . Bonfer-roni post hoc tests show that there are significant differences between mouseconditions and camera conditions (all p < 0.001), between “side” and both“sitting” or “standing” (both p < 0.001) conditions, and between “far” and“standing” (p = 0.01) conditions. No other differences were significant, inparticular no difference between the mouse conditions “grid” and “no grid”,664.4. Resultsor between the camera conditions “sitting” and “standing” was found.Figure 4.5: Comparison of average throughput values per participant. Par-ticipants were sorted by increasing throughput in the “no grid” condition,and the values for the “standing” condition are also shown.Throughput values varied considerably between participants. Figure 4.5shows average throughput values per participant for the main mouse andcamera conditions, sorted by increasing throughput with the mouse. Thedata for the camera condition shows a clear upward trend, suggesting a pos-itive correlation between performance with both devices. Only a few partic-ipants had markedly different performance levels between devices, perform-ing strongly with the mouse but poorly with the camera, for example. Thiscould be due to individual differences in how users interact with computerinterfaces, or to individual strategies while carrying out the task (a cou-ple of the participants with high performance did mention they consciouslysearched for an efficient strategy).674.4. Results4.4.4 Subjective DataWe asked participants to rate the level of difficulty they perceived the taskto be when using the camera, from 1 (Easy) to 5 (Impossible), to reportany particular strategy that they might have employed, and to provide anyother comments they had regarding their experience with the task. Theaverage difficulty was rated as 2.292 (SD = 1); the details are summarizedin Figure 4.6.Figure 4.6: Difficulty ratings subjectively reported by participants for thepointing task using the chiroptic tracker, from 1 (easy) to 5 (impossible).The most common strategy was that of doing a fast initial movement tolaunch the cursor close to the target, and then fine-tuning their pointing withprecise movements, with 13 participants saying they did this. It is possiblethat the technique was encouraged by the defect in sensing described inSection 3.3.5 that makes the tracker “sticky” when suddenly moving a longdistance. Two participants who commented they were trying to be fast useda technique in which they would move quickly and then click while the cursorwas passing over the target rectangle, so it did not matter if they overshotas long as they clicked in time. Three people tried to train their body to684.4. Resultsremember the distance they had to flick their wrist to go from one target tothe other.Several participants mentioned during the experiment that lag was per-ceptible. A few of the participants expressed frustration with the narrowertargets, both in their written comments and while doing the experiment.Other comments were about too much shakiness/sensitivity in the cursor,fatigue, preference between devices, and ergonomics. Looking more intothese issues would be beneficial, but from this data there does not seem tobe a main theme or trend. In the end it might come down to just individualdifferences and preferences.• Cursor sensitivity. P06 said “The lack of accurate and stable controlmakes the task difficult when using the camera”, P11 said “The pointeris shaky and too sensitive to movements”, and P08 said “Pointingthe camera like a pen is probably not the most stable way, holding aflashlight might be more stable figure”.• Fatigue. P22 said “Using this pointer for long periods of time may beexhausting”, and similarly P07 said “The arm is getting tired after awhile, but I imagine in a real-world scenario one wouldn’t do so manyinteractions after another”. On the other hand P09 said “Time wasnot a factor. I was not tired.”• Preference of device. Opinions were varied. P18 said “I liked themouse so much more”, but in contrast P12 said “So far it is a greattechnique to look and point at objects”, while P24 said “Same difficultyas mouse, mostly. Exception is when it flitted all over the place”,possibly referring to the chaotic movements that happened from timeto time when the sensor misbehaved. Most likely, each device has itsplace depending on the task, as P20 hints, by noting “The mouse maybe more accurate but the camera felt easier and more intuitive to use.”• Ergonomics. P17 said “If the camera can be held in [another] way[...] it might be easier and put less pressure on your wrist.”, and P21summed it up as “I think it will be important to take ergonomics into694.5. Discussionaccount.” We have stated before that the design of a physical trackeris a challenge in itself because we should consider the comfort of usersand balance their preference for grip with the benefits of one thatpromotes better accuracy and precision.4.5 DiscussionOur results indicate that the chiroptic tracker is a valid technique for apointing task in a classroom setting. While there is still much work to do,our findings are encouraging. Pointing with the camera seems to followthe Fitts paradigm, and while generally the Shannon-Welford model hadthe best fit, there was no real difference in between formulations for mostconditions.We used a constant gain value for all of the conditions in the study. Hadwe varied gain, we might be better able to determine whether a two-partWelford formulation is required, as has been discussed in the literature.As discussed by Myers et al. [32], users of laser pointers experience sig-nificantly more vertical jitter in their hand movement than horizontal jitter.In future research it would be valuable to measure the effects of verticalmovements on camera tracker performance.In closing, we briefly summarize our four hypotheses and the degree towhich our experimental data support each of them.H1 was supported. Error rates between camera and mouse conditionsin position #1 are comparable, and throughput values with both devicesare within a factor of two of each other. While there seems to be variedpreferences in devices, it is clear that participants can perform sufficientlywell with a camera tracker. The mouse has the home advantage, becauseall our participants are regular computer users familiarized with its use.We do not expect users will ever match the mouse performance as theybecome more experienced with the camera—they are fundamentally differentmethods of interaction—but at least their subjective experience might beimproved.Good interface design will also increase usability. Narrow targets were704.5. Discussiondistinctly harder to hit than the rest, so care should be taken by designers towork around the natural limitations that come with pointing at a distance.H2 was supported. Analysis of throughput shows that the presence ofthe grid did not impact user’s performance with the mouse significantly,and their error rates were comparable. While the requirement to overlaythe grid on top of the screen contents is unpleasing, it seems users can workaround it effectively in this task. This is a promising finding as we continueto design a classroom interface using the chiroptic device.H3 was supported. This is not a very surprising result, because a largerdistance from the screen increases the effect of natural hand jitter, or, putanother way, it reduces the width of the targets when measured in visualangles from the point of view of the user, making it more difficult to acquirethem precisely. Sharper angles have the added problem of bringing the visionalgorithms in the tracker to their limit, making the sensor less reliable. Inthe future we would like to perform more experiments manipulating controlto display ratio at different distances from the screen to dig deeper into theeffects of distance, and improving the sensor is still an active research area.H4 was supported. There was no significant difference in throughputvalues for the “sitting” and “standing” condition. The error rates are onlyslightly larger for the thinnest rectangles at maximum movement amplitudefor the “standing” pose.71Chapter 5Design Brief for a Classroom InterfaceIn chapter 4 we saw that users perform about twice as fast with the mousethan with the camera tracker, but the camera tracker has a strong advantagein that it allows for direct pointing that gives freedom to the lecturer to movearound the room and perform relatively complex interactions without havingto go back to the computer. It is this freedom that we believe greatly offsetsthe performance cost of the camera tracker.Informed by our experience implementing and using the chiroptic tracker,in this chapter we present some key design ideas for anyone who wants tobuild a classroom interface with a direct pointing device. These ideas havenot been tested yet; our goal is to motivate discussion and provide a startingpoint for future interface designers.We will describe our design brief in abstract first, making the case fordirect pointing in general, and later we will describe in more concrete termsan example of how our own prototype could be used as part of a largersystem to support engagement in an interactive classroom. We present ourrecommendations as a set of assertions, accompanied by a brief summarybehind our reasoning for each.5.1 Goals of the InterfaceEngaging for students. The interface should be a way for lecturers in aclassroom setting to interact with a large wall display. Its main goal is toempower users to perform interactions that are more dynamic and complexthan what is common today, which is largely restricted to flipping back andforth between slides and pointing at the screen with a laser. Even thoughlecturers are the primary target, the design should take care to acknowledgethe needs of the rest of the occupants of the room. This means, especially,725.1. Goals of the Interfacethat the audience should be able to follow along with whatever action istaking place.Tailored to expert users. University lecturers can be expected to spendsome effort in learning the interface, so while it should not be unnecessarilycomplex or obscure, a rich and powerful interface is more desirable than asimplistic one even if it requires a bit of training and practice to fully masterit. Lecturers are professionals. They should have professional-quality toolsand they should invest time and effort developing their skills with thosetools.Embedded in current practice. Interactions should flow naturally as partof the conversation between lecturer and audience that happens in a class.Disruptions to this conversation should be avoided as much as possible.When possible the interaction should mimic traditional classroom activitythat has stood the test of time. Traditional chalk-board lectures had manyadvantages. We believe some of these can be reclaimed while still profitingfrom the many new opportunities digital media provide.Multimodal and embodied. When people talk, they perform deictic ges-tures that make their words more precise or nuanced. Pointing a finger atsomething is a strong deictic gesture, as is switching your gaze towards anobject: most people will instinctively look to see what is being pointed orlooked at. This is the reason why direct pointing is such a desirable qualityin a classroom interface, because the lecturer can use body language thatthe audience understands naturally, and if this body language is translatedinto updated interface elements, then the conversation will flow effortlessly.The action of turning towards the screen to point is an indication to theaudience that they should turn their attention there. In contrast, a pointingtechnique that manipulates the display indirectly, like a mouse or a touch-sensitive surface that acts as a remote control, does not convey this naturalhuman understanding.735.2. Required Resources Available Today5.2 Required Resources Available TodayWe assume a lecturer has both hands available to carry out direct pointinginteractions, one to hold the camera and another for pressing buttons. Ob-viously this will not always be the case (special versions of the equipmentwill be needed for those with disabilities, which is a challenge we will notaddress here beyond noting that this consideration is essential to the finaldesign). The setup we used before conforms to this description, with thecamera tracker on one hand and the 5-button clicker on the other. In thefuture, as the physical form of the tracker evolves, it might be possible tohave everything integrated into a single device, or to free one hand entirelyby placing the camera on the wrist, for example.We further assume that instructors are more likely to tolerate some levelof inconvenience if it grants them a richer set of interactions with students,which is the reason they are already willing to clip on microphones andbring their own laptop computers to class for lectures. Because our directpointing techniques are intended for instructors giving lectures, not for stu-dents listening to lectures, convenience for the casual user is not (yet) arequirement.A fundamental assumption is that there is a small but sufficient set ofdistinct buttons available to the user. There has to be at least one button toswitch the device on and off (clutching). Two more buttons are enough todo most interactions because we can use one to toggle between modes andthe other to trigger actions. Having more than just three buttons reducesthe number of distinct modes required, but it also increases the complexityof the device. Our goal is not to provide the user with a mobile keyboard,so we will assume there are about five buttons available, which is what ani>Clicker has and thus is readily available in many classrooms.Buttons on an i>Clicker can detect only a “press down” event. It makesno difference if you keep a button pressed or let it go, so actions such asdragging, which is usually performed by clicking, moving, and then releasing,have to be done instead by pressing, moving, and then pressing again (i.e.,we are restricted to point-and-click rather than drag-and-drop). This simple745.3. Styles of Interaction“button down” behavior is the absolute minimum functionality of a button,and is the way that many clicker systems work (including i>Clicker).In our design the camera is part of a dedicated device for tracking. Whileit is possible that a device such as a cellphone can be used for tracking withthe techniques described in earlier chapters, it is not optimal because useraccuracy may suffer: the physical form factor of a phone makes pointingunintuitive. In general, it is best to make the interface design independentof the physical properties of the pointing device. As the research progresses,the tracking device could become more sophisticated and versatile, however,this should not be relied on in the high-level design. The simple approachtaken in our prototype is sufficient to support a rich set of interactions.The two pieces of hardware that we require are available today. Thei>Clicker fully meets our requirements for the button device. Low costcameras, configured as in our prototype, provide the tracker device. Whilethese are not currently available as commercial products, the supportingtechnology is available at commodity prices. This means that in a practicalsense all of the equipment is accessible for use in the classroom and readyfor commercialization. No additional hardware is required beyond what isalready in place in most classrooms: an instructor’s laptop computer (or abuilt-in classroom computer), WiFi connectivity, and one or more displaysystems connected to the computer.5.3 Styles of InteractionInteractions happen mainly through direct pointing, but the meaning of thepointing gesture changes depending on what buttons are pressed. The in-terface should not require direct access to the computer’s resources, suchas keyboard, mouse or personal screen, unless it is an unavoidable excep-tion such as connecting to the WiFi network or the classroom displays, orenabling the chiroptic tracker support software on the computer.Interactions are meant to be sporadic. The design of the interface doesnot require continuous input from the user for prolonged periods of time.The chiroptic device is not the best way to carry out prolonged interactions,755.4. General Recommendationseven in the classroom, because that is not what it is designed for. For ourprototype, this explains why the grid occluding the contents on the displayscreen is not a problem: the grid serves as an indication that an interactionis underway, but it goes away immediately when the interaction is finished,so if the interactions are carried out quickly and sporadically, occlusion willnot be a significant distraction because the grid will get out of the way andallow the lecture to continue once the interaction is complete.Usually lecturers will be standing and positioned in front or to the sideof the main wall display, with their backs to the display and looking at theaudience. A lecturer will want to avoid blocking students’ view of the display,so the angle of pointing will usually be considerably skewed. Lecturers willhave some room to move around and may instinctively move to a moreadvantageous position for pointing when it is required.Students in the audience might have secondary displays in front of them,such as computer screens or paper-based notebooks, but their attention willoften be mainly focused on the lecturer and the main wall display.5.4 General RecommendationsWe make four recommendations about appropriate design choices that areparticularly suited to the type of classroom-based interaction we want tosupport.Modes. These allow a user to perform different actions using the samegesture because the gesture is interpreted in the context of the current mode.An illustrating analogy is the tool palette of drawing software such as Pho-toshop: clicking and dragging the mouse has a different effect based on whattool is selected. Our prototype uses modes. Some modes could transitionautomatically from others, and some could share functionality. For exam-ple “copying” a section of the display and “highlighting” both require theuser to “draw” a region first, so those two actions (modes) could becomeavailable after the drawing mode.Try to minimize mode switching. This will involve determining whichmodes are more common so they can be activated by default, and which765.5. Example of Future In-Classroom Interactionones are natural transitions from other modes, so the user does not have toswitch explicitly.Make buttons perform actions that are similar across modes. This helpsthe user build a simple mental model because the button mapping is consis-tent. For example, the button that switches tracking off could also be usedto cancel an action, such as the drawing of a shape, or to reset the positionof an object that is being dragged.Be agnostic about other classroom software. It is acceptable to treatthe contents of the display in a special way that is optimized for a class-room interface with direct pointing, but do not tie the interface to a specificsoftware package, like a slide presentation program. Ideally it should workindependent of the software responsible for rendering the contents of thedisplay, and the device driver for the tracker should handle its own set ofcommands and not expect them to be dealt with by plug-ins to other appli-cations. As standard functionality, include a mode that makes the trackerand clicker work like a mouse with emulated left and right-click buttons, sothat any software can be used remotely with the tracker. This should notbe the default mode of interaction.5.5 Example of Future In-Classroom InteractionThe recommendations in this chapter were designed in abstract where itwas possible, independent of any specific implementation of the chiroptictracker. In this section we give an example of how an interaction could takeplace with our current prototype and a 5-button clicker like the one used forthe study described in Chapter 4. Our description is only one way in whichthe interaction can take place; the design space for a classroom interfacebased on direct pointing is vast. Exploring it more deeply is an interestingtopic for future research.We will consider the following scenario:James is giving a math lecture discussing the implications of atheorem. His current slide is relevant to the context of the dis-cussion. It includes an equation with multiple parts. A student775.5. Example of Future In-Classroom Interactionasks a question and James realizes the answer involves only twoisolated parts of the equation, so he decides to highlight them.He turns towards the display, presses button 5 on his clickerto activate the camera tracker he is holding on his right hand.This action overlays a grid of fiducial markers on the display.He points at the first relevant section and presses button 1 onhis clicker, which starts drawing the outline of a rectangle atthe current cursor position. Then he drags the cursor to en-large the rectangle until it surrounds the first part that he wantsto highlight in the equation, and then presses button 1 againto complete the rectangle. He does the same thing to create an-other rectangle surrounding the second part of the equation, andthen presses button 2, which causes the two rectangles and theircontent to remain bright while the space outside them becomesdarker, achieving a highlighting effect. At this point the grid andcursor disappear, the camera is no longer active. James turnsback to the student to explain the relevant concepts. When thequestion is resolved he presses button 4 on his clicker, whichremoves the highlights and returns to the original state of theslide, and he continues his lecture.In this example a few things stand out. First, the interaction was notplanned, it was improvised based on the dynamic requirements of the class-room. Second, the tracker and clicker only come into play during the actionstaken by the lecturer to highlight regions of the screen. The rest of the timethey are out of the way and the lecture proceeds as normal. Third, it ispossible to carry out this example with our technique, and the resources todo so are available to an instructor now.Designing an effective classroom interface that is based around directpointing is a task that requires iteration and experimentation. We hope thediscussion in this chapter provides enough insights to begin the next stepsin the process. Ultimately, the camera-based pointer and the interactiontechniques that use it will be only a part of a larger multi-user interface for785.5. Example of Future In-Classroom Interactionall the people in the classroom, an interface that might resemble an orchestraof smart instruments playing in coordination to the mutual benefit of theroom’s occupants.79Chapter 6Conclusions and Future WorkTechnological resources in modern classrooms are not being used to theirfull potential. They could be leveraged to create a richer interface for thebenefit of both lecturers and students. A key component to enable suchan interface is a pointing device that allows direct manipulation of contenton the screen without tethering users to the computer and ideally withoutintroducing more hardware into the classroom.We have explained how such a device can be created by using a patternof structured shapes on the display to encode position information, pairedwith a hand-held camera that when pointed at the screen interprets the pat-tern and determines a position to draw the cursor. In addition, we providean implementation of these ideas running reliably at sampling rates that arehigh enough to provide a smooth experience for users of the pointing device.Our implementation is still a proof of concept with many possible improve-ments, including the use of better vision algorithms for feature extractionthat are more robust to varying lighting conditions and reduced color con-trast. There is a vast pool of knowledge in computer vision that we have notfully explored, so it is possible that a completely different approach wouldwork better, for example by tracking feature points of the display itself.It would also be interesting to explore the creation of a closed-loop ver-sion of the device, where the shapes on the display are moved around de-pending on the user’s location, to both minimize clutter on the screen andensure position information can always be obtained. As long as we trackfour distinct points, we should be able to estimate a six-degree-of-freedomposition for the camera. The closed-loop tracker could be stabilized usingKalman filters, PID controllers or particle filters, all of which have been usedpreviously for this kind of problem. Stabilizing a closed-loop solution wouldallow us to do more interesting things with the tracker, like dynamically80Chapter 6. Conclusions and Future Workadjusting the markers positions and dimensions to make them optimal forboth the camera and the audience.The results of a controlled study comparing the camera pointing deviceto a mouse were reported, and they show that the mouse performs abouttwice as fast as the camera, while error rates are comparable between thedevices. Our belief is that this is a good trade-off for classroom users, whooften interact only infrequently with the display and will enjoy the increasedusability of direct pointing. In the future we would like to work with lecturersthat are willing to help us test the device in an actual classroom setting, togather field data that will be very valuable in creating a better classroominterface. We would also like to run further studies where control to displayratio values are adjusted at different distances from the display, to measurethe effects on device reliability and precision.Many of our study participants mentioned the lag in the device wasnoticeable, but not problematic. Understanding the sources and effects oflag in the system would give us a better sense of how well the device canwork with faster hardware. Our conjecture is that reduced system latencywould increase user throughput. We would like to run a second study thatlooks into the effect of lag on user performance with the chiroptic tracker,which would allow us to estimate how much improvement in performancecould be expected from reduced lag.We presented ideas for the creation of an interface that makes use of adirect pointing device and a 5-button clicker to encourage the discussion ofwhat a classroom interface should be like. Future work will implement thesetechniques and test their effectiveness with real users.The prototype implementation for the chiroptic tracker that was pre-sented here exceeded some of our initial expectations. It is possible that itcan be used effectively in other settings than the one we designed it for. Wewould like to do more explorations of the physical design to create a tech-nology that can be flexible enough to be used with a camera phone (which,while not the optimal form factor, might be a convenient way to have a chi-roptic tracker readily available), but also embedded into clothing or otherwearables. This flexibility would enable the creation of more sophisticated81Chapter 6. Conclusions and Future Workinterfaces with interactions that are similar to those found in multi-touchinterfaces. In any case more effort should be put into the design of a chi-roptic tracker as a dedicated device, one that grants lecturers a robust setof interactions with which to create a rich classroom experience.82Bibliography[1] Vivek Agarwal, Besma R. Abidi, Andreas Koschan, and Mongi A.Abidi. An overview of color constancy algorithms. Journal of PatternRecognition Research, 1(1):42–54, 2006.[2] Karl Johan Astro¨m and Richard M. Murray. Feedback systems: anintroduction for scientists and engineers. Princeton University Press,2010.[3] Matthias Baldauf, Peter Fro¨hlich, and Katrin Lasinger. A scalableframework for markerless camera-based smartphone interaction withlarge public displays. In Proceedings of the 2012 International Sympo-sium on Pervasive Displays, PerDis ’12, pages 4:1–4:5, New York, NY,USA, 2012. ACM.[4] Rafael Ballagas, Jan Borchers, Michael Rohs, and Jennifer G. Sheridan.The smart phone: a ubiquitous input device. Pervasive Computing,IEEE, 5(1):70–77, 2006.[5] Rafael Ballagas, Michael Rohs, and Jennifer G. Sheridan. Sweep andpoint and shoot: Phonecam-based interactions for large public displays.In CHI ’05 Extended Abstracts on Human Factors in Computing Sys-tems, CHI EA ’05, pages 1200–1203, New York, NY, USA, 2005. ACM.[6] Peter Beshai. Implementation and evaluation of a classroom syn-chronous participation system. Master’s thesis, University of BritishColumbia, Vancouver, BC, Canada, 2014.[7] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of SoftwareTools, 2000.[8] Bill Buxton. Some milestones in computer input devices: an informaltimeline. Ac-cessed August 12, 2015.83Bibliography[9] Duncan Cavens, Florian Vogt, Sidney Fels, and Michael Meitner. In-teracting with the big screen: Pointers to ponder. In CHI ’02 ExtendedAbstracts on Human Factors in Computing Systems, CHI EA ’02, pages678–679, New York, NY, USA, 2002. ACM.[10] Cesare Celozzi, Gianluca Paravati, Andrea Sanna, and Fabrizio Lam-berti. A 6-dof ARTag-based tracking system. Consumer Electronics,IEEE Transactions on, 56(1):203–210, 2010.[11] Matthias Deller and Achim Ebert. ModControl - mobile phones as aversatile interaction device for large screen applications. In Proceedingsof the 13th IFIP TC 13 International Conference on Human-computerInteraction, Part II, INTERACT ’11, pages 289–296, Berlin, Heidel-berg, 2011. Springer-Verlag.[12] Mark Fiala. Artag, a fiducial marker system using digital techniques. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition. IEEE Computer Society.[13] Paul M Fitts. The information capacity of the human motor systemin controlling the amplitude of movement. Journal of experimentalpsychology, 47(6):381–391, 1954.[14] David Fofi, Tadeusz Sliwa, and Yvon Voisin. A comparative survey oninvisible structured light. In Proc. SPIE, volume 5303, pages 90–98.Machine Vision Applications in Industrial Inspection XII, 2004.[15] David A. Forsyth and Jean Ponce. Computer vision: a modern ap-proach. Prentice Hall, 2003.[16] Orazio Gallo, Sonia M. Arteaga, and James E. Davis. Camera-basedpointing interface for mobile devices. In 15th IEEE International Con-ference on Image Processing, ICIP ’08, pages 1420–1423. IEEE, 2008.[17] Arjan Gijsenij, Theo Gevers, and Joost Van De Weijer. Computationalcolor constancy: Survey and experiments. IEEE Transactions on ImageProcessing, 20(9):2475–2489, 2011.84Bibliography[18] Richard Hartley and Andrew Zisserman. Multiple view geometry incomputer vision. Cambridge University Press, second edition, 2003.[19] NaturalPoint Inc. OptiTrack. AccessedAugust 17, 2015.[20] Seokhee Jeon, Jane Hwang, Gerard J. Kim, and Mark Billinghurst.Interaction techniques in large display environments using hand-helddevices. In Proceedings of the ACM Symposium on Virtual RealitySoftware and Technology, VRST ’06, pages 100–103, New York, NY,USA, 2006. ACM.[21] Hao Jiang, Eyal Ofek, Neema Moraveji, and Yuanchun Shi. DirectPointer: Direct manipulation for large-display interaction using hand-held cameras. In Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems, CHI ’06, pages 1107–1110, New York,NY, USA, 2006. ACM.[22] Carsten Kirstein and Heinrich Mueller. Interaction with a projectionscreen using a camera-tracked laser pointer. In Proceedings of Multi-media Modeling, MMM ’98, pages 191–192. IEEE, 1998.[23] Zhangbo Liu. LACOME: a cross-platform multi-user collaboration sys-tem for a shared large display. Master’s thesis, University of BritishColumbia, Vancouver, BC, Canada, 2007.[24] Bruce D. Lucas and Takeo Kanade. An iterative image registrationtechnique with an application to stereo vision. In Proc. 7th Intl. JointConf. on Artificial Intelligence, volume 81 of IJCAI, pages 674–679,1981.[25] I. Scott MacKenzie. Movement time prediction in human-computer in-terfaces. In Ronald M. Baecker, Jonathan Grudin, William A. S. Bux-ton, and Saul Greenberg, editors, Human-computer Interaction, pages483–492. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,1995.85Bibliography[26] I. Scott MacKenzie and Shaidah Jusoh. An evaluation of two inputdevices for remote pointing. In Murray R. Little and Laurence Nigay,editors, Engineering for Human-Computer Interaction, volume 2254 ofLecture Notes in Computer Science, pages 235–250. Springer BerlinHeidelberg, 2001.[27] I. Scott MacKenzie and Colin Ware. Lag as a determinant of humanperformance in interactive systems. In Proceedings of the INTERACT’93 and CHI ’93 Conference on Human Factors in Computing Systems,CHI ’93, pages 488–493, New York, NY, USA, 1993. ACM.[28] Russell MacKenzie. LACOME: Early evaluation and further devel-opment of a multi-user collaboration system for shared large dis-plays. Master’s thesis, University of British Columbia, Vancouver, BC,Canada, 2010.[29] Anil Madhavapeddy, David Scott, Richard Sharp, and Eben Upton. Us-ing camera-phones to enhance human-computer interaction. In SixthInternational Conference on Ubiquitous Computing (Adjunct Proceed-ings: Demos), 2004.[30] Kento Miyaoku, Suguru Higashino, and Yoshinobu Tonomura. C-blink:A hue-difference-based light signal marker for large screen interactionvia any mobile terminal. In Proceedings of the 17th Annual ACM Sym-posium on User Interface Software and Technology, UIST ’04, pages147–156, New York, NY, USA, 2004. ACM.[31] Orkhan Muradov. Feasibility of supporting pointing on large wall dis-plays using off-the-shelf consumer-grade tracking equipment. Master’sthesis, University of British Columbia, Vancouver, BC, Canada, 2013.[32] Brad A. Myers, Rishi Bhatnagar, Jeffrey Nichols, Choon Hong Peck,Dave Kong, Robert Miller, and A. Chris Long. Interacting at a distance:Measuring the performance of laser pointers and other devices. In Pro-ceedings of the SIGCHI Conference on Human Factors in ComputingSystems, CHI ’02, pages 33–40, New York, NY, USA, 2002. ACM.86Bibliography[33] Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, DavidMolyneaux, David Kim, Andrew J. Davison, Pushmeet Kohi, JamieShotton, Steve Hodges, and Andrew Fitzgibbon. KinectFusion: Real-time dense surface mapping and tracking. In Proceedings of the 10thIEEE International Symposium on Mixed and Augmented Reality, IS-MAR 2011.[34] Donald A. Norman. The psychology of everyday things. Basic books,1988.[35] Katsuhiko Ogata. Modern Control Engineering. Prentice Hall, fifthedition, 2010.[36] Dan R. Olsen Jr. and Travis Nielsen. Laser pointer interaction. In Pro-ceedings of the SIGCHI Conference on Human Factors in ComputingSystems, CHI ’01, pages 17–22, New York, NY, USA, 2001. ACM.[37] James G. Phillips and Thomas J. Triggs. Characteristics of cursortrajectories controlled by the computer mouse. Ergonomics, 44(5):527–536, 2001.[38] Barry A. Po, Brian D. Fisher, and Kellogg S. Booth. Comparing cursororientations for mouse, pointer, and pen interaction. In Proceedingsof the SIGCHI Conference on Human Factors in Computing Systems,CHI ’05, pages 291–300, New York, NY, USA, 2005. ACM.[39] Vasanth Kumar Rajendran. Interaction with large stereoscopic displays:Fitts and multiple object tracking studies for virtual reality. Master’sthesis, University of British Columbia, Vancouver, BC, Canada, 2012.[40] Michael Rohs. Real-world interaction with camera phones. In Proceed-ings of the Second International Conference on Ubiquitous ComputingSystems, UCS ’04, pages 74–89, Berlin, Heidelberg, 2005. Springer-Verlag.[41] Joaquim Salvi, Jordi Page`s, and Joan Batlle. Pattern codification87Bibliographystrategies in structured light systems. Pattern Recognition, 37(4):827–849, 2004.[42] Toby Sharp, Cem Keskin, Duncan Robertson, Jonathan Taylor, JamieShotton, David Kim, Christoph Rehmann, Ido Leichter, Alon Vinnikov,Yichen Wei, Daniel Freedman, Pushmeet Kohli, Eyal Krupka, AndrewFitzgibbon, and Shahram Izadi. Accurate, robust, and flexible real-timehand tracking. In Proceedings of the 33rd Annual ACM Conferenceon Human Factors in Computing Systems, CHI ’15, pages 3633–3642.ACM, 2015.[43] Junhao Shi. Improve classroom interaction and collaboration usingi>Clicker. Master’s thesis, University of British Columbia, Vancouver,BC, Canada, 2013.[44] Garth Shoemaker, Takayuki Tsukitani, Yoshifumi Kitamura, and Kel-logg S. Booth. Two-part models capture the impact of gain on point-ing performance. ACM Transactions on Computer-Human Interaction(TOCHI), 19(4):28:1–28:34, December 2012.[45] R. William Soukoreff and I. Scott MacKenzie. Towards a standardfor pointing device evaluation, perspectives on 27 years of Fitts’ Lawresearch in HCI. International Journal of Human-Computer Studies,61(6):751–789, December 2004.[46] Ivan E. Sutherland. Sketchpad: a man-machine graphical communica-tion system. Technical Report 574, University of Cambridge, ComputerLaboratory, September 2003.[47] Vicon Motion Systems. Vicon. AccessedAugust 17, 2015.[48] Richard Szeliski. Computer vision: algorithms and applications.Springer Science & Business Media, 2010.88[49] Florian Vogt, Justin Wong, Sidney Fels, and Duncan Cavens. Track-ing multiple laser pointers for large screen interaction. In ExtendedAbstracts of ACM UIST, pages 95–96, 2003.[50] Colin Ware. Information Visualization: Perception for Design. Else-vier, 2012.[51] Alan Traviss Welford. Fundamentals of skill. Methuen, 1968.89Appendix AExperimental ResourcesThe following are the documents given to participants of the pointing studydescribed in Chapter 4.A.1 Consent FormParticipants were asked to read and sign this consent form.90A.1. Consent Form91A.1. Consent Form92A.2. Initial QuestionnaireA.2 Initial QuestionnaireThis questionnaire was given to participants at the beginning of the study,to gather basic demographics and check they met the study’s requirements.93A.2. Initial Questionnaire94A.3. Final QuestionnaireA.3 Final QuestionnaireThis questionnaire was given to participants at the end of the study, andthey were instructed to report only on their experience with the camera, togather subjective and qualitative data.95A.3. Final Questionnaire96


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items