You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Automatic identification and description of software developers tasks Satterfield, Christopher 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_may_satterfield_christopher.pdf [ 1.96MB ]
JSON: 24-1.0390001.json
JSON-LD: 24-1.0390001-ld.json
RDF/XML (Pretty): 24-1.0390001-rdf.xml
RDF/JSON: 24-1.0390001-rdf.json
Turtle: 24-1.0390001-turtle.txt
N-Triples: 24-1.0390001-rdf-ntriples.txt
Original Record: 24-1.0390001-source.json
Full Text

Full Text

Automatic Identification and Description of SoftwareDevelopers TasksbyChristopher SatterfieldB.Sc, University of British Columbia, 2018A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)April 2020c© Christopher Satterfield, 2020The following individuals certify that they have read, and recommend to the Facultyof Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Automatic Identification and Description of Software Developers Taskssubmitted by Christopher Satterfield in partial fulfillment of the requirements forthe degree of Master of Science in Computer Science.Examining Committee:Gail Murphy, Computer ScienceSupervisorReid Holmes, Computer ScienceSupervisory Committee MemberiiAbstractA software developer works on many tasks per day, frequently switching backand forth between their tasks. This constant churn of tasks makes it difficult for adeveloper to know the specifics of what tasks they worked on, and when they workedon them. Consequently, activities such as task resumption, planning, retrospection,and reporting become complicated. To help a developer determine which tasksthey worked on and when these tasks were performed, we introduce two novelapproaches. First, an approach that captures the contents of a developer’s activewindow at regular intervals to create vector and visual representations of the work ina particular time interval. Second, an approach that automatically detects the timesat which developers switch tasks, as well as coarse grained information about thetype of the task. To evaluate the first approach, we created a data set with multipledevelopers working on the same set of six information seeking tasks. To evaluatethe second approach, we conducted two field studies, collecting data from a totalof 25 professional developers. Our analyses show that our approaches enable: 1)segments of a developer’s work to be automatically associated with a task from aknown set of tasks with average accuracy of 70.6%, 2) a visual representation ofa segment of work performed such that a developer can recognize the task withaverage accuracy of 67.9%, 3) the boundaries of a developer’s task to be detectedwith an accuracy as high as 84%, and 4) the coarse grained type of a task that adeveloper works on to be detected with 61% accuracy.iiiLay SummaryA software developer works on many tasks per day, frequently switching amongthese tasks. This constant churn of tasks makes it difficult for a developer to knowthe specifics of how they completed their tasks. To help a developer determine thesespecifics, we introduce two novel approaches. First, an approach that captures thecontents of a developer’s active window at regular intervals to create representationsof the work contained within a task. Second, an approach that automatically detectsthe times at which developers switch tasks, as well as information about the kind oftask being worked on. To evaluate the first approach approach, we created a data setwith multiple developers working on the same set of six information seeking tasks.To evaluate the second approach, we conducted two field studies, collecting datafrom a total of 25 professional developers.ivPrefaceAll of the work presented in this thesis was conducted in the Software Practices Labat the University of British Columbia and the Software Evolution & ArchitectureLab at the University of Zurich. All projects and associated methods were approvedby the University of British Columbias Research Ethics Board [certificate #H12-03701].A version of Chapter 3 has been submitted to be published: [Chris Satterfield,Thomas Fritz, Gail C. Murphy. Identifying and Describing a Software Developer’sTasks]. I was the lead investigator for this work, responsible for concept formation,data collection, analysis and manuscript composition.A version of Chapter 4 has been accepted for publication in the IEEE Tran-sactions on Software Engineering journal: [Andre´ N. Meyer, Chris Satterfield,Manuela Zu¨ger, Katja Kevic, Gail C. Murphy, Thomas Zimmermann, Thomas Fritz.Detecting Developer’s Task Switches and Types]. The field studies in this chapterwere designed and conducted by Andre´ N. Meyer and Manuela Zu¨ger. I assistedwith the development and analysis of our machine learning approaches, as well aswith manuscript composition.Thomas Fritz and Gail C. Murphy were supervisory authors on this projectand were involved throughout the project in concept formation and manuscriptcomposition.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Determining Intent . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.1 Intent from Interactions . . . . . . . . . . . . . . . . . . . 62.1.2 Intent from Documents . . . . . . . . . . . . . . . . . . . 82.1.3 Intent from a Combination of Interactions and Documents 82.1.4 Finding Meaning in Artifacts . . . . . . . . . . . . . . . . 92.2 Detecting Task Switches . . . . . . . . . . . . . . . . . . . . . . 93 Identifying and Describing Tasks . . . . . . . . . . . . . . . . . . . . 123.1 Data Set Creation . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.1 Developers . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.2 Session . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Data Collected . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.1 Data Annotation . . . . . . . . . . . . . . . . . . . . . . 173.3 Generating Task Representations . . . . . . . . . . . . . . . . . . 18vi3.3.1 Screenshot Preprocessing . . . . . . . . . . . . . . . . . . 183.3.2 Extracting Bags of Words . . . . . . . . . . . . . . . . . 183.3.3 Generating Task Representations . . . . . . . . . . . . . . 203.4 RQ1: Associating Descriptions . . . . . . . . . . . . . . . . . . . 213.4.1 Gathering Task Descriptions . . . . . . . . . . . . . . . . 233.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5 RQ2: Assigning Tags . . . . . . . . . . . . . . . . . . . . . . . . 293.5.1 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.6.1 Threats to Validity . . . . . . . . . . . . . . . . . . . . . 323.6.2 Applying the Approach . . . . . . . . . . . . . . . . . . . 333.6.3 Artifact Access . . . . . . . . . . . . . . . . . . . . . . . 343.6.4 Creating Vectors from Window Titles . . . . . . . . . . . 354 Detecting Task Switches and Their Types . . . . . . . . . . . . . . . 374.1 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.1.1 Study 1 – Observations . . . . . . . . . . . . . . . . . . . 404.1.2 Study 2 – Self-Reports . . . . . . . . . . . . . . . . . . . 424.2 Data and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2.1 Collected Data . . . . . . . . . . . . . . . . . . . . . . . 454.2.2 Time Window Segmentation . . . . . . . . . . . . . . . . 464.2.3 Task Switch Features Extracted . . . . . . . . . . . . . . 474.2.4 Task Type Features . . . . . . . . . . . . . . . . . . . . . 504.2.5 Outcome Measures . . . . . . . . . . . . . . . . . . . . . 504.2.6 Machine Learning Approach . . . . . . . . . . . . . . . . 504.3 Results: Detecting Task Switches . . . . . . . . . . . . . . . . . . 524.3.1 Descriptive Statistics of the Dataset . . . . . . . . . . . . 524.3.2 Task Switch Detection Error Distance and Accuracy . . . 524.3.3 Task Switch Feature Evaluation . . . . . . . . . . . . . . 554.4 Results: Detecting Task Types . . . . . . . . . . . . . . . . . . . 554.4.1 Identified Task Type Categories . . . . . . . . . . . . . . 554.4.2 Descriptive Statistics of the Dataset . . . . . . . . . . . . 584.4.3 Task Type Detection Accuracy . . . . . . . . . . . . . . . 584.4.4 Task Type Feature Evaluation . . . . . . . . . . . . . . . 604.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.5.1 Improving Task Switch Detection . . . . . . . . . . . . . 604.5.2 Improving Task Type Detection . . . . . . . . . . . . . . 614.5.3 Reducing the Prediction Delay . . . . . . . . . . . . . . . 634.5.4 Applications for Automated Task Detection . . . . . . . . 63vii4.6 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 655 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.1 Improving Task Identification . . . . . . . . . . . . . . . . . . . . 685.2 Improving Task Switch Detection . . . . . . . . . . . . . . . . . 695.3 Fully Automatic Task Support . . . . . . . . . . . . . . . . . . . 696 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72A Laboratory Data Collection . . . . . . . . . . . . . . . . . . . . . . . 82A.1 List of Tasks Performed By Participants . . . . . . . . . . . . . . 82A.2 Additional Figures . . . . . . . . . . . . . . . . . . . . . . . . . 84viiiList of TablesTable 3.1 Overview of controlled lab tasks. . . . . . . . . . . . . . . . . 16Table 3.2 Full task description for the App Market Research task (PrMR)as presented to developers. . . . . . . . . . . . . . . . . . . . 17Table 3.3 Examples of descriptions received for the Viz Library Selection(Viz) task, together with one of the expert’s ratings. . . . . . . . 24Table 3.4 Results of mapping task representations to task descriptionswritten by MT workers. . . . . . . . . . . . . . . . . . . . . . 27Table 4.1 Self-reports for Study 2. . . . . . . . . . . . . . . . . . . . . . 46Table 4.2 Features analyzed in our study and their importance for predict-ing task switches and task types. . . . . . . . . . . . . . . . . 48Table 4.3 Overview of the performance of detecting task switches, for bothindividual and general models. . . . . . . . . . . . . . . . . . 54Table 4.4 Overview and descriptions of the task type categories, the aver-age time developers spent on each task type per hour of work,and the performance of our task type detection approach, forboth individual and general models. . . . . . . . . . . . . . . . 57ixList of FiguresFigure 3.1 An example of a developer working on several tasks over time,revisiting task 3 in two task segments. . . . . . . . . . . . . . 17Figure 3.2 Process followed for the generation of task representations (thelight blue boxes represent task segments). . . . . . . . . . . . 19Figure 3.3 Word Clouds (WCs) generated for task segments from differenttasks and developers. . . . . . . . . . . . . . . . . . . . . . . 22Figure 3.4 Accuracy comparison for the 6 different combinations of tech-niques used to generate task representations (TK = tokenziation). 26Figure 3.5 Accuracy for mapping task representations to the correct taskdescriptions by lab developer. The dotted line represents theaccuracy for a random classifier. . . . . . . . . . . . . . . . . 29Figure 3.6 Accuracy for mapping task representations to the correct taskdescriptions by MT respondent. The dotted line represents theaccuracy for a random classifier. . . . . . . . . . . . . . . . . 30Figure 3.7 Accuracy for identifying the task based on our generated wordclouds. The dotted red line indicates the accuracy for a randomclassifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 3.8 Comparison of performance per task with window titles as thedata source vs screen content. For both approaches, resultsare calculated across the 12 developers with window title dataavailable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 4.1 Overview of the study design and outcomes. . . . . . . . . . . 39Figure 4.2 Screenshot of the second page of the experience sampling pop-up that asked participants to self-report their task types. . . . . 44Figure 4.3 Cross-validation approach for the individual models, leaving agap of 10 samples before and after the test set to account forthe dependence of samples in close temporal proximity. . . . . 52Figure 4.4 Confusion matrix for task type prediction. . . . . . . . . . . . 59xFigure A.1 An example screenshot captured with our tool of a developersinbox at the start of a data collection session. . . . . . . . . . 84xiAcknowledgmentsI would like to thank Gail Murphy and Thomas Fritz for fuelling and at times reiningin my intellectual curiosity. This work would not have been possible without theirguidance and support.I would also like to thank my co-authors on the TSE paper which makes upthe content of Chapter 4, and in particular Andre´ N. Meyer for their invaluablecontributions.Last but not least, I would like to thank my parents and my partner Ashley Wongfor encouraging and supporting me throughout this project.xiiChapter 1IntroductionSoftware developers work on many tasks in a day, switching between them constant-ly [22, 44]. This constant switching, and the variety and high number of tasks, makeit difficult for developers to keep track of which task they worked on when. Yet,developers can benefit from this information in several ways. For instance, knowingwhen and how long a task was worked on can help in the tracking of time spent ontasks, aiding planning, retrospection and reporting activities (e.g., [42]). As anotherexample, knowing what information is accessed as part of the task can help to recallwhat information is needed when a task is resumed [29, 56, 58].Some developers, as they work, manually track and note which informationthey access while performing a task as a form of externalization of the workingstate of a task [58]. This saved information can help a developer resume the sametask later or can serve as a means of knowing which task was worked on when.This manual approach is time consuming and requires substantial effort from thedeveloper. The Mylyn tool seeks to reduce this burden by enabling task descriptionsin a development environment to be activated with a click of a button, after which thetool can automatically track the information a developer accesses and can associate itwith the task [29]. When a developer returns to a task and reactivates the description,Mylyn can re-populate the environment with the information previously accessedas part of the task. However, Mylyn still requires the developer to manually writetask descriptions, and developers must remember to activate and deactivate the taskwhen work begins or stops on the task.1In this thesis, we explore how to move towards a fully automatic solution tosupporting developers in managing the contexts of their tasks. To this end, weconsider four research questions:RQ1: Can we automatically associate descriptions of developer’s tasks with informa-tion they access as they work?RQ2: Can we automatically assign tags to information accessed by a developer todescribe the task being performed?RQ3: Can we automatically detect the times at which a developer switches betweentasks?RQ4: Can we automatically determine the types of the tasks that a developer workson?Approaches which can help to address these research questions would make iteasier for developers to find and resume previous work (RQ1, RQ2, RQ4), enableus to tailor support systems depending on the type of task being performed (RQ4),and contribute to the improvement of existing task support systems by the relaxingof constraints (RQ1, RQ2, RQ3).To explore these research questions, we developed and evaluated two approachesin separate studies. We first consider RQ1 and RQ2, developing an approach togenerate vector and visual representations of a task using information extracted froma developer’s screen content. In developing this approach, we made the assumptionthat the times at which a developer switches tasks were known, in order to simplifyan otherwise intractable problem. We next considered RQ3 and RQ4, developing anapproach to automatically detect the task switches of a developer, and to detect thetype of task a developer works on. In comparison to the first approach we developed,this approach uses less data and a less invasive approach to data collection, butproduces a more coarse grained description of the content of a task.We begin by exploring how to support developers by automatically identifyingthe tasks they work on. To this end, we developed an approach that continuouslycaptures the screen of a developer and retrospectively generates, based on giventask switch information, representations of the tasks on which a developer worked.2These representations can be used to help determine and describe the tasks that aperformed during particular time periods. Specifically, our approach utilizes opticalcharacter recognition (OCR) to extract information directly from the developer’sscreen content. Using OCR allows us to capture only the relevant sections ofthe resources a user is viewing, namely the parts that the developer can see andinteract with. From the OCR output, we apply natural language processing andinformation retrieval techniques to generate a vector representation of the taskon which a developer worked. We experiment with a variety of techniques forgenerating vector representations of tasks as part of our investigations. Using thesame techniques, we also generate a visual representation in the form of a wordcloud that highlights the most relevant words that describe the task. Word cloudshave been shown in previous works to be useful for aiding users in determining thetopic of a document [23]. An advantage of our approach is that it is agnostic to theapplications a developer is using to perform the work.In order to evaluate the efficacy of our approach to generate vector and visualrepresentation generation, we applied these techniques to a data set we generated ina controlled lab setting with 17 developers. We then conducted two separate surveys,one in which we asked respondents to summarize the work which the developersperformed in the lab, and one in which we asked respondents to match our wordcloud representations to the task from the controlled lab setting which generatedthem. Across all task segments in our data set, we found we were able to associatetask descriptions provided by our respondents correctly in 70.6% of cases, and thatrespondents were able to match a generated word cloud to the task segment correctlyin 67.9% of cases.A key assumption we made in the development of this approach is that the timesat which developers switch tasks are known. We made this decision in order to giveour approach the best possible scenario for successful performance, ensuring eachtask segment contains only information relevant to that task. But to move towardsa fully automatic solution for supporting developers tasks, it is imperative thatwe are able to detect task switches automatically. Previous works have attemptedto automatically detect task switches (e.g. [30, 48, 55, 74, 75]). However, theevaluations performed of these techniques have been limited and the results havebeen poor in terms of prediction accuracy. In addition, many of these approaches3that are specific to the software engineering domain focus on detecting switcheswithin the IDE, meaning only development related work can be captured. Suchwork makes up a small portion of the actual work developers perform in a day [43].To improve the practical applicability of our approach to addressing RQ1 andRQ2, we sought to improve upon the results of these previous approaches to taskswitch detection by developing our own machine learning approach. We alsoconsidered the problem of classifying the type of task that a developer is working on(e.g., development, awareness, planning, etc.). Such an approach to understandingthe context of a task could help us to better support developers as they work, aswell as enabling detailed retrospective time tracking. Using features extracted fromdevelopers’ computer interaction information, we explored various machine learningmethods to develop an approach for automatic task switch and type detection. Weevaluated this approach in two separate field studies, achieving an overall accuracyof task switch prediction of 84% and 61% accuracy for type detection.Overall, our results show promise for being able to further automate task supportfor software developers performing information seeking tasks. This thesis makesfive contributions:• An application-agnostic approach to associate descriptions of informationseeking software development tasks with periods of time on which a developerworked on the task, including an evaluation of the efficacy of the approach ona data set from a controlled lab setting.• The evaluation of a variety of techniques for generating vectors describinginformation accessed as a developer works for the purpose of associating theinformation with tasks.• An evaluation of the use of word clouds to describe work performed insegments of time that enables developers to recognize work belonging to aparticular task.• An approach to automatically detect task switches and types based on develop-ers’ computer interaction that is not limited to the IDE.• An evaluation of our approach to task switch and type detection on datagathered from a field study with 25 professional developers.4We begin by comparing the problems we tackle and the approaches we inves-tigate to previous work in this area for software development in Chapter 2. Wethen describe the data collection, representation generation, and evaluation of ourapproach for generating task representations in Chapter 3. Next we present ourapproach to task switch and type detection and their evaluation in Chapter 4. Wediscuss how these approaches complement each other and directions for futureresearch in Chapter 5. Finally, Chapter 6 concludes the thesis.5Chapter 2Related WorkWork related to our research can roughly be categorized into work that focuses (a) ondetermining the intent of developers’ actions based on their computer interactionsand the documents and artifacts they access or produce, and (b) on detectingdevelopers’ task switches.2.1 Determining IntentThe research questions we pose require some level of understanding of the intent ofa developer in undertaking particular work. Determining the intent of a developer isa growing area of research. The more we know about a developer’s intention, suchas the task she is working on, the better we can support the developer, for exampleby providing better code recommendations (e.g., [16, 30]). Approaches have beendeveloped to determine intent from how a developer interacts with the computer,from the documents produced by a developer and from a mix of both. We describeapproaches in each of these categories and also describe related work in findingmeaning in artifacts.2.1.1 Intent from InteractionsFor some research systems, intent is specified through specific interactions a de-veloper takes within the environment in which they work. In the Mylyn system, adeveloper can indicate through an explicit click of the button on which issue they are6currently working: the text in an issue provides information about the developer’sintent [29]. In the Jasper system, a developer can create special working areas oftheir development environment into which fragments of work can be placed for laterrecall [11]. The approach we consider in this thesis relieves the developer from apriori indicating work on a specific task.Other researchers have attempted to determine automatically the higher-levelactivities developers perform based on their interaction with the computer. Forexample, Mirza et al. used temporal and semantic features based on windowinteractions and the window titles over 5 minute time windows to predict one ofsix work activity categories: writing, reading, communicating, system browsing,web browsing, and miscellaneous [50]. In a controlled lab study and field studywith 5 participants, they achieved an accuracy of 81%. Koldijk et al. investigatedthe predictive power of keyboard and mouse input, as well as application switchesand the time of day, for predicting a larger set of 12 high-level task types—such asreading email, programming, creating a visualization—for a given 5 minute periodof time [33]. Using classifiers trained on an individual basis, they were able toachieve up to 80% accuracy. However, they found that a classifier trained on oneperson is highly individual to that person and does not generalize well to otherpeople. In an approach more specific to software developers, Bao et al. explore theuse of conditional random fields (CRFs) to predict one of six development activities:coding, debugging, testing, navigation, search, or documentation [5]. Applying theirapproach to data collected from 10 software developers over a week, the authorsfound they were able to classify an activity with an accuracy of 73%. The results ofBao et al. point to the difficulty of determining at a fine granularity what a developeris working on at a specific moment. In our work, with our first approach we aimto determine the content of a developer’s task rather than the kind of activity beingundertaken. We also attempt in our second approach to improve on the results of theprevious authors in determining the coarse grained type of the task that a developeris working on, using less data and a less invasive approach to data collection thanthe one used in our first approach.72.1.2 Intent from DocumentsResearchers have also looked into the extraction of intent from natural languagedocuments associated with a software development. Early on, researchers havetried to detect the coarse intent of sentences in emails and tried to summarize them,for example to add them to a to do list (e.g., [14]). Di Sorbo et al. introduced theconcept of intention mining in the context of emails in software development [79].They used a Natural Language Processing (NLP) approach to classify the content ofdevelopment emails according to the purpose of the emails, such as feature requestor information seeking. The researchers defined six categories that describe theintent of a developer’s sentence and reported a 90% precision and 70% recall fortheir approach in the context of email intent classification. Huang et al. attemptedto generalize the approach of Di Sorbo et al. to developer discussions in othermediums, for example those contained in issue reports [25]. They found that theNLP patterns used did not adapt well to other mediums, achieving an accuracy ofonly 0.31. By refining the taxonomy of intentions defined by Di Sorbo et al. , andapplying a convolutional neural network (CNN) based approach, the authors wereable to improve on the results of the original paper by 171%. These approaches aimto classify what the content of a document is attempting to state as compared to ourapproach in this thesis which aims to determine what the developer is attempting todo.2.1.3 Intent from a Combination of Interactions and DocumentsShen et al. [75] use a combination of information about how a user interacts withwindows on their screen and email messages the user handles in their TaskPredictorsystem. Using supervised machine learning, they predict on which task a user isworking. However, this techniques requires the user to pre-define the tasks on whichthey work so that they can be predicted and the classifier needs to be trained onsome of the user’s data beforehand. Our approach differs in assessing methods forrepresenting the work being performed based on information that a developer workson through screen scraping; these representations can be used for predicting whichof a known set of tasks the work represents and for generating visual representationsof the work that a developer can recognize irrespective of having a set of known8tasks.2.1.4 Finding Meaning in ArtifactsThe content of artifacts created as part of, or about, software development containsignificant meaning. Software engineering researchers have developed techniques tofind particular meaning in artifacts that have similar characteristics to the approachwe develop in this thesis. For example, Ponzanelli et al. present CodeTube, anapproach that mines video tutorials from the web to enable developers to query thecontents of the tutorial to retrieve relevant fragments [62]. The authors used OCRand speech recognition in order to extract text from the videos and evaluate therelevancy of fragments to the user’s query. The determination of what a segment ofvideo is about is similar to the problem we tackle of what a segment of a developer’swork is about.2.2 Detecting Task SwitchesSeveral researchers have explored the detection of task switches mostly for generalknowledge workers. These approaches mainly differ in the features they used toidentify the task boundaries or switches, ranging from semantic features to temporalfeatures, the method they use, unsupervised versus supervised, and the way theyevaluated their approach. One of the most prominent approaches is by Shen et al.[73, 74, 76, 81] that is mainly based on semantic features and supervised learning.They reused an approach, TaskTracer [18], that allows users to manually indicate thetasks they are working on, and additionally tracks their application interactions inthe background, including window titles. Based on the assumption that windows ofthe same task share common words in their titles, they create vectors from windowtitles and identify task switches based on a textual similarity measure using theusers’ previously declared tasks and supervised learning. After the first version [73],they further improved their approach to reduce the number of false positives and tobe able to predict task switches online [74, 76]. Their evaluation is based on a smallset of two users and counts a task switch as accurate if it falls within a 4 to 5 minutetime window of a real switch, which is a very coarse measure, given the frequenttask switching in today’s environment that happen every few minutes [22, 43].9Based on the assumption that switches between windows of the same task occurmore frequently in temporal proximity than to windows of a different task, Oliver etal. [55] examined a temporal feature of window switches within a 5 minute timewindow in addition to semantic features and using an unsupervised approach. Anevaluation based on 4 hours of a single participant, showed a precision of 0.49 andrecall of 0.72. Researchers have also used other temporal features, in particular, thefrequency of window events, to determine task switches. Under the assumptionthat users navigate between windows more frequently when they switch tasks, asopposed to during a task, Nair et al. [51] developed a system that calculates windowevent frequency based on fixed 5 minute time windows. An evaluation with 6participants resulted in an accuracy of 50%. Mirza et al. [48] relaxed the constraintof a fixed time window, used adjusted frequency averages and studied the variousapproaches with 10 graduate students. They found that their approach improved theaccuracy and achieved an overall accuracy of 58%. Overall, previous research hasshown that detecting task switches is difficult, even for short periods of time and incontrolled environments. In our work, we focus on software development work andextend these approaches by including and examining both, semantic and temporalfeatures of window events, as well as user input features, and by conducting twostudies with professional software developers.Little research has been performed on task switch detection in the softwaredevelopment domain and all of this research has focused solely on software develop-ment tasks within the IDE. One of the first, Robillard and Murphy [69] proposedto use program navigation logs to infer development tasks and they built a proto-type, without evaluating it. In 2008, Coman and Sillitti [13] focused on splittingdevelopment sessions into task-related subsections based on temporal features ofdevelopers’ access to source code methods and evaluated their approach with 3participants over 70 minutes each, finding that they can get close to detecting thenumber of task switches, yet the point in time when the task happens is a lot moredifficult. Zou and Godfrey [89] replicated Coman and Sillitti’s study in an industrialsetting with six professional developers and found that the algorithm detects manymore task switches than the ones self-reported by the participants with an error ofmore than 70%. Finally, on a more fine-grained level, Kevic and Fritz [30] examinedthe detection of activity switches and types within a change task using semantic,10temporal and structural features. In two studies with 21 participants, they found thatactivity switches as well as the six self-identified activity types can be predictedwith more than 75% accuracy. In contrast to these approaches, we focus on all tasksa developer works on during a day, not just the change tasks within the IDE.11Chapter 3Identifying and Describing TasksIn this chapter we focus on the problem of whether the topic of work—a task—canbe automatically identified based solely on the information that a software developeris accessing as part of a task. To simplify the investigation of this problem, weassume for the purposes of the research in this chapter that the times at whichdevelopers task switches are known, simulating a developer manually marking taskswitches as they work. We consider two research questions:RQ1: Can we automatically associate descriptions of developer’s tasks with informa-tion they access as they work?RQ2: Can we automatically assign tags to information accessed by a developer todescribe the task being performed?Approaches that can help address the first question would enable a developerto locate when work was being performed on a particular known task. Theseapproaches could help a developer look back in the history of their work to identifyinformation accessed as part of a known task or could be used to help completereports on time spent on particular known tasks. These approaches could also helpidentify which task a developer worked on when, relaxing constraints in tools likeMylyn [29] or Jasper [11] for a developer to indicate when work begins or ends ona task. Approaches that can help address the second question would further relieveconstraints associated with having to know the tasks being performed. Instead, a12developer could access tags that describe the work from which a task descriptioncould be written or associated.To evaluate the ability of our approach to address the two research questions,we required a data set. To generate this data set, we had 17 experienced softwaredevelopers work on six tasks in a controlled lab setting. We designed the sixtasks to be representative of information seeking tasks often performed by softwaredevelopers. Previous studies have shown that developers spend 31.9% of theirday on such tasks [21]. Each task description consisted of a paragraph or more oftext indicating what information was needed on a given topic. More informationabout the tasks developers worked on can be found in Chapter 3.1. As a developerworked, we recorded their screen and took notes about when they worked on eachtask. We interrupted developers and prompted changes in the order in which taskswere performed to gather more realistic data that involves task switching. From thisdata, we generated representations of the tasks using our approach. We make thisdata set available to other researchers so they may build on it to investigate otherapproaches.To investigate the first research question, we evaluated a variety of techniquesfor generating representations of the tasks from the data to compare to descriptionsof the tasks. To reduce bias in the descriptions of tasks, we gathered multipleshort summaries for the tasks using professional IT personnel sourced throughMechanical Turk. Our evaluations revealed that a simple approach using TF-IDFyielded the best results (Accuracy: 70.6% for task segments and 75.5% for allsegments of a task).To investigate the second research question, we used the best technique identifiedfrom our evaluation of the first research question (TF-IDF with word tokenization)to generate word clouds for task segments in our dataset. We then conducted asurvey in which 28 software developers had to match a randomized subset of thegenerated word clouds to the original tasks. Overall, the respondents in this surveywere able to identify the correct tasks for the word clouds in 67.9% of the cases.133.1 Data Set CreationTo support the investigation of the two research questions, we created a data setfrom 17 developers working in a controlled laboratory setting on a set of six tasksover a 2 hour time period. We chose a laboratory setting to be able to gather datafrom multiple developers working on the same tasks. The full data set is availableonline [72].3.1.1 DevelopersWe recruited 17 participants—that we will call developers in the following—throughadvertising at our university and personal contacts. Of these developers, 10 weregraduate students, 4 were upper year undergraduates, and 3 were interns at amid-sized software company. All developers had several years of experience insoftware development, with an average of 6.4 (±2.4) years per developer. 10 of thedevelopers were female, and 7 were male. All developers were residents of Canada.3.1.2 SessionAt the start of a session, a developer was presented with a list of 6 tasks to performwithin a 2 hour time period. The tasks were presented in the form of unread emailssitting in an inbox accessed by a webmail client. Figure A.1 shows a screenshotof the inbox at the start of a session. The order in which the tasks appeared in theinbox for a developer was randomized. The tasks represent a variety of non-codingtasks associated with software development that commonly form part of softwaredevelopment. We chose to focus on non-coding tasks because the developers couldattempt more tasks in a two hour block compared to coding tasks that would requiregaining familiarity with a codebase and because it broadened the approaches thatcould be used to extract meaning from the information accessed as part of a task bya developer. We discuss the implications of this choice further in Chapter 4.6.Table 3.1 provides a short description of each of the six tasks, including ashort name that we use in this chapter to refer to a specific task; the short taskname and description was not presented to developers. An example of one of theactual task descriptions used in this study is presented in Table 3.2. The full taskdescriptions for each task can be found in Appendix A. We intentionally designed14the App Market Research Task and Recommend Tool Task as taskswhich were likely to have very similar information accessed as part of working onthe task to allow us to assess the discriminative power of our approach.We asked a developer to work on the tasks on a laptop with a 13.3 inch, 1440x900sized screen running macOS which was instrumented with our own recording tool,PersonalAnalytics1 [42]. As a developer worked on the tasks, the tool recordedscreenshots of the developer’s active window at 1 second intervals. Applicationnames and window titles were also recorded whenever they changed. To simulateinterruptions, the tool produced a popup in random intervals lasting from 6.5 to16.5 minutes. which asked the developer to switch to a new task. The averagetime between popups was selected as 11.5 minutes, in accordance with Gonzlezand Mark’s findings on the average amount of time knowledge workers spend in aworking sphere segment before switching [22]. To simulate the disruptive effectsof a real external interruption, we also asked developers to solve an arithmeticquestion before switching to a new task. These popups were excluded from ourtools recordings to avoid biasing our results.As a developer worked on the tasks, a researcher manually annotated the times atwhich the developer switched tasks, also keeping track of the task the developer wasworking on. After the session was complete, the times at which switches happenedwere verified and adjusted by reviewing a screen capture that ran in the backgroundof the provided laptop, to ensure highly accurate task switches were recorded.3.2 Data CollectedIn total, we were able to collect screenshot data for all 17 developers and on allsix tasks for each developer. Also, all but one developer completed the 6 taskswithin the allowed time period. On average, developers took 91.2 ±17.5 minutesto complete the six tasks and we collected an average of 5131 screenshots perdeveloper. Due to a technical issue, we were able to gather window titles for only12 of the 17 developers.1 3.1: Overview of controlled lab tasks.Abbrev. Short Task Name Short Task Description (by us)BugD Duplicate Bug Task Examine a collection of bug re-ports from a Bugzilla repositoryto determine if any were dupli-cates.Viz Viz Library Selection Task Research visualization librariesand identify one which is suitablefor outlining the benefits of yourcompanies tool, for creating a pre-sentation to clients.PrMR App Market Research Task Perform market research on threeproductivity apps. Identify com-mon functionalities, similaritiesand differences, and report onyour findings.PrRec Recommend Tool Task Examine app store reviews forthree productivity apps (the sameones as above) in order to recom-mend one to your coworker.DeepL Deep Leaning Presentation Task Prepare in advance answers tolikely questions for a hypothet-ical presentation you are givingabout potential deep learning ap-plications.BlC Blockchain Expert Task Answer your coworkers follow-up questions about a hypotheti-cal presentation you gave aboutthe different ways your companycould make use of blockchain.16Table 3.2: Full task description for the App Market Research task (PrMR) aspresented to developers.The software company you work for is considering expanding into the produc-tivity tool sphere. Your manager has asked you to do some market research on3 of the most popular already existing apps in this domain: Microsoft To-do,Wunderlist, and Todoist. Provide a short written summary of the similarities anddifferences between these 3 apps.Figure 3.1: An example of a developer working on several tasks over time,revisiting task 3 in two task segments.3.2.1 Data AnnotationUsing the task annotations collected by the researcher during each session, weannotated the collected data with the task switches and the task the developer wasworking on. Each session with a developer resulted in a developer working in aninterleaved fashion on the six tasks. Figure 3.1 depicts a portion of a developer’swork, showing an example of the interleaving. We define a task segment as theperiod of time between two task switches, during which a developer was workingon a specific task. We define a task segment grouping as the collection of all tasksegments that collectively represent work on a specific task. We use task segmentgroupings as a baseline for evaluating our approaches, as it represents the simplestpossible scenario in which we have the entirety of the information accessed duringthe work on a task available to us for our analysis.173.3 Generating Task RepresentationsThe notion of a task is an abstract concept. In order to interact with tasks in ameaningful way, we must be able to represent tasks in a concrete manner. Wegenerate two types of task representations: vector space representations (vectors)and visual representations (word clouds). These representations are created fromthe screenshots of the active windows that we gathered in the data collection phaseusing a series of extraction and processing steps. An overview of the steps involvedis depicted in Figure Screenshot PreprocessingTo prepare the collected screenshots of the active windows for the optical characterrecognition (OCR) with Tesseract [82], we preprocess the screenshots in accordancewith suggested best practices for the Tesseract tool. Specifically, we convert thecolored screenshots to grayscale and scale the resolution down to 300 DPI. Thesesteps are recommended as Tesseract was originally intended for reading paperdocuments (i.e., black on white). In addition, since most application windows havebars, such as a menu or a bookmark bar, at the top of the window, and these barsgenerally do not contain information specific to the task at hand, we consider itas noise and crop a percentage of the top of the screenshot to remove this noise.Through experimentation, we found that removing the top 15% of the screen acrossall application window screenshots provides the best balance between removingnoise without much loss of meaningful content for the data that we collected. Notethat this percentage might have to be adjusted when working on a screen with adifferent resolution and size. All screenshot preprocessing steps were automatedwith the ImageMagick tool [26].3.3.2 Extracting Bags of WordsAfter preprocessing, we used the Tesseract OCR engine to extract the textualcontent of each screenshot. Tesseract tries to also preserve the format of the text,and produces a structured string for each screenshot that we store in a document,one for each screenshot. Note that the structured strings produced by Tesseract stillcontain substantial noise even after the preprocessing. For instance, an ‘I’ could18.........t1 t3t2OCR (Tesseract)StructuredStrings ofTextTokenization (NLTK) OR Keyword Extraction (RAKE)Stemming & Aggregation (NLTK)TF-IDFBag of WordsWCTS1 + VTS1TimeTask 1 Task 2 Task 1 Task 3WCTS2 + VTS2 WCTS3 + VTS3 WCTS4 + VTS4Task SwitchFigure 3.2: Process followed for the generation of task representations (thelight blue boxes represent task segments).be misinterpreted as the number ‘1’ or the letter ‘l’. As well, many nonsensicalartifacts were produced due to noise from items like images and menu bars on thescreenshot.To break up these documents into usable pieces of information (words) andreduce some of the noise, we applied one of two techniques, either (a) tokenization,or (b) keyword extraction. We chose tokenization since it is a common practice whenprocessing natural language text, and we chose keyword extraction as an alternative,19since it might help us to reduce some of the noise in the extracted text. For thetokenization, we used the Natural Language Toolkit (NLTK) [54] Version 3.2.5 andapplied standard word tokenization techniques based on whitespace and punctuationto generate lists of tokens containing all of the words in a screenshot. To furtherreduce the noise, we then performed a dictionary check and discarded any tokensthat were not correct English words. Further, we removed all stop-words from thetokens. For the keyword extraction, we used an open source implementation [64](version 1.0.4) of the RAKE algorithm[70]. Based on an input string, RAKEproduces a set of keywords that is equal to 1/3 of the number of original words (notcounting duplicates).After breaking up each document into a set of words, we stemmed all the wordsusing the Porter stemmer implementation from NLTK. Finally, we created a bag ofwords—that is, a record of the frequency of each word—for each task segment byaggregating all words extracted from all screenshots of a task segment. In order toproduce baseline task representations from all information on a task (task segmentgrouping), we also created bags of words by aggregating all words from all tasksegments belonging to the same task.3.3.3 Generating Task RepresentationsA bag of words is itself a primitive representation of a task where the words canbe seen as words that describe the task and the frequency of each word in the bagas the importance of it. However, this representation only reflects importance of aword with respect to the current document (task segment). To also take into accountthe relevance of the word in context of the overall work of the developer and furtherhelp filtering the noise from the screenshots and the OCR, we experimented withseveral natural language processing (NLP) and information retrieval techniques.Since TF-IDF performed best in our comparison (see Chapter 3.4), we present thegeneration of task representations using TF-IDF in the following.The formula for TF-IDF is defined as ft ∗ id ft where ft represents the frequencywith which a term t occurs within a single document, and id ft measures the inversedocument frequency of t. The value of id ft is calculated as log Nd ft , where N isthe total number of documents and d ft is the document frequency of t, or more20specifically the number of documents that the term t appears in.For our purposes, we consider a document in this scenario as the bag of wordsgenerated from each task segment ts, and id ft is calculated based on the collectionof all documents from the same developer. Note that for our baseline in which wedo not distinguish between task segments but group all segments of a task together,we consider these groupings as the documents and the six documents (one per task)as the whole corpus for TF-IDF.Using the TF-IDF scores for each word in a bag of words, we ranked the wordsfound within each task segment from most to least relevant. We then used this listto produce (a) a visual representation in the form of a word cloud WC, and (b) avector space representation V.In generating the word cloud, we found that a task contained on average 1148unique words after the removal of stop-words. Since the majority of the wordsare not strictly relevant to the task itself, we limited the selected words to the 100most relevant words. We found a 100 word limit to achieve a balance between notshowing relevant words and cluttering the visualization with irrelevant information.We further used the TF-IDF relevancy score of the words to determine the size ofthe word. Figure 3.3 shows two examples of such word clouds. For the vectorspace representation, we formed multi-dimensional vectors with one dimension perunique word in the set of all words from all documents for a developer. Each tasksegment can then be represented as a vector that has a non-zero entry for each wordin the bag of words that represent the task segment, and the non-zero entry beingthe TF-IDF score of the word.3.4 RQ1: Associating DescriptionsOur first research question asks whether we can automatically associate descriptionsof a developer’s tasks with the information the developer accesses as she works.Performing this association automatically is challenging because there are manyways in which a developer can complete a task and there are many ways in which adeveloper can describe the task on which they are working.The data we collected in the lab setting (Chapter 3.1) includes a number ofways in which the tasks assigned could be completed. While participants in the lab21(a) WC for a Deep Learning Presentation (DeepL) task segment of D1(b) WC for a Duplicate Bug (BugD) task segment of D4Figure 3.3: Word Clouds (WCs) generated for task segments from differenttasks and developers.setting had some overlap in the resources they accessed as part of a task, no twoparticipants completed a task in exactly the same manner.Similarly, developers are likely to tailor their task descriptions towards the waysthey might approach a task. To study the first research question, we therefore alsoneeded a range of descriptions of the tasks on which the developers had worked.To gather these descriptions, we employed Amazon’s Mechanical Turk. Givena range of descriptions collected in this way, we are able to assess how the rangeof techniques we developed for generating task representations (Chapter 3.3) canaddress the first research question.223.4.1 Gathering Task DescriptionsTo capture a range of task descriptions, we distributed a survey via Amazon’sMechanical Turk. As a requirement for responding to our survey, we asked thatrespondents be currently or previously employed in the software industry. In totalwe received responses from 29 respondents. These respondents represented a rangeof fluency with English and a range of experience in software development. Onaverage, respondents had 6.2 (±5.4) years of software development experience, and3.8 (±3.5) years of professional development experience. Of these respondents, 24reported that they were native English speakers, while 3 reported being fully fluentand 2 reported being proficient.Respondents of this survey were presented with the same set of six full taskdescriptions that we also used for the data set creation in the controlled lab setting.An example can be seen in Table 3.2. We asked respondents to “Please summarizethe task described below in your own words, as you might write it for your ownreference in a to-do list or similar. Please limit your response to at most 15 words.”.Thereby, we randomized the order in which the full task descriptions were presented.To filter out irrelevant or low quality responses, we asked two external expertsto rate the quality of every task description generated by each respondent. Eachrater used a scale from 1-3 to indicate the relevancy and quality of the responses,with a score of 1 indicating an irrelevant response, 2 indicating relevant but lowquality responses, and 3 representing relevant and high quality responses. We foundthat the distinction between responses rated 2 or 3 varied greatly between our twoexperts, but that there was a strong consensus with regard to the responses whichwere rated 1 / irrelevant (Cohen’s Kappa: 0.74, indicating strong agreement [41]).These irrelevant responses tended to come in multiples from the same participants.We removed all participants with irrelevant responses and considered only thoseresponses which both authors rated with a score of 2 or higher, leaving us with 20participants and a total of 120 task descriptions. A sample of 3 responses for onetask with the ratings by one expert rater is depicted in Table 3.3.23Table 3.3: Examples of descriptions received for the Viz Library Selection(Viz) task, together with one of the expert’s ratings.Rating Task Description (Survey Response)Irrelevant (1) I would suggest SIMILE Exhibit or InfoVis Toolkit forJavascript libraries to create a visualization.Low Quality (2) Visualize workers work pattern.High Quality (3) Create visualizations for product benefits. Select libraries andgive existing work examples.3.4.2 EvaluationFrom the controlled lab setting, we have 189 task segments and 102 task segmentgroupings. From the Amazon Mechanical Turk survey, we have 20 descriptions foreach task, resulting in a total of 120 task descriptions. We wish to determine if theapproach we developed for generating task representations can be used to determinewhich task segment (or task segment group) maps to which task description withsufficient precision and recall, even when these task descriptions might vary. Wealso wish to determine which choice of techniques within the approach gives optimalresults. As our ground truth, we use the annotations made by a researcher during thedata set collection phase which tell us what task description a specific task segmentor task segment group correctly maps to.Our evaluation consists of considering each task segment from a lab developer’swork and mapping it to one of the six task descriptions produced by a MechanicalTurk respondent. We use this evaluation method that assumes a complete set ofdescriptions as we wish to assess how well our approach might work in a situationwhere a developer may be trying to determine, from a given set of tasks, whenthey performed work on each task. For the mapping, we generate a vector spacerepresentation of the task segment VT S as well as one for each of the six taskdescriptions V1 to V6 produced by a respondent and then calculate the cosinesimilarities between VT S and each of V1 to V6. We choose the task description mostsimilar to our generated task representation and evaluate it by comparing it to theground truth to determine if it is correct.For generating task representations from task segments in vector space format,24we experimented with and compared six (3 x 2) different combinations of techniques:3 different techniques for vectorization (term frequency, TF-IDF, and word2vec wordembedding), and 2 different techniques for extracting bags of words (tokenizationusing NLTK, and keyword extraction using RAKE as described in Chapter 3.3.2).TF-IDF vectors were generated as described in Chapter 3.3.3, while TF vectorswere generated directly from the bag of words calculated from an individual tasksegment without taking into account inverse document frequency, and only scalingby TF. For the word embedding vectors, we used the Gensim 2 implementation ofWord2Vec [46], training on a Wikipedia corpus collected in 2018. Specifically, wecreated a word2vec vector for each word in the bag of words representing a tasksegment, and then averaged all these vectors to generate a single word2vec vectorfor the task segment.To generate vectors from the task descriptions of Mechanical Turk (MT) workers,we tokenized the task descriptions using NLTK (keyword extraction is not usefulin this case given the brevity of the descriptions), and then applied the exact samevectorization technique as used for the task segments, i.e. either TF, TF-IDF, orword2vec word embedding. TF-IDF vectors were generated on the fly during ourevaluation, as the IDF calculation varies depending on the document set (i.e., thetask segments) of the developer dataset being evaluated.3.4.3 ResultsFigure 3.4 illustrates the results of the comparison between the six different combin-ations of vectorization and word extraction techniques. Overall, the combination ofTF-IDF with simple word tokenization performed the best, however the differencesare small compared to the combination with RAKE or using just TF. Ultimately,word2vec performed the worst for the generation of task representations and map-ping to the task descriptions. Since word2vec is also the most computationallyintensive, it was the least appropriate for this scenario. Based on these results, weselected the combination of TF-IDF with word tokenization (NLTK) as the approachthat we use for the remainder of the chapter.Table 3.4 presents the results of the evaluation when mapping task representa-2 + TF TK + TF RAKE + TFIDF TK + TFIDF RAKE + W2V TK + W2VTechniquesAccuracy - Comparison of Techniques Figure 3.4: Accuracy comparison for the 6 different combinations of tech-niques used to generate task representations (TK = tokenziation).tions of task segments (or task segment groupings) to the task descriptions writtenby the 20 MT workers. We report the precision and recall for each of the 6 tasks,calculated on a per task segment basis (or with the baseline of the per task segmentgrouping).26Table 3.4: Results of mapping task representations to task descriptions written by MT workers.Task Segments Task Segment GroupingsBugD Viz PrMR PrRec DeepL BlC BugD Viz PrMR PrRec DeepL BlCPrecision 92.6% 72.7% 53.3% 44.7% 83.5% 80.9% 97.4% 78.1% 55.0% 55.7% 85.3% 86.4%Recall 82.3% 81.5% 47.1% 65.4% 75.3% 70.0% 89.4% 89.4% 45.0% 70.9% 82.1% 76.5%27Overall, using only task segments, our approach achieved high accuracy acrossall tasks (70.6%) in comparison to a random classifier (16.7%). Accuracy for tasksegment groupings was moderately better (75.5%). This is a promising result asit indicates that there is often already sufficient information in an individual tasksegment to predict the task that is being performed; adding more information helpssome but does not make a dramatic difference.Our approach performed well at predicting tasks with a distinct focus, suchas DeepL and BlC, with precision values over 80%. This result is unsurprising,as in order to perform these tasks, the developers in the lab setting tended to turnto resources that contained a dense amount of highly specific information relatedto these topics, such as the Wikipedia pages for blockchain and deep learning.The presence of dense, consistent information eases the production of accuraterepresentations for the tasks. Our approach also performed well at recognizingthe BugD task with precision over 92%. We found this result surprising as weexpected this task to be one of the more difficult tasks to predict, especially sinceour summary authors were given no information about the content of this task,beyond that it involved finding duplicate bugs in a Bugzilla repository. As expected,the most difficult to predict tasks were the PrMR and PrRec tasks. These tasks arevery similar and as such, resulted in very similar task representations as well as verysimilar task descriptions and in turn, in a high confusion between the two tasks.Figure 3.5 and Figure 3.6 illustrate the results broken down on a per developerand a per MT respondent level. Despite differences in the way each developerperformed each task, the results are fairly consistent across developers, rangingfrom a minimum accuracy of 62.9% to a maximum of 77.3%. Across the MTrespondents that authored the task descriptions, the results are mostly consistent,however, there is a significant variation for a few respondents. The respondentsfor which the accuracy of mapping task representations to their task descriptionswere rather low tended to be ones who authored multiple descriptions that werealso rated lower by the experts (e.g., item 2 in table 3.3 was written by author S6).These result demonstrate that the task representations are relatively robust acrossdevelopers and different ways of performing the tasks, and that writing precise andsomewhat detailed descriptions of the tasks being performed clearly impacts theresults of our approach.2800. D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16 D17DeveloperAccuracy - Per Developer Figure 3.5: Accuracy for mapping task representations to the correct taskdescriptions by lab developer. The dotted line represents the accuracyfor a random classifier.3.5 RQ2: Assigning TagsTo address the question of whether we can automatically generate tags to informationaccessed by a developer which would help the developer to identify what task sheworked on during a specific period of time, we evaluated the word clouds wegenerated as described in Chapter 3.3.3. Specifically, we asked 28 participantsexperienced in software development to match our generated word clouds to theoriginal full task descriptions of the tasks that were performed during the data setcreation.3.5.1 SurveyTo evaluate the quality of our automatically generated word clouds as a visual repre-sentation of a task, we conducted a survey with experienced software developers.Participant were recruited through personal and professional contacts, and as anincentive for responding were entered into a draw for one of two $25 gift cards ifthey desired. In total, we received survey responses from 28 individuals, with an2900. S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20RespondentAccuracy - Per RespondentFigure 3.6: Accuracy for mapping task representations to the correct taskdescriptions by MT respondent. The dotted line represents the accuracyfor a random classifier.average of 8.0 (±3.9) years of software development experience. 20 participantswere male, while 8 were female. 9 participants reported they were native Englishspeakers, 12 reported that they were fully fluent in English, and the remainder (7)reported that they were proficient in their understanding of English.We asked our participants to match word clouds to corresponding tasks bypresenting them with the list of the six full task descriptions that we also used for thedata set creation. An example of one of the descriptions can be seen in Table 3.2. Theword clouds used in the survey were generated following the procedure describedin Chapter 3.3.3. Using the data that we collected across all 17 developers in thedata set creation (Chapter 3.1), we randomly selected 4 task segments and 4 tasksegment groupings for each of the six task and generated word clouds for these,resulting in a total of 48 word clouds. Since asking survey participants to examine atotal of 48 word clouds would be too much and impractical, we randomly selectedand asked each participant about 12 of the 48, ending up with 2 word clouds (1 for atask segment, 1 for a task segment grouping) for each of the six tasks. Examples of30two of these word clouds can be found in Figure 3.3. We asked participants to readthe six full task descriptions and to then identify Which task the presented wordclouds describe best. Participants also had the option to indicate that the word clouddoes not match any task.3.5.2 ResultsWe aggregated the results of the survey responses to obtain accuracy ratings forthe word clouds we generated. Overall, the average accuracy of mapping wordclouds to the corresponding tasks was 67.9% for the word clouds generated fromtask segments, and 69.6% for the word clouds generated from groupings. Figure 3.7shows the breakdown of the accuracy on a per task level. The success rates of ourparticipants varied widely between tasks. For example, for the blockchain experttask BlC, our participants were able to correctly identify the task for the generatedword cloud 100% of the time. Conversely, participants struggled to properly identifythe task for the word clouds generated for the duplicate bug task BugD (35.7%). Thistask was by far the most difficult for participants to identify, and many participantsreported that the word clouds generated by this task were not descriptive.Unsurprisingly, participants frequently confused the word clouds generated forthe app market research task PrMR and for the recommend tool task PrRec. Theseword clouds tended to have very similar key words, as both full task descriptionsmentioned the same three productivity tools.Comparing the results of the word clouds generated from segments to the onesgenerated from groupings did not reveal a substantial difference. This is a promisingresult, as it indicates that enough data can be generated from within the bounds ofmost task segments to create word clouds that accurately represent the topic of atask as a whole.3.6 DiscussionDecisions we have made in designing the approach we introduce are impacted bythe evaluations we undertook. We discuss threats to the validity of these evaluationsand consider alternatives that could make it easier to apply our approach.3100. Viz PrMR PrRec DeepL BlCTaskWord Cloud Matching AccuracySegmentsGroupingsFigure 3.7: Accuracy for identifying the task based on our generated wordclouds. The dotted red line indicates the accuracy for a random classifier.3.6.1 Threats to ValidityThe evaluations of the approach we conducted rely on a data set that focused onsix tasks. Although we chose these tasks to be examples of information findingtasks performed by developers, the range of tasks explored is small. By focusing oninformation finding tasks, we also exclude a significant category of tasks on whichdevelopers commonly work, namely coding related tasks. We believe that withminor adaptions, such as tokenizing camel case words or parsing the OCR results toextract in code comments, our approach could be made to work with coding tasks.If we took this approach to coding tasks, the quality of the code base in terms ofdocumentation, naming convention, and so on, could play a large role in the abilityof our approach to make accurate predictions. It would be impossible to associate adeveloper description, or generate a meaningful visual representation, if the codebase does not contain descriptive names and lacks documentation. We leave theinvestigation of the generalizability of our approach across a wider range of tasks tofuture study.Another threat to the findings is the size of the tasks studied and the interleaving32of work on different tasks. To fit within a reasonable time frame for a lab setting,the tasks worked on were relatively small in scope. In reality, developers work oncomplex tasks that can have a huge scope and span multiple topics. In addition,although we caused developers to switch tasks, it is not possible to replicate themany task switches a developer undertakes as he works [44]. A field study is likelyneeded to mitigate these threats.We also note that the tasks we designed may be more specific in their wordingthan those that might occur in a developer’s normal work pattern. For example, adeveloper might work on a task in response to some relatively vague verbal requestfor help from a colleague. In such cases, it is unlikely that the summaries that thedeveloper would write for these tasks are highly descriptive. We mitigate this threatby including a wide variety of low and high quality task summaries written by agroup of MT workers with diverse demographic in our evaluation.3.6.2 Applying the ApproachThe approach we have introduced and evaluated assumes that the boundaries ofa task segment can be known with high accuracy. Automatic detection of tasksegments (i.e., task switches) is a difficult problem (e.g., [48, 74]). We investigatethis problem further in Chapter 4. While our own results are promising, and weare optimistic that the techniques to detect task switches will continue to improve,future work should explore the performance of our approach in the absence ofknowing task segment boundaries. It may be that missing or erroneously predictinga task switch could lead to degraded performance in our approach in practice.It is also possible that in practice the vocabulary a developer uses to describetheir task does not match exactly with the words commonly found within thecontent of the task. For example, a developer might use the word “chart” in theirtask description, yet in the window content of the task the word “graph” mightappear prominently instead. Applications of TF-IDF would miss this connectiongiven its focus on exact word matches. Incorporating some notion of semanticsimilarity into our approach, for example adding Word2Vec or another model forword embedding, we might be able to enhance task descriptions to also includesemantically similar words. More experimentation in a more realistic setting is33needed to investigate the impact and need for semantic similarity.3.6.3 Artifact AccessUsing OCR and capturing a developer’s screen content has several benefits. First,it is an application agnostic approach that does not require any instrumentationof applications. As well, a screenshot shows us the exact content a developer islooking at in the moment. While OCR performed well for the purposes of ouranalysis, there are many drawbacks that could limit its usability in practice. Forone, OCR is an extremely CPU intensive task. Processing screenshots in real timein the background while a developer works may be impractical for this reason. Anobvious alternative might be to send screenshots to the cloud for processing, butprivacy concerns, both from the developer’s and company’s perspective, limit theapplicability of this approach.Another issue is the noise generated when using OCR. This may be alleviatedto some extent by using a commercial option rather than the open source Tesseractengine. However, we can not guarantee that the product of a screenshot processedwith OCR is exactly the same as the content a developer saw on their screen whenthe screenshot was taken.An alternative which we will investigate in future work is to track all fileaccesses and edits made within the scope of a task segment. If we know whichfiles a developer is interacting with, we can extract the contents of the file directly.The benefit of knowing exactly which information in a document is being viewedwould be lost in such an approach. However, this loss may be outweighed by theability to produce cleaner data, and the much lower CPU usage. The contents ofweb page visits could also be extracted relatively easily with the help of a browserextension. However, it could be difficult to obtain information from applicationssuch as instant messaging and email clients, as there is a much wider range ofchoices for a developer to use in these cases. For this reason, producing a suiteof instrumentations for all the most commonly used applications is impractical.Further investigation is needed to determine how much predictive power is lost bythe exclusion of these categories of applications.343.6.4 Creating Vectors from Window TitlesAs mentioned in Chapter 3.1, in addition to recording screenshots of a developer’sactive window, our tool also recorded the window title of every window the devel-oper accessed. Unfortunately, due to a recording error window title data was lostfor 5 of the 17 developers in our data collection session.To investigate whether the easier to collect information about application win-dow titles might suffice for supporting our approach, we evaluated RQ1 with thewindow title data from the 12 developers, in place of the information extracted usingOCR. Comparing the results of this evaluation with the results from the same 12developers using screen content, we found that while the results were lower overall,the difference was modest (64.4% accuracy using window titles vs 70.3% accuracyusing screen content). Figure 3.8 illustrates the differences in performance seen ona per task level. While screen content is a superior choice of data source in almostall cases, window titles seem like a viable alternative especially given the savingsin CPU resources. Worth investigating is whether a combination of our approachusing window titles and the other data extraction techniques mentioned above canrival the results we achieved using screen content.3500. Viz PrMR PrRec DeepL BlCTaskF1 Score - Window Titles vs Screen ContentWindow TitlesScreen ContentFigure 3.8: Comparison of performance per task with window titles as thedata source vs screen content. For both approaches, results are calculatedacross the 12 developers with window title data available.36Chapter 4Detecting Task Switches andTheir TypesIn the previous chapter, we showed that it is possible to determine and describe thecontent of a task, making the assumption that the times at which task switches occurare known. In this chapter, we investigate the feasibility of automatically predictingthese task switches, as well as predicting at a coarse grain, and with minimal data,the type of task being performed. We consider the following two research questions:RQ3: Can we automatically detect the times at which a developer switches betweentasks?RQ4: Can we automatically determine the types of the tasks that a developer workson?Previous researchers have proposed approaches to automatically detect switchesbetween tasks, varying mainly in the features used (e.g., user input or applicationbased), and the method applied (e.g., supervised versus unsupervised machinelearning) [48, 74, 76]. Yet, the evaluations performed to study these approachesare often fairly limited in terms of the tasks and number of participants, and theresults show that it is very challenging to achieve high prediction accuracy oftask switches without too many false positives [48, 51, 81], or that one has toaccept a high deviation in time of 3 to 5 minutes between predicted and actual taskswitches [73, 74, 76]. In addition, existing approaches in the software engineering37domain for detecting task switches are limited to the IDE and therefore do notcapture non-development work, which can account for 39% up to 91% of the timedevelopers spend at work [3, 21, 44, 61, 77].In addressing RQ3, we investigate whether we can automatically detect taskswitches of professional software developers in the field with high accuracy, basedon temporal and semantic features as extracted from their computer interactioninside and outside the IDE.We were also interested in classifying the type of task a developer is workingon, since the better we understand the context of a task, the better we can supportdevelopers. To the best of our knowledge, there has been only one approach sofar that looked at the automatic classification of developers’ activities on a tasklevel [5]. Yet, their examination was limited to specific development activities only,and did not consider the whole range of non-development tasks that developers areworking on, such as administrative or planning tasks.In addressing RQ4, we investigate the task types that software developers areworking on more holistically, and explore how accurately we can predict them inthe field.To address our research questions, we performed two field studies: one with 12professional developers in which we observed their work over a 4-hour period andlogged the task switches and types without interrupting their work; and one with 13professional developers in which we regularly prompted participants to self-reporttheir task switches and types over a period of about 4 workdays and conducteda post-study questionnaire. By varying the study methods, we wanted to achievea higher generalizability of our results and ensure that we take into account theeffects of self-reporting while also capturing the breadth of developers’ tasks overmultiple days. For both field studies, we also collected the participants’ computerinteraction using a monitoring tool that we installed on their machine and that wasrunning in the background. From the computer interaction data, we extracted a totalof 109 temporal and semantic features. Our analysis of the data shows that we canuse the automatically logged computer interaction data to train machine learningclassifiers and predict task switches with a high accuracy of 87%, and within a shorttime window of less than 1.6 minutes of the actual task switch. Our analysis furthershows that we are able to predict task types with an accuracy of 61%, yet that this38accuracy varies by task type. The features based on mouse and keyboard interactiongenerally hold the highest predictive power, while the lexical features we extractedfrom the application names and window titles have the least predict power in ourapproach.4.1 Study DesignTo investigate the use of computer interaction data for predicting task switches andtypes, we conducted two field studies, a 4-hour observational study and a multi-daystudy with experience sampling, with a total of 31 professional software developersinitially. The observations and self-reports served as the ground truth of participants’task switches and types, while we additionally gathered computer interaction datato extract features for our predictions. In both studies, we used the same definitionsof tasks, task switches and types which we also shared with the participants. A briefoverview of our study design is presented in Figure 4.1.Figure 4.1: Overview of the study design and outcomes.394.1.1 Study 1 – ObservationsIn our first study, we observed the work of 12 participants over a period of 4 hoursto gather a richer understanding of developers’ task switches and types they workon.Procedure While conducting the study, the observer followed a detailed protocolthat we developed before the study. The first observation session was performed byboth observers at the same time. A cross-check of the two observation logs showedan inter-rater agreement of 97%, suggesting a high overlap of observing the sametasks and task switches.Before each observation session, the observer explained the study purpose andprocess to the participants and asked them to sign a consent form, to install amonitoring tool that tracks participants’ computer interaction, and to describe thetasks they were planning to work on during the observation. The observer alsointroduced herself to nearby colleagues and asked them to ignore her as much aspossible, and collaborate with the observed participant as they would normally do.After that, the observer placed herself behind the participant to prevent distractions,while still being able to see the screen contents on the participant’s computer.Finally, the observer started the actual observation session and asked the participantto continue their work as usual.We observed participants for a total of four hours each on a single workday: twohours before and two after lunch. For the observations, we followed Mintzberg’sprotocol of a structured observation session [47]. The observer wrote in an observa-tion log 1 each time the participant switched from one task to another. Each entryin the observation log consists of a timestamp, a description of the reason for thetask switch and a description of the task itself. We inferred tasks and their detailsfrom the active programs and their contents on the screen, as well as discussionsparticipants had with co-workers. After each session, the observer validated theobserved tasks and task switches with the participant, by going through the list ofobserved tasks and accompanying notes and modifying mistakes made during theobservation.1We used our own observation logging tool: We recruited 14 participants through professional and personal con-tacts from two large-sized and one medium-sized software companies. We excludedtwo participants for which we were not able to observe a sufficient amount of taskswitches (less than 10). Of the remaining 12 participants, 1 was female and 11were male. Throughout this chapter, we refer to these participants as P1 to P12.Our participants had an average of 10.8 (±7.4, ranging from 1 to 20) years ofprofessional software development experience and were working in different roles:8 participants identified themselves as individual contributors and 4 as developersin a leading position. All participants resided either in Canada or the United States.Monitoring Tool To collect computer interaction data from developers, we dev-eloped and used our own monitoring tool, PersonalAnalytics 2, for the Windowsoperating system. The tool tracks participants’ mouse and keyboard interaction, aswell as their application usage. For the mouse, the tool tracks the clicks (coordinatesand button), the movement (coordinates and moved distance in pixels), and thescrolling (coordinates and scrolled distance in pixels) along with the correspondingtime-stamp. For the keyboard, the tool records the type of each keystroke (regular,navigating, or backspace/delete key) along with the corresponding time-stamp. Forprivacy reasons, we did not record specific keystrokes. Our tool further recordsthe currently active application, along with the process name, window title, andtime-stamp whenever the window title changed or the user switched to anotherapplication.Task Type Inference We inferred task type categories by performing a ThematicAnalysis [8] on the basis of related work and our observation logs. The analysisprocess included first familiarizing ourselves with the observed task switches, opencoding the observed and participant-validated tasks and accompanying notes, identi-fying themes, and categorizing the resulting themes into higher level task types. Thisprocess resulted in nine task type categories: Development, Personal, Awareness &team, Administrative, Planned meeting, Unplanned meeting, Planning, Other andStudy. The task types are described in more detail in Table 4.4 and discussed in2 Details can be found in [42].41Chapter 4.4.1. In contrast to a task (and the task type), an activity describes an eventor happening that does not necessarily need to have a particular purpose (or task).For example, the activity Web Browsing, could be grouped into several task types,such as Development when the developer is reading API documentation online andPlanned meeting when the developer is using an online-conferencing tool.4.1.2 Study 2 – Self-ReportsTo capture a longer time period and more breadth in developers’ work, we conducteda second field study with 13 participants over a period of 4 workdays each. For thisstudy, we used experience sampling, in particular we regularly prompted participantsto self-report task switches and types. By using experience sampling, we also wantedto mitigate the risk of a bias in participants’ behavior due to an observer sittingbehind them, which, for example, could lead to participants being less likely tobrowse work unrelated websites.Procedure Before the study, we emailed participants a document explaining thestudy goal and high-level procedure, asked them to sign a consent form and toanswer a pre-study questionnaire with questions on demographics, their definitionof a task, reasons for switching between tasks, and on the task types they are usuallyworking on. Afterwards, participants received the study instructions, detailing thestudy goals, definitions of task switches and types that we used for the study, andinstructions on how to install and run the monitoring tool. They were asked toinstall the same monitoring tool that we described above on their main computer.In case participants worked on multiple computers (e.g., a desktop and a laptop),we asked them to install the monitoring tool on both devices. Participants werefurther asked to read our definitions of a task, task switch and task type, as wellas instructions on how to use the self-reporting component that we added to ourmonitoring tool. Finally, participants were asked to pursue their work as usual forthe next couple of workdays while also self-reporting their task switches and typeswhen the pop-ups/prompts appeared.For this study, our tool prompted participants once per hour to self-report theirtask switches and types for the previous hour. The self-reporting step is explained42in more detail below. We intentionally decided to use an interval of one hourrather than a full day, to balance the intrusiveness of the prompts with the ability toaccurately remember tasks and task switches over the previous time interval [84].To further ensure high quality in the collected self-report data we further allowedparticipants to withdraw from the study at any point in time, and to pick the timefor their participation themselves. In addition, and to avoid boredom or fatigue, weasked participants to respond to a total of 12 to 15 prompts, assuming an average offour self-reports per day and a total of three to four workdays for participation. Thisnumber was a result of several test-runs over multiple weeks and from qualitativefeedback gathered with a pilot participant, a professional developer. Furthermore,we provided support to postpone self-report prompts for 5 minutes, 15 minutes, or 6hours, and built and refined the self-reporting component to require as little effort aspossible to answer, e.g., by letting participants answer the required fields by simplyclicking on elements instead of asking them for textual input. Finally, each pop-upalso asked participants to report their confidence with their self-reports.Throughout the study, participants could check the number of completed pop-ups. Once they completed 12 pop-ups, participants could notify us and upload thecollected data and self-reports to our server. The upload wizard once again describedthe data collected and allowed participants to obfuscate the data before sharing itwith us. At the end of the study, participants were asked to answer a post-studyquestionnaire with questions on the experienced difficulties when self-reporting taskswitches and task types, on further task types they were working on, and on howthey could imagine using information on task switches and types. After completingthe survey, participants were given two 10 US$ meal cards to compensate for theirefforts.Participants We recruited 17 participants through professional and personal con-tacts from one large-sized software company. We discarded data from three partici-pants that self-reported less than 10 task switches in the days of their participation.We further discarded the data of one participant whose definition of a task switchwas very different to ours and the rest of the participants (i.e., he considered everyapplication switch a task switch). Of the remaining 13 participants that we usedfor the analysis, 2 were female and 11 were male. Our participants had an average43Figure 4.2: Screenshot of the second page of the experience sampling pop-upthat asked participants to self-report their task types.of 12.1 (±8.2, ranging from 1 to 30) years of professional software developmentexperience and were working in different roles: 10 identified themselves as individ-ual contributors and 3 as developers in a leading position (i.e., Lead or Manager).All participants resided in the United States. In this chapter, we refer to theseparticipants as P13 to P25.Self-Reporting Component The self-reporting component is part of our monitoringtool and includes a pop-up with three pages. The first page asked participants to self-report the task switches they experienced in the past hour. It visualized participants’application usage on a timeline using different colors for each application andallowed them to self-report their task switches by clicking on the lines denotingapplications switches. We restricted the task switch self-reports to a granularity ofapplication switches with a minimum length of 10 seconds for a variety of reasons:First, we assumed that most of participants’ task switches coincide with applicationswitches (e.g., switching from the email client to the IDE, or from the browser toan IM client) and fewer happen during a session uniquely spent within the sameapplication (e.g., switching tasks directly in the IDE or in the browser). And, wewanted to avoid cluttering the user interface of our self-reporting component andsimplify the reporting for participants. Similar to [76], the timeline visualizationprovided additional details when the participant hovers over an application, suchas the application name, time when it was used, window title(s) and user inputproduced in that application. As soon as participants completed self-reporting theirtask switches for the whole previous hour, they could proceed to the second page44and self-report their task types (see Figure 4.2). On the second page, we visualizedthe same timeline as before, but added another row that prompted participants toselect task types from a drop-down menu. After selecting the task types for alltask segments, participants could proceed to the last page. The third page askedparticipants to self-report their confidence with their self-reports of task switchesand task types on a 5-point Likert-scale (5: very confident, 1: not at all confident)and optionally add a comment. Capturing participants’ confidence served as anindicator of the quality and accuracy of their self-reports. The user interface weused to collect the ground truth for task switches and types resembles the one byMirza et al. [48–50].The supplementary material [45] includes the study instructions we shared withparticipants, the pre- and post-study questionnaires they answered and additionalscreenshots detailing the self-reporting component.4.2 Data and AnalysisFor this study we collected two rich data sets, including observed or self-reportedground truth data, and automatically tracked computer interaction data. Prior to themain analysis of the data we performed multiple pre-processing steps, includingdata segmentation and feature extraction, which are summarized in the remainderof this section.4.2.1 Collected DataFor study 1, we collected observation logs for a total of 51.7 hours of work and anaverage of 4.3 (±1.3) hours per participant. For study 2, we collected self-reportsfor a total of 58 workdays and an average of 4.5 (±1.7) days per person. On average,participants reported a high confidence with their self-reports (>3) in 20.6 (±9.0),and a medium or low confidence (≤3) in 22.2 (±16.7) of the pop-ups they answered.77.0% was the highest ratio of medium or low confidence self-reports that oneparticipant had, and 16.7% was the lowest. We decided to only use the data of the268 self-reports with a high confidence (>3), thus including a total of 268 hoursof work and discarding the rest (289 self-reports). This allowed us to ensure wewere training our models with data that is of high quality and accuracy. Future work45could also account for over- or under-confidence in participants’ self-reports.Table 4.1 reports statistics on the self-reports. Since overall, only 11% of thepop-ups were postponed by participants, one reason for the relatively high numberof self-reports with medium or low confidence could be that the pop-ups appearedat inopportune moments and participants did not remember they could postpone it.Instead, participants might have just clicked through the pop-up and reported a lowconfidence to not distort the data.Table 4.1: Self-reports for Study 2.All per Part.Days participated 58 4.5 (±1.7)Pop-ups displayed to participants 557 42.8 (±21.6)Pop-ups answered by participants 268 20.6 (±9.0)- Pop-ups answered within 5 minutes 158 12.2 (±6.3)- Pop-ups answered after 5 minutes 110 8.5 (±5.4)Pop-ups postponed by participants 62 4.8 (±3.5)Pop-ups discarded by researchers 289 22.2 (±16.7)4.2.2 Time Window SegmentationTo calculate and extract task switch detection features, we defined the time win-dows to be between two application switches, which we call application segments.Thus, the task switch detection model that we were going to build, could recognizetask switches whenever a developer switches between applications, but would misstask switches within an application, such as a switch from a work item to the nextone inside the IDE. We consider application segments to be an appropriate timewindow with minimal prediction delay, since developers spend on average only 1 to2 minutes in an application before switching to another [22, 43, 44], and to ensurethe accuracy of participants’ self-reports (in study 2) was high. Threats to thisclassification are discussed in Chapter 4.6. In contrast, previous approaches predom-inantly used longer and fixed window lengths of 5 or 10 minutes [51, 55, 74, 76].These shorter and more flexible time windows at borders of application switchesallow to more accurately capture developers’ behaviors, and to more precisely locatethe point in time of the task switch. For the task type detection features, we usedthe time windows between two task switches, as identified by our observations(study 1) or participants’ self-reports (study 2), which we call task segments for the46feature extraction.4.2.3 Task Switch Features ExtractedA next step towards building a classifier for task switch detection is to extractmeaningful features from the raw computer interaction data collected by the mon-itoring tool. The features that we developed are either based on heuristics thatparticipants stated as indicative of their task switches in the post-study questionnaire(study 2), based on features that have been linked to developers’ task switchingbehavior in prior work, as well as based on our own heuristics. The features weused are presented in Table 4.2 and are discussed in more detail in the remainder ofthis section.Task switch detection is a special case of change-point detection [6, 24], whichis the process of trying to detect abrupt changes in time-series data. This is whymany of our features compare the similarity between characteristics of the previousapplication segments with the current one, for example the difference in the numberof keystrokes. To determine how many steps back one needs to compare the currentwith the previous application segments’ features, we run the task switch detectiontaking into account 1 and up to 10 steps back into the past, and comparing theresulting precision and recall. Our analysis of the results indicated that after aninitial increase of the precision for detecting a switch, the precision and recallgradually drop as the number of steps increases. We therefore chose 2 as the numberof steps to go back in terms of application segments. As a result, the total numberof features used for the task switch detection is 84, which is double the number ofunique features used: once calculated for comparing the current with the previousapplication segment, and once to compare the previous two application segments.In the following, we provide an overview over all the features used:User Input Features The first feature group are user input features. They arebased on keyboard and mouse interaction, such as the difference in the number ofkeystrokes the participant pressed per second between this and the previous timewindow segment.47Table 4.2: Features analyzed in our study and their importance for predicting task switches and task types.Features Import.SwitchImport.Type AllImport.Type UIUser Input Features 45.8% 52.0% 81.0%Keystroke differences (4): difference in the number of navigate/backspace/normal/total keystrokespressed per second between the previous and current application/task segment [5, 27, 33, 50]16.4% 19.1% 29.9%Mouse click differences (4): difference in the number of left/right/other/total mouse clicks per secondbetween the previous and current application/task segment [5, 27, 33]17.9% 13.9% 29.4%Mouse moved distance (1): total moved distance (in pixels) of the mouse per second [27] 6.8% 8.7% 13.5%Mouse scrolled distance (1): total scrolled distance (in pixels) of the mouse per second [5, 33] 4.7% 5.0% 8.2%Application Category Features 29.4% 34.3% NASwitch to/from specific application category (26): switch to/from a specific application category (e.g.,messaging), while the previous one was different. Application categories considered: CodeReview [PS],DeveloperTool [44], IDE [44], Idle [PS],[37], IM [PS],[28], Mail [PS],[28], Music [PS],[5, 28, 48],Navigate, Read/Write Document [44], TestingTool [44], Utility, WebBrowser [PS],[44], Unknown28.0% NA NASame application category (1): the current application category is the same as the one in the previousapplication segment, e.g., both are messaging [28, 48]1.4% NA NATime spent per application category (13): the percentage of the total duration of the task segment thatwas spent in each of the 13 application categories [33, 50, 76]NA 34.3% NASwitching Frequency Features 16.6% 13.7% 19.0%Difference in the window switches frequency (1): difference of the number of switches betweenwindows of the same or a different application per second between the current and the previousapplication/task segment [33, 55, 76]7.2% 13.7% 19.0%Difference in the time spent in an application (1): difference of the total duration spent between thecurrent and the previous application segment [76]9.4% NA NALexical Features 8.2% 0% 0%Code in window title (1): the window titles of the current and previous application/task segments bothcontain code, as identified by text that is written in camelCase or snake case. Can also distinguishbetween development and other file types1.4% 0% 0%Lexical similarity of the window titles and application names (2): cosine similarity based on theterm frequency-inverse document frequency (TF-IDF) between the current and previous applicationsegments’ window titles or application names [9, 55, 76]6.8% NA NAReferences on these features (in blue) are either on previous related work or participants’ suggestions (PS). A feature importance of NAdenotes that the feature was not used for the prediction group. For the task type columns, ‘All’ denotes that all features were considered,‘UI’ indicates that only the user interaction features were used, and the application category features were ignored. Numbers in bracketsshow feature counts.48Application Category Features We categorized commonly used applications intoone of 13 predefined application categories, based on our classification in previouswork [44] and participants’ suggestions of what they consider to be good indicatorsfor switching to another task. These include categories specific to software engi-neering, such as DeveloperTool, CodeReview or TestingTool, but also more generalones, such as Read/Write Document, Email and Web Browser. They are leveraged in26 features that capture switches to or from a specific application category, such asswitching to a messaging application or becoming idle. Since switching to anotherapplication might be another indication for a task switch [28, 48], we added onefeature that captures these.Switching Frequency Features In the post-study questionnaire, participants men-tioned that they often navigate through several applications to clean-up their compu-ter right before starting a new task, which is why we added a temporal feature basedon the window switching frequency. One feature captures the difference in the timespent in an application, since this might be another indicator for a task switch, eitherbecause a switch is less likely immediately after a task switch, and the likelihood ofa task switch increases as time passes [76].Lexical Features Inspired by prior work [9, 55, 76], we also added three lexi-cal/semantic features that are extracted from application names and their windowtitles. The textual data was first pre-processed to produce lists of words via tokeniza-tion on punctutation and whitespace. From these lists we also removed commonstop words such as “and”, “the”, and “or”. Since window titles might include codesnippets, such as a class or method name or development file type, we added afeature that captures whether the window title contains text written in camelCaseor underscore case, and whether this is different to the previous segment. Todetermine whether the previous and current application segments have a contextualsimilarity, two features are calculated based on the cosine similarity of the windowtitles and application names using the term frequency-inverse document frequency(TF-IDF). Note that the application name and window titles were also used todetermine the application category features. In addition, and unlike some previous49work, we explicitly did not capture file contents to reduce intrusiveness and avoidprivacy concerns [65, 80, 88].4.2.4 Task Type FeaturesFor the task type detection, we reused the same features as in the task switchdetection whenever possible. However, some features required adaption or made nosense in this context. First, as the time window for task type detection encompassesone or multiple application segments, we replaced the application category featureswith a feature that captures the ratio between the time spent in the specific applicationcategory and the time spent in the task segment. This allowed us to determinethe dominant application category in a task segment. Second, we eliminated thelexical similarity features as these are computed based on an application segment’ssimilarity to another segment. In the task type detection scenario, we have nocomparable ground truth to use to calculate such features. This resulted in a total of25 features used for the task type detection.4.2.5 Outcome MeasuresFor the task switch detection, we labeled each application segment either withSwitch or NoSwitch, depending on whether we observed a task switch (in study1) or whether the participant self-reported a task switch (in study 2). While ourmodel is able to detect task switches on the granularity of application segments, anactual switch might happen while using the same application. Thus, our task switchdetection approach is at most the duration of the application segment away fromthe actual task switch, which was an average of 1.6 minutes (±2.2) in our study.For the task type detection, we labeled each task segment with the observed orself-reported task type. Descriptive statistics regarding participants’ task switchingbehavior and the task types they worked on can be found in Chapter 4.3.1 andChapter 4.4.2, respectively.4.2.6 Machine Learning ApproachWe used scikit-learn [59], a widely used machine learning library for Python, to pre-dict task switches and task types. We evaluated several classifiers by applying them50to our feature set and testing different hyperparameters. A RandomForest classifierwith 500 estimators outperformed all other approaches, including a Gradient Boost-Classifier, Support Vector Machine (SVM), Neural Network and Hidden Naı¨veBayes classifier. Details on the hyperparameters of the evaluated classifiers can befound in the supplementary material [45]. A RandomForest classifier is one form ofensemble learning that creates multiple decision tree classifiers and aggregates theirpredictions using a voting mechanism [10, 35]. It does not require a pre-selection offeatures and can handle a large feature space that also contains correlated features.Hence, for the remainder of this chapter, the presented results were obtained usinga RandomForest classifier. Prior to classification, we impute missing values byreplacing them with the mean and apply standardization of the features, whichcenters the data to 0 and scales the standard deviation to 1. These common stepsin a machine learning pipeline can improve a classifier’s performance [60]. Forthe task switch detection, we further apply Lemaıˆtre’s implementation of SMOTE,which is a method for oversampling and can considerably boost a classifier’s per-formance in the case of an imbalanced dataset such as ours [34]. For the task typedetection, where as much as 80-90% of the reported types are of the Developmentclass we instead employ penalized classification to correct problems caused by classimbalance, as SMOTE has significant drawbacks when the minority classes have alimited number of samples [53].We built both individual and general models, where an individual model istrained and tested with data solely from one participant and a general model istrained on data from all participants except one, and tested on the remaining one.Individual models often have a higher accuracy since they are trained on a person’sunique behavioral patterns. On the other hand, general models are usually lessaccurate but have the advantage of solving the cold-start-problem, which meansthat no prior training phase is required and the model can be applied to new usersimmediately.To evaluate the individual models, we applied a 10-fold cross-validation ap-proach, where the model was iteratively tested on 1/10 of the dataset while beingtrained on the remaining data. We adapted the cross-validation approach to accountfor the temporal dependency of the samples. In particular, there is a dependencybetween samples in close temporal proximity, since data from the preceding samples51is incorporated in the features. To ensure a valid and realistic evaluation of themodel [63], we therefore deleted h samples on either side of the test set block.In our case, we chose h=10 since we included up to 10 preceding samples in thefeature calculation (see Chapter 4.2.3). The cross-validation approach is illustratedin Figure 4.3.Figure 4.3: Cross-validation approach for the individual models, leaving a gapof 10 samples before and after the test set to account for the dependenceof samples in close temporal proximity.4.3 Results: Detecting Task Switches4.3.1 Descriptive Statistics of the DatasetParticipants switched frequently between tasks, with a mean task switch rate of 6.0(±3.7, min: 1.8, max: 18.9) times per hour. The average time spent on each taskwas 13.2 (± 7.3, min: 3.1, max: 30.8) minutes 3. Developers’ task switch behaviorsare similar to previous work [22, 43].4.3.2 Task Switch Detection Error Distance and AccuracyTo analyze how well our task switch predictions work, we run a first discrete analysisby calculating the error distance between each predicted and the actual task switch.The average error distance is 2.79 (±2.30) application-switches, meaning that in3We do not report individual results for the two studies, since the task switch rate (p-value=.056)and time spent on a task (p-value=.215) are not significantly different in the two datasets.52case a task switch was not detected at the exact moment, it is on average 2.79applications before or after the predicted one. To put this in context, multiplyingthe average application segment length of 1.6 (± 2.2) minutes (Chapter 4.2.5) withthe error distance results in an average of only 4.46 minutes that a task switch ispredicted before or after the actual one. Of all task switches that were not detectedat the exact moment, 44.7% of the task switches our model predicted have an errordistance of 1 application-switch, 15.8% have a distance of 2 application-switches,and 39.5% have a distance of 3 or more application-switches.Table 4.3 gives an overview of the task switch detection performance of indivi-dual and general models. We split the presentation of the data into the two studies,since they were collected with a different method. As a baseline, we report theresults of a random classifier, where the likelihood of predicting a certain class isbased on the class distribution of the training set.Overall, our analysis revealed that we can detect task switches at their exactlocation with a high averaged accuracy of 84% (precision: 62% and recall: 35%,kappa: 0.34) when trained with individual models. Applying the general model, weachieved an averaged accuracy of 73% as well as higher recall of 55% and lowerscores in both precision (46%) and kappa (0.27). Overall, despite these differenceswe found the two models were similar in performance judged by both AUC (74%for individual vs 75% for general) and F1-score (43% vs 40%). Compared withour baseline classifier, both the individual and general model show substantialimprovements across the board, with the exception of recall in the individual model.Note that this does not mean that the baseline necessarily performed better in thiscase, only that our model was more much selective in its predictions, as is reflectedin the higher precision score. For the individual models, we compared the resultsof each participant’s model (see supplementary material [45]). It reveals that theprediction performance varies quite substantially for each participant.53INDIVIDUAL MODELS GENERAL MODELDataset Accuracy AUC F-Score Precision Recall Accurracy AUC F-Score Precision RecallStudy 1: Observations 82% 76% 44% 57% 38% 70% 78% 46% 46% 61%Study 2: Self-Reports 86% 72% 42% 67% 31% 66% 72% 36% 43% 55%All 84% 74% 43% 62% 35% 67% 75% 40% 46% 57%Baseline 55% 49% 24% 18% 42% 51% 50% 25% 18% 48%Table 4.3: Overview of the performance of detecting task switches, for both individual and general models.544.3.3 Task Switch Feature EvaluationA Random Forest classifier can deal well with a larger number of features whichmakes prior feature dimensionality reduction of our 84 features obsolete [10, 35].While we do not apply a feature selection technique in our approach since it wouldonly select the most predictive features in the model, we are still interested inlearning if certain features are generally more important, especially across differentparticipants. The second column of Table 4.2 contains the feature importance asattributed by the RandomForest classifier using all features and averaged over allparticipants’ individual models. To calculate the feature importance metrics, weused the Gini impurity measure from scikit-learn, which captures the feature’sability to avoid mis-classification [59]. The most predictive feature groups areuser input (45.8%) and application category (29.4%). The feature group with theleast predictive power are the lexical features (8.2%). The supplementary materialincludes the feature importances of each individual feature [45].4.4 Results: Detecting Task Types4.4.1 Identified Task Type CategoriesAs described in more detail in Chapter 4.1, we inferred task type categories aftercollecting task and task switch data from observing 12 developers at work andperforming a Thematic Analysis. This resulted in nine task type categories wedescribed in Table 4.4. In the post-study questionnaires of study 2, participantsreported that they agreed with the identified task types and generally had no issuesto assign them. However, two participants mentioned that a task type for Supportduties was missing:“[Support]-Duties. These are very specific tasks that require a lot of different thingsto do. It’s not Development and it can be a lot of ad-hoc and requires many contextswitches.” - P14Two participants mentioned that it was sometimes difficult to know if time spenton emails should be assigned to Development or Awareness & team:55“I was sometimes unsure of how to classify the time I spent responding to emails. Igenerally classified it as development since most of the emails were development-related.”- P21Most of our task type categories are consistent with previous work that in-vestigated knowledge workers’ tasks [15, 31, 33, 66]. For example, Meetings,Administrative, Planning and Private were also prevalent in both Kim et al. ’s andCzerwinksi et al. ’s work [15, 31]. Kim et al. further divided project work (in ourcase Development tasks) into Documenting and Conceptualizing Ideas, Environmentand Development and Design. We did not make these finer-granular distinctionssince we did not want to make the self-reporting of task types in the second studytoo complicated, which would degrade the quality of self-reports.56Table 4.4: Overview and descriptions of the task type categories, the average time developers spent on each task typeper hour of work, and the performance of our task type detection approach, for both individual and general models.INDIVIDUAL MODELS GENERAL MODELAvg (Stdev) Sample All Features UI Features All Feat. UI Feat.Task Type Category mins/h Size Prec. Rec. Prec. Rec. Prec. Rec. Prec. Rec.Development: bug-fix, refactoring, code review, implement-ing new feature, reading/understanding documentation/code,testing, version control, dev.-related learning37.2 (±12.2) 612 70% 85% 62% 77% 59% 77% 50% 78%Personal: work unrelated web browsing, private emails ortexts, (bio or lunch) break9.7 (±7.0) 170 48% 45% 42% 34% 33% 32% 32% 23%Awareness & team: reading/writing emails, discussions/an-swering questions in IM5.3 (±6.0) 234 64% 53% 40% 35% 36% 44% 22% 15%Administrative: often routine tasks, e.g., reporting work-time, expenses report, paperwork4.0 (±3.6) 12 50% 17% 33% 8% NA NA NA NAPlanned Meeting: attending a scheduled meeting/call, e.g.,weekly scrum, weekly planning meeting3.6 (±2.7) 94 40% 40% 33% 31% 10% 4% 2% 1%Unplanned Meeting: attending an ad-hoc, informal meeting,usually with one team-member only, e.g., unscheduled phonecall, colleague asking a question3.1 (±2.8) 90 43% 29% 36% 29% 41% 25% 30% 18%Planning: in the calendar, task list, work item tracker 3.0 (±3.5) 90 31% 24% 26% 16% 16% 1% 1% 0%Other: tasks that do not fit into the other categories. Par-ticipants mentioned that these were support-duty, documentwriting (e.g., in PowerPoint, Word) and for product develop-ment/innovation.2.7 (±3.9) 40 47% 25% 3.7% 2.5% 0% 0% 2% 1%Study: work related to this study (e.g., talking to observer,filling out questionnaire)1.9 (±1.9) 64 67% 58% 49% 41% 78% 69% 29% 20%All 1406 59% 61% 46% 49% 44% 50% 33% 41%Baseline 1406 30% 30% 30% 30% 24% 24% 24% 24%The ’All Features’ columns show results using models trained with all features, while the ’UI Features’ columns show results from modelstrained using only user interaction features (i.e., excluding application category features).574.4.2 Descriptive Statistics of the DatasetOn average, developers worked on 6.1 (±1.6, min: 3, max: 9) different task typesduring the studied time periods, indicating that most of the identified task typesare relevant to all developers. The majority of developers worked on Development,Awareness & team, Personal and Planning tasks on a daily basis. Only five develop-ers worked on Administrative tasks during the study period, indicating that for manydevelopers this is not a task they spend time on very often. The task type participantsself-reported having spent the most time on is Development, with an average of 37(± 12) minutes spent for every hour of work. Participants also spent a surprisinglyhigh amount of time, almost 10 minutes per hour of work, with Personal, includingwork unrelated browsing and messaging. Table 4.4 reports details for all task typesas well as the number of participants who self-reported having worked on the tasktype.We also analyzed if having a higher diversity in work (i.e., working on moredifferent task types) correlates with developers switching more between tasks. Thereis a weak, not statistically significant positive correlation (Pearson’s r = 0.32, p=.12), which suggests that there are other, more important reasons causing developersto switch tasks.4.4.3 Task Type Detection AccuracyTable 4.4 shows the results of our task type detection approach across all 9 tasktype categories. We omit the accuracy metric in this table, as recall is a measure ofindividual class accuracy, and since the recall presented in the all row is weightedby class size it therefore assumes exactly the same value as accuracy. As with thetask switch detection analysis, we trained both individual models and one generalmodel which was trained on all participants. The Administrative task type was notpredicted a single time by the general classifier, and as thus the precision scoreswere undefined for this class. Similarly, the task types Planning and Other had lowprecision and recall values, since the sample size used for training these types wassmall. In general, the individual models (precision 59%, recall 61%) outperformedthe general model by a large margin (precision 44%, recall 50%).One important aspect of our approach that distinguishes our classifier from58Figure 4.4: Confusion matrix for task type prediction.previous work (e.g., [5, 28, 48]) is its ability to make predictions even on previouslyunseen applications. To demonstrate this, we split the results into two categories:with the manual application category mappings (All Features) and without (UI Fea-tures). The UI Features include all user interaction features, but exclude applicationfeatures. While the combined approach proved to be superior, user input featuresstill proved to have high predictive power on their own. Overall, there was a 28.2%increase in precision when including the application category features, and a 24.5%increase in recall.We also found there was a substantial difference in performance dependingon the task type category. The Development task type proved to be the easiest topredict, achieving high recall (85%) and precision (70%) scores. Conversely, thePlanning task type saw very poor results, with only 24% recall and 31% precision.These results are somewhat in line with what one might expect. Naturally, sometask categories are more difficult to predict than others. For instance, discerning thenature of a meeting (planned or unplanned) based purely on a users applications59used and input activity seems to be nearly impossible. As seen in Figure 4.4, there issubstantial confusion between some categories, especially between the two meetingcategories (Planned Meeting and Unplanned Meeting) and the Personal category.These categories tended to have a high amount of time spent idle, meaning theparticipant was away from their computer which naturally makes correct predictionsexceptionally difficult. As a consequence of the dominance of Development samplesin our dataset, our classifier also exhibits a strong bias towards predicting theDevelopment category. While a larger sample size would likely help reduce thisbias, it is of note that the Development category is also the one participants spent themajority of their time in, 37.2 (±12.2) minutes on average for every hour of work.4.4.4 Task Type Feature EvaluationThe third and fourth column of Table 4.2 show the Gini feature importances wecalculated for our RandomForest classifiers, averaged over all participants. Whenconsidering All Features, we found the time spent per application category featuresto have by far the greatest importance (39.1%), followed by keystroke features(17.5%). However, the combined user input feature group contributed more thanany other feature group (47.6%). The lexical features did not contribute at all to theresults of the classifier, which suggests there is room for improvement in this areaas window titles can contain a substantial amount of hints that could help to identifya specific task. The supplementary material includes individual task type featureimportances.4.5 DiscussionIn this section, we discuss implications of our results, possible improvements toautomated task switch and type detection, and practical applications of automatedtask detection in real-world scenarios.4.5.1 Improving Task Switch DetectionWe found that for the task switch detection, the individual models perform quitesimilarly to the general model overall, even though the prediction performancevaries quite substantially for each participant. This suggests that using the general60classifier is accurate enough to solve the cold-start problem. For practical, real-world applications, we therefore suggest using a general model as a default, andthen allowing the user to improve the classifier by training it. As we found in study2, collecting periodic self-reports over just a few days is feasible in real-worldscenarios and may even lead to some insights about work itself.More research is required to explore reasons for and better balance the individualdifferences in developers’ task switching behaviors. This includes investigating thecharacteristics of inaccurately classified task switches and consider additional datasources. For example, we could imagine to include information about a developer’spersonality and company culture to train a classifier that works well for developerswith similar work habits, instead of building a general one for everyone. Futurework could also study the predictive power of features extracted from additional datasources, such as emails, calendars, biometrics (e.g., detecting when a user is awayfrom the computer), and more detailed development related data (e.g., activitiesinside the IDE).The relatively low feature importances of our lexical features shows furtherpotential to more effectively leverage contextual information. Besides calculatinglexical similarity based on cosine similarity (TF-IDF) of window titles, we alsoexperimented with variations, such as an unweighted term frequency metric andtwo different word embedding models including one trained on Wikipedia, andone trained on StackOverflow which has been shown to produce embeddings thatare more closely related to the domain of software engineering[19]. They led toeven less predictive features, which is why we did not report them separately. Onereason could be the little overlap in the window title data. Window titles generallycapture only the application name and the name of the current file, document, emailor website, which limits overlaps with other window titles. Including the actualcontents of these resources could be one way to overcome these limitations, butcould result in privacy concerns, as discussed in more detail in the next section.4.5.2 Improving Task Type DetectionFor the task type predictions we found that the individual models outperform thegeneral model, with an overall accuracy of 61% compared to 50%. Even though we61collected data from a rather large sample of 25 participants (compared to similarwork), we were not yet able to build highly reliable general models, which couldsolve the cold-start problem. The difficulty to discover common patterns across allparticipants emphasizes how individual and diverse developers’ tasks are.We see our work as a first step towards better understanding and automaticallycharacterizing developers’ task context. With our models’ ability to automaticallydetect task switches based on data collected through our computer interactionmonitoring, a next step could be to collect a more in-depth set of data in-betweentwo task switches, and from more participants over a longer period of time. Forexample, IDE extensions (e.g. Feedbag [2] or WatchDog [7]) could be leveraged toidentify the code files, code reviews, and projects the developer has been workingon, browser trackers (e.g. RescueTime [67]) could identify and classify the websitesa developer visited, and integrations into the email or IM client could help tounderstand which people a developer communicated with. To better manage theselarge amounts of data, research will need to come up with approaches to modeland summarize task context—and task types are a first step into doing that. Wehave been able to find little work on automatically detecting, characterizing andsummarizing (developers) tasks yet [5, 33, 50].While more fine-grained lexical data, such as the file or website contents (asapplied in [65, 80, 88]) or participants’ actual keyboard input, could be leveragedto improve our models, it also might reveal details about the company’s productsor the developers’ work and personal life that they are not comfortable sharing. Tominimize privacy concerns, we had to find a trade-off between intrusiveness, bycapturing only a minimum set of data, and completeness, by monitoring as muchas possible to get enough data that allowed us to predict task switches and types inthe field. To earn participants’ trust with capturing potentially sensitive data, wewere also transparent with what data we collect and how it will be used, allowedparticipants to review it before sharing it with us, and making it possible to pausethe monitoring application at any time that seemed particularly sensitive to them.624.5.3 Reducing the Prediction DelayIdeally, a task switch and type detection would be close to real-time, i.e., closeto the exact time a switch occurs. With our approach, there can be a predictiondelay of a maximum of one unique application segment, on average 1.6 minutes(±2.2), when predicting a task switch. This delay is considerably smaller comparedto previous approaches that applied fixed window lengths of (usually) 5 minutes(e.g., [33, 50, 51, 55, 74, 76]). Nonetheless, future work could further reduce theprediction delay by further shortening the smallest possible segment size, in ourcase application switches. This would allow to also identify switches within anapplication, such as when a developer is switching tasks inside the web browser orIDE.4.5.4 Applications for Automated Task DetectionAn active area of research aims to better support developers’ frequent task switching,for example by supporting resuming interrupted tasks or by easing task switching.So far, most approaches are limited to developers’ manual identification of taskswitches, and their evaluations have pointed out challenges this poses for them.Our approach demonstrates the feasibility of automatically detecting task switchesand types in the field, based on a few hours of training data, which makes itpossible to increase the value of previous approaches significantly and stimulatenew research and tool support. Notably, tool support would greatly benefit from theimprovements we discussed in the sections above. In the post-study questionnaire ofstudy 2, participants described concrete applications that we qualitatively analyzedand related to prior work, which resulted in the following three main opportunitiesfor applying automated task detection:One application of an almost real-time detection of task switches that 8 (outof 13) participants described is to actively reduce task switching. This includesautomatically blocking notifications from email, instant messaging or social net-works when a developer is focused on a (challenging) task, to allow extended timesof deep focus:63“What if Windows has a built-in and personalized model about when to give you notifi-cations. I feel like there is a good middle ground between forcing the user to turn offnotifications from the OS and having too many notifications interrupting the user.” - P25Reducing task switching at times of high focus could greatly reduce multi-tasking, a major source of stress and quality issues [22, 38–40]. Similarly, anautomated task switch detection could improve interruptibility classifiers and post-pone in-person interruptions from co-workers to task switch borders, times they areless costly [20, 90, 91].Another application of automated task detection could be to support the re-sumption of suspended or interrupted tasks. Participants did not suggest thisapplication themselves, but 8 (out of 13) rated it as ’useful’ or ’very useful’ in afollow-up question of the final questionnaire. According to Parnin and Rugaber, amajor challenge of task resumption is to rebuild the interrupted task’s context [57].Applying similar summarization approaches as seen in other areas of softwaredevelopment [52, 85] could be presented to the user as cues upon returning tothe suspended task, which has been shown to considerably reduce the resumptionlag [1, 4, 71]. While previous approaches, such as TaskTracer [18], ScalableFabric [68], GroupBar [78] and Mylyn [29], allow the capturing and presentation oftask context, they require the user to manually group related artifacts or manuallystate the start and end of a task, thus, reducing chances of long-term adoption. Eventough there is room for improvement as discussed above, our approach can serve asa starting point to automate these approaches, since it can already be beneficial toreceive help with resuming some tasks, as long as they are detected correctly.A third opportunity of application that 10 (out of 13) participants suggestedis to use automated task detection to increase their awareness about task workand time spent on tasks, which could help to identify opportunities for work habitand productivity improvements. This is in line with a survey with 379 developersthat showed the most-often mentioned measurement of interest when reflectingabout productivity are the tasks developers made progress on and completed in aworkday [43]. An aggregated visualization of the automatically inferred tasks couldgive developers insights such as how much time they spend on different tasks, whenthey worked on planned versus unplanned tasks, or their multi-tasking behaviors:64“It can help point out different working styles that are also effective and efficient. Noteveryone works in the same way.” - P24Recently, researchers started building retrospective dashboards for developers [7,12, 42, 86] and other knowledge workers [32, 67, 87], usually by visualizing dataon the level of applications or application categories, but suggesting that a per-task level would be more beneficial. An increased awareness about one’s taskswitching behavior could support developers to identify goals that help to maintainand improve good work habits, such as reducing multi-tasking or actively blockingnotifications from distracting services and websites at times they need to focus.Participants further suggested that the data could help to reduce administrativeworkloads that require them to report time spent at work:“We’re often asked to report at the end of the month how much time we spent on supportrequests (...) versus development work. That kind of info is tedious to track manually,but a tool could generate an automatic report as needed, allowing for more accuratecounts.” - P22Lu et al. recently showed that the lack of logs of activities and tasks is oftena hindrance to be able to transfer them into time reports [36]. While a few time-tracking tools already exist (e.g., DeskTime [17], TimeDoctor [83]), they all requireusers to manually specify the start and end of a task.4.6 Threats to ValidityObserving Developers in Study 1 The internal validity of our results might bethreatened by the presence of the observers during the observation sessions, causingdevelopers to diverge from their regular work habits, e.g., having less breaks thanusual. Observing participants on a single day only might not be representative ofthe participant’s regular workday. We tried to mitigate these risks by not interactingwith participants during the observations, splitting up the session into two two-hour blocks, sitting as far away from the participant as possible, telling co-workersbeforehand that they could still communicate and interrupt as usual, and by allowingthe participant to pick an optimal timeslot that is representative of their usualwork. Our observational study has the advantage that, rather than performing a lab65study or experimental exercise, participants were observed during their real-worldwork, thus increasing generalizability and realism. However, the above mentionedrisks of observing developers at their workplace make it very difficult to scaleobservational studies and observe them over many days. Hence, we did not relyonly on observations, but also on participants’ self-reports and with that, combiningtwo methods and strengthening our overall approach.Self-Reporting in Study 2 While collecting participants’ task data using self-reports has proven to be a valuable approach to scale the collection of labeled datafor supervised learning, there are a few limitations. First, we rely on the accuracyof participants’ self-reports. For example, they might not always have been ableto accurately remember their tasks, or filling out the pop-up regularly might beperceived as cumbersome after a while. In Chapter 4.1.2, we describe our actionsto minimize these risks in detail, including the ability to postpone a pop-up andcollecting confidence ratings. Aiming to make the self-reporting as easy as possiblerequired limiting the self-reports to segments with the granularity of an applicationswitch and excluding application switches shorter than 10 seconds. This is why ourmodels are unable to detect task switches within an application, as well as very shortones. Since developers switch between applications very frequently, on averageevery 1.6 minutes (±2.2), our model is able to predict a task switch within the sametime frame. Future work could investigate how to give participants good-enoughcues that allow them to accurately self-report switches within applications (e.g.,switching from a news website to the work item tracker in the browser) withoutmaking the interface too cluttered. Finally, the reliance on collecting computerinteraction data only, instead of also including other sensors such as heart-ratemonitors or cameras, limits our knowledge of what is happening when there is noinput to the computer, e.g., in the case of idle times from using the smartphone,reading a document without scrolling, or a discussion with a co-worker.Sample Size A further threat to the external validity of our results could be thenumber of participants. A higher number of participants might have led to a morerobust general model to predict task switches and task types. Nonetheless, collecting66task data from 25 participants is considerably higher than what was reported inprevious work (between 1 and 11 participants). We tried to mitigate this threat byselecting participants from four different software companies in various locations.Task Definitions The construct validity of our results might be threatened by ourdefinitions of a task (switch) and our open coding approach to identify task typecategories. To minimize this risk, we based our definitions of task, task switch andtask type on previous work, and asked participants about their own definitions inboth studies (Chapter 4.1).67Chapter 5DiscussionIn the previous two chapters we presented two different approaches: the first toautomatically identify and desecribe a software developer’s tasks, and the secondto automatically detect a software developer’s task switches and types. But whilewe developed and evaluated these approaches in separate studies, they are highlycomplementary to each other. In this chapter we will discuss some of the ways inwhich the approaches could complement each other in future applications.5.1 Improving Task IdentificationIn developing the task identification approach presented in Chapter 3, we madethe assumption that a developer’s task switches were already known. In a real usecase scenario, we cannot expect a developer to manually indicate all of their taskswitches. Such a requirement would limit the uptake of this approach, and evenif developers intended to indicate their task switches, it is highly likely that theywould forget to do so frequently [29]. By incorporating the approach presented inChapter 4 for task switch and type detection, we could fully automate the processby automatically detecting the boundaries of tasks. In addition, the rough typecategorization generated could be used to improve the generation of task representa-tions. For example, certain task types might have some words which are stronglyassociated with them. If these words occur within the information extracted from atask, then they could be assigned a more prominent relevancy score.685.2 Improving Task Switch DetectionThe approach to task switch detection presented in Chapter 4 achieves reasonablyhigh accuracy, precision and recall scores, especially in comparison to previousworks. However, there is still substantial room for improvement. One direction forfuture work to investigate could be the use of a semi-automatic approach, in whichtask switches are predicted automatically but users are prompted occasionally tocorrect erroneously detected switches. Since it can be difficult for developers toremember what they worked on at a specific time during their work day, the taskidentification approach presented in Chapter 3 could be used to create word cloudsto be used as visual aids to help a developer determine where the boundaries ofa task correctly belong. Such a semi-automatic approach would also present anopportunity to gather further ground truth information. By adapting our task switchdetection approach to use a different kind of model, for example a long short-termmemory (LSTM) based model, this information could be used for online learning.This would allow for more individualized models to be made available to all users.LSTM models are also particularly effective at processing sequential data such asthe data gathered from a developers computer interactions.5.3 Fully Automatic Task SupportThe approaches presented in Chapters 3 and 4 represent a step towards truly fullyautomatic task support systems. Revisiting the Mylyn system, the existing con-straints which contribute to developer overhead when using the system are mainlyin requiring that developers indicate their task switches, and in requiring developersto label their tasks. The former issue is resolved by our approach to task switchdetection, while our approach to associating task descriptions and generating wordclouds for tasks resolves the latter. Incorporating these approaches into an updatedtool could mean developers would reap all the benefits of the Mylyn system withoutany of the overhead. Future research will be required to investigate the efficacyof such an update in comparison to the manual alternative, as well as developeropinions.69Chapter 6ConclusionHave you ever wondered what you worked on throughout a day, possibly to recordtime spent on different projects? Have you ever wanted to look back and find whereyou worked on a particular task to find what resources you consulted as part of thetask?This thesis introduces and evaluates approaches to help support these goals.First, an approach to automatic task identification which extracts the contents ofthe active window being worked with on a regular basis, uses optical characterrecognition (OCR) to transform the contents into tokens and words, and appliesTF-IDF with work tokenization (NLTK) to form a vector representation of the tasksegment. For the purposes of this approach, task segments were formed usingmanually indicated task switches. The vector representation generated from thisapproach can be used to help identify which task a segment represents for a knownset of tasks with an averaged accuracy of 70.6%. Visual representations of a tasksegment were also generated using TF-IDF scores for each word in a bag of wordsformed from the screen content of the task segment as a developer worked. Througha survey, we found that participants could determine which task a word cloud fora segment of work represented with reasonable accuracy (67.9% on average) forseveral tasks. Interestingly, the accuracy rose only modestly when consideringidentifying the task based on a word cloud formed from all segments comprisingwork on a task.Second, an approach to automatic task switch and type detection that extends70previous work by using a broader range of temporal and semantic features, bydeveloping new features based on a developer’s computer interactions, and bynot being limited to capturing task switches and types within the IDE only. Theevaluation of this approach in a field-study with 25 professional developers, com-pared to 1 to 11 participants in previous work, revealed higher accuracy (84% forswitches and 61% for types) and less delay in the predictions than comparable priorwork.The results from the evaluations of these approaches show promise for helpingto determine automatically the intent and boundaries of a developer’s many tasksthroughout the day. By enabling the detection of developer intent within the contextof a specific task segment, various tools can be improved that a developer reliesupon and new tools can be introduced to help support such activities as time trackingand task resumption.71Bibliography[1] E. M. Altmann and J. G. Trafton. Task interruption: Resumption lag and therole of cues. 2004.[2] S. Amann, S. Proksch, and S. Nadi. Feedbag: An interaction tracker for visualstudio. In 2016 IEEE 24th International Conference on ProgramComprehension (ICPC), pages 1–3. IEEE, 2016.[3] S. Astromskis, G. Bavota, A. Janes, B. Russo, and M. D. Penta. Patterns ofdevelopers’ behaviour: A 1000-hour industrial study. Journal of Systems andSoftware, 132:85–97, 2017.[4] B. P. Bailey, J. A. Konstan, and J. V. Carlis. The effects of interruptions ontask performance, annoyance, and anxiety in the user interface. In INTERACT,2001.[5] L. Bao, Z. Xing, X. Xia, D. Lo, and A. E. Hassan. Inference of developmentactivities from interaction with uninstrumented applications. EmpiricalSoftware Engineering, 23(3):1313–1351, June 2018. ISSN 1573-7616.doi:10.1007/s10664-017-9547-8. URL[6] M. Basseville, I. V. Nikiforov, et al. Detection of abrupt changes: theory andapplication, volume 104. Prentice Hall Englewood Cliffs, 1993.[7] M. Beller, I. Levaja, A. Panichella, G. Gousios, and A. Zaidman. How tocatch ’em all: Watchdog, a family of ide plug-ins to assess testing. In 2016IEEE/ACM 3rd International Workshop on Software Engineering Researchand Industrial Practice (SER IP), pages 53–56, 2016.[8] V. Braun and V. Clarke. Using thematic analysis in psychology. Qualitativeresearch in psychology, 3:77–101, 01 2006.72[9] O. Brdiczka, N. M. Su, and J. B. Begole. Temporal task footprinting:identifying routine tasks by their temporal patterns. In Proceedings of the15th international conference on Intelligent user interfaces, pages 281–284.ACM, 2010.[10] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.[11] M. J. Coblenz, A. J. Ko, and B. A. Myers. Jasper: An eclipse plug-in tofacilitate software maintenance tasks. In Proceedings of the 2006 OOPSLAWorkshop on Eclipse Technology EXchange, eclipse 06, page 6569, New York,NY, USA, 2006. Association for Computing Machinery. ISBN 1595936211.doi:10.1145/1188835.1188849. URL[12] Codealike., 2020. Retrieved March 19, 2020.[13] I. D. Coman and A. Sillitti. Automated identification of tasks in developmentsessions. In 2008 16th IEEE International Conference on ProgramComprehension, pages 212–217, 2008.[14] S. Corston-Oliver, E. Ringger, M. Gamon, and R. Campbell. Task-focusedsummarization of email. In Text Summarization Branches Out, pages 43–50,2004.[15] M. Czerwinski, E. Horvitz, and S. Wilhite. A diary study of task switchingand interruptions. In Proceedings of the SIGCHI conference on Humanfactors in computing systems, pages 175–182. ACM, 2004.[16] K. Damevski, H. Chen, D. C. Shepherd, N. A. Kraft, and L. Pollock.Predicting future developer behavior in the ide using topic models. IEEETransactions on Software Engineering, 44(11):1100–1111, Nov 2018. ISSN2326-3881. doi:10.1109/TSE.2017.2748134.[17] DeskTime., 2020. Retrieved March 19, 2020.[18] A. N. Dragunov, T. G. Dietterich, K. Johnsrude, M. McLaughlin, L. Li, andJ. L. Herlocker. Tasktracer: A desktop environment to support multi-taskingknowledge workers. In Proceedings of the 10th International Conference onIntelligent User Interfaces, IUI ’05, pages 75–82. ACM, 2005.[19] V. Efstathiou, C. Chatzilenas, and D. Spinellis. Word embeddings for thesoftware engineering domain. In Proceedings of the 15th InternationalConference on Mining Software Repositories, pages 38–41. ACM, 2018.73[20] J. Fogarty, A. J. Ko, H. H. Aung, E. Golden, K. P. Tang, and S. E. Hudson.Examining task engagement in sensor-based statistical models of humaninterruptibility. Proceedings of the SIGCHI conference on Human factors incomputing systems - CHI ’05, page 331, 2005.[21] M. K. Gonc¸alves, C. R. de Souza, and V. M. Gonzalez. Collaboration,information seeking and communication: An observational study of softwaredevelopers’ work practices. J. UCS, 17(14):1913–1930, 2011.[22] V. M. Gonzlez and G. Mark. Constant, Constant, Multi-tasking Craziness:Managing Multiple Working Spheres. 6(1):8, 2004.[23] T. Gottron. Document Word Clouds: Visualising Web Documents as TagClouds to Aid Users in Relevance Decisions. In M. Agosti, J. Borbinha,S. Kapidakis, C. Papatheodorou, and G. Tsakonas, editors, Research andAdvanced Technology for Digital Libraries, pages 94–105, Berlin, Heidelberg,2009. Springer Berlin Heidelberg. ISBN 978-3-642-04346-8.[24] F. Gustafsson and F. Gustafsson. Adaptive filtering and change detection,volume 1. Citeseer, 2000.[25] Q. Huang, X. Xia, D. Lo, and G. C. Murphy. Automating Intention Mining.IEEE Transactions on Software Engineering, pages 1–1, 2018. ISSN2326-3881. doi:10.1109/TSE.2018.2876340. Conference Name: IEEETransactions on Software Engineering.[26] ImageMagick. Imagemagick., 2020.[Accessed March 5, 2020].[27] S. T. Iqbal and B. P. Bailey. Understanding and developing models fordetecting and differentiating breakpoints during interactive tasks. InProceedings of the SIGCHI Conference on Human Factors in ComputingSystems, CHI ’07, pages 697–706. ACM, 2007.[28] S. T. Iqbal and B. P. Bailey. Effects of intelligent notification management onusers and their tasks. In Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems, CHI ’08, pages 93–102, 2008.[29] M. Kersten and G. C. Murphy. Using Task Context to Improve ProgrammerProductivity. In Proceedings of the 14th ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering, SIGSOFT ’06/FSE-14,pages 1–11, New York, NY, USA, 2006. ACM. ISBN 978-1-59593-468-0.doi:10.1145/1181775.1181777. URL[30] K. Kevic and T. Fritz. Towards activity-aware tool support for change tasks.In 2017 IEEE International Conference on Software Maintenance andEvolution (ICSME), pages 171–182. IEEE, OCT 2017. ISBN978-1-5386-0992-7. doi:10.1109/ICSME.2017.48.[31] Y.-h. Kim and E. K. Choe. Understanding Personal Productivity: HowKnowledge Workers Define, Evaluate, and Reflect on Their Productivity. InProceedings of the 2019 CHI Conference on Human Factors in ComputingSystems - CHI ’19, number May, 2019.[32] Y.-H. Kim, J. H. Jeon, E. K. Choe, B. Lee, K. Kim, and J. Seo. TimeAware:Leveraging Framing Effects to Enhance Personal Productivity. InProceedings of the 2016 CHI Conference on Human Factors in ComputingSystems (CHI ’16), pages 272–283, 2016. ISBN 9781450333627.[33] S. Koldijk, M. van Staalduinen, M. Neerincx, and W. Kraaij. Real-time taskrecognition based on knowledge workers’ computer activities. In Proceedingsof the 30th European Conference on Cognitive Ergonomics, ECCE ’12, pages152–159, Edinburgh, United Kingdom, Aug. 2012. Association forComputing Machinery. ISBN 978-1-4503-1786-3.doi:10.1145/2448136.2448170. URL[34] G. Lemaıˆtre, F. Nogueira, and C. K. Aridas. Imbalanced-learn: A pythontoolbox to tackle the curse of imbalanced datasets in machine learning.Journal of Machine Learning Research, 18(17):1–5, 2017.[35] A. Liaw, M. Wiener, et al. Classification and regression by randomforest. Rnews, 2(3):18–22, 2002.[36] D. Lu, J. Marlow, R. Kocielnik, and D. Avrahami. Challenges andopportunities for technology-supported activity reporting in the workplace. InProceedings of the 2018 CHI Conference on Human Factors in ComputingSystems, CHI ’18, pages 170:1–170:12. ACM, 2018.[37] W. Maalej, M. Ellmann, and R. Robbes. Using contexts similarity to predictrelationships between tasks. Journal of Systems and Software, 128:267 – 284,2017.[38] G. Mark, D. Gudith, and U. Klocke. The Cost of Interrupted Work: MoreSpeed and Stress. In CHI 2008: Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems, pages 107–110, 2008.75[39] G. Mark, S. Iqbal, and M. Czerwinski. How blocking distractions affectsworkplace focus and productivity. In Proceedings of the 2017 ACMInternational Joint Conference on Pervasive and Ubiquitous Computing andProceedings of the 2017 ACM International Symposium on WearableComputers on - UbiComp ’17, pages 928–934, 2017.[40] G. Mark, M. Czerwinski, and S. T. Iqbal. Effects of Individual Differences inBlocking Workplace Distractions. In CHI ’18. ACM, 2018.[41] M. L. McHugh. Interrater reliability: the kappa statistic. Biochemia Medica,22(3):276–282, Oct. 2012. ISSN 1330-0962. URL[42] A. Meyer, G. C. Murphy, T. Zimmermann, and T. Fritz. Designrecommendations for self-monitoring in the workplace: Studies in softwaredevelopment. PACM on Human-Computer Interaction, 1(CSCW):1–24, 2017.ISSN 2573-0142. doi:10.1145/3134714.[43] A. N. Meyer, T. Fritz, G. C. Murphy, and T. Zimmermann. SoftwareDevelopers’ Perceptions of Productivity. In Proceedings of the 22Nd ACMSIGSOFT International Symposium on Foundations of Software Engineering,FSE 2014, pages 19–29. ACM, 2014.[44] A. N. Meyer, L. E. Barton, G. C. Murphy, T. Zimmermann, and T. Fritz. Thework life of developers: Activities, switches and perceived productivity. IEEETransactions on Software Engineering, 43(12):1178–1193, Dec 2017. ISSN2326-3881. doi:10.1109/TSE.2017.2656886.[45] A. N. Meyer, C. Satterfield, M. Zu¨ger, K. Kevic, G. C. Murphy,T. Zimmermann, and T. Fritz., 2020.Supplementary Material.[46] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributedrepresentations of words and phrases and their compositionality. In Advancesin neural information processing systems, pages 3111–3119, 2013.[47] H. Mintzberg. The nature of managerial work. Theory of management policyseries. Prentice-Hall, 1980.[48] H. T. Mirza, L. Chen, G. Chen, I. Hussain, and X. He. Switch detector: anactivity spotting system for desktop. In Proceedings of the 20th ACMinternational conference on Information and knowledge management, CIKM’11, pages 2285–2288, Glasgow, Scotland, UK, Oct. 2011. Association for76Computing Machinery. ISBN 978-1-4503-0717-8.doi:10.1145/2063576.2063947. URL[49] H. T. Mirza, L. Chen, A. Majid, and G. Chen. Building user task space bymining temporally proximate desktop actions. Cybernetics and Systems, 42(8):585–604, 2011.[50] H. T. Mirza, L. Chen, I. Hussain, A. Majid, and G. Chen. A Study onAutomatic Classification of Users Desktop Interactions. Cybernetics andSystems, 46(5):320–341, July 2015. ISSN 0196-9722.doi:10.1080/01969722.2015.1012372. URL[51] R. Nair, S. Voida, and E. D. Mynatt. Frequency-based detection of taskswitches. In Proceedings of the 19th British HCI Group Annual Conference,volume 2, pages 94–99, 2005.[52] N. Nazar, Y. Hu, and H. Jiang. Summarizing software artifacts: A literaturereview. Journal of Computer Science and Technology, 31(5):883–909, Sep2016.[53] G. H. Nguyen, A. Bouzerdoum, and S. L. Phung. Learning PatternClassification Tasks with Imbalanced Data Sets. Pattern Recognition, Oct.2009. doi:10.5772/7544.[54] NLTK. Natural language toolkit (nltk)., 2020.[Accessed March 5, 2020].[55] N. Oliver, G. Smith, C. Thakkar, and A. C. Surendran. Swish: semanticanalysis of window titles and switching history. In Proceedings of the 11thinternational conference on Intelligent user interfaces, pages 194–201, 2006.[56] C. Parnin and R. DeLine. Evaluating cues for resuming interruptedprogramming tasks. In Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems, CHI ’10, pages 93–102, Atlanta, Georgia,USA, Apr. 2010. Association for Computing Machinery. ISBN978-1-60558-929-9. doi:10.1145/1753326.1753342. URL[57] C. Parnin and S. Rugaber. Resumption strategies for interrupted programmingtasks. In 2009 IEEE 17th International Conference on ProgramComprehension, pages 80–89, 2009.77[58] C. Parnin and S. Rugaber. Resumption strategies for interrupted programmingtasks. Software Quality Journal, 19(1):5–34, Mar. 2011. ISSN 1573-1367.doi:10.1007/s11219-010-9104-9. URL[59] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.[60] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn:Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.[61] D. E. Perry, N. Staudenmayer, and L. G. Votta. People, organizations, andprocess improvement. IEEE Softw., 11(4):36–45, July 1994.[62] L. Ponzanelli, G. Bavota, A. Mocci, M. Di Penta, R. Oliveto, B. Russo,S. Haiduc, and M. Lanza. CodeTube: Extracting Relevant Fragments fromSoftware Development Video Tutorials. In 2016 IEEE/ACM 38thInternational Conference on Software Engineering Companion (ICSE-C),pages 645–648, May 2016. ISSN: null.[63] J. Racine. Consistent cross-validatory model-selection for dependent data:hv-block cross-validation. Journal of econometrics, 99(1):39–61, 2000.[64] RAKE. Python implementation of the rapid automatic keyword extractionalgorithm using nltk., 2020. [AccessedMarch 5, 2020].[65] T. Rattenbury and J. Canny. Caad: An automatic task support system. InProceedings of the SIGCHI Conference on Human Factors in ComputingSystems, CHI ’07, pages 687–696. ACM, 2007.[66] W. Reinhardt, B. Schmidt, P. Sloep, and H. Drachsler. Knowledge workerroles and actions - results of two empirical studies. Knowledge and ProcessManagement, 18(3):150–174, 2011.[67] RescueTime., 2020. Retrieved March 19, 2020.78[68] G. Robertson, E. Horvitz, M. Czerwinski, P. Baudisch, D. R. Hutchings,B. Meyers, D. Robbins, and G. Smith. Scalable fabric: Flexible taskmanagement. In Proceedings of the Working Conference on Advanced VisualInterfaces, pages 85–89. ACM, 2004.[69] M. Robillard and G. Murphy. Program navigation analysis to supporttask-aware software development environments. pages 83–88, 2004.[70] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic KeywordExtraction from Individual Documents. In Text Mining: Applications andTheory, pages 1–20. Mar. 2010. ISBN 978-0-470-68964-6.doi:10.1002/9780470689646.ch1. Journal Abbreviation: Text Mining:Applications and Theory.[71] A. Rule, A. Tabard, K. Boyd, and J. Hollan. Restoring the Context ofInterrupted Work with Desktop Thumbnails. In 37th Annual Meeting of theCognitive Science Society, pages 1–6. Cognitive Science Society, July 2015.[72] C. Satterfield., 2020. TaskIdentification and Description Dataset.[73] J. Shen, L. Li, T. G. Dietterich, and J. L. Herlocker. A hybrid learning systemfor recognizing user tasks from desktop activities and email messages. InProceedings of the 11th International Conference on Intelligent UserInterfaces, IUI ’06, pages 86–92. ACM, 2006.[74] J. Shen, L. Li, and T. G. Dietterich. Real-time detection of task switches ofdesktop users. In Proceedings of the 20th international joint conference onArtifical intelligence, IJCAI’07, pages 2868–2873, Hyderabad, India, Jan.2007. Morgan Kaufmann Publishers Inc.[75] J. Shen, W. Geyer, M. Muller, C. Dugan, B. Brownholtz, and D. R. Millen.Automatically finding and recommending resources to support knowledgeworkers’ activities. In Proceedings of the 13th international conference onIntelligent user interfaces, IUI ’08, pages 207–216, Gran Canaria, Spain, Jan.2008. Association for Computing Machinery. ISBN 978-1-59593-987-6.doi:10.1145/1378773.1378801. URL[76] J. Shen, J. Irvine, X. Bao, M. Goodman, S. Kolibaba, A. Tran, F. Carl,B. Kirschner, S. Stumpf, and T. Dietterich. Detecting and correcting useractivity switches: Algorithms and interfaces. pages 117–126, 01 2009.79[77] J. Singer, T. Lethbridge, N. Vinson, and N. Anquetil. An examination ofsoftware engineering work practices. In CASCON First Decade High ImpactPapers, CASCON ’10, pages 174–188. IBM Corporation, 2010.[78] G. Smith, P. Baudisch, G. Robertson, M. Czerwinski, B. Meyers, D. Robbins,and D. Andrews. Groupbar: The taskbar evolved. In Proceedings of OZCHI,volume 3, pages 1–10, 2003.[79] A. D. Sorbo, S. Panichella, C. A. Visaggio, M. D. Penta, G. Canfora, andH. C. Gall. Development Emails Content Analyzer: Intention Mining inDeveloper Discussions (T). In 2015 30th IEEE/ACM InternationalConference on Automated Software Engineering (ASE), pages 12–23, Nov.2015. doi:10.1109/ASE.2015.12. ISSN: null.[80] C. A. N. Soules and G. R. Ganger. Connections: Using context to enhance filesearch. In Proceedings of the Twentieth ACM Symposium on OperatingSystems Principles, SOSP ’05, pages 119–132. ACM, 2005.[81] S. Stumpf, X. Bao, A. Dragunov, T. G. Dietterich, J. Herlocker, K. Johnsrude,L. Li, and J. Shen. Predicting user tasks: I know what you’re doing. In 20thNational conference on artificial intelligence (AAAI-05), workshop on humancomprehensible machine learning, 2005.[82] Tesseract. Tesseract open source ocr engine.,2020. [Accessed March 5, 2020].[83] TimeDoctor., 2020. Retrieved March 19, 2020.[84] R. Tourangeau, L. J. Rips, and K. Rasinski. The psychology of surveyresponse. Cambridge University Press, 2000.[85] C. Treude, F. Figueira Filho, and U. Kulesza. Summarizing and measuringdevelopment activity. In Proceedings of the 2015 10th Joint Meeting onFoundations of Software Engineering, ESEC/FSE 2015, pages 625–636.ACM, 2015.[86] Wakatime., 2020. Retrieved March 19, 2020.[87] S. Whittaker, V. Hollis, and A. Guydish. Don’t Waste My Time: Use of TimeInformation Improves Focus. In Proceedings of the 2016 CHI Conference onHuman Factors in Computing Systems (CHI ’16), 2016.[88] I.-C. Wu, D.-R. Liu, and W.-H. Chen. Task-stage knowledge support:coupling user information needs with stage identification. In IRI -2005 IEEE80International Conference on Information Reuse and Integration, Conf, 2005.,pages 19–24, 2005.[89] L. Zou and M. W. Godfrey. An industrial case study of coman’s automatedtask detection algorithm: What worked, what didn’t, and why. In 2012 28thIEEE International Conference on Software Maintenance (ICSM), pages6–14, 2012.[90] M. Zu¨ger, C. Corley, A. N. Meyer, B. Li, T. Fritz, D. Shepherd, V. Augustine,P. Francis, N. Kraft, and W. Snipes. Reducing Interruptions at Work: ALarge-Scale Field Study of FlowLight. In Proceedings of the 2017 CHIConference on Human Factors in Computing Systems, pages 61–72, 2017.[91] M. Zu¨ger, S. C. Mu¨ller, A. N. Meyer, and T. Fritz. Sensing interruptibility inthe office: A field study on the use of biometric and computer interactionsensors. In CHI, 2018.81Appendix ALaboratory Data CollectionThis appendix contains additional information pertaining to the data collection donein Chapter 3.A.1 List of Tasks Performed By ParticipantsDuplicate Bug Task Your manager has noticed that there has been a substantialinflux of duplicate bug reports recently. Explore the provided list of bug reports,and identify whether there is a duplicate and provide its ID. search the bugs withthe ids: 2264, 2268, 2271, 2277.Viz Library Selection Task Your team has developed an application for optimizingdeveloper work patterns - reducing the number of impactful interruptions. You havebeen tasked with creating visualizations for a presentation outlining the benefitsof your product to potential clients. You have the following data available to youfor use: Application usage times, interruption times, durations, and disruptivenesslevels, keyboard and mouse activity levels. What libraries would you suggest forcreating data representations? Give some examples of other works that have beencreated using these libraries. At a minimum the visualizations should include abefore/after comparison of developers work days using the product vs not using theproduct.82App Market Research Task The software company you work for is consideringexpanding into the productivity tool sphere. Your manager has asked you to dosome market research on 3 of the most popular already existing apps in this domain:Microsoft To-do, Wunderlist, and Todoist. Provide a short written summary of thesimilarities and differences between these 3 apps.Recommend Tool Task Your coworker is having difficulty deciding on whichproductivity app they should use, and have asked you for a recommendation. Theyhave narrowed their decision to 3 apps: Microsoft To-do, Wunderlist, and Todoist.Based only on app store reviews, which of these apps would you recommend?Identify any reviews that were particularly influential in your decision.Deep Leaning Presentation Task You are preparing to give a presentation onpotential deep learning applications to the CTO of your company. While you havealready completed the slides for the presentation, you should also prepare answersfor a few questions which are likely to arise during the presentation.1. The lines drawn between layers of the network on the included slide representweighted inputs from one layer to the next. How does the network decide onwhat weights to choose during the training process?2. Most of the technologies behind deep learning have already been around forover 30 years. Why is deep learning only becoming popular now? What haschanged?3. What kind of performance increases can be seen by using GPU’s instead ofCPU’s? Are GPU’s always superior with respect to deep learning applica-tions?Blockchain Expert Task You recently gave a short presentation to your colleaguesoutlining different ways the company may be able to make use of blockchain. Oneof your co-workers felt a little bit lost during the presentation, and emailed you acouple follow up questions afterwards.831. You mentioned that blockchain is a form of distributed ledger. What does thismean? What advantages are offered over traditional client-server databaseledger systems?2. Could you explain what ”proof-of-work” means? What is it? What does itdo? Why is it necessary?A.2 Additional FiguresFigure A.1: An example screenshot captured with our tool of a developersinbox at the start of a data collection session.84


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items