UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Skinprobe 2.0 : development of a system for low-cost measurement of human soft tissues Wick, Alistair 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2019_november_wick_alistair.pdf [ 38.65MB ]
Metadata
JSON: 24-1.0383333.json
JSON-LD: 24-1.0383333-ld.json
RDF/XML (Pretty): 24-1.0383333-rdf.xml
RDF/JSON: 24-1.0383333-rdf.json
Turtle: 24-1.0383333-turtle.txt
N-Triples: 24-1.0383333-rdf-ntriples.txt
Original Record: 24-1.0383333-source.json
Full Text
24-1.0383333-fulltext.txt
Citation
24-1.0383333.ris

Full Text

SkinProbe 2.0Development of a System for Low-Cost Measurement of Human SoftTissuesbyAlistair WickM.Eng., University of Bristol, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)October 2019© Alistair Wick, 2019The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:SkinProbe 2.0: Development of a System for Low-Cost Measurementof Human Soft Tissuessubmitted by Alistair Wick in partial fulfillment of the requirements for the degreeof Master of Science in Computer Science.Examining Committee:Dinesh K. Pai, University of British ColumbiaSupervisorJames J. Little, University of British ColumbiaExamining CommitteeiiAbstractWe present “SkinProbe 2.0,” a prototype system for low cost, high volume mea-surement of the physical properties of human soft tissues through direct contactand perturbation of the skin. Our solution encompasses a handheld device and as-sociated cloud-based AI processing pipeline, and derives physically-representativevalues of stiffness and thickness directly from video. These input videos includeimages of the surface under contact, and of a “flexure,” our novel apparatus foroptical force measurement. Videos are captured using a smartphone embedded inthe device.Our system processes these videos, generating dense optical flow fields for se-lected frames, and passing these frames and flow fields through two bespoke NeuralNetworks: one providing estimated force readings, and one providing estimates ofsoft-body material properties in the contact vicinity.We automate the collection of training data for our networks with robotics anda 3D-printed apparatus, along with custom-made silicone tissue phantoms, and acloud pipeline for data collection, storage, and retrieval. This allows us to scale tothousands of samples in each training dataset, with minimal human involvement incollection, and a highly repeatable collection process.We demonstrate the functionality of our measurement device, cloud pipeline,and force estimation system, and show promising material estimation results on ourtissue phantoms. We further consider directions for future research in improvingour system, both for handheld data collection, and for eventual usage on humansubjects.iiiLay SummaryWe developed “SkinProbe 2.0,” a handheld probing device, which we hope to usein the future to measure how people’s skin moves and reacts to small amounts ofpressure. Our device is currently a prototype, and not yet suitable for use on peo-ple. In the future, these results could let us use computers to help create customizedclothing with a perfect fit, or allow us to automatically optimize prosthetics for anindividual, making them more comfortable and secure. The device is low-costand highly portable, and uses cloud computing and machine learning to help gen-erate its results. We lowered the cost by basing the device around an ordinarysmartphone, and using the phone’s camera for all our sensing needs – it observesmovement of both the skin, and of a special spring-like device we call a “flexure,”which allows the camera to measure the pressure that is applied.ivPrefaceThe UBC Sensorimotor System Laboratory’s “Skincap” project is the direct prede-cessor to, and parent of this work. Within Skincap, we sought to develop and refineexperimental techniques and tooling for the measurement of human soft tissues,with the goal of improving the physical accuracy of simulations for animations,and for tasks like computer-aided design of clothing. I participated extensively inthis project, designing the version 1 Skin Probe (Section 1.4) and major portions ofthe associated capture software; other members of the lab collected data from hu-man participants using this system, provided feedback on the design, and obtainedresults from human participants. This work led to our 2018 paper The HumanTouch: Measuring Contact with Real Human Soft Tissues [34] (ACM Transactionson Graphics), on which I was one of many co-authors. This was the first publica-tion describing and utilizing the SkinProbe V1. The SkinProbe 2.0 project wouldnot have been conceived without the hard work of all the authors on that paper;that said, nobody other than Dr. Pai and I have made direct contributions to thedevelopment of SkinProbe 2.0.As the name implies, SkinProbe 2.0 is positioned as a successor of, and poten-tial replacement for the V1 probe. In Section 1.4, I outline the design and usage ofthis original probing system, and explore the shortcomings which led to a desire forthis replacement version. Section 1.4.1 is based substantially on my own work inthe paper discussed above [34], section 3, “Design and Fabrication of Skin Probe”.My supervisor, Dr. Pai, was largely responsible for conceiving this project, andprovided valuable input throughout. I carried out the development of this systemmyself: I wrote the code, designed and fabricated parts for the prototype device,and carried out experiments with the system. Dr. Pai and I are the only contributorsvto the work presented in this thesis, again excepting the previously-published V1probe discussed in Section 1.4.With the exception of Section 1.4.1, I wrote the entirety of this thesis as anoriginal work; no other sections are based on any other sources, published or un-published.We did not require ethics board approval for this work, as no human subjectswere involved in testing this prototype system.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Material Parameterization . . . . . . . . . . . . . . . . . 21.2.2 Data Capture . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . 81.2.5 Processing Pipeline . . . . . . . . . . . . . . . . . . . . . 91.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3.1 Regression Models . . . . . . . . . . . . . . . . . . . . . 10vii1.3.2 Force Measurement . . . . . . . . . . . . . . . . . . . . . 111.3.3 Physical Analyses and Simulation of Human Tissues andOther Materials . . . . . . . . . . . . . . . . . . . . . . . 121.3.4 Tissue Phantoms . . . . . . . . . . . . . . . . . . . . . . 141.4 SkinProbe 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 192 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.1 Probe Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.1.1 Flexure . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2.1 Force Estimator . . . . . . . . . . . . . . . . . . . . . . . 282.2.2 Material Estimator . . . . . . . . . . . . . . . . . . . . . 292.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.2.4 Optical Flow Generation . . . . . . . . . . . . . . . . . . 382.3 Robotic Data Collection . . . . . . . . . . . . . . . . . . . . . . 392.3.1 Hardware Design . . . . . . . . . . . . . . . . . . . . . . 402.3.2 Tissue Phantoms . . . . . . . . . . . . . . . . . . . . . . 412.4 Automation Server . . . . . . . . . . . . . . . . . . . . . . . . . 442.4.1 Automation Programs . . . . . . . . . . . . . . . . . . . 452.4.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . 462.4.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 502.5 Experiment Software . . . . . . . . . . . . . . . . . . . . . . . . 512.5.1 Android Application . . . . . . . . . . . . . . . . . . . . 512.5.2 Cloud Server Suite . . . . . . . . . . . . . . . . . . . . . 563 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.1 Force Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 593.1.1 Performance on Training and Validation Data . . . . . . . 603.1.2 Performance on Test Data . . . . . . . . . . . . . . . . . 633.1.3 Performance on Novel Test Data . . . . . . . . . . . . . . 643.2 Material Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 75viii3.2.1 Validation and Test Data . . . . . . . . . . . . . . . . . . 753.2.2 Handheld Comparison Experiment . . . . . . . . . . . . . 794 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2 Project Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.3.1 Force Sensing . . . . . . . . . . . . . . . . . . . . . . . . 864.3.2 Material Estimation . . . . . . . . . . . . . . . . . . . . . 874.3.3 Human Trials . . . . . . . . . . . . . . . . . . . . . . . . 874.3.4 Cloud Pipeline . . . . . . . . . . . . . . . . . . . . . . . 884.4 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . 88Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90ixList of TablesTable 2.1 Threshold forces for phantom µ¯ measurement for the differentphantom materials . . . . . . . . . . . . . . . . . . . . . . . . 36Table 2.2 Control commands available in our Python automation program-ming environment, and their functions. . . . . . . . . . . . . . 47Table 2.3 Probe application camera settings. . . . . . . . . . . . . . . . 53Table 3.1 Force estimator test metrics . . . . . . . . . . . . . . . . . . . 60Table 3.2 Material estimator metrics . . . . . . . . . . . . . . . . . . . . 77Table 3.3 Material estimator handheld comparison metrics . . . . . . . . 80xList of FiguresFigure 1.1 Visual descriptions of our material parameterization . . . . . 2Figure 1.2 Raw video frames as captured by the probe’s camera. . . . . . 3Figure 1.3 Overview of our estimation architecture. . . . . . . . . . . . . 5Figure 1.4 Structure of the force estimation model, showing the input,output and hidden layers of the deep neural network (DNN). . 6Figure 1.5 Color-mapped rendering of a flow field input to the materialestimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 1.6 Structure of the material estimation model, showing the input,output and hidden layers of the DNN. . . . . . . . . . . . . . 8Figure 1.7 Labeled image of SkinProbe V1, reproduced from [34] Fig. 2 16Figure 1.8 Labeled image of SkinProbe V1 head section, reproduced from[34] Fig. 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 1.9 SkinProbe V1 software, reproduced from [34] Fig. 4 . . . . . 18Figure 2.1 computer-aided design (CAD) model of the SkinProbe 2.0 de-vice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 2.2 SkinProbe 2.0 rear, showing the frame, flexure mounting, andhardware attachment points. . . . . . . . . . . . . . . . . . . 23Figure 2.3 Side-on view of probe flexure under no load, and under 4N load. 24Figure 2.4 Probe flexure force-displacement behavior in the 0-4N range. . 24Figure 2.5 The flexure’s lattice silicone bonding structures, and break-away mold removal. . . . . . . . . . . . . . . . . . . . . . . 26Figure 2.6 The rigid force training plate . . . . . . . . . . . . . . . . . . 33xiFigure 2.7 Capture apparatus, sampling results, and final parameteriza-tions for material ground-truth data . . . . . . . . . . . . . . 35Figure 2.8 The “automation rig,” with the delta robot platform, siliconephantom tray, and mounted probe visible. . . . . . . . . . . . 41Figure 2.9 The assembled phantom tray. . . . . . . . . . . . . . . . . . . 42Figure 2.10 Two silicone phantoms cast outside the tray, demonstrating dif-ferent levels of softness with variant resting configurations. . . 42Figure 2.11 Cutaway view of a silicone phantom and its 3D-printed tray,used for material model training. . . . . . . . . . . . . . . . . 43Figure 2.12 Close-up of the phantom tray before silicone casting, showinglattice base and removable (disposable) mold walls. . . . . . . 43Figure 2.13 Phantom tray during silicone casting, with mold caps installed. 44Figure 2.14 The automation server’s web-based user interface. . . . . . . 48Figure 2.15 Annotated screenshot of the probe’s smartphone application. . 52Figure 3.1 Performance of the force estimation network on its own train-ing dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Figure 3.2 Performance of the force estimation network on its own vali-dation dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 61Figure 3.3 Force estimation training loss graph. . . . . . . . . . . . . . . 62Figure 3.4 Force estimates on a previously-unseen test dataset. . . . . . . 63Figure 3.5 Testing force estimation in contact with a phantom. . . . . . . 64Figure 3.6 Force estimates for contact with a silicone phantom . . . . . . 65Figure 3.7 Poor force estimation on a pathological case, in contact with asoft silicone phantom. . . . . . . . . . . . . . . . . . . . . . 66Figure 3.8 Contact force errors, with distribution, in the phantom force test. 66Figure 3.9 Poorly estimated frame of force network input, compared to aframe with no force applied. . . . . . . . . . . . . . . . . . . 67Figure 3.10 Testing force estimation in contact with a leather strip (laidover a phantom). . . . . . . . . . . . . . . . . . . . . . . . . 68Figure 3.11 Force estimates for contact with a leather strip laid over a phan-tom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Figure 3.12 Contact force errors, with distribution, in the leather strip test. 69xiiFigure 3.13 Testing force estimation in direct sunlight, in contact with leather. 70Figure 3.14 Force estimates while under direct sunlight. . . . . . . . . . . 70Figure 3.15 Moving the probe by hand in its mount. . . . . . . . . . . . . 71Figure 3.16 Selected force estimates with the probe handheld, but mountedon the rig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72Figure 3.17 Selected force estimates with the probe freely handheld, touch-ing the phantoms. . . . . . . . . . . . . . . . . . . . . . . . . 73Figure 3.18 Force estimates on a previously-unseen test dataset, capturedafter thousands of touches post-training. . . . . . . . . . . . . 74Figure 3.19 Post-stress force error correction. . . . . . . . . . . . . . . . 74Figure 3.20 Material network performance on its own validation set. . . . 76Figure 3.21 Material network performance on a previously-unseen test dataset. 78Figure 3.22 The dismounted prototype, ready for freehand usage. . . . . . 79Figure 3.23 Material estimator results for robotically- and manually-captureddatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81xiiiGlossaryAJAX Asynchronous JavaScript and XML, a set of tools and techniques whichenables live interaction between client-side JavaScript and a web server.API Application programming interface, a generic term for a set of technical com-mands and capabilities exposed by a software component.AR Augmented reality, an interactive medium which mixes sensory input fromthe real world with generated virtual input.AWS Amazon Web Services, a collection of cloud computing platforms and ser-vices offered by Amazon.CAD Computer-aided design, the use of computer software in developing andmodifying engineering designs.CI Continuous integration, a development methodology where new code, modelsand data are frequently (continuously) loaded onto a production-like envi-ronment for automated and/or manual testing.CNN Convolutional neural network, a DNN employing one or more convolutionallayers – essentially, learned filters – for signal or image processing.DNN Deep neural network, an artificial neural network with several stacked “hid-den” layers, commonly used in modern machine learning tasks.DRY Don’t repeat yourself, a software development methodology where repeti-tion of code and data is reduced or eliminated, providing a single locationfor editing each piece of “knowledge.”xivEM Electromagnetic, a term describing forces and fields around electrically chargedparticles, relating electricity and magnetism.FEM Finite element method, a numerical method for discretizing problems suchas fluid and soft-body simulations.FPS Frames per second, the number of frames of video recorded or displayed persecond.GPU Graphics processing unit, a dedicated, highly parallel co-processor com-monly found in modern desktops and servers, used to accelerate paralleliz-able tasks, such as image processing and large matrix operations.GUI Graphical user interface, a visual, virtual, interactive interface consisting ofdisplay and interactive elements such as text, images, and push-buttons.HTML HyperText Markup Language, an XML-like markup language used to de-fine documents for display in a web browser.HTTP HyperText Transfer Protocol, a common communications protocol for re-questing and receiving documents from web servers.IPC Inter-process communication, methods allowing for data exchange betweenseparate software processes, possibly on different computers.JSON JavaScript Object Notation, a widely-used, human-readable serialized dataformat. Often used to replace XML in client-server communication.LSTM Long short term memory, a popular type of recurrent neural network whichworks well in practice.MAE Mean absolute error, a distance metric between two sets of values, whichsums their element-wise absolute differences:1nn∑i=1|yi− xi|xvMRI Magnetic resonance imaging, a method for non-invasive imaging of the in-terior of the body using powerful magnetic fields.MSE Mean squared error, a distance metric between two sets of values, whichsums their element-wise squared differences:1nn∑i=1(yi− xi)2NSERC Natural Sciences and Engineering Research Council of Canada, a fund-ing body for scientific research in Canada.OOD Out-of-distribution [data], describing data which is anomalous or otherwiseoutside of the distribution of training data in a machine learning task.PLA Polylactide, a polyester thermoplastic made with plant-based compounds,commonly used for 3D printing. Also known as “polylactic acid.”ReLU Rectified linear unit, a simple and popular activation function for artificialneural networks, essentially comprising y = max(0,x)RGB Red-green-blue, an additive color model where red, green, and blue prin-cipal components are added to create a range of colors, typically used fordigital storage of color images.RMSE Root mean square error, a common outlier-sensitive metric for regressionmodel performance. Lower values indicate a better fit, with 0 being a perfectfit.S3 Simple Storage Service, a bulk data storage service in AWS.SVD Singular value decomposition, a factorization which separates a matrix Minto the form USV T .xviAcknowledgmentsThis research was funded by the Natural Sciences and Engineering Research Coun-cil of Canada (NSERC) Idea to Innovation (I2I) program and Vital Mechanics Re-search.I’d like to thank Dinesh Pai for allowing me to join his lab, for his support anddirection in achieving this milestone, and for allowing me the professional freedomto experiment with different techniques and technologies, as well as the personalfreedom to maintain links with family and friends at home. I also wish to thank JimLittle for agreeing to be my second reader, and for his timely and helpful feedback.To my labmates, past and present: Austin Rothwell, Pearson Wyder-Hodge,Jan Hansen, Seung Heon Sheen, Prashant Sachdeva, Egor Larionov, Ziheng Liang,Ye Fan, Matt Dietrich, Yuchi Zhang, Edwin Chen and others. Thank you for anenjoyable and enlightening time, and for all the discussions and help.I’ve had many wonderful experiences here, for which I must thank the manyfriends I’ve made since arriving: Siddhesh Khandelwal, Giovanni Viviani, Hay-ley Guillou, Neil Newman, Kuba Karpierz, Yasha Pushak, Anna Scholtz, ClementFung, Fabian Ruffy, Michael Oppermann, Nico Ritschel, Puneet Mehrotra, JanPilzer, Adam Hutter, and many more. You’ve hiked, biked, road-tripped and inflat-able kayak-ed with me around this beautiful country, and helped me tour many abeautiful pub.Living away from family and friends can be difficult, especially when you’re a9-hour flight from them. I couldn’t have survived here without the support of thepeople I left behind. Thank you to my parents, my friends, and the family at homewho supported my decision to explore life across the pond.Zahra; thank you. I couldn’t have done it without you.xviiChapter 1Introduction1.1 GoalsWith SkinProbe 2.0, we aim to simplify and accelerate the collection of large vol-umes of data on the physical properties of human soft tissues. We hope that thisdata will reveal scientifically and practically useful information on the distributionof these properties over wide populations, enhancing the modeling and simulationof both humans in general, and of individuals in particular. We break down theselofty goals into practical targets:• Low cost of production and usage• Portability – ability to collect data “in the field”• Usability – little training or expertise required of experimenters• Rapid data collection• Accuracy – measure physically meaningful properties of soft tissuesWith these goals in mind, we opted for a significantly different approach to theoriginal V1 probe (Section 1.4). We base our solution around a consumer smart-phone and cloud computing technology, with only the calibration of each devicerequiring specialized, immobile hardware. This minimalist approach is enabledby our use of a novel force-sensing “flexure,” machine learning, and off-the-shelf1(a) µ¯ Stiffness (b) L0 ThicknessFigure 1.1: Visual descriptions of our material parameterizationoptical flow estimation, allowing us to infer material properties using only opticaldata. Usage in the field is supported with only the probe and a working internetconnection. Our use of a consumer smartphone and consumer-grade 3D printingmeans the incremental cost of constructing a new probe is low. Each probe is op-erated through the embedded phone’s touchscreen, so we achieve usability withinterface elements which are easily recognizable to the millions of people who usesmartphones every day. We achieve rapid data collection by requiring no externalhardware, and by providing a cloud pipeline for processing new data, which pro-vides rapid feedback to experimenters. The accuracy of our system is not currentlyproduction-ready, but we present our preliminary results as a proof of concept jus-tifying further refinement of our techniques.1.2 OverviewHere we provide a brief overview of the project’s components and methods.1.2.1 Material ParameterizationEstimating the properties of soft materials requires defining those properties, whichwe do here. We use a simple model which attempts to capture the stiffness and thethickness of the sample under consideration. We limit the model to two variablesout of a desire to limit the necessary volume of data – specifically, the number of2(a) No contact (b) 1N normal force (c) 3N normal forceFigure 1.2: Raw video frames as captured by the probe’s camera.different material samples – necessary for training our material model.We treat each sample as a depth-wise homogeneous block, backed by a rigidunderlying surface, and define µ¯ as the slope of the force/displacement curve at thepoint of contact with the material – that is, for small displacements (Figure 1.1a).We define L0 as the thickness of the material, in millimeters, to the (real or imag-ined) rigid underlying surface (Figure 1.1b).Our “ground truth” material properties are measured with the same equipmentused to generate training data for our system, with only minor modifications.1.2.2 Data CaptureThe smartphone in the probe serves three main purposes: it displays controls forregistering a participant, and for starting and stopping recordings; it captures op-tical data (videos); and it communicates with the cloud pipeline – uploading data,and downloading results. Though modern smartphones are powerful computers intheir own right, and are now capable of performing deep neural network (DNN)inference at a reasonable pace1, doing so presents issues with regards to batterylife and overheating in a portable application. We offload all significant process-ing “to the cloud,” leaving to the phone’s lesser computational powers the (stillchallenging) task of capturing, compressing and uploading high quality video.The data we are interested in collecting are short videos of the probing shaftbeing pressed into the participant – we refer to a single capture, and resulting datafrom of one of these videos as a “touch.” An experimenter starts recording, moves1TensorFlow Lite (https://www.tensorflow.org/lite) permits deployment of fairly advanced imageclassification and object detection models on mobile devices.3the probe to touch the stationary participant, and applies a small amount of force(less than 4N) through the shaft. She then retracts the probe, and stops the record-ing, with the whole process having taken 3-6 seconds. See Figure 1.2 for severalrepresentative frames of captured video, but note that it shows contact with a sili-cone phantom, not a human subject.The scope of experiments which could be carried out with the probe is broad,but in general, we foresee the device being used in multiple locations on each par-ticipant. Part of the utility of rapid measurements is to enable many measurements,generating a map of material values covering a significant region of the partici-pant’s body. In the future, we envision the probe’s application software guiding theexperimenter through captures for different locations on the body.1.2.3 EstimationThe desired final output of our system, as applied to a particular recording, is sim-ply a pair of material parameter estimates. As such, we want a material model“matEst” which accepts a frame sequence x, and outputs a material estimate vectormˆ with our selected material properties:mˆ =[µ¯L0]= matEst(x)Rather than attempting to tackle this problem as a monolithic challenge, ourestimation pipeline breaks down the task into two major sub-problems: force esti-mation for frame selection, and material estimation using the selected frames. Wesketch this overall pipeline in Figure 1.3. We attempt to solve these sub-problemswith two distinct convolutional neural network (CNN) regressors, which we trainfrom scratch. These are related to, but distinct from popular CNN discrete classi-fication models – this class of models is famous for its excellent performance onobject recognition tasks, where a particular image is categorized into one of severaldifferent classes. Our regression models instead output numeric estimate vectors,which are continuously-varying across the input space.The video captured for each touch includes a view of both the subject’s skin,and of the flexure, which includes the probing shaft. The flexure is a novel mechan-4Figure 1.3: Our estimation architecture – raw video is fed frame-wise intothe force estimator, which is used to select frames for optical flow. Thegenerated optical flow is fed into the material estimator, along with es-timated forces for the selected frames.ical device, used to transform applied forces into visible deflection in a repeatablemanner. We extract these forces using our “force model” (Figure 1.4), the first ofour CNNs, which infers the force by observing this deflection in a cropped portionof the video. We apply our force model in an instantaneous frame-by-frame man-ner, ignoring the state of the flexure over time, as we found this provided sufficientaccuracy while keeping the model relatively simple.So, for a given frame of video xi, our model outputs an estimate fˆi of the trueforce vector fi :fˆi = forceEst(xi)≈ fiNote that, for this prototype, we only estimate the normal force, so these force5Figure 1.4: Structure of the force estimation model, showing the input, out-put and hidden layers of the DNN. Here, “channels” refer to the outputs,akin to filtered images, of each distinct CNN kernel in each layer. Weuse x→ y as a shorthand for a layer with an input size of x channels/neu-rons, and and output size of y channels/neurons. Where x is omitted, itis the size of the last layer’s output.estimates are in fact scalar. However, the approach described here could be used togenerate and process 3D forces.Next, we compute the input for the material estimator, another CNN regressionmodel. Given the estimated input forces fˆi , we select two frames of the video –a “contact” frame xc, and a “pressure” frame xp – with which to compute opticalflows. These frames are selected as the estimated force magnitude crosses twoseparate thresholds: contact, at 0.15N; and pressure at 2N. We chose these valuesmanually, aiming to reduce the effect of any noisy output from our force estimator,and to ensure significant visible deflection between the two frames.Optical flow is a measurement of the apparent motion of objects, surfaces, orother features in a sequence of images, traditionally in pairs of sequential frames ofvideo. We use dense optical flow between the selected frames, where an estimateof motion in image space is generated at every point (conceptually, at every pixel)in the image. For a pair of images, the output of this generation is a “flow,” a densefield of 2D vectors indicating the motion in pixel units of the feature at that pixel’slocation, from the first to the second image in the pair.6Figure 1.5: Color-mapped rendering of a 120× 120 flow field input to thematerial estimator, with the tip of the flexure shaft – which has littlerelative motion to the camera – visible on the right of the image.We compute our optical flow estimates (Figure 1.5) using an off-the-shelf solu-tion – an open-source implementation of FlowNet2.0 [14], written in Tensorflow.We input the selected frames xc and xp for a touch into this estimator, obtaining aflow estimate Oˆ:Oˆ = flowNet2(xc,xp)The material estimator itself has broadly similar structure to the force estimator,though it is somewhat larger. It is a CNN regressor comprising three convolutionallayers, and five fully-connected layers (Figure 1.6). The optical flow image Oˆ iscropped to a region near the point of contact with the flexure, and this cropped flowis input to our material estimator model, which outputs our final vector of estimatedmaterial properties:mˆ =[µ¯L0]= matModel(Oˆ)7Figure 1.6: Structure of the material estimation model, showing the input,output and hidden layers of the DNN. The initial flow input is 2-channelas it is a field of 2D vectors (flows), while the two force inputs are theestimates fˆc and fˆp for the frames used to generate the flows.1.2.4 Model TrainingAs our models for force and material estimation are DNNs, they must be trainedto do anything useful. Since we have specific ground-truth data, our models aretrained in a supervised manner. This means providing example inputs and the cor-responding expected outputs – the difference between the calculated and expectedresults provides a loss, or error, which is minimized to train each model throughbackpropagation. This process typically requires large volumes of training data,enough to cover the expected distribution of real-world (test) data and outputs. De-pending on the complexity of the model, this can mean thousands to millions ofexamples.We provide large amounts of training data for our models with our robotic “au-tomation rig,” described in detail in Section 2.3. It holds the probe, and generatesartificial touches against silicone tissue phantoms. We automate this data collec-tion process with custom sequential control software for our robot, additionallycontrolling the probe’s smartphone application over WiFi, to indicate the start andend of recordings. The rig integrates an industrial force/torque sensor, providing8reliable force readings, and the robot provides accurate position readings, whichwe also save with the training data.We designed the tissue phantoms to provide a variety of stiffness and thicknessresults. We achieve this by varying the type of silicone resins used in the differentphantoms, and by varying the depth of the material within each phantom usingspecially-designed trays.We obtain the material training data in two ways: µ¯ by probing the tissuephantoms and analyzing the resulting force/displacement curves; and L0 by simplyanalyzing the depth of each material sample across its surface, given the computer-aided design (CAD) model of the silicone tray.1.2.5 Processing PipelineWith the smartphone’s storage limited to a few gigabytes, and processing takingplace remotely, we opt to immediately offload all recorded videos to cloud storage.The phone application records to local mp4 video files, which are then uploaded toa cloud storage bucket, where they are sorted according to the experiment’s sessionID and current recording number. The application then deletes the local temporaryfile.During a live experiment, the material estimation processing takes place on acloud computer equipped with a powerful graphics processing unit (GPU), whichhelps accelerate the large tensor operations involved in DNN inference. We alsouse this cloud machine for all optical flow processing, and the majority of theforce estimation processing necessary for training and testing our force and mate-rial models. This provides some of the assurances of continuous integration (CI) –ensuring our development models function in the “production” cloud environment– and allows us to run the full FlowNet2 model, which has significant memoryand compute requirements, and which requires special environmental configura-tion which proved difficult on our local workstations.Our instance runs three custom services, and a Redis database application,which together make up our cloud pipeline:• Redis database: an off-the-shelf, in-memory database providing fast accessesand asynchronous primitives for inter-process communication (IPC).9• Flask server: the instance’s outward-facing interface. Provides HyperTextTransfer Protocol (HTTP) endpoints for starting, monitoring, and retrievingthe results of processing jobs.• PyTorch service: loads and executes our force and material models on datawhich it pulls from our cloud storage.• Flow service: loads the FlowNet2 model, and uses it to generate optical flowestimates on demand.These services communicate on the instance to carry out the processing stepsoutlined across the previous pages of this section. This processing is initiated bythe upload of a new video recording from the probe’s smartphone application, andreturns results back to the application upon completion, where they are displayedto the experimenter. All results are additionally stored securely in the cloud, in thepopular (and widely-readable) JavaScript Object Notation (JSON) format.We discuss this cloud pipeline in greater detail in Section 2.5.2.1.3 Related Work1.3.1 Regression ModelsWhile the problems we attempt to solve with deep regression models are highlyspecific to our project, DNNs have been used extensively to solve a wide varietyof computer vision problems in the last several years. Much of the focus of thiswork has been on supervised categorical learning – classifying one or more objectsin an image [17], or semantically segmenting an image to its component parts [24]– but this is not the only type of vision task where deep learning has proven use-ful. The natural output of a neural network is a numeric vector, and while this canbe interpreted (and optimized) as a likelihood distribution over categories, it canalso be used for regression for many different target variables [1]. Regression fromimages may sound like an odd task, but is in fact very useful in a variety of do-mains. For example, human pose estimation from red-green-blue (RGB) images isa regression task, producing a vector describing a skeleton configuration. Outputsmay come in the form of 2D joint locations [45], a heatmap of joint locations [48],10or, more recently, as a map of 3D limb orientations [25]. Depth estimation fromimages is also fundamentally a regression task, and has seen recent strides [7][23].A common example of image-based regression, with some similarity to our forceestimation task, is bounding-box estimation, where objects of interest are locatedin image space. Our force estimator works, at some level, in a similar fashion, bylocating the flexure shaft in the frame. Examples include architectures with convo-lutions followed by fully-connected layers (as in our models) [12], and networkswhich regress an object mask output [43]. Many of these regression approachespropose advanced techniques which we do not make use of, such as multi-passrefinement, but our DNN architectures draw from these examples and more.1.3.2 Force MeasurementThe measurement of forces is an old and extremely widely-studied problem, whichwe could not hope to cover here. While our apparatus is novel, the basic idea ofoptically deriving a force estimate from the deflection of elastic material is not.Optical force measurement systems may employ mirrors and lasers to amplify ap-parent motion for detection – this technique is known as optical beam deflection[49][37][35], and produces highly accurate measurements from relatively small de-flections. The features used for estimation in these techniques are generally moresimple than images – generally, the position or brightness of spot(s) of light aretransformed by a simple mathematical model to the final output. An advancedstudy of similar techniques can be seen in [13], where Hirose and Yoneda de-veloped a packaged 6-axis optical force sensor, and discuss several methods oftransforming light based on deflection.Deflection-based force measurement has also been utilized without optical sens-ing components. Zhou et al [10] use a carbon-infused silicone rubber to create awearable, capacitive tactile sensor, whose electrical capacitance changes as thematerial is deformed. Magnets placed in soft materials can be localized usinghall-effect or magnetic field sensors to infer the forces applied to those materi-als [5][21][47], relating deflection to force output in similar manner to the opticalmethods discussed above. These magnetic and capacitive measurements sufferfrom environmental inaccuracies: for example, the human body affects the capac-11itance of objects it is near to or in contact with, and electronic devices generateelectromagnetic (EM) interference which can affect the analog measurement ofa magnet’s position. In the case of powerful magnetic fields, such as those in amagnetic resonance imaging (MRI) machine, the distortion may be severe, andmagnetic force sensors may become inoperable.Our idea of monitoring macro-scale deflection for force detection is inspired inpart by visual servoing, a robotic control technique where a robot’s pose is moni-tored by camera in a closed-loop control system. This type of servoing has indeedbeen used for force control, as in Nelson et al’s work [29].Many other forms of force sensing are common, but typically require special-ized electronic hardware. This includes accurate, low-deflection strain gauges usedin industrial force/torque sensors. We use this type of sensor for training our forcemodel, and the original V1 probe also relied on this technology. These sensorsrequire additional hardware for reading and interpreting analog electrical signals,are often tied to laptops or desktops, and have a very high unit cost (often $1,000or more).1.3.3 Physical Analyses and Simulation of Human Tissues and OtherMaterialsThe human body is an exceptionally well-studied object, and this study has natu-rally extended to measurement and modeling of the dynamic response of humansoft tissues. Our work ultimately aims to aid this analysis at a large scale, and sowe pay homage to the enormous number of projects which have come before.Much work has gone into visually plausible simulation of soft tissues, pri-marily for artistic applications in movies and, more recently, video games [9].The behavior of soft, deformable materials is well-approximated by finite elementmethods (FEMs) (refer to [40] for a thorough overview), and recent projects havepushed towards highly detailed, anatomically-based human body models repletewith muscles, tendons and fat. Fan et al used MRI data to model distinct muscleswith targeted active and passive shapes [8]; Sueda et al matched the movementof muscles, tendons and bones in the hand to artistic animations [41]; and Si etal modeled a complete human musculoskeletal structure, conforming its motion tonatural swimming gaits [39]. Additional examples of this approach may be found12in Lee et al 2009 [22] and Teran et al 2005 [44]. While structurally advanced,these approaches lack physical realism – they produce visually convincing resultsin many cases, but cannot meaningfully predict the force response of human tissuesunder direct contact, as they do not use real-world measurements of the behaviorof these tissues.Parameter fitting for these types of models has also seen significant work.Sifakis et al used dense motion capture markers, and a volumetric model, to es-timate muscle and tissue properties within a male subject’s face based on its mo-tion. Similarly, Pons-Moll et al [36] used 4D, full-body motion capture data totrain a statistical model predicting the future shape of a human body mesh, usingits current physical state. This model produced realistic jiggles and bulges, butfailed to generalize to the effects of external forces. Kim et al [15] extended thiswork with an FEM-simulated tissue layer over a statistically-modeled inner body,learning the physical material parameters from motion capture data, and allowinggeneralization to external forces. This work is appealing in its elegance, but it doesnot include data on compression behavior, and so cannot be relied on to reproducethis behavior accurately. It also optimizes a linear material model, which is unableto accurately simulate tissue compression. Wang et al [46] supply another exampleof parameter estimation in a linear model, using Kinect depth sensors to analyzethe motion of complex (but inanimate) objects under external forces.When attempting to capture compression behavior of soft tissues, contact mea-surement provides directly relevant data. In a work which heavily inspired thisproject, Pai et al [33] used a robotic probing system to capture the shape and phys-ical behavior of objects such as plush toys. Kry and Pai [19] used motion andforce capture to estimate the compliance behavior of joints (rather than soft tissuesspecifically) in the human hand – in this case, the contact was between the hand,and an object being grasped. Bickel et al [2] improved upon these techniques witha nonlinear model for heterogeneous materials. This approach was somewhat lim-ited by their use of a basic, resistive force sensor in their probe. Miguel et al [28]estimated properties of cloth, skin, and internal anatomy with an advanced hyper-elastic model. Finally, in our own lab’s work, Pai et al [34] used direct captureof force-deflection behavior to parameterize a nonlinear FEM “sliding thick skin”layer over a rigid inner body. We captured this data using the SkinProbe V1, which13is discussed further in Section 1.4.More generally, the haptic behaviors of complex structures – not limited to thehuman body – have been captured and modeled in several different ways. MacLeancreated the “Haptic Camera” [26] for automated, robotic capture of haptic environ-ments, measuring and modeling a toggle switch’s behavior. While actuated in onlyone dimension, this system was then able to convincingly “replay” the switch’s be-havior to a user. Pai and Rizun developed the “WHaT” [32] wireless haptic texturesensor, a handheld device incorporating a single-axis piezoelectric force sensor andmultiaxis accelerometer, which allowed a user to measure surface properties likeroughness. The handheld, wireless nature of the device made it easy to use oncomplex, three-dimensional objects – we hope our handheld, wireless probe willachieve much the same.1.3.4 Tissue PhantomsBeyond simply studying the body, significant research effort has gone into recreat-ing the dynamic behaviors of human tissues in artificial analogs, referred to in theliterature as phantoms. These are often designed for surgical training and practice;SynDaver® products have been used for surgical simulation [30], and range fromsingle-material tissue and skin phantoms to complex, multi-layered phantoms withsimulated fat, muscle, veins and skin. These commercial products are made withsilicone and other gel-like materials, and aim to emulate the look and feel of realhuman tissues under surgical manipulations – palpation, cutting, stitching etc.Other phantoms have been designed for calibrating and testing medical devices,such as ultrasounds [4]. Phantoms may also serve double-duty as training aids forthese same devices, and need not consist of artificial materials – Sekhar et al [38]discuss the use of bovine liver and rib segments, along with pimento olive “tumors,”as a phantom for liver biopsy training. We avoid the use of animal flesh and otherorganic materials in our phantoms for obvious reasons: our automated data capturesessions can last many hours, and we wish to make these sessions repeatable andmess-free.The production of tissue phantoms has also seen some targeted research. Hallet al [11] characterized the force-displacement behavior of a variety of gelatin-14based materials using a motorized capture system. Derler et al [6] evaluated thefrictional response of various skin phantoms on textiles, comparing the phantom’sresponses to that of real human skin.1.4 SkinProbe 1.0Version 1 of our SkinProbe was, naturally, the immediate predecessor to this project,and our goals and decisions for SkinProbe 2.0 were informed significantly bylessons learned designing and using version 1.The original SkinProbe generated qualitatively different data to SkinProbe 2.0,providing force-displacement measurements and multi-view optical flows. We2locally optimized a reality-based model [31] we called “sliding thick skin” [34],with a thick FEM layer which slid smoothly over a parametric inner body.1.4.1 DesignThe SkinProbe3 (Figure 1.7) was a custom in-house design, tailored for the Skin-Cap project, and produced with rapid prototyping techniques. We wrote customsoftware to operate the probe and record data, which ran on an attached “host” PC.HardwareThe probe housed three miniature high-definition RGB cameras, a six-axis force/-torque sensor, three user input buttons, and a micro-controller board. The probeprovided visual feedback to the experimenter during operation through a graphicaluser interface (GUI) displayed on a separate monitor, which could be controlleddirectly using the buttons on the probe body. The cameras were placed to providea clear view of the skin patch around the probing point, with overlapping fields ofview that eliminated occlusions around the probe, except at the point of contact.The design underwent rapid iterative changes throughout the project. An initialprototyping phase with non-functional sample parts resulted in general decisionson the overall shape of the probe, including the pistol-grip design – which opti-mized grip angle for the probing locations we were considering – and rear place-2In Section 1.4, “we” refers to the authors Pai et al [34], including me.3Section 1.4.1 is based on my own work in [34] section 3: Design and Fabrication of Skin Probe.1561743285109Figure 1.7: SkinProbe V1. 1) 3D tracking marker arrangement on the “ten-tacle”; 2) Single-piece probe “head,” holding cameras and force sensor;3) Carbon-fiber composite probe stem; 4) Quick-change, magnetic-lockprobe tip; 5) Front-facing cameras; 6) Front trigger button; 7) Rear inputbuttons; 8) Ergonomic handle, housing micro-controller; 9) Externallyaccessible reprogramming port; 10) Exit point for probe cables, withinternal strain relief. Reproduced from [34] Fig. 2.ment of the marker tracking assembly, which kept the assembly out of the way ofthe probe cameras and experimenter. After this initial prototyping phase, the de-vice saw five major iterations, with a general trend of increasing size and additionalfeatures, such as the addition of buttons for experiment control.We optimized for ergonomics, durability over multiple experiments, and easeof use. This included use by a solo experimenter: the probe’s input buttons allowedan individual to perform an initial calibration, then select and create recording “tri-als” with no other experimenters present. This was desirable as it improved partic-ipant privacy and comfort, and left other project members free for data processingtasks. With this in mind, and drawing inspiration from video-game console con-trollers, we laid out the probe buttons for easy operation with one hand. Two but-tons were positioned for operation with the experimenter’s thumb, on the probe’sback – that is, the side facing the experimenter. The third button was integratedinto a “trigger” on the rear of the probe, operated with a forefinger. In addition, wedesigned the probe’s large, organically-shaped handle to be held comfortably with16124635Figure 1.8: Cutaway view of SkinProbe V1’s “head” section. 1) Embed-ded force/torque sensor; 2) 3D printed probe stem mount; 3) Integratedprobe camera mounts; 4) Probing tip, also showing the internal channelhousing the locking magnets; 5) Endoscope camera (one of three); 6)Camera and force sensor cable routing. Reproduced from [34] Fig. 3.a one- or two-handed grip.With a sensitive force transducer mounted back from the cameras (Figure 1.8),a relatively long shaft was needed to reach the probing point. This had to belightweight, to minimize the influence of the shaft mass on the sensor; compact, toavoid occluding the camera’s views of the skin patch; and extremely stiff, to ensurea rigid transformation between the probing location and the probe itself, even asforces were applied to the tip. After experimenting with metal shafts and tubes, weswitched to a lightweight carbon-fiber shaft with a custom 3D printed mountingand a magnetically-locking, quick-change tip.In an application where experiment hardware is in direct contact with membersof the public, safety is an especially great concern. We kept contact forces for ourexperiments low – easily achieved with a human experimenter in control – and tookcare to avoid sharp edges and exposed electronics in the probe’s design, minimizingrisk of injury to all parties involved.SoftwareThe capture software for SkinProbe V1 (Figure 1.9) served primarily to monitordata streams from the various sensors and the Vicon motion capture system, and to171 23Figure 1.9: SkinProbe V1’s capture software. 1) Live feed of forces andtorques applied to the probe tip, showing a recent touch; 2) 3D visualizerwindow, displaying real-time positions of the participant markers andprobe tip from a user-adjustable virtual camera; 3) Live split view fromthe three probe cameras, showing the shaft and tip in-frame, along withthe participant’s skin and attached markers. Reproduced from [34] Fig.4.consolidate these streams into recordings sampled at up to 120Hz. It also allowedthe experimenter to execute automated optical flow post-processing on the capturedvideo feeds. The software was controlled through a GUI written in C++ using theQt 5 application library. An additional 3D view of the capture volume was providedthrough our visualizer application, a separate program written in Python and usingOpenGL, which we used to display both live and pre-recorded data from the mainapplication.Another important role of the software was to calibrate the SkinProbe beforeexperiments, and to store and re-use these calibrations as far as practicable. This in-cluded establishing the relative transformations between the probe’s tracking “ten-tacle” and the tip, and between the tentacle and cameras – we considered thesetransformations to be relatively stable between sessions. The software also in-cluded the ability to tare the force sensor, and to establish the weight of the shaftand attached tip. This allowed us to accurately zero our force readings, leaving18only the externally-applied forces on the probe tip in our recorded data.We used Vicon Blade software to monitor the Vicon system. Blade capturedthe 3D position of any detected markers in the capture volume, and provided ro-bust tracking of the transformations of rigid “props” like the probe tentacle. Bladerelayed this data via Vicon’s proprietary streaming application programming inter-face (API) to our main capture application, where we implemented our own pre-dictive marker tracking solution for the less-than-rigid bodies of our experimentparticipants.1.4.2 LimitationsThe most significant limitation of the V1 probe is the total system cost and com-plexity. An accurate, real-time, 3D motion capture system costs tens to hundredsof thousands of dollars. Full-body 3D scanners can cost hundreds of thousands ofdollars. Accurate multi-axis force sensors range from thousands, to tens of thou-sands of dollars. Along with a powerful host PC, and thousands of dollars in re-quired commercial software, these hardware components make for a system whichis prohibitively expensive to create, use, and service.These cost and hardware restrictions also make scaling the system infeasible.Only one participant may be run at a time, as the motion capture system consti-tutes an open area which cannot be shared for privacy reasons. The motion capturevolume and body scanner are also immobile, requiring major effort to move, andextensive set-up at any new location. This means that participants must be able totravel to the laboratory to be tested, severely limiting both the number of availableparticipants, and the geographic region from which they may be pooled. Dupli-cating the system in new locations would repeat the massive cost of a completelynew set-up, and would require significant work from technicians to replicate thesoftware environment and hardware connectivity of the original system.Less serious limitations include convenience and duration issues. The systemrequires each participant to partially strip, and to wear a set of markers whichmust be carefully and professionally placed (this requires substantial practice). Thesystem also requires a full body scan, with the markers. Errors in motion tracking,which we are fortunately able to detect, require the experiment to be paused and the19participant to stand for re-identification of the tracking markers. These additionalsteps beyond the core recorded touches introduce friction and use up time in theexperimental process, frequently leading to session times of 90 minutes or more.This further limits the available pool of participants: in short, while individualmeasurements are rapid, the sessions are neither quick nor easy.Finally, we found that the optical data gathered from the probe’s trinocular vi-sion system was of questionable quality. The high-resolution images tended to begrainy and out of focus. We also had little control over values such as exposureand frame-rate, which was typically 15 frames per second (FPS), and we struggledwith the camera’s lack of metadata output – for example, no accurate frame cap-ture times were provided. The camera sensors were not well-seated inside theirhousings, tending to point slightly off-axis, and the cameras themselves were alsonotoriously unreliable, requiring frequent replacement. With this poor data, wefound that our optimization process struggled to match virtual flows on simulatedmaterial to the real-world flow estimates our system generated, and generally re-sorted to using the much higher-quality force/displacement data generated with theprobe’s force sensor and the motion capture system.20Chapter 2Methods2.1 Probe HardwareThe probe itself is, of course, central to the project, providing the core data collec-tion and experiment control capabilities of our system. Experimenters operate theprobe through its touchscreen, and take measurements by gently pressing the prob-ing tip into a participant or material sample. Results, including estimates of theforces exerted through the flexure, are generated in the cloud before being returnedto the device, and displayed on the screen for experimenters to check.Modern smartphones offer a wide variety of sensors and capabilities, includinghigh-resolution multi-touch screens, high-speed internet over WiFi and 4G, veryhigh-fidelity imaging through two or more cameras, accurate orientation sensing,GPS location, and, in newer devices, inside-out SLAM (typically used for aug-mented reality (AR) applications). They are perhaps the densest, best-value sensorsuites available today, and are widely used and understood. We opted to use onein the probe to make the most of these capabilities. The probe is built around aconsumer Android smartphone – we use a Samsung Galaxy S81, but only require amodern Android device with a back-facing camera, running our capture application(Section 2.5.1). Where the original SkinProbe (Section 1.4) used push-buttons, anelectronic force/torque sensor, host PC, and external motion capture system, thisnew device contains and requires no electronics other than the smartphone – the1https://www.samsung.com/global/galaxy/galaxy-s8/21Figure 2.1: CAD model of the SkinProbe 2.0 device.device is operated through its touchscreen, and relies solely on the back-facingcamera for data capture.A 3D-printed frame securely holds the phone, and provides mounting pointsfor attaching the flexure and ergonomic handgrips (Figure 2.1). The frame alsoserves to allow mounting the device in the robotic automation rig, which we use forautomated data collection (see Section 2.3). The probe’s force-sensing flexure isfitted to the back of the device, attaching to the frame (Figure 2.2). The orientationof the flexure is adjusted relative to absolute normal by mounting it indirectly,on slanted angle adaptor pieces. We position the flexure to make it substantially22Figure 2.2: SkinProbe 2.0 rear, showing the frame, flexure mounting, andhardware attachment points.visible to the phone’s back camera, while not completely taking over its field ofview. The various mounting points along the frame allow for adjustment of theflexure position, with further fine-tuning achieved by varying the flexure and angleadaptor designs.2.1.1 FlexureThe probe’s flexure serves both to provide contact with the participant’s tissue,and to translate the applied force into visible motion, detectable by the probe’smain camera. We fabricated the flexure with a combination of 3D printing andsilicone resin casting: the rigid, printed pieces serve as the probing shaft and frame;the silicone cross-bars connect the rigid pieces, and provide repeatable deflectionbehavior under force (Figure 2.3). We use some novel techniques to combine thesilicone and printed plastic, made possible by our use of an Ultimaker 32 dual-extrusion printer (Ultimaker, Utrecht Netherlands).The Ecoflex3 00-50 silicone rubber (Smooth-On Inc., Macungie PA) we se-lected for the flexure was chosen for its near-perfect elasticity and remarkabledurability. We observed no plastic deformation on the silicone parts of the flexure2https://ultimaker.com/3d-printers/ultimaker-33https://www.smooth-on.com/product-line/ecoflex/23(a) No load (b) 4N normal loadFigure 2.3: Side-on view of probe flexure under no load, and under 4N load.Figure 2.4: Probe flexure force-displacement behavior in the 0-4N range,with a 3mm/s contact speed. The lower line – where the plot loopsback – shows some hysteresis, where the material takes some time torecover after being strained.after tens of thousands of touches on our automation rig, with forces ranging up to3 Newtons; however, we did observe plastic deformation after exposure to higherforce levels (this is discussed further in Section 3.1). Additionally, we found thatusing this silicone yielded a flexure force-deflection response with a near-lineartrend, and low hysteresis (Figure 2.4).24Single-Piece FabricationWe developed a process whereby flexures are fabricated with a single 3D print, us-ing integrated single-use mold pieces which are broken away after silicone casting.These mold pieces are made with Ultimaker Breakaway Material4, which forms amechanically breakable, but watertight bond with polylactide (PLA) plastic. Wealso use this material as removable support for the flexure’s complex geometries,which is its original purpose. The silicone resin is poured directly into the printedpart, through removable “risers” which allow resin ingress and air egress. Once thesilicone has set, we remove the breakaway material. This conveniently leaves sili-cone areas with dimensions defined in our 3D design. We do not rely, for example,on accurately pouring the resin up to a level. It also removes the need to cut awayleaked resin, which we found was a problem with rigid 3D-printed molds, wheresmall gaps allowed the resin to leak from the part. Finally, this method simpli-fies the process of designing the molds, as it only requires defining a thin shell ofbreakaway material around the silicone volumes, rather than separate, solid moldswhich can be attached and removed intact.Silicone does not adhere to 3D printed plastic parts when cast in contact withthem, and in fact will not adhere at all to most materials. This is an attribute whichmakes it ideal for mold-making, but difficult to work with for our purposes. It alsocannot be glued with standard compounds. We opt to mechanically join the siliconewith our prints by integrating rigid 3D lattices, which the silicone resin flows intoand around before curing, forming a multi-layered mechanical joint. We generatethese lattices using Ultimaker’s Cura 4.25 3D printing slicer software, by settingthe lattice volumes to print with no solid walls, and with an “infill” of parallel linesat alternating angles per-layer. These lattices are shown in the slicer in Figure 2.5a,and during a print in Figure 2.5b. They are completely obscured by the siliconepost-curing (Figure 2.5d). Note that neither the dual-material breakaway molddesign, nor the complex 3D bonding structures would be possible with traditionalmanufacturing techniques.We calibrated the design of the flexure through rapid prototyping and iteration,starting with an initial concept where we used a simple strip of flexible 3D-printed4https://ultimaker.com/materials/breakaway5https://ultimaker.com/software/ultimaker-cura25(a) Slicer cutaway (b) During printing(c) Silicone casting (d) Post-processingFigure 2.5: a) A cut-away view of the flexure in Ultimaker Cura 3D print-ing slicer software. Note the visible lattice areas in black plastic, andthe white breakaway material. b) The flexure during printing, againwith lattice structures and white breakaway material visible. c) Sili-cone ready for casting into risers. d) The breakaway mold material isremoved after casting and curing, revealing the set silicone cross-bar(blue material).26plastic. Our final design provides some accommodation in all three axes of motion,and though we do not currently take advantage of this capability, we believe thiswill allow us to extract 3D force estimates in the future.2.2 EstimatorsRecall that two bespoke regression models are used in our system: the force es-timator, which provides per-frame force estimates in Newtons, and the materialestimator, which provides the final estimates of the physical material properties.The overall flow of data from raw video to material estimates is depicted in Fig-ure 1.3.We split the estimation pipeline in this way for several reasons. Force estima-tion allows us to impose constraints on the data. We can, for example, ensure thattouches exceed a useful level of force for parameterizing the material. It acts asa kind of domain knowledge injection – we know that force estimation is at somelevel necessary for the task – and also allows us to perform frame selection fortwo-frame (rather than video sequence) optical flows, which keeps the individualmodels as simple feed-forward networks, and reduces the dimensionality of the in-put data to the material estimator. Given enough training data on handheld toucheswith human skin, we could expect a recurrent model, like a long short term mem-ory (LSTM) network, to regress material properties directly from a video sequenceof optical flow fields and corresponding flexure images. This is an appealing so-lution, but acquiring a sufficiently large sample of this in-distribution training data(and labeling it accurately) would be a huge undertaking. A single two-frame flow“image” between two points in a touch is far more constrained than the flow se-quence from the full video of the touch, and our hope is that this allows robotically-collected data to be closer in distribution to the same data collected by an unsteadyhuman hand.The best-in-class results achieved on image-based tasks by other authors leadus to use CNNs as our regression models, for both force estimation and materialproperty estimation. Both of our models have broadly similar structures and train-ing methods, with minor tweaks for each task. We also apply some limited dataaugmentation, which we found improved robustness in practice.27Our models stack several convolutional layers, with each layer’s output chan-nels fed to the next layer’s inputs. We apply max-pooling [3] between the layers,which is a common technique for improving the robustness of vision models, andhas the added benefit of reducing the working set size with each layer. We ap-pend several fully-connected layers in an architecture resembling LeNet [20], withsuccessively shrinking layer sizes collapsing down to the required output dimen-sionality – 1× 2 for the material estimator, 1× 1 for the force estimator. We userectified linear unit (ReLU) activations on each layer, a standard choice for DNNs,and train with mean squared error (MSE) loss, another common choice for re-gression models. Our models are trained and run using PyTorch6, utilizing GPUacceleration.2.2.1 Force EstimatorThe force estimation network generates the intermediate force estimates fˆi , whichare used for frame selection in the overall pipeline. The estimates for the selectedframes are also input to the material estimator.The force network accepts a cropped, normalized, 286×527 grayscale image,and inputs this directly to its convolutional stack. We use 3× 3 kernels and 2x22D max-pooling, expanding to 16 output channels in the first layer and droppingto 4 in the third and final convolutional layer. The output of the final convolutionallayer is flattened to 1× 8704, and fed into the network’s fully connected layers,which run 3 deep and output a lone 1×1 normalized force estimate, representingthe normal force applied to the flexure. We settled on these layer sizes and depthsby trial and error, finding that new or larger layers did not improve the results, butdid slow down training and evaluation times.We expect the convolutional layers in the force estimator to act as powerfulfeature detectors, identifying elements in the images (and on the flexure) such ascorners and boundary lines, with the layer outputs consisting of stacked 2D fea-ture maps. We then expect the first fully-connected layer to focus not just on themagnitude, but on the location of feature activations in the convolutional outputchannels, with subsequent layers non-linearly regressing these feature movements6https://pytorch.org/28into the normalized output forces. In essence, we expect the network to learn totrack notable features on the flexure across the images, and use the positions ofthese features to infer the forces applied to the flexure. It is also reasonable to ex-pect the network to identify image features which appear or disappear at certainforce levels, such as specular highlights on the flexure shaft or silicone sections, orspecific patterns on the silicone surface as it deforms.2.2.2 Material EstimatorRecall from Section 1.2.3 that, from frames of video xi, we select contact and pres-sure frames xc and xp, and, with the material model, estimate a material parametervector mˆ using the optical flow between the two frames:mˆ =[µ¯L0]= matModel(flowNet2(xc,xp))The selection of these contact and pressure frames runs as follows. Given theestimated input forces fˆi , the selection process picks the first of several consecutiveframes where the estimated force magnitude has passed a manually chosen thresh-old value. With a window size of w frames, contact force threshold Ftc and n totalframes in the touch, the chosen contact frame index c is:c = min{i ∈ 0 . . .n−1 : ‖ forceEst(xi+ j)‖< Ftc∀ j ∈ 0 . . .w−1}With corresponding force estimate fˆc = forceEst(xc). The subsequent pres-sure frame index p is selected similarly, with Ft p. We set our threshold values atFtc = 0.15N,Ft p = 1.5N, thus selecting frames xc and xp for flow calculation. Wecalculate an optical flow estimate using FlowNet2 for these frames (from xc to xp),and input this flow Oˆ, along with the contact and pressure force estimates fˆc andfˆp , to the material estimation process.We crop the generated flow to a 2×120×120 region around the contact point,a fairly small area in the overall image space which displays the most obvious sur-face displacements, and input this to the convolutional stack at the “start” of thematerial estimator network. We apply a large 5×5 convolutional kernel in the firstlayer, reasoning that features are fairly sparse in the smooth flow image. The sub-29sequent two layers apply standard 3×3 kernels. All three convolutional layers use2×2 2D max-pooling, and we use fairly limited channel counts of 8, 6, and 4. Thefinal convolutional output is flattened and passed through the first fully-connectedlayer (as in the force network), the output of which is then concatenated with thenormalized force inputs. With all inputs accounted for, we gradually reduce thelayer size through 4 more fully-connected layers to the final 1×2 normalized ma-terial parameter output. We normalize to avoid prioritizing the loss of one param-eter over another – the scaling of µ¯ estimates in natural units is over 100× greaterthan L0. As in the force estimator, we use ReLU activations throughout.2.2.3 TrainingWhile neural networks with different overall structure and size perform differently,and are suited to different tasks, a newly-initialized network of any design willgenerally perform no better than random at its intended task. This is because theparameters, typically “weights” of the various connections in the net, must be tunedfrom a random initialized state to produce the desired behavior. This tuning isreferred to as “training,” and is analogous to training a person or animal to performa task.In the standard supervised learning paradigm7, training a neural network in-volves repeatedly applying the network to its intended task using some set of“training data,” and calculating a “loss” (essentially, the difference) between thedesired results – the “ground truth” – and the network’s output. Training aims tominimize this loss, and typically this is achieved through backpropagation. Back-propagation calculates a gradient for each learnable parameter in the network withrespect to the loss, allowing an optimizer to be applied which “follows” these gra-dients to a greater or lesser extent. Some or all parameters are tweaked by a smallamount with each update, in the direction indicated by the back-propagated gradi-ents. Most training optimizers provide a user-tunable “learning rate,” a so-calledhyperparameter which adjusts the step size for tuning. Many optimizers provideadditional hyperparameters, which alter their behavior in other ways.Training is a time- and resource-intense process, with significant computational7Other types of unsupervised training are possible, but are not directly relevant to this project.30requirements for both the forward and backward passes through the network, scal-ing with the network’s size. Each set of updates utilizing all examples in the train-ing set is referred to as an “epoch.” A network may need hundreds, thousands,or even millions of epochs of training before its performance is optimal, and eventhen, it may perform poorly on real-world data for any number of reasons. A net-work may overfit to its training data, learning to recognize specific training exam-ples or spurious noise rather than the “real,” relevant information contained in thedata. More general out-of-distribution (OOD) data issues can also occur, where thedistribution of the training data does not properly match up with that of real-worldtest data.As long as new training data is in-distribution, increasing and diversifyingtraining datasets for neural networks has been observed to monotonically improvetheir real-world performance [42]. This drove our desire to capture large amountsof training data with reliability and repeatability, leading to the automated capturesolution discussed in Section 2.3. We use this system to capture all of our trainingdata for both models, and much of our validation data.Dataset ProcessingStoring, loading and utilizing large datasets for machine learning presents severaldifficulties. We currently store half a terabyte of training data on Amazon WebServices (AWS) Simple Storage Service (S3), and have encountered limits on disk,system memory, and GPU memory. We bypass these limits with custom datasetcaching and loading classes, which cache requested data from S3 to local storage,unpack the data as needed, and support loading this data as PyTorch tensors. Thesetensors can then be moved into GPU memory for training or inference.Our dataset loaders support several useful features:• Checking the local hard drive cache, and only downloading missing or in-complete files.• Partial dataset loading – request specific recordings, and only those record-ings will be downloaded and loaded into memory.• Lazy loading – on memory constrained systems or very large datasets, data31may be loaded from disc as needed, not kept in system memory.• Unpacking mp4 video files to png images with ffmpeg8.• Storing, loading, and displaying Middlebury-format optical flow flo files.• As-needed optical flow and force estimate generation, starting and commu-nicating with the cloud GPU machine as needed to obtain these estimates.We use the same dataset loaders for local training and for live inference in thecloud, providing don’t repeat yourself (DRY) guarantees that data will be loadedidentically for training and “production.” We also perform some limited data aug-mentation dynamically in these loaders.Several other critical parts of the overall pipeline are implemented by the datasetloaders. This includes: the force-based frame selection logic discussed above;normalization and de-normalization of forces and material parameters for inputs,training targets, and outputs; image and flow cropping; data augmentation; anddata selection (e.g. checking the minimum force was hit).Force Training DataThe training data for our force network is captured by the robotic automation rig,discussed in Section 2.3. The data consists of the individual video frames from aseries of touches on a rigid plate (Figure 2.6), with randomized target depths inthe normal axis, and skew motions in the off-normal axes. Typically these trainingdatasets number a few hundred touches, with tens of thousands of frames (eachframe is a training example). This corresponds to just a few minutes of video at 30FPS, but we found this was more than sufficient. Validation datasets are separatelycaptured, with much the same programming and with the same hardware setup, butfewer touches – generally 30-50 touches, with 1000-2000 frames.One of the difficulties with training and using DNNs is the unexpected behav-ior they can display, especially when testing with data slightly outside the trainingdistribution. We found that while the probe’s smartphone torch provided extremelyconsistent lighting in the flexure images, our initial force models were extremely8https://ffmpeg.org/32Figure 2.6: Rigid force training plate, with various shaded patterns to createa variety of backscattered lighting effects on the flexure.sensitive to the effect of backscattered lighting changes – that is, changes in thealbedo and texture of the material being probed. An especially bright or dark mate-rial would reflect light from the torch back on to the flexure, and could significantlybias the resulting force estimates: in fact, when training on the material phantomtray, the force model learned to directly correlate the brightness of reflected lightto the output force. Bringing the probe closer to the surface increased the inten-sity of backscattered light, and the probe’s proximity to the surface was roughlyproportional to the level of force applied. We countered this unexpected ingenuityon the network’s part by adding patterns to the force plate, which create a varietyof backscattered lighting conditions when the plate is touched in different areas.We also made the input images grayscale, to prevent any dependence on the mate-rial’s hue, and augmented the data with additive lightness variations across wholeimages. The model was then able to predict accurately on materials of differentcolors and patterns.Another issue arose when we tested the force network after removing and re-mounting the flexure. This resulted in a small shift in the flexure’s position inthe image, moving the feature locations on which the network relied to make itspredictions. Testing the force predictions under this scenario resulted in signifi-cant mis-predictions, and a massive increase in average loss. Given that the shift33was small, and given that stationary features existed in the images (the flexureframe), we reasoned that the network should be able to learn to recognize the rel-ative position of the flexure shaft, and infer the force from that. This would makethe estimates invariant to small relocations of the flexure in the image after be-ing jogged out of alignment, or re-mounted. We achieved this with another layerof data augmentation, shifting and rotating the images by small, random amountsbefore cropping to the shaft area. This augmentation scheme draws heavily fromthe random-cropping augmentations used to great effect in CNN image recognitiontasks.Material Training DataOur material model training data is captured by the automation rig, in much thesame way as the force training data. However, rather than touching the rigid forceplate, we touch the phantom material samples – silicone slabs of variable thicknessand stiffness, mounted in the phantom tray (see Section 2.3.2). In contrast to theforce data, where every frame of video is a training example, each material exampleconsists of a complete touch. So, to obtain the thousands of examples necessary fortraining, we have to collect videos of thousands of touches, making these materialdatasets comparatively large in terms of raw storage.As discussed in Section 2.2.2, the actual inputs to the material network consistof a cropped optical flow between two selected frames from a touch, and the forceestimator’s output for those two frames. The training targets are also provided bythe material dataset loader, and are drawn from material data gathered ahead-of-time on the phantom tray.We obtain L0 ground-truth data directly from our CAD models of the tray, butestimating µ¯ requires some additional work. Using a rigid probing shaft (Fig-ure 2.7a), we sampled accurate force-displacement curves at small intervals acrosseach phantom’s surface, with repeated touches at each point. For each location,we then estimated the initial slope of this force-displacement curve – the slope atcontact – by fitting a line to a short post-contact region in the aggregated data. Thisinvolved thresholding the force magnitudes to identify points between contact anda low target force (Table 2.1), and cropping the data to this region. We then ag-34(a) Rigid material probing shaft (b) Samples obtained with the rigid shaft – µ¯ estimatesare derived from these(c) Parameterization of the material phantoms, including µ¯ and L0Figure 2.7: Capture apparatus, sampling results, and final parameterizationsfor material ground-truth data35Point Force thresholdContact 0.02 N00-10 target 0.15 N00-30 target 0.2 N00-50 target 0.3 NTable 2.1: Threshold forces for phantom µ¯ measurement for the differentphantom materialsgregated cropped data from multiple repeats at each location, and fitted a line tothis multi-sampled point cloud. We took the slope of this line as µ¯ for the location,with units in N/m. Figures 2.7b and 2.7c show the sampled forces, and resultingparameterizations for each phantom.Returning to captures with the probe, a primary concern of ours when captur-ing the material training data was accounting for the differences between robotic,on-rails motion of the probe and the more fluid, unsteady human-held motion dur-ing actual experiments. Our solution is straightforward, and ties into the choice ofa two-frame flow for our data – we simply randomize the direction of the probe’smovement vector post-contact, within reasonable bounds, in the x and y axes. Ig-noring hysteresis and other time-varying effects (like motion blur, which should beminimal), we reason that the optical flow between two points of a random walk –as in unsteady handheld movement – should be nearly identical to the flow betweenthose same two points in a robotically linear motion. We additionally randomizethe contact speed of the touches.The touches are otherwise straightforward. We continue pushing into the ma-terial up to a target force of 3N, quickly retract, and pause to stop any materialhysteresis affecting subsequent touches (especially on the softer 00-10 phantom,which displays significant hysteresis). Samples of the material are distributed ran-domly along the x and y axes, and we perform the same number of touches (typi-cally 1000-2000) on each of the three phantoms. We capture validation and testingdatasets in the same way, simply reducing the number of touches to 200 per mate-rial.36Training MethodologyWe trained our models on a local workstation equipped with an Nvidia GTX TitanX GPU, and packaged and uploaded our trained models to S3 upon completion.Both the force and material models were trained with similar techniques, using ourcustom data-loading code to formulate the raw video and force readings, capturedby the automation rig, into usable input and target output data for training. Eachinstance of either network was trained with a particular, named, dataset and val-idated on another named dataset, usually much smaller and captured in the sameway as the training set.Our models were trained up to an epoch limit, which we varied based on eachnetwork’s convergence rate. The material net was trained with relatively little data,and continued to converge over thousands of epochs, while the force network wastrained with tens of thousands of examples, and converged in dozens to hundredsof epochs. We randomized the order of training examples in each epoch. Wefound that the networks converged well with a mini-batch size of 32, which isfairly standard in the literature. After each epoch, we ran the network (forwardpass with no backpropagation) on the validation dataset, obtaining a loss scorewhich we recorded and compared to the training session’s best (lowest) score. Wesaved the network state whenever the best score was “beaten,” and re-loaded thisgold-standard network state at the end of training. This avoided the problem ofworsening validation loss towards the end of training, as the networks overfit totheir training data.Packaging and uploading the trained networks involved naming them (usu-ally by the training dataset, structure, and epoch count), gathering relevant meta-data (such as normalization parameters), and uploading the meta-data and PyTorchstate dict for the network to a models directory on S3. A model’s statedictionary holds its trainable parameters, including the all-important connectionweights.We used the Adam optimizer [16], and found that learning rates of 5e-4 onthe force network, and 1e-3 on the material network worked well for our tasks. Weexperimented with variable (during training) learning rates, with exponential decayand stepped functions, but found little success. We subsequently reverted back to37fixed learning rates, to reduce complexity and avoid them as a potential sourceof confusing errors. We also found that standard regularization techniques, suchas batch-norm and dropout, tended to yield worse results. Dropout in particularappeared to completely prevent convergence in our tasks, so we stopped usingthese techniques.2.2.4 Optical Flow GenerationAs discussed in Section 1.2.3, optical flow is a high-level feature derived fromimage sequences – usually, image pairs – and describes the apparent motion ofobjects from one frame to another in 2D image space. Estimating flows given animage pair is a computationally difficult problem, especially when dealing withlarge displacements and with real images (which have artifacts like specular high-lights) [18].We generate optical flows for image pairs in the captured video of participant’sskin, and these flows necessarily encode both the motion of the probe relative to theparticipant, and the deformation of the participant’s tissues under contact. Theseseparate but related forms of motion are left coupled – we leave the interpretationof the signal up to our material estimator network.Our use of flows rather than the images themselves (or raw image pairs) is jus-tified by a desire for robustness, and careful consideration of the problem. We donot believe that the texture and color of human skin is the most useful indicator ofits (or the underlying tissue’s) mechanical properties, and even if it were – say, byindicating the participant’s age – we cannot expect to capture any relevant corre-lations when generating training data with our tissue phantoms. To do so wouldrequire advance knowledge of those correlations so that we might encode them inthe phantom designs, and such knowledge could only be established using a systemlike the probe we are developing; a catch-22. Trying to capture this variation wouldalso greatly expand the necessary range of tissue phantoms we had to develop, andthe amount of training data we would have to collect for each probe. Finally, itwould run the risk of introducing problematic bias – a tattooed participant wouldvery likely display textures and colors outside the distribution of any conceivabletraining set, leading to (perhaps dramatically) incorrect estimates.38Ultimately, we hypothesize that the most relevant information for visually de-termining a soft material’s physical properties is in its deformation, and we encodethis belief into our system by the use of optical flows.We use a publicly available9 Tensorflow10 implementation of FlowNet 2 [14], acutting edge optical flow estimator, which is itself a particularly large and complexCNN regression model. We use pre-trained weights for this model, and run italongside our own models on our cloud virtual machine. We do not modify or“fine tune” the model in any way, using it off-the-shelf.2.3 Robotic Data CollectionTraining a DNN from scratch, through back-propagation, requires a very largeamount of labelled data, often thousands to millions of examples for robust perfor-mance. The amount of training data required for a Computer Vision task can oftenbe reduced by using a “pre-trained” network, either in whole or in part, whoseparameterized weights have already been tuned for some difficult task, such as ob-ject labeling in RGB images across hundreds or thousands of labeled classes. Theidea is that typical images are made up of patches of common textures, colors,and shapes, and that a network trained on a widely-scoped task will necessarily becapable of identifying these common features. Re-training, or “fine-tuning” thesenetworks to, for example, identify a different set of objects that those it was origi-nally trained for, is far more tractable than the original training task working fromrandomly-initialized weights. For example, a DNN trained to recognize cars wouldhave little trouble learning to recognize semi trucks: low-level texture features suchas the presence of smooth body panels, asphalt road surfaces, and different typesof skies in the images would often be present in both classes of image; higher-levelfeatures like tyres, grilles, and window panels could also be shared, recognized inthe new dataset, and the mechanisms of recognition so reused.The tasks we attack with DNNs bear little resemblance to these more traditionalimage classification and interpretation problems. We operate on close-up imagesof novel hardware, with a small, relatively fixed feature set, and 2D optical flow9FlowNet2 Implementation: https://github.com/vt-vl-lab/tf flownet210Tensorflow: https://www.tensorflow.org/39fields drawn from video footage of human skin. A standard image classificationnetwork like ImageNet would have little to say on the configuration of our probingflexure, and – as these pre-trained networks are often very large – would introducea heavy computational burden.Our use of DNNs in our force and material estimation models therefore ne-cessitates the collection of a large volume of data – data which, incidentally, mayvary qualitatively with each constructed probing flexure, and with each smartphonemodel used to build the probe. Manual data collection is a laborious task, slow andimprecise. Synthetic, software-generated data is another option, but comes withcaveats. Rendering images which properly cover the true distribution – imageswhich are, in other words, photorealistic, and calibrated to the properties of ourphysical camera system – would be a research project unto itself, and would ad-ditionally involve accurate physical simulation of the flexure and material underexamination. We opt instead to use the universe as our real-time “simulation,” us-ing a robotic system coupled to a local server to reliably and repeatably generatelarge datasets.2.3.1 Hardware DesignThe robotic “automation rig” (Figure 2.8) is designed to produce data containingsimple touches of some material sample(s) under consideration – typically a setof silicone phantoms, discussed further in Section 2.3.2. A touch consists of po-sitioning the tip of the probing shaft above the sampling location, pressing the tipinto the material (thereby applying some measurable force), and retracting. Force,position and video data are captured in real time as each touch is carried out.Our rig is based on a Force Dimension Delta 311 robot platform, which pro-vides 3 axes of motion. The robot’s control libraries allow positioning the effector,and applying forces in 3D cartesian space. An ATI Mini4012 force/torque sensorprovides precise ground-truth force readings, and the probe itself (Section 2.1) ismounted on the robot’s end-effector.We arrange this hardware with two 3D-printed assemblies: a “base,” which fas-tens securely to the robot’s stationary frame, mounting the phantom tray (described11http://www.forcedimension.com/products/delta-3/overview12https://www.ati-ia.com/products/ft/ft models.aspx?id=Mini4040Figure 2.8: The “automation rig,” with the delta robot platform, siliconephantom tray, and mounted probe visible.in Section 2.3.2) through the in-line force sensor; and the probe mounting, a sim-ple structural addition to the robot’s end-effector, which facilitates fastening andremoving the probe. We fabricate these assemblies using the same Ultimaker 3133D printer we use to fabricate the probe and flexure.2.3.2 Tissue PhantomsIn lieu of a software simulation of human tissue, and without a human (or othermammal) to confine in our data capture apparatus, we revert to the next best thing:a set of silicone “tissue phantoms” (Figure 2.9), which attempt to mimic the soft,viscoelastic behavior of human tissue, and cover a range of material properties –mimicking, for instance, tensed or slack muscles; fatty belly tissues; or the thinskin-on-bone found around the clavicle.We use a small selection of different casting silicones, covering a range ofstiffness levels (Figure 2.10). These are Ecoflex14 00-10, 00-30, and 00-50 silicone13https://ultimaker.com/en/products/ultimaker-314https://www.smooth-on.com/product-line/ecoflex/41Figure 2.9: The assembled phantom tray. Silicone phantoms are incorporatedin order of increasing stiffness, from softest at the top to stiffest on thebottom.Figure 2.10: Two silicone phantoms cast outside the tray, demonstrating dif-ferent levels of softness with variant resting configurations.42TraySiliconeFigure 2.11: Cutaway view of a silicone phantom and its 3D-printed tray,used for material model training.Figure 2.12: Close-up of the phantom tray before silicone casting, showinglattice base and removable (disposable) mold walls.rubbers (Smooth-On Inc., Macungie PA), which we occasionally refer to as “S-10,”“S-30,” and “S-50” respectively.These silicone resins are cast directly into the 3D-printed “phantom tray,” whichadditionally varies the thickness of the soft materials beneath a uniform surface(Figure 2.11). This thickness then becomes one of the material properties learnedby our material estimator, and creates additional variety in the measured µ¯ stiffnessof the materials.The phantom tray itself is designed with some of the same fabrication tech-niques as the probe flexure, discussed in Section 2.1.1. The base of each phantomvolume is constructed of a dense, rigid, three-dimensional lattice of printed plastic(Figure 2.12). This lattice provides an extremely robust connection between the43Figure 2.13: Phantom tray during silicone casting, with mold caps installed.silicone and the plastic tray, as the liquid silicone resin flows into and around thelattice before curing. As with the flexure, the walls of the casting volumes are madeof disposable breakaway material, which is printed as an integrated part of the traywith an Ultimaker 3 dual-extrusion printer. The walls are broken away post-curing,leaving a smooth finish, and well-defined geometry for the phantoms. We ensurethe top surface is perfectly level (and that the phantoms display some surface tex-ture) with printed “cap” pieces for the molds, which slot in once the silicone resinshave been poured (Figure 2.13).2.4 Automation ServerThe automation server operates and monitors the automation rig. This involvescontrolling the robot, reading values from the force sensor, remotely operating theprobe’s smartphone application, and providing a web-based graphical user inter-face (Figure 2.14) for user programming and control of automated capture sessions.The architecture of this system is built around a custom Python Flask15 serverwhich provides API endpoints: these endpoints activate the server’s various capa-bilities, can be polled for the system status, and generally provide the “back-end”functionality of the user interface. We run high-performance robot and force sensordata acquisition in background C/C++ threads, which interface with our Python15Flask documentation: https://flask.palletsprojects.com/en/1.1.x/44code.2.4.1 Automation ProgramsAutomated recording sessions generate large numbers of recordings with our au-tomation hardware, with limited or no user interaction beyond initial programming.To program the system, the user creates an “automation script,” which controls therobot and probe – we discuss the functionality of this scripting system here.Automation scripts are entered, edited, and run from our web UI. The scriptsthemselves are sent to the automation server for execution.Recordings are grouped into named “sessions,” and can be tagged individuallywith arbitrary metadata. Automated recordings hold not just the probe’s videos,but also high-frequency measurements from the force-transducer, and calibratedposition readings from the robot.The programming of our automation system is built around Python scripting,which provides a familiar and exceedingly powerful environment for generatingcommands. It provides the ability to generate commands in loops and other high-level structures, and the ability to use Python’s vast selection of libraries. Theability to use numpy proved particularly useful, as we are working primarily withnumbers and vector positions. We can, for example, target random locations withina defined range by simply using numpy’s random number generation facilities.Our scripts are written with special functions which emit commands for theautomation server to execute through its sequential controller. We summarize theavailable commands in Table 2.2. These commands are collected into a completesequence when the script is first run – that is to say, the entire program is executedahead of time. We structure the programming this way – as opposed to dynamicscripts, which might be able to read and respond to the system state – as it allowsus to perform basic correctness checks on the scripts before execution. If any partof the script produces an error (such as a divide by zero), the script fails to run,whereas a dynamically-executed Python script would only fail when it reached thepoint where the exception was generated. Since these automation programs cantake several hours to run, a failure part way through (or at the end!) of a programwould be a frustrating waste of our researcher’s time. This structure also allows us45to perform more high-level correctness checks: for example, we can guarantee thatall start record calls are matched by a stop record call. Finally, having afixed-sized list of commands for each execution run allows us to display a progressbar, and a completion time estimate.Python is a fully-fledged scripting language capable of, among other things,full filesystem access. When combined with the fact our scripts are entered througha web interface, this represents a potential security issue, as the scripts are executedremotely. We mitigate this by exposing the server only to the local network, andby requiring username and password authorization to access the UI, and all APIendpoints.There are some features here worth discussing, which go beyond the basicrecord/move/sleep commands. We use await input to create partially auto-mated scripts, where the user may move the probe manually for some recordings,or where the material tray or flexure should be swapped out (or any other aspect ofthe hardware manually reconfigured). We use this in combination with enableand disable motors to move the probe to a target location, then allow the userto disengage the motors and start a recording with one press, before re-engagingand finishing the recording with a second press. We make extensive use of theability to move up to a force limit, which helps keep data consistent, and preventsdamage to the flexure and phantoms by limiting maximum strain.2.4.2 User InterfaceOur web-based user interface (Figure 2.14) features a live readout of various met-rics direct from the automation hardware. The interface is a HyperText MarkupLanguage (HTML) webpage, accessed via a client browser. The page is serveddirectly by the Flask application. We built the interface with Bootstrap user inter-face elements, and poll with asynchronous JavaScript and XML (AJAX) requeststo load live data. The interface allows the user to monitor the robot’s position – theposition of the flexure tip, assuming no deflection – and the force being applied tothe material tray.The user can monitor the robot’s position through direct numeric readouts,which show both the raw robot-space position, and calibrated phantom-space po-46Command Functionstart session(s) Start a new session, named s. Recordings startedafter this command will be grouped under thissession. Must be matched by a subsequentend session call.end session() End the current session, packaging and uploadingthe recorded data to S3.start record(m) Start a new recording, attaching the providedmetadata object m – typically a Python dictionary,which must be serializable to JSON. Must bematched by a subsequent stop record call.stop record() Stop the current recording, holding while data issaved and video is uploaded.sleep(t) Stop processing new commands for t seconds.move(p[,s[,t[,f]]]) Move the robot to phantom-frame position p withspeed s. The command (but not necessarily themove) ends when the robot is within thresholddistance t of the target, or normal-force f is ex-ceeded.sync() Synchronize the server and smartphone clocks,obtaining a new ∆t measurement as described inSection 2.5.1.bias() Bias the force sensor readings to a new zero-point.set headless() Disable smartphone recordings. Used formaterial force-displacement measurements forground-truth data.mute() Disable saving and upload of any data (outputs awarning). Used for developing and testing scripts.await input(m) Display a message m and wait for user input – theuser may click “Continue” in the UI, or “Go” inthe app to continue execution.enable motors() Enable the robot’s motors.disable motors() Disable the robot’s motors.Table 2.2: Control commands available in our Python automation program-ming environment, and their functions.471234(a) Direct monitoring and control elements. 1) Interactive 3D visualiza-tion of flexure tip position relative to phantom tray CAD model. 2) Positiontransformation calibration (controls and results) 3) Live normal force graph. 4)Direct system control - “regulation” enables or disables the motors, and move-ment controls allow manually moving the robot.132(b) Programming controls. 1) Program execution controls, and programsave/load (on client machine). 2) Program execution progress bar. 3) Interac-tive editor with syntax highlighting, multi-select.Figure 2.14: The automation server’s web-based user interface.48sition. The position is additionally displayed in an interactive WebGL view. Thisview shows the phantom tray using its 3D CAD model, and represents the ide-alized tip position as a small yellow marker with intersecting lines. These linesextend out from the marker point to provide additional 3D reference for its loca-tion relative to the phantom tray. The user can shift their view of the tray and tipby rotating, panning, and dollying16 the virtual camera with mouse controls. Ad-ditional markers serve as calibration points: if the reference and measured markersoverlap completely, the calibration is perfect (see Section 2.4.3 for more details onthe calibration process).Forces are monitored in two ways: through direct numeric readouts, whichshow both the raw force reading and a moving average; and through a live HTMLcanvas graph, which shows only the normal component of the moving-averageforce reading. This graph is scalable in the Y -axis, using the mouse wheel. Wefound no existing JavaScript packages which worked well for providing this typeof live, scrolling view, so we developed a custom solution which is functional, butbasic. Our graph places one “reading” (response from the automation server) ateach screen pixel, which corresponds to an approximately 30-second long window.The main issue with this approach is that slow or intermittent responses, whichcan occur when the page is de-focused, will unpredictably and unevenly shift andscale measurements in the time axis. It also means the graph cannot show high-frequency changes: the sensor itself is polled at approximately 300Hz, but the UIretrieves a moving-averaged force reading at only 15Hz.Direct control of the robot is enabled through push-buttons in the UI. The robotcan be moved in small increments along each axis by adjusting its “target position”(displayed above the position readouts) with the “+” and “−” buttons. It can alsobe brought immediately to a safe position with the “home” button. Finally, the usercan completely disable the robot’s motors by switching off position regulation;re-enabling regulation will cause the robot to hold its new position, so that theend-effector can be placed by hand and left stationary.16Dollying refers to moving a camera forwards and backwards along its line of sight. This isdistinct from “zooming,” where the field of view is changed on a static camera to enlarge or shrinkobjects in the frame.492.4.3 CalibrationThe Delta-3 robot’s proprietary control software calculates the cartesian position ofits end-effector by applying forward kinematics with its measured servo positions.This is a helpful capability, which we rely on heavily, but we wish to obtain thetip position (translated from the end-effector), and want this position relative tothe phantom tray. This makes the automation programs easier to write (with acoordinate origin in the tray center) and more repeatable – re-assembling or re-attaching the phantom tray mounting in a slightly different position should notchange the results.Our calibration process revolves around measuring three fixed points X on thephantom tray in the robot’s coordinate frame, which have known positions Y in thephantom’s coordinate frame. For calibration points, we use the three bolts whichattach the phantom tray (or other material tray) to the force sensor assembly. Arigid “calibration probe” may be used to manually register these bolt’s positions in3D space by following these steps:1. Replace the flexure with the calibration probe2. Support the probe with one hand, and disable the motors using theRegulate/Off UI control3. Move the probe so that the calibration probe slots over the top bolt4. Click Calibrate to mark the calibration point5. Repeat steps 3 and 4 for the remaining two bolts, in clockwise order6. Return the probe to a safe position, and re-engage the motors withRegulate/OnWhen the three positions have been sampled, we automatically calculate andsave a new calibration. This involves constructing a transformation composed of arotation matrix R, and translation vector t . This transformation maps points fromthe robot frame to our phantom frame. Let x¯ be the mean of the measured positions,50and y¯ be the mean of the phantom frame positions. We can then calculate the trans-formation, using the singular value decomposition (SVD) solution [27] to Wahba’sproblem to find the rotation matrix. Note that we weight all points equally:B =3∑i=1(xi− x¯)(yi− y¯)TB =USV TR =U [diag(1,1,det(U) ·det(V T ))]V Tt = y¯−R× x¯(2.1)We can then transform any point x in the robot’s coordinate frame (such asa position reading) to and from the corresponding phantom frame position y asfollows:y = RT × (x−t)x = R× y+t(2.2)2.5 Experiment SoftwareOur software runs during data capture sessions, facilitating the collection of sam-ples in the form of video and raw data files. The probe runs our Android Appli-cation, gathering and uploading video data, while a virtual machine in the cloudruns a suite of server applications which the probe interacts with. Additionally, inautomated capture sessions, a local server runs our robotic automation rig (Sec-tion 2.3), provides a web interface for controlling and monitoring these automatedcapture sessions, and operates the probe via local WiFi.2.5.1 Android ApplicationThe Android application (Figure 2.15) runs on the probe’s smartphone, and servesto capture the video data on which our material predictions are based. We oper-ate the application in two ways: with standard touchscreen controls, visible in the5175123 46Figure 2.15: Screenshot of the probe’s smartphone application, with thephantom tray visible in the camera preview (the probe is mounted inthe automation rig). 1) Live camera preview. 2) Action button for par-tially automated captures. 3,4) Session control buttons, for creatingnew sessions and starting and stopping manual recordings. 5) Mate-rial estimation results display. 6) Material estimator model selection(spinner). 7) Phone system status: battery level and temperature.screenshot; and remotely, through the automation server discussed in Section 2.4.The application uploads captured videos directly to S3, and is additionally able toissue commands to, and display results from the cloud server discussed in Sec-tion 2.5.2.Video CaptureWe ensure consistent video footage for training and inference by manually settingthe camera’s exposure, focus, and color-adjustment values. We additionally enablethe flashlight for video recordings, which provides highly consistent lighting in ourimages – within the 6cm the probe’s camera operates at, the flashlight drowns outnearly all ambient light sources. We reduce exposure times to account for this,which has the added benefit of reducing or eliminating motion blur artifacts. Themanually-set values we use are summarized in Table 2.3. We additionally set thecolor transformation and color correction gains by reading out values from an auto-52Preview CaptureFPS 30 30Focus distance 55mm 55mmAuto-focus No NoAuto-exposure No NoOptical stabilization No NoVideo stabilization No NoTorch No YesExposure time 10ms 3msISO Sensitivity 1500 70Table 2.3: Probe application camera settings17 for Samsung Galaxy S8 rearcamera.corrected image taken with our other settings held constant.Videos are recorded as 1024× 768 resolution mp4 files with a high qualitysetting for compression, and no audio. We (optionally) upload the video files to thecurrent session’s directory on S3, and immediately remove them from the phone’slocal storage post-recording to avoid exhausting the available capacity. Uploadingrequires a functional internet connection, using either mobile data or WiFi. Faileduploads are re-attempted, and in automated capture sessions, no new recordingswill be made until upload has completed.AutomationFor automated capture sessions, the phone application maintains a backgroundTCP connection with the automation server over a local WiFi network, with anadditional layer of detection for dropped connections, and the ability to resenddropped messages. Messages are exchanged in a simple JSON format, and consistprimarily of simple instructions for the phone from the server, and confirmationfrom the phone that execution of those instructions completed. This confirmationis necessary as some actions, like starting a recording, can take several seconds.17Refer to Android Camera2 CaptureRequest documentation for additional information onthe meaning of these settings: https://developer.android.com/reference/android/hardware/camera2/CaptureRequest53We thus established robust, asynchronous communications between the phone andserver.When operating in tandem with the automation server and robotic platform,accurately recording the timing of image captures becomes essential. Even a smalldelay could lead to images of the flexure and material samples being systematicallymis-labeled with force and position readings, which are recorded separately by theautomation server. To correct for this, we establish a precise time offset betweenthe phone and server system clocks, assuming similar latency in both communica-tion directions. We calculate a time delta between the two devices by recording thestart time t0 on the server, requesting a timestamp tp from the phone, and recordingthe response time t1, again on the server. The time delta is then:∆t = tp− t0− t1− t02Where at any time t:tphone = tserver +∆tWe repeat this process 200 times, and take the average ∆t . We found that theclocks drift without frequent synchronization – on the order of 0.01s an hour –and this drift can build up enough to affect synchronization of force and positionreadings with individual frames of video. We avoid drift between the two clocks byrepeating this synchronization process frequently during longer capture sessions,which can last 12 hours or more.Operating a smartphone for extended periods is somewhat outside such a de-vice’s intended usage, and presents its own difficulties. Recording video is an CPU-and power-intensive task, and we found that the phone’s battery would often draindespite being plugged in to the AC adaptor. Evidently, the adaptor did not provideenough power alone, so the phone’s battery had to make up the difference. Thishigh power usage would, after some time, completely drain the battery, causing anuncontrolled shutdown. It also created a secondary issue: power use in electron-ics invariably translates to heat, and the phone’s overheat protection would throttleCPU use and, eventually, shut down the device. We solved these issues in software,by monitoring the phone’s battery level and temperature, completely disabling the54camera whenever these metrics passed critical levels. This allowed the phone tocool and/or charge until the metrics passed above a second set of target levels. Theapplication maintains a state machine with transition queueing to ensure that, forexample, starting a new recording is deferred until recordings are resumed, andthat the camera is not inadvertently disabled during an active recording.Live ExperimentsIn a live experiment, the application’s role changes in several ways. A connec-tion to the automation server is, naturally, no longer required, and the applicationbecomes fully controlled by the experimenter, through the touchscreen.The experimenter can start a new “session,” and then make an unlimited num-ber of recordings in each session. “Live” recordings are treated differently to au-tomated ones by both the phone application and server suite, in large part due tothe lack of external sensing hardware – there are no force sensor or robot positionreadings to store, and no automation server to synchronize with or send updates to.Instead of the automation server generating and uploading a metadata json filefor each recording, as in automated touches, the phone application generates thisfile. It includes accurate capture times for each video frame, and the start and stoptimes of the recording. These files are uploaded individually, rather than packagedin a zip archive as in automation sessions. Videos are uploaded as normal, albeitto a different S3 folder.Once each live touch has been completed (when recording is stopped), and theresulting video has been uploaded, the phone application sends an HTTP requestto our cloud instance, demanding a new material estimate for the recording. Therequest specifies the session, recording, and model to use to generate the results.The model to use is selected by the experimenter through a drop-down UI, whichis populated by listing the available models on S3. The job request occurs on an in-dependent background thread, which then continues to monitor the progress of thematerial estimation job in the background, polling the instance with additional re-quests. Multiple simultaneous requests are supported, whether from a single probedevice or many – jobs are simply queued by the instance, and several monitor-ing threads may run at once within the phone application, though it may not be55clear which set of results corresponds to which recording; currently, only one setof results is displayed at a time. The application alerts the user when estimationhas completed (or failed) via a UI update. The resulting µ¯ and L0 estimates areshown in the results display within the application (Figure 2.15) as soon as theyare available, and are also stored on S3 for later processing.Older “live” recordings may be subsequently reevaluated on-demand usingnewer, or different models. New estimates are indexed by the model used to createthem, not overwritten. This reevaluation is not currently possible from the phoneapplication itself, which provides no ability to browse or play back old recordings,or to examine old results – however, these capabilities are perfectly feasible withsoftware updates to the application.2.5.2 Cloud Server SuiteWe perform live, GPU-intensive computations for the SkinProbe 2.0 system on anAWS P3.2xlarge instance, which we operate on-demand. The instance per-forms three main tasks: force inference, optical flow calculation, and material pa-rameter estimation. The force and flow tasks are generally rolled into a materialestimation calculation – as discussed in Section 2.2, force estimates and opticalflow fields for a given recording are necessary for generating these final mate-rial estimates. However, we split these tasks semantically so we can maintain thecapability to perform them separately. This allows us to test separately, and to au-tomatically obtain force estimates and flows from the cloud instance for training.The three tasks are split across two main systemd services on the instance,with an additional service providing the public-facing API endpoints using Flask18,and an automatic update service which runs at startup, pulling in new code. Finally,we also run a standard Redis in-memory database on the instance, which we usefor high-performance IPC, with synchronous polling and robust queueing.Workflow for Live ExperimentsDuring a live experiment, the smartphone application communicates with the cloudinstance to obtain results. Upon receipt of a request for new material estimates,18This service should not be confused with the local automation server, which also uses Flask56the Flask server places the job (and its associated parameters) on a Redis queue,and creates a new entry in the database representing the job’s status and results.The PyTorch server, which synchronously19 polls from this queue, receives thejob and begins processing it. Meanwhile, the Flask server responds to the phoneapplication with a job ID, which can then be used in subsequent queries to monitorthe process status.As discussed in Section 2.2, generating a material estimate involves first es-timating forces applied throughout the video, selecting frames for optical flow,generating that flow, and executing the material model with the flow as input. ThePyTorch service executes most of this process, communicating with the flow serverthrough Redis to trigger flow generation (the flows are then available on the localdisc).Some implementation details complicate the pipeline we have previously dis-cussed. As the video data is initially uploaded to S3, and is not immediately avail-able locally, the server must download the new data, and unpack it to an imagesequence on the instance’s virtual storage – we achieve this using the same dataloading libraries we use for training and testing our regression models. The forceand material models requested for the job may also not be available locally; in thiscase, they are dynamically loaded from S3, where they are stored in their fully-trained state. When processing material estimation requests, which require both amaterial and force model, the downloaded material model’s metadata then speci-fies the particular force model to download and use for the force estimation step.As only one pass through each dataset is required, we use our data loader’s lazyloading ability to keep memory usage in check.The optical flow service runs in a separate process to the material and forceestimation, and exists to operate the Tensorflow FlowNet2 model. It uses Redis forinter-process communication in precisely the same way as the estimation service:when optical flow pairs are required by the material data loader in the main esti-mation process, a new flow job is posted to a Redis queue, requesting those flowpairs. The flow service polls this job from its queue, loads the images from disc,estimates the optical flow fields, and saves them back to disc in Middlebury flo19In this context, synchronous means blocking. The PyTorch service waits idle until a job becomesavailable on the queue.57format. Once all pairs are processed, the flow service updates the job to indicatethat the results are available for loading by the estimation process.After completing the estimation pipeline to obtain normalized results, we de-normalize the results to real units using each model’s normalization parameters.The generated force and material estimates are then gathered and stored on thejob’s Redis entry, and the job is marked as “complete.” We finally add the generatedestimates to the session’s estimates files, indexed by the model used to create them,and (re-)upload them to S3. These files constitute a basic database of completedestimates, which can then be downloaded for offline analysis.Throughout the job’s processing, which takes several seconds, status updatesare written to the previously-created Redis database entry. These updates may beread asynchronously by the smartphone – or any other – client by simply polling astatus HTTP endpoint with the job ID. The client obtains final results in the sameway – once they are made available on Redis, the generated estimates are returnedin the status response, for display to the experimenter through the smartphone ap-plication UI.58Chapter 3ResultsHere we outline and discuss the results of our work, analyzing the accuracy ofthe probe’s flexure-based force estimation, and of the generated material resultsfor the synthetic tissue phantoms – the same phantoms used to train the probe’sneural nets. We establish the prototype probe’s functionality under controlled con-ditions, and with this limited set of materials, and further evaluate performance ofthe system when probing is done by hand.3.1 Force PredictionsThe overall design of our pipeline for SkinProbe 2.0 means that accurate forcepredictions are an important foundation for accurate overall results. Poor force es-timates could add noise or, worse, bias to the final material estimates by leadingto incorrect frame selection – with a target of 1N, underestimating the force andmistakenly selecting frames at a true applied force of 1.2N could lead to a µ¯ under-estimate, as the observed deflection would be greater than at the levels the materialestimator was trained for.We therefore tested the force predictor experimentally, to validate its perfor-mance in isolation from the rest of the pipeline. We captured data for these ex-periments using the robotic automation rig, generating probe touches on either therigid force-training plate, or the phantom tray. We programmed for touches withvarious angles of attack, speeds, and maximum forces – these variables were ran-59Loss (Newtons)Dataset MSE (L2) MAE (L1) RMSE R2Training 4.44e-03 0.0432 0.066 0.9977Validation 2.74e-03 0.0359 0.071 0.9972Test 5.08e-03 0.0422 0.069 0.9964Soft contact 7.43e-03 0.0592 0.086 0.9931Leather contact 1.03e-02 0.0791 0.101 0.9916Guided handheld 8.90e-03 0.0655 0.094 0.9927Freely handheld 3.75e-03 0.0426 0.061 0.9961Post-stress 1.10e-01 0.229 0.332 0.9185Post-stress corrected 6.04e-03 0.0492 0.078 0.9955Post-stress retrained 4.81e-03 0.0409 0.070 0.9961Leather in direct sunlight 3.24e-01 0.445 0.569 0.7676Table 3.1: Force estimator network MSE and mean absolute error (MAE)losses, root mean square error (RMSE) and R2 metrics. The losses andRMSE are measured in Newtons, which the network outputs directly.domized across the different samples. We further evaluate the model in handheldusage, with “randomization” of touches carried out by the experimenter.We summarize the accuracy of our force estimation results numerically in Ta-ble 3.1, and expand on these results in detail in the remainder of this section.3.1.1 Performance on Training and Validation DataWe first examine the performance of the force model on its own training and vali-dation datasets. We plot individual touch recordings from the datasets for clarity -note again that the force readings are calculated per-frame, so while some of theseplots run over time, each estimate is calculated independently.The net performs well on its own training data (Figure 3.1), with 87% of thetraining samples yielding predictions within∼ 0.1N of the measured ground truth.This is not surprising – in fact, neural networks often over-fit to their training data,picking up on artifacts and specific patterns which may appear there, but which areabsent in real-world data. We might expect the network to do better here than ithas, and indeed leaving the network to train does improve results on the training60Figure 3.1: As expected, the force network predicts forces in its training setextremely accurately; results of several consecutive touches are shown,with different maximum depths (maximum force levels) and contactspeeds.Figure 3.2: Performance of the ForceNet on its own validation dataset, whichis captured in the same way as the training set. The net is not trainedon this data, but the network state which performs best on this data isselected. The results here very nearly match performance on trainingdata.61Figure 3.3: Progression of the average per-item training and validation losseswhile training the force estimator. Note the logarithmic y-axis. Trainingloss continues to drop long after validation loss has plateaued, demon-strating over-fitting – to counter this, we save and use the network whichperforms best on validation data.set to the point of near-perfect accuracy (Figure 3.3). Since we save the networkduring training at the point it performs best on its validation data (which may occurrelatively early in training), this effect is not seen in these results – performance onthe training data is imperfect. We show how the network performs on its validationdata in Figure 3.2.In fact, comparing the MAE for training and validation data shows that thisparticular network performs better on its validation set than on its own training set,with 0.0432N MAE for training data, versus 0.0359N for validation data. 91% ofvalidation data predictions fall within 0.1N of the ground-truth.This improvement over training data performance is slightly surprising, but isperhaps explained by outlier predictions in the training set. It seems that fewer such“difficult” cases appear in the validation set, which is far smaller, and which maysimply miss these types of outliers by chance. Another explanation is stochasticityin the validation performance – examining the loss graph in Figure 3.3 shows that62Figure 3.4: Performance of the ForceNet on an unseen test dataset, whichwas captured in the same way as force training and validation data. Thenet was not trained or validated on this data.the validation loss fluctuates noisily, even long after it plateaus. If the performancehappens to peak at an early point (while training data performance is still worse),this network will be saved and kept. In this case, the network would be over-fit tothe validation data.3.1.2 Performance on Test DataThe force network also performs well on test data – data not seen in any way duringtraining1 – with some caveats as our test data moves away from the distribution oftraining and validation data. This first test set is a small collection of rigid platetouches, captured in the same way as the network’s validation data. We show thenetwork’s results on this data in Figure 3.4.The network accurately tracks the overall shape of the force curves, with noapparent noise and with only minor inaccuracies. The MAE of 0.0422N comparesfavorably with results on the training and validation datasets, with the networkperforming only slightly worse than on its validation data, and marginally out-performing its training data results.1As discussed above, our networks could be slightly over-fitted to their validation data, as weexplicitly keep the network weights which yield best performance on that data.63Figure 3.5: Testing force estimation in contact with a phantom.3.1.3 Performance on Novel Test DataTesting the neural net only on data captured robotically, in a fashion identical toits training data, is somewhat misguided. In our system, the force predictions areintended to eventually be used in contact with soft human tissue, which displaysa variety of colors and textures that may inadvertently be shown to the networkthrough reflections, and other optical artifacts. Motion of the probe when heldby a human, or when in contact with different materials, may also differ quali-tatively from robotic motion in contact with a rigid plate. These different typesof motion could reveal flexure configurations which are not reached (or not com-monly reached) in the robotic training data. In the remainder of this sub-section,we test the network’s resilience to different contact materials, surface aesthetics,and modes of motion.First, we capture data in contact with one of the flesh-colored silicone phan-toms (Figure 3.5). Here, the network performs well in general (Figure 3.6), butstruggles with cases displaying unusual (with respect to the training data) configu-rations of the flexure. Summarizing performance over 30 touches yields an MAEof 0.0592N, which is worse than the tests on a rigid backing, but this averagederror does not tell the whole story. On mid-to-high force touches, the network per-forms as normal, but it struggles to predict on some of the lower-force touches,pathologically continuing to predict near-zero forces at levels which are normally64Figure 3.6: Performance of the ForceNet on a test dataset containing toucheswith a silicone phantom. These touches are otherwise similar to thosecarried out on the rigid force plate for training.easily detectable. Figure 3.7 shows one touch where this effect was particularlydramatic.In this case, the network barely deviates at all from a prediction of 0N force,despite clear contact being made. Examining the data tells us why – Figure 3.9shows the frame with the highest error, and compares it to one with no force ap-plied. The two frames are visually very similar, making a zero-force predictionunderstandable. This touch involved a shallow contact with significant lateral mo-tion – the probe was moved to the left during contact, pulling the flexure tip tothe right, away from the camera, and exerting nearly as much lateral force as nor-mal force (which is what the network predicts). The network is not trained withdata like this, for the simple reason that this type of motion results in sliding onthe rigid plate; the flexure cannot be moved into a configuration with proportion-ally high lateral force without a grippy, or slanted surface to press against. Thebest way to train the network to predict accurately on touches like this would beto include similar data, with large lateral forces, in the training set. This wouldrequire a force plate with a grippy surface, presenting a shortcoming in the rigid65Figure 3.7: Poor force estimation on a pathological case, in contact with asoft silicone phantom.Figure 3.8: Contact force errors, with distribution, in the phantom force test.We separate estimates at near-zero measured force from those at higherforce levels to highlight the network’s relatively high accuracy nearzero, and the network’s tendency to incorrectly predict zero force evenat fairly high force levels – up to −0.4N.66Figure 3.9: Poorly estimated frame of force network input, compared to aframe with no force applied. Here, the network predicted −0.003N, fora frame with a true measured force of −0.342N.plate design, and an opportunity for future improvement.Luckily for our purposes, we do not consider this type of contact typical for thetouches used for probing, which largely move directly into and out of the target –this “blind spot” in the network’s training should not affect the intended use case.Examining the distribution of prediction errors directly (Figure 3.8) shows usthat the network has a strong preference for predicting a zero force level. This istroubling, but is likely a simple reflection of a skewed distribution in the trainingdata. The network learns that images with no force applied are extremely common,and while this is true, feeding the network a proportionally large number of thesecases – as we currently do – may be counterproductive for training and predictionperformance, as they are all extremely similar. These zero-force frames are alsoclearly not captured during touches, making them of little interest to us.We captured an additional dataset in contact with leather, laid over the samephantom as above. The leather provides a far darker, more specular surface (Fig-ure 3.10), as well as different contact dynamics. Examining the prediction graph(Figure 3.11), we see that the network follows the true force curve, but often doesso with an erroneous offset. In general, it appears to have a troubling tendency to67Figure 3.10: Testing force estimation in contact with a leather strip (laid overa phantom).Figure 3.11: Performance of the ForceNet on a test dataset containingtouches with leather, overlaid on a silicone phantom. These touchesare otherwise similar to those carried out on the rigid force plate fortraining. Note the tendency to underestimate the force magnitude.68Figure 3.12: Contact force errors, with distribution, in the leather strip test.We separate estimates at near-zero measured force as before, high-lighting the relatively poor (and skewed) results at higher forces in thiscase.underestimate the magnitude of the applied force. We show the error distributionfor the leather dataset in Figure 3.12.We speculate that this may be due to differences in cast light from the leatheraltering the pattern of features detected by the network, when compared to thetraining data using the lighter rigid force plate. As the rigid plate is in fact designedto create different levels of cast light, this demonstrates another shortcoming of theplate’s design. The plate’s surface is fairly diffuse, so it scatters the light from thesmartphone’s flashlight back across the probe and flexure, while the leather absorbsor specularly reflects the light.Remaining with the theme of robustness to lighting conditions, we capturedan experimental data set in direct sunlight (Figure 3.13). These data representa significant departure from those seen in the training set, as the sunlight is brightenough to overwhelm the phone’s flashlight on the contact surface (the same leatherused in the previous test). As such, the network makes significant errors in itsprediction, but interestingly is not completely thrown off – a strong signal remains69Figure 3.13: Testing force estimation in direct sunlight, in contact withleather.Figure 3.14: Performance of the ForceNet on a test dataset containingtouches with leather in direct sunlight. We do not expect (and do notsee) accurate results here, but a strong signal remains.70Figure 3.15: Moving the probe by hand in its mount.(Figure 3.14), and the predictions are stable from frame to frame, if not particularlyaccurate. That said, the MAE of 0.445N is unacceptable for real-world usage.Further work, likely including new force training data, would be needed for thesystem to function in the presence of such a bright light source.Handheld UsageAs discussed above, types of motion outside the training data’s range can present achallenge for the force predictor, and so handheld motion of the probe could provedifficult for the force predictor to follow. Here, we captured datasets comprisingassorted hand-guided and fully handheld touches on the silicone phantoms.The first dataset has the probe mounted on the automation rig, but with themotors disabled. All motion of the probe is created directly by hand, using a handleon the probe’s base, but the rig holds the orientation of the probe perfectly steady.We made several touches on different phantoms and at various force levels, up toapproximately 4N, and sometimes retracting and re-inserting the probe tip duringa touch (a type of motion not seen in training data).Results here were promising (Figure 3.16), with metrics very comparable torobotic touches on the same phantoms. While predictions worsened when the proberemained in contact with the phantoms, predictions during the compression phase71Figure 3.16: ForceNet predictions for phantom touches, made by hand butwith the probe mounted in the automation rig (with the motors dis-abled).of each touch were excellent, and it is these predictions which are critical for thematerial estimation process. We suspect this degradation in performance duringtouches is due to unaccounted hysteresis in the flexure itself, and could potentiallybe accounted for with a more advanced model – one which takes into accountthe flexure’s dynamics. This could be a similar model with multiple image frameinputs, or even a “tacked-on” post-processing step accepting consecutive force es-timates, and outputting a revised, dynamics-aware estimate.We followed up by detaching the probe from its mounting entirely, and recordedfreely handheld touches, intended to resemble material estimation touches by anexperimenter: slow-to-medium contact speed, up to approximately 3N. In thisscenario, the flexure and predictor network also performed very well (Figure 3.17),with an MAE of 0.0426N; no worse than in robotic testing. The only meaningfulfailure in this test is the system’s obvious inability to detect negative (or, strictlyspeaking, positive) normal forces; as with other shortcomings we have observed,it is not trained to do this, as the rigid plate does not display the 00-10 silicone’sdistinctive tackiness. We reason this shortcoming is unimportant for our use case,72Figure 3.17: ForceNet predictions for phantom touches, made fully by hand– the probe was not mounted or in any way attached to the automationrig, though the phantom tray was (in order to detect forces).as we are interested in material behavior under compression, rather than these briefperiods of tension.Post-Stress Test DataWhile we account for slight shifts in the mounting position of the flexure usingdata augmentation, the force prediction process does not account for plastic defor-mation (“settling”) of the flexure after extended use. The data set in Figure 3.18was captured on the rigid plate after approximately 6,000 touches at relatively highforce levels, up to 7N. While the zero-point has not majorly shifted, the networksignificantly over-predicts forces at higher levels of deflection, sometimes overes-timating by a full Newton. Overall accuracy reduced to an MAE of 0.229N, overan order-of-magnitude decrease in performance.Currently the only way to solve this problem in actual usage is to re-train with amore recent data set, which yields results at the normal level of accuracy, returningto an MAE of 0.0409N. However, in Figure 3.19, we show that the error is approx-imately a function of the estimated force. It could therefore be feasibly corrected73Figure 3.18: Performance of the ForceNet on an unseen test dataset capturedafter thousands of post-training touches, recorded in the same way asforce training and validation data. The net was not trained or validatedon this data.Figure 3.19: Force network residuals on the post-stress dataset, with andwithout a polynomial correction. Raw prediction errors are shownabove in gray, while corrected errors are shown below in blue.74using a simple calibration process, transforming the network output to match re-ality using a small dataset. We optimized a fourth-degree polynomial correctionon the output of the force model, and show that these “corrected” errors are onlymarginally worse than a retrained network. With a corrected MAE of 0.0492N,this correction may be sufficiently accurate for our methods – though this metric,taken alone, could display some over-fitting of the correction to this particular case.We believe the problem of mechanical changes to the flexure during extendeduse may indeed be solvable with a calibration process, which would remove theneed for large-scale force data collection for re-training. In any case, this problemonly occurs after the flexure is exposed to force levels outside its range of use – wedid not observe this problem when limiting touches to 3−4N, even after thousandsof touches.3.2 Material Estimation3.2.1 Validation and Test DataHere we examine the accuracy of the material network’s estimates, beginning withthe training and validation data performance. Recall that robotically-collected ma-terial data takes the form of a scan of hundreds to thousands of touches across eachof the three material phantoms, in order of increasing stiffness. Each full touch isa single data-point for the material network, resolving to a single flow input. Foreach touch, the material network provides a material estimate vector containing µ¯and L0 predictions. Touches take place left-to-right on the phantoms, going fromthe deep towards the shallow side of the phantoms, are randomly spaced with auniform distribution, and randomly offset in the Y (up/down) direction. Refer toSection 2.2.3 for additional details on the material data collection.The material network performs well on its validation data, though notablyworse than on its training data, demonstrating the overfitting common to manyDNNs. Results are somewhat noisy for both dimensions of material property es-timates, but the underlying signal is very clear, and the network rarely makes sig-nificant mis-predictions on this data. L0 predictions are shown in Figure 3.20a,and µ¯ predictions in Figure 3.20b. Table 3.2 provides useful metrics for the per-75(a) L0 results(b) µ¯ resultsFigure 3.20: Material network performance on its own validation set, pre-dicting a) L0 material depth and b) µ¯ surface stiffness. Estimates areshown in blue, and corresponding estimate error (to ground truth) isshown in red.76MAE RMSE R2TrainingL0 0.050 mm 0.064 mm 0.9995µ¯ 6.909 N/m 9.035 N/m 0.9994ValidationL0 0.259 mm 0.389 mm 0.9816µ¯ 25.468 N/m 51.601 N/m 0.9782TestL0 0.438 mm 0.706 mm 0.9447µ¯ 35.793 N/m 69.765 N/m 0.9662Table 3.2: Material estimator MAE, RMSE and R2 metrics for the two differ-ent material properties. Recall that L0 measures thickness, and µ¯ mea-sures stiffness. Note that training error is, by several measures, signifi-cantly lower than validation error, which is itself lower than the test error.formance on both properties. Promisingly, we can see that most L0 predictions fallwithin half a millimeter of the true value, and that the handful of outliers fall within1.5mm. The majority of µ¯ predictions here are also excellent, but the few outliersare fairly pronounced, and there are some regions where predictions are incorrectlyskewed.Testing with data unseen by the network during training shows performancewhich is worse again than the validation performance (Figure 3.21), but still promis-ing. In this case, most L0 predictions fall within two millimeters of the true value– slightly, but not majorly worse than on validation data – and we once again ob-serve a handful of more extreme outliers. The majority of µ¯ predictions are againexcellent, but we see significant outliers and some regions where predictions areincorrectly skewed – the predictions seem to be particularly bad around the pointthe phantom flattens out at 1mm, where the flexure could simply be getting de-flected down the slope unpredictably. L0 results seem to be most accurate in thefinal thin section of each phantom, while the opposite appears to be true for the µ¯predictions.77(a) L0 results(b) µ¯ resultsFigure 3.21: Material network performance on an unseen test dataset, pre-dicting a) L0 material depth and b) µ¯ surface stiffness. Estimates areshown in blue, and corresponding estimate error (to ground truth) isshown in red.78(a) Front (b) Top view, showing flexure ori-entationFigure 3.22: The dismounted prototype, ready for freehand usage.3.2.2 Handheld Comparison ExperimentWe set up an experiment to analyze the probe’s true performance on the phantoms,and to help eliminate variables as sources of error. We collected three datasets,with different methodologies:• Handheld-Guided: The probe was left mounted in the automation rig, butthe robot’s motors were disabled (near-zero resistance), and the probe washeld and moved by the experimenter. This kept the probe constrained to afixed orientation, but left motion in the three cartesian axes to the experi-menter’s hand movements.• Handheld: The probe was completely detached from the automation rig,and used freehand (Figure 3.22). The phantom tray was also detached, andplaced flat on a table (not in the standard “wall-mounted” orientation). Allmotion of the probe was left to the experimenter, including rotation, thoughthe experimenter attempted to maintain a perpendicular angle of attack, asbefore.• Robotic: The probe was mounted and robotically articulated in much thesame way as was used to generate training and validation data. This wasmeant to serve as a “control.”Each set of three “locations” corresponds to a different phantom – 00-10, 00-30 and 00-50 from left to right in the figures. We sampled each phantom with 5repeats at each of 3 locations – on either end, at the extremes of thickness, and79MAE RMSE R2RoboticL0 0.316 mm 0.481 mm 0.9829µ¯ 34.524 N/m 65.161 N/m 0.9856GuidedL0 0.377 mm 0.627 mm 0.9709µ¯ 101.470 N/m 232.215 N/m 0.8165HandheldL0 1.179 mm 1.713 mm 0.7827µ¯ 217.282 N/m 359.104 N/m 0.5612Table 3.3: Material estimator MAE, RMSE and R2 metrics for L0 and µ¯ . Notethat these metrics represent only 45 samples per dataset, far fewer thanthe other test results discussed above.in the center. We chose this coarse sampling to enable us to target touches withreasonable accuracy with the probe handheld.The material estimates for these datasets (Figure 3.23) followed a general trendof decreasing accuracy from robotic, to guided, to handheld results (Table 3.3).The robotic measurement accuracy was comparable to the previously-discussedtest results for the material estimator, and we observe robotic estimates generallyclustered around the ground truth values for both material properties. Estimateson the handheld-guided dataset were also fairly accurate – while performance inthe µ¯ dimension suffered, predictions for tissue thickness were only slightly worsethan those with the robotically-captured data. Estimates with the freely handhelddataset were notably less accurate than either dataset with the probe mounted, withworse performance in both parameter dimensions.There are several possible explanations for the poorer results with hand-guidedand fully handheld readings: these are necessarily limited to differences betweenthe test data and the training data, and corresponding differences in downstreamresults, including the force predictions and flows. We have established that theforce predictions remain accurate even as the probe is manipulated by hand, andthe phantoms contacted in these tests have not changed from those the materialestimator was trained on – so, we suspect differences in the probe’s motion between80Figure 3.23: Material estimator results for robotically- and manually-captured datasets.81the training and test data to be the cause of the mis-predictions seen here.As with any handheld object, handheld articulation of the probe can createsmall “tremor” motions, which the material network does not see in training –while we attempt to account for this in training by dynamically moving the probewhile in contact with the phantoms, we may not have accounted for all possiblevariations. Another possible difference is that the speed of a handheld touch mayvary through the duration of the touch, while in the robotic touches seen in training,the speed is consistent throughout each touch. When fully handheld, dynamicrotations may also occur relative to the phantom surface; rotation is not presentin the training data, robotic data, or in the handheld-guided data, as the probe’sorientation is fixed: On its mounting, the probe moves only in the 3 cartesian axespermitted by our robot.In this case, experimenter inaccuracy may be an additional cause of someamount of the error seen in these graphs. Small differences in the location ofeach touch on the phantoms could lead to measurement of substantially differentmaterial properties, as both measured properties can change rapidly across eachphantom. This experiment assumes that the experimenter has contacted the phan-toms accurately in each touch, at the designated positions. This is necessary forthe handheld data, as we have no simple way of measuring the probe’s positiononce it is removed from its mounting. However, this leaves room for human error,particularly at locations 1, 4, and 7 in the middles of the phantoms.Interestingly, L0 predictions appear to be best at the shallower end of eachphantom, where µ¯ predictions are worst. This may be an issue of differentiationin the data – visually, there is little difference between pressing into a very thinstrip of soft silicone, and pressing into the same depth of stiffer silicone; whatis obvious is that both strips are thin. The µ¯ estimates for the thin end of eachphantom appear to jump between the 00-10, 00-30 and 00-50 stiffness values for1mm thickness, demonstrating both that the network has over-fit to the particularproperties of the three phantoms, and that there remains uncertainty even in differ-entiating just these three cases. The material estimator also seems to struggle withthe deeper L0 predictions, especially for 00-50: in this stiffer phantom, there maybe less surface-level behavioral changes due to increasing depth than for the softerphantoms, particularly the 00-10.82Chapter 4Conclusions4.1 ContributionsOur primary contributions are the hardware and software making up our system:• We built a low-cost prototype probe device capable of accurate force sens-ing from purely optical data, and limited detection of material properties incontacted surfaces.• We developed application software for the probe’s smartphone, which con-trols camera parameters, communicates with cloud services, and providesboth remote-control capabilities and a user-operated graphical user interface.• We built a robotic system for automated capture of training data for our esti-mators, developed control software for this platform permitting custom userprograms, and developed cloud storage clients for storing and retrieving thisdata in useful formats.• We deployed a cloud pipeline for rapid, on-demand estimation and storageof forces, optical flows, and material data for one or more probe devices,using GPU-accelerated machine learning models.• We characterized the performance of our prototype through a series of labo-ratory tests.83To summarize, we believe the notable contributions of this project include theuse of a smartphone and “flexure” to estimate forces and material properties withonly optical data, the cloud pipeline for estimating those results, and the roboticsystem used to help train the estimators.4.2 Project GoalsWe set out to create a system with these targets in mind:• Low cost, both to produce and to use• Portable• Usable with little training• Rapid data collection• Accurate, useful resultsOur device is certainly low-cost, especially when considered in the contextof the SkinProbe V1. Building a new probe requires a one-off investment in adual-extrusion 3D printer, which is a professional, but relatively affordable devicecosting in the region of $4000. Training estimators for a new probe requires a 3-axis robot and force sensor, together costing no more than $10,000. Training alsorequires access to a GPU-equipped workstation or cloud computer. The incremen-tal cost of a SkinProbe 2.0 device is then no more than $1000, the bulk of whichis the cost of a new smartphone. This compares to an initial cost of up to severalhundred thousand dollars for the V1 probe, which, unlike our prototype, could notbe duplicated for use in parallel – a new host computer and motion capture setupwould be required at each measurement facility.Using the device for experiments with live feedback imposes a cloud hostingcost of approximately $5 per hour, which pays for GPU instance rental. This costcould be reduced by using non-GPU instances, or lower-tier GPUs for inference,though estimation speeds are likely to suffer as a result. Note that this cost could beeffectively reduced if multiple probes were in use at once – a single cloud instance,with a fixed hourly price could support multiple “client” probes.84Use of a smartphone makes our system both user-friendly and portable. Ourcurrent phone application is usable for the limited trials we have run, with verystraightforward touchscreen controls – essentially just two buttons for experimen-tal data capture, which any smartphone user should be comfortable with. Thissoftware could also be easily expanded to serve many additional roles, changed tobe made more user-friendly (for example, by the addition of in-app tutorials), oreven localized to different local languages. Carrying out touches is not demanding,simply requiring a steady hand. The probe’s two-handed grip makes manipulatingit easy.We support experimental data collection with only an active internet connec-tion, making use on mobile data networks possible – an experimenter would beable to travel to other locations (or countries!) to gather data with human sub-jects, solving one of the major drawbacks of the V1 probe. There remain somelimitations around working environment: we have seen that force estimation failsin direct sunlight, and we can reason that strong shadows cast on a participant’sskin would interfere with optical flow generation. For these reasons, our systemcan likely only be used effectively in an indoor environment, without strong directlighting (e.g. from floodlights). That said, there are no other constraints on theambient lighting conditions – our use of the phone’s flashlight ensures consistentillumination, so long as the flashlight source is not overwhelmed.Collection of data is indeed rapid, though the response time for the cloud-basedgeneration of results is not ideal, typically taking≈ 10s. While new recordings canbe made during processing (data collection need not be stopped), real-time feed-back would allow experimenters to know if their experimental collection is workingas they go. The majority of this processing time is spent uploading, downloading,and unpacking the video files, and this time could be significantly reduced by re-ducing the size of the videos. For this prototype, we took a highly conservativeapproach to video compression, using high-bitrate, high-quality encodings to de-crease the possibility of compression artifacts tampering with our results.The final, and perhaps most important goal is accuracy. Our force-estimationpipeline works well in practice, including when handheld and in contact with novelmaterials – though performance does degrade in some cases, we believe these pre-dictions are still accurate enough to be useful, and hope that the minor remaining85issues with the force estimation are solvable with further work. Processing imagesto estimated forces is fast, and the network is small and simple enough that, in thefuture, it could feasibly be run on a mobile device, providing instant feedback toexperimenters.However, in this project, the force estimations are only a means to an end;our material estimation results are, for now, not reliable enough to be useful inpractice, and our system is not yet usable on humans. This is clearly a significantdrawback, but we do not see this challenge as insurmountable: merely deservingof further research and development on the probe’s software, primarily its estima-tion algorithms. Transitioning to use on human subjects may also require furtherdevelopment of the training phantoms, use of human data for training, or both.4.3 Future WorkAs this is a prototype system, there are many aspects of our work that could beimproved upon or expanded.4.3.1 Force SensingThe probe’s flexure deflects in more than just the normal direction – our intentionwas to support force estimation in three dimensions, and we believe this shouldbe feasible. Supporting this would be a matter of expanding the force estimationmodel to a three-dimensional output, verifying the results, and scaling each di-mension of the network output targets appropriately to ensure suitable levels ofaccuracy in all three axes. This would also likely require replacing the rigid forceplate with a grippy surface, allowing significant lateral forces to be applied duringtraining.Another notable drawback of our force sensing system is its lack of awarenessof the flexure’s dynamics from frame to frame. Improving the force estimationmodel to account for these dynamics, including hysteresis, would allow the accu-racy of our force estimates to be further improved, especially in cases of sustainedcontact with the subject. This could be achieved with a time-sequence model, suchas an LSTM, or even with a simpler model working as a post-process on a sequenceof single-frame estimates.86Robustness could be improved by adding darker patches to the rigid plate, tack-ling the problem seen in the leather contact test, or potentially by using a moretraditional pre-processing pass on the flexure images to negate the effects of differ-ent lighting conditions. There may also be some benefits to adding high-contrast,“trackable” features to the flexure shaft and frame, which the network (or a moretraditional computer vision pipeline) would be able to pick out more robustly.4.3.2 Material EstimationOur current material estimation system has several limitations, not least its loweraccuracy in handheld use. Our estimates tend to draw from the observed distribu-tion of phantoms seen in training, regardless of the actual properties of the materialbeing tested – we see this in the tendency for estimates to “jump” between thegroupings for the different silicone types. This clearly is of little use for analyzingthe properties of human soft tissues, and so we expect making this model general-ize may require a significantly wider variety of phantom data for training, includingphantoms designed to better approximate human tissue. One approach for gather-ing this data could be to generate it synthetically, covering a very wide range ofmaterial properties, with the use of optical flows and force data as inputs serving toremove the need for photorealistic image rendering – flows from a roughly texturedskin rendering should be very similar to the real-world equivalent. Another wouldbe to construct a wider variety of physical phantoms, training with these, and po-tentially fine-tuning with data captured from humans. Capturing human data wouldrequire using either the V1 probe, or a hybrid motion-tracked V2 probe to establishground-truth measurements.We would also suggest expanding the range of material properties estimated.Material force-displacement behavior could be parameterized in far greater detailthan our basic stiffness and thickness model. Terms could including volume preser-vation and anisotropic properties.4.3.3 Human TrialsSkinProbe 2.0 is ultimately intended to be used on humans, and so this is a majordirection for future work. Building new probes, and bringing them into the field for87human trials would allow data collection at a large scale, for application in diversefields such as character animation, medicine, and clothing and prosthesis designand customization. New probes could be equipped with mobile data SIM cards foruse away from dedicated research facilities.4.3.4 Cloud PipelineAs mentioned above, generation of estimation results is hardly real-time: there issignificant room for improvement in the speed of our cloud pipeline. Reducing thefile size of our videos would be a shortcut to reducing these processing times, andit is likely that lower-bitrate video encodings would still work well, especially ifour estimation networks were trained using videos encoded in the same way. If ourcurrent bitrate of 20-30,000 Kbps could be successfully reduced to a more typicalrate of 2-4,000 Kbps for our 1024× 768 resolution, we could see 5− 10× speedincreases in this transfer and unpacking portion of the results generation. Thereare also other straightforward optimizations which could be made: by unpackingvideos in-memory, we could perform force estimation directly on image framesas they are decoded, rather than waiting for all frames to be written to disc be-fore reading them back into memory one-by-one. Developing high-speed videoupload direct to the cloud instance would remove the interim step of S3 uploadand download; we would then be able to transfer videos to S3 from the instance,ensuring the same level of data retention as before. Better still, if the phone couldstream video to the instance, frames could be processed for force estimates evenbefore the recording was completed. At that point, frame selection and materialestimation could be completed even before the experimenter retracts the probe!The cost of operating the cloud pipeline could be reduced by switching to in-stances with less-powerful, or no GPUs, though this could present technical chal-lenges for running the FlowNet2 model.4.4 Final ThoughtsThis prototype is our first step towards large-scale measurement of human tissueproperties. Our system is not yet ready for prime time, but we have demonstratedfunctionality in several key areas, and developed robust, low-cost hardware ready88for further software improvements. On the technical side, we hope this work in-spires other researchers to use smartphones, optical force sensing, and cloud com-puting in their own projects.89Bibliography[1] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab. Robustoptimization for deep regression. In Proceedings of IEEE InternationalConference on Computer Vision, December 2015. → page 10[2] B. Bickel, M. Ba¨cher, M. A. Otaduy, W. Matusik, H. Pfister, and M. Gross.Capture and modeling of non-linear heterogeneous soft tissue. ACMTransactions on Graphics, 28(3):89:1–89:9, July 2009. ISSN 0730-0301.doi:10.1145/1531326.1531395. URLhttp://doi.acm.org/10.1145/1531326.1531395. → page 13[3] Y.-L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of featurepooling in visual recognition. In Proceedings of 27th InternationalConference on Machine Learning, pages 111–118. ICML, 2010. → page 28[4] R. O. Bude and R. S. Adler. An easily made, low-cost, tissue-like ultrasoundphantom material. Journal of Clinical Ultrasound, 23(4):271–273, 1995.doi:10.1002/jcu.1870230413. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1002/jcu.1870230413. → page 14[5] J. J. Clark. A magnetic field based compliance matching sensor for highresolution, high compliance tactile sensing. In Proceedings of IEEEInternational Conference on Robotics and Automation, pages 772–777 vol.2,April 1988. doi:10.1109/ROBOT.1988.12152. → page 11[6] S. Derler, U. Schrade, and L.-C. Gerhardt. Tribology of human skin andmechanical skin equivalents in contact with textiles. Wear, 263(7):1112 –1116, 2007. ISSN 0043-1648.doi:https://doi.org/10.1016/j.wear.2006.11.031. URLhttp://www.sciencedirect.com/science/article/pii/S0043164807003535. 16thInternational Conference on Wear of Materials. → page 15[7] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a singleimage using a multi-scale deep network. In Advances in Neural Information90Processing Systems 27, pages 2366–2374. Curran Associates, Inc., 2014.URL http://papers.nips.cc/paper/5539-depth-map-prediction-from-a-single-image-using-a-multi-scale-deep-network.pdf. → page 11[8] Y. Fan, J. Litven, and D. K. Pai. Active volumetric musculoskeletal systems.ACM Transactions on Graphics, 33(4):152:1–152:9, July 2014. ISSN0730-0301. doi:10.1145/2601097.2601215. URLhttp://doi.acm.org/10.1145/2601097.2601215. → page 12[9] S. Glassenberg and M. Yaeger. Gastro ex: real-time interactive fluids andsoft tissues on mobile and vr. In ACM SIGGRAPH 2018 Real-Time Live!,page 3. ACM, 2018. → page 12[10] X. Guo, Y. Huang, X. Cai, C. Liu, and P. Liu. Capacitive wearable tactilesensor based on smart textile substrate with carbon black/silicone rubbercomposite dielectric. Measurement Science and Technology, 27(4):045105,2016. → page 11[11] T. J. Hall, M. Bilgen, M. F. Insana, and T. A. Krouskop. Phantom materialsfor elastography. IEEE Transactions on Ultrasonics, Ferroelectrics, andFrequency Control, 44(6):1355–1365, Nov 1997. ISSN 0885-3010.doi:10.1109/58.656639. → page 14[12] D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fps with deepregression networks. In Proceedings of European Conference on ComputerVision, pages 749–765. Springer, 2016. → page 11[13] S. Hirose and K. Yoneda. Development of optical six-axial force sensor andits signal calibration considering nonlinear interference. In Proceedings ofIEEE International Conference on Robotics and Automation, pages 46–53vol.1, May 1990. doi:10.1109/ROBOT.1990.125944. → page 11[14] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox.Flownet 2.0: Evolution of optical flow estimation with deep networks. InProceedings of IEEE Conference on Computer Vision and PatternRecognition, pages 2462–2470, 2017. → pages 7, 39[15] M. Kim, G. Pons-Moll, S. Pujades, S. Bang, J. Kim, M. J. Black, and S.-H.Lee. Data-driven physics for human soft tissue animation. ACMTransactions on Graphics, 36(4):54:1–54:12, July 2017. ISSN 0730-0301.doi:10.1145/3072959.3073685. URLhttp://doi.acm.org/10.1145/3072959.3073685. → page 1391[16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014. → page 37[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural InformationProcessing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf. → page 10[18] T. Kroeger, R. Timofte, D. Dai, and L. Van Gool. Fast optical flow usingdense inverse search. In Proceedings of European Conference on ComputerVision, pages 471–488, Cham, 2016. Springer International Publishing.ISBN 978-3-319-46493-0. → page 38[19] P. G. Kry and D. K. Pai. Interaction capture and synthesis. ACMTransactions on Graphics, 25(3):872–880, July 2006. ISSN 0730-0301.doi:10.1145/1141911.1141969. URLhttp://doi.acm.org/10.1145/1141911.1141969. → page 13[20] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learningapplied to document recognition. Proceedings of IEEE, 86(11):2278–2324,1998. → page 28[21] C. Ledermann, S. Wirges, D. Oertel, M. Mende, and H. Woern. Tactilesensor on a magnetic basis using novel 3d hall sensor - first prototypes andresults. In Proceedings of IEEE International Conference on IntelligentEngineering Systems, pages 55–60, June 2013.doi:10.1109/INES.2013.6632782. → page 11[22] S.-H. Lee, E. Sifakis, and D. Terzopoulos. Comprehensive biomechanicalmodeling and simulation of the upper body. ACM Transactions on Graphics,28(4):99:1–99:17, Sept. 2009. ISSN 0730-0301.doi:10.1145/1559755.1559756. URLhttp://doi.acm.org/10.1145/1559755.1559756. → page 13[23] Z. Li, T. Dekel, F. Cole, R. Tucker, N. Snavely, C. Liu, and W. T. Freeman.Learning the depths of moving people by watching frozen people. InProceedings of IEEE Conference on Computer Vision and PatternRecognition, June 2019. → page 11[24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks forsemantic segmentation. In Proceedings of IEEE Conference on ComputerVision and Pattern Recognition, June 2015. → page 1092[25] C. Luo, X. Chu, and A. Yuille. Orinet: A fully convolutional network for 3dhuman pose estimation. In Proceedings of British Machine VisionConference, 2018. → page 11[26] K. E. MacLean. The ‘haptic camera’: A technique for characterizing andplaying back haptic properties of real environments. Proceedings of HapticInterfaces for Virtual Environments and Teleoperator Systems, pages459–467, 1996. → page 14[27] L. Markley. Attitude determination using vector observations and thesingular value decomposition. Journal of the Astronautical Sciences, 38:245–258, 11 1987. → page 51[28] E. Miguel, D. Miraut, and M. A. Otaduy. Modeling and estimation ofenergy-based hyperelastic objects. Computer Graphics Forum, 35(2):385–396, 2016. doi:10.1111/cgf.12840. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12840. → page 13[29] B. J. Nelson, J. D. Morrow, and P. K. Khosla. Improved force controlthrough visual servoing. In Proceedings of American Control Conference,volume 1, pages 380–386 vol.1, June 1995. doi:10.1109/ACC.1995.529274.→ page 12[30] L. Nicholas, K. Toren, J. Bingham, and J. Marquart. Simulation indermatologic surgery: A new paradigm in training. Dermatologic Surgery,39(1pt1):76–81, 2013. doi:10.1111/dsu.12032. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1111/dsu.12032. → page 14[31] D. K. Pai. Robotics in reality-based modeling. In Robotics Research, pages353–358, London, 2000. Springer London. ISBN 978-1-4471-0765-1. →page 15[32] D. K. Pai and P. Rizun. The what: A wireless haptic texture sensor. InProceedings of Haptic Interfaces for Virtual Environments and TeleoperatorSystems, pages 3–9. IEEE, 2003. → page 14[33] D. K. Pai, K. van den Doel, D. L. James, J. Lang, J. E. Lloyd, J. L.Richmond, and S. H. Yau. Scanning physical interaction behavior of 3dobjects. In Proceedings of ACM SIGGRAPH, pages 87–96. ACM, 2001. →page 13[34] D. K. Pai, A. Rothwell, P. Wyder-Hodge, A. Wick, Y. Fan, E. Larionov,D. Harrison, D. R. Neog, and C. Shing. The human touch: Measuring93contact with real human soft tissues. ACM Transactions on Graphics, 37(4):58, 2018. → pages v, xi, 13, 15, 16, 17, 18[35] J. Peirs, J. Clijnen, D. Reynaerts, H. V. Brussel, P. Herijgers, B. Corteville,and S. Boone. A micro optical force sensor for force feedback duringminimally invasive robotic surgery. Sensors and Actuators A: Physical, 115(2):447 – 455, 2004. ISSN 0924-4247.doi:https://doi.org/10.1016/j.sna.2004.04.057. URLhttp://www.sciencedirect.com/science/article/pii/S0924424704003917. The17th European Conference on Solid-State Transducers. → page 11[36] G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black. Dyna: A model ofdynamic human shape in motion. ACM Transactions on Graphics, 34(4):120:1–120:14, July 2015. ISSN 0730-0301. doi:10.1145/2766993. URLhttp://doi.acm.org/10.1145/2766993. → page 13[37] P. Puangmali, H. Liu, L. D. Seneviratne, P. Dasgupta, and K. Althoefer.Miniature 3-axis distal force sensor for minimally invasive surgicalpalpation. IEEE/ASME Transactions on Mechatronics, 17(4):646–656, Aug2012. ISSN 1083-4435. doi:10.1109/TMECH.2011.2116033. → page 11[38] A. Sekhar, M. R. Sun, and B. Siewert. A tissue phantom model for trainingresidents in ultrasound-guided liver biopsy. Academic radiology, 21(7):902–908, 2014. → page 14[39] W. Si, S.-H. Lee, E. Sifakis, and D. Terzopoulos. Realistic biomechanicalsimulation and control of human swimming. ACM Transactions onGraphics, 34(1):10:1–10:15, Dec. 2014. ISSN 0730-0301.doi:10.1145/2626346. URL http://doi.acm.org/10.1145/2626346. → page12[40] E. Sifakis and J. Barbicˇ. Finite element method simulation of 3d deformablesolids. Synthesis Lectures on Visual Computing: Computer Graphics,Animation, Computational Photography, and Imaging, 1(1):1–69, 2015. →page 12[41] S. Sueda, A. Kaufman, and D. K. Pai. Musculotendon simulation for handanimation. ACM Transactions on Graphics, 27(3):83:1–83:8, Aug. 2008.ISSN 0730-0301. doi:10.1145/1360612.1360682. URLhttp://doi.acm.org/10.1145/1360612.1360682. → page 1294[42] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonableeffectiveness of data in deep learning era. CoRR, abs/1707.02968, 2017.URL http://arxiv.org/abs/1707.02968. → page 31[43] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for objectdetection. In Advances in Neural Information Processing Systems 26, pages2553–2561. Curran Associates, Inc., 2013. URLhttp://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf. → page 11[44] J. Teran, E. Sifakis, S. S. Blemker, V. Ng-Thow-Hing, C. Lau, andR. Fedkiw. Creating and simulating skeletal muscle from the visible humandata set. IEEE Transactions on Visualization and Computer Graphics, 11(3):317–328, May 2005. doi:10.1109/TVCG.2005.42. → page 13[45] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deepneural networks. In Proceedings of IEEE Conference on Computer Visionand Pattern Recognition, pages 1653–1660. IEEE, 2014. → page 10[46] B. Wang, L. Wu, K. Yin, U. Ascher, L. Liu, and H. Huang. Deformationcapture and modeling of soft objects. ACM Transactions on Graphics, 34(4):94:1–94:12, July 2015. ISSN 0730-0301. doi:10.1145/2766911. URLhttp://doi.acm.org/10.1145/2766911. → page 13[47] H. Wang, G. De Boer, J. Kow, A. Alazmani, M. Ghajari, R. Hewson, andP. Culmer. Design methodology for magnetic field-based soft tri-axis tactilesensors. Sensors, 16(9), 2016. ISSN 1424-8220. doi:10.3390/s16091356.URL https://www.mdpi.com/1424-8220/16/9/1356. → page 11[48] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional posemachines. In Proceedings of IEEE Conference on Computer Vision andPattern Recognition, June 2016. → page 10[49] Y. Zhou, B. J. Nelson, and B. Vikramaditya. Fusing force and visionfeedback for micromanipulation. In Proceedings of IEEE InternationalConference on Robotics and Automation, volume 2, pages 1220–1225 vol.2,May 1998. doi:10.1109/ROBOT.1998.677265. → page 1195

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0383333/manifest

Comment

Related Items