UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A unified recognition and stereo vision system for size assessment Naiberg, Andrew 1994

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1994-0573.pdf [ 1.96MB ]
Metadata
JSON: 831-1.0051464.json
JSON-LD: 831-1.0051464-ld.json
RDF/XML (Pretty): 831-1.0051464-rdf.xml
RDF/JSON: 831-1.0051464-rdf.json
Turtle: 831-1.0051464-turtle.txt
N-Triples: 831-1.0051464-rdf-ntriples.txt
Original Record: 831-1.0051464-source.json
Full Text
831-1.0051464-fulltext.txt
Citation
831-1.0051464.ris

Full Text

A UNIFIED RECOGNITION AND STEREO VISION SYSTEMFORSIZE ASSESSMENTByAndrew NaibergB.Sc. (Physics), Queen’s UniversityA THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE STUDIESCOMPUTER SCIENCEWe accept this thesis as conformingto the required standardO42PTHE UNIVERSITY OF BRITISH COLUMBIAJuly 1994® Andrew Naiberg, 1994In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)_______________________________Department of cl.ThfA4 )?rA-..4JThe University of British ColumbiaVancouver, CanadaDate )4A4, Ii’, ig’sDE-6 (2188)ABSTRACTThis paper presents a unified recognition and stereo vision system which locates objects anddetermines their distances and sizes from stereo image pairs. Unlike other such systems, stereoinformation is not the input to the recognition stage. Instead, recognition is performed firstand its output forms the input to stereo processing. This permits successful analysis of imagescaptured in poor conditions (murky, specularities, poor camera alignment) and reduces timerequirements.Model-based recognition is accomplished in two stages. The first stage seeks feature matchesby comparing the absolute orientation, relative orientation and relative length of each imagesegment to those of the model segments in order to find chains of adjacent segments in theimage which match those of the models. The absolute orientation constraint can be suppressedto permit identification of randomly-oriented objects.The second stage verifies candidate matches by comparing the relative locations of matchedimage features to the relative locations of the corresponding model features. The models themselves are generated semi-automatically from images of the desired objects.In addition to providing distance estimates, feature-based stereo information is used todisambiguate any multiple or questionable matches.Although the system and issues presented herein are quite general, the discussion and testingare primarily related to the motivating task of non-invasively assessing the size of sea-cagesalmon.11Table of ContentsAbstract iiTable of Contents iiiList of Figures vAcknowledgements vi1 Introduction 11.1 The NIFE System: Non-Invasive Fish Enumeration 21.2 The Scope of this Thesis 41.3 Thesis Overview 72 Previous Work 82.1 Previous Systems Related to Aquaculture 82.2 Edge Detection 102.3 Recognition 122.4 Stereo 182.4.1 Correlation 192.4.2 Feature-Based Stereo 213 The Manual Stereo System 223.1 Acquiring Stereo Images 223.2 Camera Correction and Calibration 233.3 Size Assessment 273.4 Results 281114.2 Implementation Details4.2.1 Pre-Processing4.2.2 Model Creation4.2.3 Matching4.2.4 Verification4.2.5 Stereo3132363838424553565 Results5.1 Example #15.2 Example #25.3 Example #35.4 Example #45.5 General Observations5.6 Complexity 766 Conclusions & Future Work6.1 Conclusions6.2 Future WorkAppendicesA Binocular Stereo GeometryB Imaging GeometryBibliography858587894 The Automatic System: Design and Implementation Details4.1 Design Decisions4.1.1 Model-Based Recognition4.1.2 Stereo31and Use636367697275787879ivList of Figures1.1 The stages of processing.52.1 The need for super-segments 152.2 A super-segment for Structural Indexing 163.1 Apparatus for gathering images 243.2 Correction of lens distortion 254.1 Modified super-segment 344.2 Polygonal approximations 354.3 Schematic of procedure 394.4 Segmentation of a curve 404.5 Sample segment information 424.6 Handling discontinuous contours 504.7 Models with shifted centrepoints 585.1 Images for first example 665.2 Images for second example 695.3 Images for third example 715.4 False matches due to similar shapes 725.5 Images for fourth example 736.1 Missing segments during matching 81A.1 Stereo imaging geometry 85B.1 Imaging Geometry 87VI was a less than serious student in college. If I had it to do over again, I would befar more serious. I did play a lot of golf. But I don’t think that’s any reflection onmy ability to lead this nation.- U.S. Vice President Dan QuayleACKNOWLEDGEMENTSI would like to thank my supervisor, Dr. Jim Little, for his continued support, guidance andeasy-going demeanour despite my frequent deviation from the straight and narrow. I alsoexpress my gratitude to the second reader, Dr. David Lowe, who first stimulated my interestin computer vision and provided sound advice throughout.A chorus of thank-you’s to all who made it fun: Bill Gates, Michele & Sara McIntosh,Larry Palazzi, Lynette Alderson, Andrew Beatty, the Subaru & the mother of all tapes, AndrewRobinson, Hubert Lai, Rob Burton and the j4, Roz Shafran, the Jericho Laser fleet, Dan Kovacs,Jen Butler, Homer Simpson, Colin Savage, Trev Neufeld, Ted Charlton, Whistler/Blackcomb,the Blackcomb Ski School desk (especially John Riedl and Susie Price), rejects & rejectors toonumerous to mention, Mom, Dad & the family, Gull and all those who helped along the way.Finally, I would like to thank the Department of Computer Science and the British ColumbiaScience Council, whose support allowed me to maintain my opulent lifestyle of the oppressedand downtrodden.viCHAPTER 1INTRODUCTIONLast year I went fishing with Salvador Dali. He was using a dotted line. He caughtevery other fish.- Steven WrightVision provides humans (and most animals) with a staggering amount of information abouttheir surroundings. A passing glance at a scene can provide an observer with a detailed accountof the objects in the scene, their locations, their states of motion with respect to the observerand the spatial relationships between them. Furthermore, all this can be accomplished evenin the presence of unfamiliar objects, random orientations and configurations, unfavourablelighting, varying degrees of overlap and similar hindrances.Biological vision systems derive this information from many sources. Objects may be recognized based on their shape, colour or context while distances may be calculated from stereoscopicinformation or inferred using such clues as occlusion, relative sizes and familiarity. Still moreinformation is available from lighting and shadows, motion and the scale of visible detail. Despite the amount of information available and the number of sources, biological vision systemsinterpret scenes rapidly and effortlessly.The development of successful computer-based vision systems, in contrast, has been neitherrapid nor effortless. In fact, continued research has underscored the complexity of the world andthe attendant intricacies of trying to interpret general visual information. In an unconstrainedenvironment, many factors such as uneven lighting, coincidental alignments and marked changesin the appearances of objects as they are rotated can give rise to confusing or misleadinginformation; the correct interpretation often requires the assimilation of several sources ofinformation at various scales. In short, the visual world is often an inhospitable place.1Chapter 1: Introduction 2Nonetheless, guided by both continued study of biological vision systems and consistentthought, researchers in computer science have devised many techniques which accomplish important steps in the vision process. Also, although no existing computer-based vision systemeven approaches the generality, speed and robustness of human vision, we do have useful computer systems that tackle various aspects of the vision process and/or derive specific informationfrom images under restricted circumstances.This thesis describes the development of a non-invasive size assessment system using model-based recognition and stereoscopic vision. The term “non-invasive” means that no specializedhandling or presentation of the objects is required for size assessment; the objects are simplyvideotaped as they would typically be found and size estimates are obtained from the information on the resulting videotape alone.’Although the system was designed to be as general as possible and is hence suitable forgeneric size-assessment applications, it was written as part of a project to assess the size of net-pen salmon non-invasively. For this reason, the system is described and tested in the contextof this application.1.1 THE NIFE SYSTEM: NON-INVASIVE FISH ENUMERATIONFish size is an important factor affecting the economy of any aquaculture enterprise. Sizeinformation can be used to monitor the growth of the fish, determine the amounts of feedand medication required and would aid in deciding when to grade, sort or harvest the fish.However, despite its economic importance, fish size in tanks and sea cages is measured infrequently because most available methods of measurement are time-consuming, labour-intensive,insufficiently accurate and harmful to the fish.Most current measurement techniques involve measuring a sample of approximately 200fish from the population, either by capturing the fish and measuring them above water or byvideotaping the fish as they are shuttled through a narrow portal and taking measurements from‘This, however, does not preclude either strategic placement of the cameras to capture the best possibleimages or the inclusion of a custom background to enhance image contrast or clarity.Chapter 1: Introduction 3the videotape. Many farmers have reported error on the order of 15-25% using these techniques([K1o93, PMO86j) where they would like size information within 5% of actual values ([K1o93j).The large error has been attributed to several factors: the sample may be too small to providean accurate representation of the entire population, the sample may be skewed because somefish are more easily captured than others and the measurements may reflect inaccuracy resultingfrom human error.In addition, these methods cause the fish considerable stress and damage ([K1o93, PMO86]).In a population of salmon, stress results in lower disease tolerance and an increase in mortalities([SS78, Sch82]). In addition to stress, methods requiring capture or that the fish be shuttledthrough a narrow portal cause bruising and scale loss as a result of coffisiorts and handling,thereby reducing the value of the fish.Finally, as a result of these problems and the required labour, size assessments are performedinfrequently. Whereas farmers would like to monitor growth on a weekly basis, practicalitydictates that assessments be performed when the fish are being handled for other reasons, suchas when harvesting, changing nets, grading or sorting.In response to these problems, the Department of Bio-Resource Engineering at the University of British Columbia set out to develop a system capable of providing reliable size distribution information without harming the fish or, indeed, encroaching upon them in any way.Proposed by Trev Neufeld in [Neu83], the goal of the system was to provide Non-Invasive FishEnumeration, thus it became known as the NIFE system.The proposed system was to accomplish its goal by videotaping the fish as they swim freelyin their enclosure and calculating the size of individual fish which appear in the images. The sizeof each fish assessed would then be added to a database whence a statistical size distributionwould be derived after enough data had been collected. Because all necessary informationwould be obtained from the videotape, the fish would not have to be manipulated in any way.Furthermore, if the system were fully automated, fish farmers could update their informationinexpensively and frequently without risking harm to the fish.Conceptually, the automated system operates by locating in stereo image pairs the salientfeatures which delimit the desired dimensions of the fish. To determine the length of a fish,Chapter 1: Introductionthese features are the head and tail; the breadth of a fish may be delimited by its dorsal andventral fins. Once the apparent dimensions of the fish have been obtained from these features,the distance to the fish is calculated from the stereoscopic information. Given the apparentdimensions of an object and the distance to that object, its actual dimensions follow straightforwardly. It is expected that a relationship can be found between the exterior dimensions of afish and its biomass so that the system can provide a biomass distribution rather than a lengthdistribution. The stages of processing are shown in Figure 1.1.The system was developed in stages. The first stage was to design and assemble the necessaryvideo equipment and examine precisely what information can be derived from still images offish. The results of this first stage indicated that the system would require stereo video in orderto deliver the desired information. Thus the second stage was to develop a manually-operatedprototype stereo system in order to illustrate the technique and test the results.The ultimate goal of the NIFE project is a fully automated system which would capturestereo images from videotape, locate as many fish as possible in the images, determine theirlengths and biomass and enter the information in a database for statistical analysis, all withoutexternal assistance. This would allow a fish farmer to videotape the fish during the day, initiatethe computerized evaluation process in the evening and view the data the next morning. Thesystem would stop analyzing image pairs automatically once a predetermined number of fish orimages had been analyzed, as determined by statistical requirements.1.2 THE SCOPE OF THIS THESISAs was mentioned earlier, vision systems typically have great difficulty operating in unconstrained environments. The specific requirements of the NIFE system do little to simplifythe situation. The first complication is the marine environment: images captured underwaterfrequently exhibit either low contrast in murky conditions or pronounced back-scattering anduneven illumination in brighter conditions. In addition, there is always the possibility of aphysical contaminant such as silt or algae further obscuring the images.Fish are themselves not an easy object to recognize despite having distinct features. Firstoff, fish generally school in dense groups, thus images of fish are often complicated by overlap,Chapter 1: Introduction 5Left Image Right Imagexl x2 xl x2Figure 1.1: The stages of image processing: (a) The edge detector has traced the outline of onefish in each image and some noise. (b) All contours returned by the edge detector have beenapproximated using linear segments. (c) The recognition system has identified potential headsand tails in both images and omitted all other contours. (d) The verifier has eliminated thefalsely-matched features and the stereo procedure has calculated the disparity d of both thehead and tail and two estimates of the apparent length 1, one from each image.(a)(b)(c)(d)-j:z”I’c: ‘rChapter 1: Introduction 6hidden features and incomplete outlines (see Figure 5.5a). Secondly, the surface of fish can giverise to specular reflections that complicate the edge-detection process.However, even if complete outlines were guaranteed, there are still the problems of size andshape. Because fish can be found at any stage of growth, size is an unknown which complicatesthe determination of distance or scale rather than a constant from which distance or scale can becalculated. Finally, recognition is hindered by the fact that not all fish look alike: even within asingle species, biological variation ensures that different fish often have slightly different shapes.As a result of these complications, this thesis does not attempt to solve the general problemof assessing the size of random three-dimensional objects in arbitrary orientations. Instead, thisthesis takes the more successful approach outlined earlier, that of extracting specific informationunder restricted circumstances.The first of these restrictions specifies a requirement of the images themselves. It is assumedthat the images being analyzed are of sufficient quality that a good edge detector can extracthighly-connected contours; the recognition system relies upon extended features derived fromthese contours. It is also implicitly assumed that images are easily collected, thereby allowingproblematic images to be discarded without fear of being left with too few.The remaining assumptions specify the nature of the objects being analyzed. It is assumedthat the objects being measured are fairly rigid and are being imaged from restricted viewpoints, generally such that the desired measurements are roughly parallel to the image plane,making the object effectively two-dimensional. Although the inclusion of more models couldaccommodate deformable objects and several viewpoints, this would become very computationally expensive. This restriction is intended to specify that the system does not deform itsmodels in an effort to achieve a match (i.e., the system models are assumed rigid) nor does itsolve for arbitrary perspective projection.2 Finally, the features being recognized are assumedto be parts of larger objects as the system is designed to calculate dimensions as the distancesbetween pairs of features found on a single larger object.The system returns the apparent dimensions and disparity of individual objects and features.2Although fish deform as they swim, they are videotaped in side profile, thus the deformations are in adirection parallel to the optical axis and appear as relatively small changes in length.Chapter 1: Introduction 7With correct calibration, this information can provide actual dimensions.3 The conversion fromlinear dimensions to biomass and the calculation of size distributions from biomass estimatesare left to further work in biology and statistics.1.3 THESIS OVERVIEWThis thesis is organized in six chapters. Chapter 2 reviews some of the related work in bothcomputer vision and size assessment of fish. Chapter 3 introduces the prototype manual systemand explains how images are obtained and prepared for both the manual and automatic systems.Chapter 4 explains the design and operation of the automatic system. Chapter 5 presents theresults of performing size assessment on several stereo image pairs. Chapter 6 concludes thethesis by evaluating the success of the system and suggesting some possible directions for futurework.3The manual system described in Chapter 3 was fully calibrated and the calibration process is explained inSection 3.2.CHAPTER 2PREVIOUS WORKThe simplest schoolboy is now familiar with facts for which Archimedes would havesacrificed his life.- Ernest RenanOne unfortunate result of the degree of specialization of current research is that it leads researchers to focus oniy on their particular problem, assuming that the related problems eitherhave been or will be solved for them. Thus many feature detection algorithms rely upon receiving clear, connected segments and, in turn, stereo algorithms rely on good feature detectionand localization.An interesting aspect of the NIFE system is that, being primarily an engineering ratherthan an academic project, no part of the system could be taken for granted and furthermore,no restrictions were placed on how the system achieve its results. For these reasons, almostevery step in the vision process required decisions.This chapter describes some of the previous work related to the current system in order tooutline the evolution of ideas and explain the rationale behind some of these decisions.2.1 PREvious SYSTEMS RELATED TO AQUACULTUREWith hopes of keeping the system simple and inexpensive, preliminary investigation sought adimensionless number, some ratio of measurements on the fish that varies appreciably withlength or biomass. If found, such a value could provide the length or biomass of the fish froma single image without knowing the distance of the fish from the camera. After approximatelya year of study indicated that no such ratio exists, this programme was abandoned and it wasacknowledged that the system would require some means of determining the distance to features8Chapter 2: Previous Work 9in the field of view. Without this ability, it would be impossible to differentiate between a largefish located far from the camera and a small fish close to the camera. Stereo vision was theobvious choice because it is entirely non-invasive, requiring no handling of the fish or speciallighting.While stereo is a well-known technique for determining distances and has been studied bycomputer scientists, cartographers and others, some more closely-related work has been doneto determine both the three-dimensional positions of individual fish and three-dimensionalstructure of fish schools.In their research into the structure of fish schools, Cullen et al [CSB65] state that prior workin that area was based on two-dimensional photographs of fish schools taken from above and didnot take into account the vertical distance between fish. In an effort to correct this oversight,their system used a single still camera fitted with a stereo-prism attachment to produce stereoimages on a single negative. The camera was mounted above the tank and white styrene sheetswere place on the bottom of the tank. Disparity was measured directly from enlarged imagesusing a ruler. The same group devised a second method to determine the distance to the fishusing the parallax between a fish and its shadow on the white bottom-sheet given the projectionof a 10cm rod placed vertically on the bottom-sheet. In both systems, the position of a fish inthe plane parallel to the image plane was determined from its position in the images relativeto a grid marked on the bottom-sheet.Van Sciver [VS72] describes a similar method for determining the size of objects underwater.His system uses a single underwater camera and requires that the objects remain stationarywhile the camera is physically moved sideways between exposures to obtain a stereo pair.Disparity was once again measured from enlargements of equal magnification.Van Long et al [VLAI85] describes a more advanced system for determining the locationand size of marine organisms in situ using underwater stereo cameras and a stroboscope toilluminate the cameras’ field of view. In order to ensure simultaneous exposure of the stereophotographs, thereby permitting motion of the organisms, the cameras had no shutters and filmswere simply advanced after each flash of the strobe. Disparity was again measured manuallyfrom enlargements of equal magnification. In a test of the system, Long et al took manyChapter 2: Previous Work 10stereo photographs of both a test pattern and a scale bar in a lOm diameter aquarium. Usingthe photographs of the test pattern along with actual measurements of the three-dimensionallocations of points, they calculated correction factors to convert from the calculated locations ofpoints to actual locations in all three dimensions. They quote mean absolute errors of 0.77% inestimating distance to the scale bar and 0.75% in estimating its length. Van Long and Agayama[VLA85I describes what appears to be the same system with the addition of a computer andgraphics tablet to aid in calculating distances, lengths and bearing angles.Pitcher and Lawrence [PL84] describe a stereo system which implements video equipmentvery much like that used by the NIFE project. Pitcher explains that although video equipmentoffers low running costs, the ability to re-record and special effects such as slow-motion and time-lapse, its major drawback is its relatively low resolution compared to photographic film. Thisdrawback was amplified in stereo video, which at the time was accomplished using a video mixerthat displayed both images simultaneously in “split-screen” format, thereby halving the alreadyinadequate resolution. To overcome this problem, Pitcher proposed a rapid switch circuit toroute still images alternately from two video cameras to a single video-cassette recorder (VCR).The apparatus was calibrated with a grid in the field of view and was used to measure thethree-dimensional positions of individual fish in schools.Naiberg et al [NPSN93] describes a computerized stereo system for estimating the distance,size and biomass of individual fish. With this system, the operator uses a mouse to mark thehead and tail of a single fish in both images of a stereo pair. The system then uses these pointsto calculate the distance to and length of that fish. This manually-operated system was used asa prototype to demonstrate the stereo concept and was generally well-received by local salmonfarmers, motivating the automated system described herein. The prototype system is discussedin chapter 3.2.2 EDGE DETECTIONBecause it is intended that the NIFE system recognize and locate fish automatically in images,reliable detection of edges in the images is clearly crucial to its success.The central assumption of edge detection is that brightness discontinuities in an imageChapter 2: Previous Work 11correspond to relevant scene features such as illumination or reflectance boundaries or, ideally,the edges of objects. It is the job of an edge detector to accurately locate these brightnessdiscontinuities by performing numerical differentiation on the intensity values in an image.Unfortunately, no method has been found that accurately locates all edges with no erroneousresponses under all circumstances, thus different edge detectors adhere to different criteria andoffer different trade-offs between competing goals. Colin Savage, one of the original members ofthe NIFE project, has tested various edge detectors extensively for his Masters thesis. Savageconcluded that the Canny edge detector [Can86] performs better than the alternatives on imagesof fish in sea cages, which tend to exhibit low contrast and high noise levels.The two primary design criteria for the Canny edge detector are that it offer good detection,meaning it detects real edges well and rarely marks edges falsely, and good localization, meaningthe points it marks as edges are very close to the actual edges. A third criterion was that asingle edge give rise to a single response. Canny points out that the probability of failingto mark real edges and the probability of marking edges falsely both decrease with increasingsignal-to-noise ratio, thus good detection corresponds to maximizing signal-to-noise ratio. Also,because the Canny edge detector is a gradient operator, meaning it responds with a peak tomark the location of an edge, good localization corresponds to narrowing its response peak.In working through the mathematical formulation of the above criteria as applied to stepedges, Canny found that there is a theoretical trade-off between the primary goals of detectionand localization: those steps which increase signal-to-noise ratio to improve detection broadenthe response peak and those which narrow the response peak to improve localization decreasesignal-to-noise ratio. With this trade-off, it is possible to derive a single optimal operator.Because the Canny edge detector responds with a peak to mark edges, its output must bethresholded to discriminate edges from noise. This step often introduces excessive streaking,the breaking up of edge contours caused by fluctuation of the detector output above and belowa single threshold. To minimize streaking, the Canny detector implements hysteresis with twothresholds: once a pixel with intensity above a pre-set high threshold is found and marked, allconnected pixels are marked provided their intensity is above a pre-set lower threshold.Chapter 2: Previous Work 122.3 REcoGNITIoNSimply stated, object recognition consists of determining what objects are present in an imageand where they are located. Specifically, a model-based object recognition system,’ given acollection of object descriptions (“models”) and an image containing one or more of theseobjects, attempts to match the object descriptions to objects in the image and determineprecisely where the objects are located in the image. These tasks are often referred to as“labeling” and “pose estimation” respectively.There have been may approaches to object recognition over the years and it remains avibrant field in computer vision because there is no consensus on how to represent objectsor search for them in the presence of noise, occlusion, translation, rotation and changes ofscale. Clearly there exists at least one combination of object representation and matchingstrategy which yields excellent performance, for humans are able to recognize almost instantlya vast array of objects from almost any viewpoint even when faced with occlusion, clutter, poorlighting and any of a number of other hindrances. Furthermore, humans are able to correctlyclassify objects they have never seen before, such as the fist time one sees a mountain bike orhand-held video-cassette recorder, and objects that appear in different forms and sizes, such ascars, chairs and dogs.Computer vision often derives its shape representations from the boundary contours ofobjects in images. These contours can generally be detected under a wide range of imagingconditions without prior processing or knowledge of the image content. A number of strategieshave been devised to describe contours and to compare image contours to stored descriptionsof model contours.One of the earliest attempts at shape representation is chain coding, described in [FF65].Essentially, chain coding is performed by numbering one through eight the eight pixels surrounding the central pixel in a three-by-three pixel box. Any arbitrary line is then describedby beginning at any pixel and recording the numerical code of the next pixel comprising theline, then the next and so on. This yields a chain of numbers representing the path followed by1As opposed to, for example, a neural network or statistical pattern classifier.Chapter 2: Previous Work 13the line.Chain-code correlation is the associated method of shape recognition. Line segments in animage are coded as described above and correlated against stored “model” chains such that aperfect match link for link results in a score of one and any differences between the links in theimage chain and those in the model chain reduce this score. A match between an image chainand a model chain is declared whenever this score is above a threshold. Davis [DR76] explainsthat chain-code correlation is not rotation-invariant and is very sensitive to both small changesin shape and global changes in scale, concluding that, “it cannot be considered a generallyuseful tool for shape matching” ([DR76], p.61).Chain-code correlation is one example of a number of correlation-based algorithms whichperform template matching. Most of these algorithms suffer from the same drawbacks andare therefore useful only when dealing with non-occluded, non-rotated objects of fixed scale([Ste92], p.11).Since this early attempt, much work has been devoted to developing more general and robustrecognition systems. Davis [DR76] describes a large class of shape recognition systems which usesets of numerical values as shape descriptions and match these sets of values using techniquesfrom statistical pattern recognition. The values used include various moments, Fourier shapedescriptors and measures of compactness and elongatedness. However, Davis explains, thesesystems cannot be used to match a part of one shape to part of a larger shape because thedescription of any part of a shape is generally not related to the description of the entire shapein any simple way. This is a serious drawback from both practical and theoretical points ofview. In practice, we would like a system to recognize partially occluded or incomplete objectsfor enhanced robustness, particularly because it is likely that the system will receive incompleteedges. From a theoretical point of view, we would like artificial inteffigence to explain humancapabilities, thus any system which cannot recognize incomplete objects is unsatisfying for itcannot adequately explain human visual processes. For these reasons, most recent researchin object recognition is concerned with the general case of identifying objects from arbitraryviewpoints allowing for rotation, scale changes, partial occlusion, moderate distortion and noisyimages.Chapter 2: Previous Work ijDavis [DR76] presents a two-stage shape matching procedure in which shapes are representedas sequences of angles and their positions. This representation has the advantage that it isinvariant to both rotation and a wide range of scale changes. Furthermore, it facilitates detectionof partial matches as a series of matches between angles can begin at any point in the sequence.A model of the polygonal (i.e., piecewise-linear) representation of a curve M = M1,M2,. . ., Mwhere each M, =< x, y, magi > with x and y providing the location of the angle and magthe magnitude of the angle. The polygonal representation of a curve in the image is similarlyrepresented as I = 11,12,.. ‘m where m < n. A match of I against Mis comprised of(1) amapping associating the elements of I with the appropriate elements of M and (2) a coordinatetransformation from model coordinates to image coordinates to provide spatial registration.The first stage of the process searches for potential matches. The matching function regardsthe models as being spring-loaded templates, thus any model can be made to match any imagecurve if one is wiffing to expend the energy required to deform the springs as required. A perfectmatch occurs when the model curve lies perfectly over the image curve with no energy requiredto deform the springs. The match evaluation function sums the spring deformation energyrequired to obtain a match and some function of the difference between the relative distancebetween image segments and the relative distance between the corresponding model segments,adding a further penalty for missing angles, to assess the “cost” of any given association. Thetask of matching image curves to model curves is now the dynamic programming problem offinding the association functions F, which minimize cost. To avoid the computational requirements of dynamic programming, Davis uses a parallel, iterative procedure with local evaluationfunctions.Matched segments are entered in an “association graph” in which each association of animage curve with a model curve corresponds to a node in the graph with a weight equal tothe cost of that association. Two nodes are considered connected if the two associations theyrepresent can reasonably occur simultaneously. This decision is made by determining whetherthe spring deformation energy required to connect the two nodes is below a threshold.The second stage of the process attempts to eliminate incorrect matches from the associationgraph by means of relaxation techniques. Relaxation is an iterated parallel operation whichChapter : Pret’ious Work 15discards a node from the association graph if it is not sufficiently consistent with enough of itsneighbour nodes.A relaxation procedure similar to Davis’s is explained concisely by Rosenfeld in {RHZ76].Rosenfeld’s algorithm begins with the set of all possible labels for all objects and, at eachiteration, discards from each feature all those labelings which would result in a related featurehaving no consistent label.2 In this way it ensures that each node is labeled so as to be consistentwith all of its neighbour nodes. These iterations proceed until no further labels can be discardedor, ideally, each feature has exactly one consistent label, yielding a consistent match. Davis’salgorithm was able to discard all but the correct match in 47 of 50 tests matching portions ofdigitized coastlines.Figure 2.1: This segmented model of a fish illustrates how individual segments are too local andprovide little information. Those segments with corresponding numbers would easily be confused if recognition were based on the length and orientation of individual segments. Groupingadjacent segments provides more information and alleviates the potential confusion.While the recognition system used in NIFE, like Davis’s, operates in two stages and uses thelocations and magnitudes of angles in its representation of shapes, it is more like two-dimensionalStructural Indexing presented in [SM92, Ste921. For recognition of two-dimensional objects,the outlines of the objects provide a simple representation. Structural Indexing is based on theobservation that the system must operate with only parts of outlines if it is to successfully handleocclusion and noise, yet individual segments are too local to be used as matching primitives2Two features are related if they mutually constrain each other’s labeling.23Chapter 2: Previous Work 16because they provide very little information (see Figure 2.1). Thus the basic features in thismethod, called super-segments, are formed by grouping a fixed number of adjacent segments.Super-segments are described using the following terms (see Figure 2.2):“3_ “Figure 2.2: A super-segment of cardinality four which describes the tail of a fish. The location ismarked L and the orientation vector is marked tT. The effipse to the right shows the eccentricityof the super-segment.Cardinality The cardinality of a super-segment is the number of segments of which it iscomprised.Arclength The arclength of a super-segment is the sum of the lengths of the individual segments. Arclength is not scale-invariant.Angle Angles are measured between successive segments. Thus a super-segment with n segments has n— 1 angles.Location For super-segments of even cardinality, the location of the super-segment is definedto be that of the middle vertex; for super-segments of odd cardinality, the location is theSi‘a1,/I-,LS3S4Chapter 2: Previous Work 17midpoint of the middle segment.Orientation The orientation of a super-segment is the direction of the vector between thepredecessor and successor of its middle vertex.Eccentricity The eccentricity of a super-segment is computed from its second moment ofinertia. It is essentially a measure of the aspect ratio of the super-segment and is usedbecause, without length information, two super-segments can have very different shapesdespite having identical angles.Because the polygonal approximation of a contour depends on its size in the image and theresolution of the segmentation, Structural Indexing uses polygonal approximations of variousresolutions simultaneously.To generate models for Structural Indexing, the Canny edge detector is first applied tothe scene and the resulting contours are segmented to compute polygonal approximations.Super-segments are formed by grouping a fixed number of connected linear segments. Thesesuper-segments are stored as entries in a hash table. Each super segment is assigned a codebased on the angles between its composite segments and its eccentricity. This code serves asthe key for the hash table, providing fast access.In the hypothesis-generation stage of recognition, the image is processed as described aboveand the angles and eccentricities of the resulting super-segments are compared to the angles andeccentricities of the super-segments in the candidate models. Hypothesized matches betweenthe image and the models are retrieved from the hash table using the encoded keys.Once all possible matches between image super-segments and model super-segments havebeen computed, they are passed to a verification procedure. This procedure begins by dividingthe hypotheses by model and storing them in a table with the models as keys and hypothesesas entries. These hypotheses are then grouped into consistent clusters.A match is declared if three consistent hypotheses are found to support it. Consistency isdetermined using the constraints suggested by Crimson and Lozano-Perez in {GLP84j:Distance The distance between two super-segments in the image must be close to the distanceChapter 2: Previous Work 18between the corresponding super-segments in the model. Differences of scale are accommodated by factoring in the ratio of arclengths between the super-segment in the imageand that in the model.Angle The angle between the orientation vectors of two image super-segments must be closeto the angle between orientation vectors of the corresponding model super-segments.Direction The range of components of a vector spanning the two image super-segments in thedirection of their normal vectors must be close to the corresponding range for the modelsuper-segments.The final step is to apply two iterations of least-squares fitting to compute the transformationfrom model coordinates to scene coordinates. The second iteration is generally necessary as thefirst one provides only a rough estimate of the transformation, not a perfect match.Stein states that Structural Indexing performed well on both real and synthetic data exhibiting translation, rotation, scaling, occlusion and noise. It does, however, require denseimage data and fails to detect objects if the edges exhibit severe discontinuity. Its efficiency isgiven to be 0(n) Orecognition 0(n2m) where n is the number of features in the sceneand m is the number of candidate models.2.4 STEREOThe use of stereopsis to determine distances is based on the principle that, given two images of ascene taken simultaneously from slightly different vantage points, a given point in the scene willbe imaged to a different point in each of the two images unless steps are taken to compensate.The human stereopsis system compensates by adjusting the vergence angle of the eyes to makecertain image points coincide, then uses this vergence angle to determine the distance from theviewer to those points. The farther the point is from the observer, the less crossed the eyesmust be; if the point were infinitely distant, both eyes would be looking straight ahead. Thissystem works well because the brain can correctly adjust the vergence angle of the eyes veryrapidly and because at any given time, a human can focus on only a fairly restricted region inChapter 2: Previous Work 19the field of view, thus the vergence angle can be adjusted to obtain stereo information only forthat region.Instead of adjusting the vergence angle of the cameras, most computer-based stereo imagingsystems use fixed cameras, resulting in the aforementioned relative shift in the location of agiven scene point between the two stereo images. The distance to a point in the scene can thenbe determined from this shift because its magnitude is inversely proportional to the distance ofthe point from the cameras. Thus, points near the cameras may appear in opposite corners ofthe two images while points infinitely far from the cameras will appear at the same location inboth images. A derivation of this relationship is provided in Appendix A.In order to determine the distance from the cameras to some object visible in both images,one must locate the projection of at least one point on that object in one of the images, thenlocate the projection of the same point in the other image and determine the magnitude ofthe relative shift in position. Given this relative shift in position, called disparity, and a fewparameters of the imaging hardware, the distance to the point can be calculated quite simply. Inpractice, the second step in this process, finding the point in the second image which correspondsto a given point in the first image, is the difficult step. This is known as the correspondenceproblem and there are two general approaches to solving it.2.4.1 CORRELATIONThe correlation-based approach is based on the assumption that corresponding points in the twoimages will lie in regions with similar brightness patterns. To search for the point FR in the rightimage which corresponds to a given point FL in the left image, a correlation algorithm comparesthe brightness of each pixel in an m x ii (usually square) “window” centred on FL to each pixelin identical windows centred on each possible FR in the right image. The correlation functionf 0 g is generally such that a close match between brightness patterns in the two regionsyields a high correlation score while large differences between the two brightness patterns resultin a low score. In other words, correlation performs essentially by laying a template windowfrom one image over candidate windows in the second image and scoring the closeness of thematches. Whichever window in the right image produces the highest correlation score is theChapter 2: Previous Work 20best match and, provided that score is above some threshold, its centre pixel is considered FR.[Fua9l] presents such a correlation algorithm with an extra verification step: the two images areused symmetrically so that if window aj yields the closest match to bL of all possible windowsin the right image, bL must provide the closest match to aR of all possible windows in the leftimage in order to declare a correct match. That is, the two windows must “choose” each otheras the best possible match.Correlation has the advantages that is a simple, general algorithm requiring no prior knowledge of the application and has the capability to produce very dense disparity maps because itattempts to determine disparity at each point in an image. Furthermore, the correlation neednot be performed on raw intensity values; frequently derivatives of intensity provide more stablevalues on which to perform correlation.One problem with correlation is that it may not operate well when confronted by highlyspecular surfaces because the shift in viewpoint between the two images may give rise to corresponding regions with very different brightness patterns. Also, in order to match windowsreliably, correlation requires intensity patterns that vary sufficiently over the area of the correlation window being used. This could be a problem with murky or low-contrast images. Possiblythe biggest theoretical problem with correlation is that it fails at object boundaries: where acorrelation window overlaps a depth discontinuity, covering both foreground and background,there will be no matching window in the second stereo image. Most mature systems analyzeboundaries to address this issue.Perhaps the major practical drawback of correlation is that it is a very computationallyintensive procedure because it operates at every pixel in the images. To minimize computation,stereo cameras are typically aligned with their optical axes parallel so that corresponding pointsare constrained to lie at the same vertical level in the two images. This simplification, calledthe epipolar constraint, reduces the search space from two dimensions to one.3 Furthermore,only a certain range of disparities is examined to further reduce the one-dimensional search.However, even with these reductions, correlation must perform a computation for each pair ofwindows being compared; with each computation being a compound step such as adding the3With the camera axes aligned parallel, epipolar lines are parallel and horizontal; if the cameras are notparallel, the epipolar lines diverge.Chapter 2: Previous Work 21square of a difference between differences, correlation is not for the computationally faint ofheart.2.4.2 FEATURE-BASED STEREOThe second approach to stereo matching seeks to identify a salient feature in one image and amatching feature located plausibly in the second image. For example, if a corner is identifiedin the left image at some point P, the algorithm searches the regioll in the right image to theleft of P along the epipolar line for a matching corner. Marr and Poggio [MP79] proposed sucha feature-based procedure as a model of human stereopsis.Feature-based algorithms are not as computationally intensive as correlation-based algorithms because they operate only on salient features rather than on every pixel in the images.They often perform well provided the images do not exhibit many instances of similar featureson the same epipolar lines.Assuming the required salient features exist in the images, the drawback of feature-basedstereo is that it requires reliable identification and location of these features, thus it is only asgood as the feature detection system providing this information. Depending on which featuresare examined, different problems must be addressed. If general features such as zero-crossingare examined, the system may have difficulty because there will typically be many featuresto interpret; if higher-level semantic features are examined, the system will typically requireprior knowledge of what features are likely to be in the images and from what viewpoints.Regardless of the approach, feature-based stereo provides relatively sparse disparity maps forit can determine disparity only at recognized features rather than at every pixel.Finally, studies of stereopsis have revealed several properties which can be used to improvethe reliability of stereo algorithms. Pollard, Mayhew and Frisby [PMF85] demonstrate thatplacing a limit on disparity gradients provides a means of resolving ambiguities without severelyrestricting the generality of the stereo system. Mohan, Medioni and Nevatia {MMN89] presenta method to detect and correct disparity errors by exploiting figural continuity, the propertythat disparity changes smoothly along a contour.CHAPTER 3THE MANUAL STEREO SYSTEM“He’s suffering from Politicians’ Logic.”“Something must be done, this is something, therefore we must do it.”- Jonathan Lynn é4 Antony Jay, “Yes, Prime Minister”In order to familiarize ourselves with the use of stereopsis to determine length and to gaugereaction to the idea, a prototype system was developed and demonstrated to salmon farmers,equipment suppliers and government representatives.’ The prototype was also presented at anaquaculture conference [NPSN93]. This method uses the same equipment and procedure togather stereo images as the automatic system described in Chapter 4. However, this systemrequires the user to manually locate and mark the heads and tails of fish in the stereo images,thereby performing the recognition and correspondence steps for the system. The computerthen calculates and displays the disparity, distance, length and estimated biomass of the fishbeing assessed.3.1 ACQUIRING STEREO IMAGESThe cameras used are Panasonic WBVD400 black and white video cameras. Designed forsurveillance, they are well-suited to low-light applications, offering sensitivity down to 0.5 lux.They are fitted with Cosmicar 4.8mm wide-angle lenses and placed in custom-built underwaterhousings with lexan dome-port lenses. Application of Tsai’s camera calibration procedure{Tsa87] estimated the effective focal length of the lens/dome-port combination to be 5.4mm.2These wide-angle lenses were chosen to maximize the field of view. The housings are bolted to1This work was supported by a research grant from the British Columbia Science Council.2Thanks to Rod Barman of the UBC Laboratory for Computationai Intelligence for programming and performing the Tsai calibration.22Chapter 3: The Manual Stereo System 23an adjustable aluminum platform at a predetermined separation with alignment pins to keepthe camera axes parallel. The platform and cameras are lowered into the tank or sea-cage onan aluminum column and the entire apparatus is painted black to minimize disturbance to thefish. Power to and communication with the cameras are through waterproof umbilical cablesto the surface.The video signals from the cameras are routed through a Panasonic WJFS1O digital frameswitcher and recorded on a Panasonic AG-1960 S-VHS video-cassette recorder. The digitalframe switcher routes still images to the VCR from the two cameras alternately at 1/15 secondintervals. This permits the recording of images from both cameras on a single videotape andgreatly simplifies the pairing of stereo images: each left image is followed by the correspondingright image. Unfortunately, the 1/15 second interval between the corresponding images of astereo pair can cause significant error in disparity if the fish are swimming quickly. To overcomethis problem, the signal from the second camera in the stereo sequence should be routed to thedigital frame switcher through a 1 / 15 second frame-delay module.3 The equipment is shown inFigure 3.1.To facilitate both the location of points on the fish and accurate length estimates, we wouldlike the fish to be videotaped with their centrelines parallel to the image plane of the cameras.Fortunately, pacific salmon tend to circulate in an annular region in the sea-cage, hence placingthe cameras on the outside of this annulus pointing radially inward captures many fish thisway. Video footage was gathered with the cameras at several different locations in the sea-cages; more research is required into the behaviour of fish in sea-cages to determine how manyand which locations are required to obtain accurate information about the fish.3.2 CAMERA CORRECTION AND CALIBRATIONThe ideal camera separation is determined by the estimated range of distances to the targets: wewish to maximize disparity by increasing camera separation while maintaining adequate overlapbetween the two cameras to obtain stereo information over a reasonable area. If the targets3The frame delay was not used in the prototype system because testing was carried out almost exclusivelywith stationary targets.Chapter 3: The Manual Stereo SystemFigure 3.1: The apparatus used to gather stereo images underwater.3are expected to be distant, the cameras must generally be farther apart to compensate for thedecrease in disparity with increased distance. Fortunately, this is feasible because the overlapregion increases with distance. Once the estimated range of distances and a few constants ofthe imaging hardware are known, the appropriate camera separation can be calculated fromthe equations of Appendix A.Before accurate measurements can be made, various lens distortions must be removed fromthe images. These distortions include barrel distortion introduced by the wide-angle lenses4and ripples introduced by the hand-polished lexan dome-port lenses on the waterproof camerahousings.The distortions are corrected by capturing an image of a plane square grid placed parallelto the image plane of the camera. This image typically shows the lines of the grid to be4Barrel distortion causes horizontal and vertical lines to bow increasingly outward toward the edges of theimage.Chapter 3: The Manual Stereo System 25straight in the centre of the image and increasingly curved and rippled toward the edges. Byexamining the central, essentially undistorted lines, one can determine the number of pixelswhich should separate the lines in an ideal, undistorted image. By comparing the locationsof the grid points in the actual image to where they would be in an undistorted image, onecan determine the correction values for each intersection point, the number of pixels each pointmust be moved both horizontally and vertically to correct the lens distortions (see Fig. 3.2).A program was written which requires the user to mark the grid intersection points using amouse-operated cross-hair, then calculates the horizontal and vertical correction values andwrites the intersection points and their associated correction values to a correction file on disk.A second program reads in the correction file and uses the locations of the intersectionpoints and their associated correction values to move each pixel in the image as necessary toremove distortions. This program uses linear extrapolation and cubic spline interpolation togenerate rows of smoothly-varying correction values based on the known correction values atintersection points. Given a point in a distorted image, its correction values are determinedby finding the closest known correction value(s) and either linearly extrapolating or linearlyFigure 3.2: A simulation of a distorted grid (dashed line) over an ideal grid. The magnifiedview shows the horizontal and vertical correction values, marked dx and dy respectively.Chapter 3: The Manual Stereo Systemor bilinearly interpolating as necessary, depending on the location of the point relative to theknown intersection points.5 The accuracy of the correction can be tested by correcting theoriginal image of the grid and examining how close the corrected image is to a square grid.A similar procedure was written to align the epipolar lines in the two images of a stereopair. After capturing a stereo pair of the calibration grid, the user marks pairs of correspondingpoints in the two images; the procedure leaves one image untouched and aligns the epipolarsby calculating the vertical shifts required in the second image to make corresponding pointslie at the same vertical level. These methods of correcting distortion were chosen because thehand-polishing of the dome-ports gives rise to random, non-uniform distortions which cannotbe described by an analytical model of distortion.During size assessment, distortion correction is performed in real time only on the pointsmarked by the user. Because very few points are used during size assessment ,6 performingcorrection on the entire image in advance requires inordinate amounts of processing time andstorage space. A second reason is that during correction, many points in a distorted image arerelocated beyond the borders of the image, hence these points would be lost. By performingcorrection internally after a point is marked, the entire original image is usable.Once the lens distortions have been corrected, the final calibration step is to calculate theconstants required to convert measurements of disparity, length in pixels and height in pixelsto measurements of distance, actual length and actual height respectively. This is performedby obtaining a stereo pair of images of an object of known size placed a known distance infront of the cameras. The calibration grid used earlier serves well because it provides many“objects” of known size in many orientations throughout the image. Knowing both the actualdistance to the object and its resulting disparity, the constant converting disparity to distanceis easily calculated for the camera separation used. Similarly, knowing the apparent lengthand height of the object (in pixels) and its distance, it is straightforward to calculate theconstants converting corrected length and height (in pixels) to actual length and height forthe imaging hardware used. Separate constants are required for the horizontal and vertical5Although more sophisticated extrapolation methods are available, linear extrapolation was found to producethe best results.61n the demonstration system, the user clicks on only four pixels per fish in a 512 x 480 pixel image.Chapter 3: The Manual Stereo System 27directions because the aspect ratio of pixels is typically not 1:1. These values follow from theequations of appendix A.3.3 Siz ASSESSMENTFish size is determined from pairs of uncorrected stereo images of the fish in side profile. Thecoordinates are corrected in real time as outlined above. The programs were written in ANSI Con an Intel 486 machine equipped with an ITEX Overlay Frame Grabber card for acquiring andprocessing images. The Itex card provides only frame RAM; all image processing routines areimplemented serially in software. This equipment was chosen for its availability and relativelylow cost in the hope of making the system accessible to local salmon farmers.Using a mouse-operated cross-hair, the user must mark the nose, tail-fork, dorsal and ventralsides (at the widest point) of a flat, non-occluded fish in both the left and right images.7 Theprogram automatically displays the second image of the stereo pair once the user has markedall four points in the first image.8The disparity of the nose of the fish is calculated by subtracting the horizontal coordinate ofthe nose in the right image from that in the left image. Because distance varies inversely withdisparity, the distance to the nose is calculated by dividing the previously-determined distanceconstant by the disparity. The same procedure provides the distance to the tail-fork.The length of the fish is calculated using the Pythagorean theorem in three dimensions toobtain accurate estimates regardless of the orientation of the fish in the image. The horizontaland vertical components of length are calculated by subtracting the and y coordinates of thehead and tail and converting these to metres using the distance and the constants determinedearlier. These components are determined from both images and the larger values chosenbecause contortions may make the fish appear shorter in one image but never longer.9 Thethird component of length, parallel to the optical axis, is calculated by subtracting the distanceto the tail from the distance to the head. Clearly if the fish is oriented parallel to the image7Locations on the dorsal and ventral sides are required because the most accurate formula found relatingexterior dimensions to biomass involved the “height” of the fish as well as length.versions of the system could analyze more points to check for consistency.9Recall the two images are actually separated by 1/15 second, thus regular swimming motions may cause thefish to appear shorter in one image than the other.Chapter 3: The Manual Stereo System 28plane, this value is zero. The height of the fish (from dorsal side to ventral side) is calculatedanalogously except the estimated distance to the centre of the fish is based on the distances tothe nose and tail because there are no features which provide reliable disparity values in thecentre of the fish.Once the length and height of the fish have been calculated, the biomass is determined froma formula which takes into account condition factors for the particular fish in question. Theprogram can easily be tailored to suit different conditions or groups of fish by providing differentbiomass formulae as necessary. For each fish, the program displays the length and disparity inpixels, the distance, length and height in centimetres and the biomass in grams.3.4 RESULTSFeedback from fish farmers and industry representatives was generally very positive. In fact,representatives from an aquaculture equipment distributor offered a development grant in exchange for distribution rights. Most users felt comfortable with the system and could assessfish quite quickly with very little practice. All agreed that the technique is much easier thanexisting methods. As will be discussed shortly, some users reported difficulty in determiningprecisely which points to mark under certain circumstances.The accuracy of the system was tested by estimating the distance and dimensions of bothartificial targets and dead or anesthetized fish submerged in test tanks.The necessary constants for each test were determined using the method described aboveand averaging several readings of disparity and apparent dimensions taken randomly throughoutthe images. The distortion correction files for individual cameras10 were known in advance. Thestereo pair used for calibration was analyzed to ensure the accuracy of the calibration but wasnot used during testing.The artificial targets included the calibration grid and several white plastic shapes. Thegrid is an excellent target because the distance between any two intersection points is easilycalculated, thus the grid offers many targets in various orientations throughout the images. The‘°For the correction files, the term “camera” includes the lens and housing as the distortion is affected by thedome port and the precise position of the camera within the housing.Chapter 3: The Manual Stereo System 29targets were placed between 30 and 130 centimetres from the focal plane of the cameras andwere not necessarily parallel to the image plane.In repeated tests on these well-defined targets, the estimates for distance, length and heightwere always within 4% of the correct values when the targets were parallel to the image plane.When the targets were not parallel to the image plane, tests showed average errors of 5.6% forlength and 4.3% for height. The orientation of the object (horizontal or diagonal) in the planeparallel to the image plane had no effect on the error.In limited testing conducted by suspending dead fish in front of the cameras, average errorswere 2.7% for length estimates and 4.3% for height estimates. The maximum errors were 4.7%for length and 8.3% for height.A more thorough evaluation of the system was performed by Andrew Ohara as an undergraduate thesis project [0ha93]. Testing was carried out on three dead and 42 anesthetizedchinook salmon1’by measuring them with calipers and then holding them underwater in frontof the cameras in various orientations. Once the anesthetized fish regained consciousness, theywere videotaped swimming freely in the tank together; still frames were captured later.Ohara’s analysis of these images indicates that the system provides accurate length estimatesat shorter distances (50 cm) but that the accuracy of the length estimates decreases slightlywith increased distance (up to 110 cm). The system was very accurate with the fish parallelto the image plane but tended to underestimate lengths increasingly as the fish were rotatedfurther out of the image plane. The width estimates were not as accurate and showed noobvious relationships to distance or orientation.The fact that the system provided very accurate estimates with well-defined targets butcould not equal this accuracy with fish indicates that it may be difficult to mark points correctlyon ill-defined targets. Some users reported this problem. This is further supported by theincrease in error with distance because as distance increases, the outline of the targets becomesmore difficult to locate precisely. Furthermore, errors introduced by increased distance arecompounded because as distance increases, each pixel represents a larger increment of lengthor height, thus any error is also magnified.“The fish were anesthetized with 2.5 grams of tricaine-methane-sulfonate.Chapter 3: The Manual Stereo System 30Incorrect marking of points could lead to a compound error because the same points are usedto determine both apparent length and disparity. If the user erroneously marks a point in oneimage which makes the fish appear larger to the system, this exaggerated length will be chosenas explained in section 3.3 above. If this erroneous point also causes an apparent decrease indisparity, the distance estimate will be too high and the exaggerated apparent length will befurther magnified as a result of the exaggerated distance. If the manual system were to be usedcommercially, it would benefit from some form of image processing such as edge detection tofacilitate accurate location of points.Another complication in determining the length of fish is that the swimming motion involvesswinging the tail from side to side, causing the fish to appear shorter when viewed from the side.This motion is often modeled by a sine wave traveling along the body of the fish. The preciseshape of this wave varies with each species but for fin fish the amplitude of the wave is typicallynegligible over the front 2/3 of the body. Accepting this model, swinging the aft third of thebody 350 from the centreline results in an apparent length approximately 6% shorter than thecorrect length.’2 There is no obvious solution to this problem when assessing individual fishbecause it is often impossible to determine from an image if the fish is contorted or flat. Whenassessing a large group of fish, the solution may be to adjust the system constants such that thelength of flat fish is slightly overestimated and that of contorted fish is slightly underestimated.This may result in accurate statistical predictions for the group as a whole.The one simple conclusion from this testing was that the cameras must be securely lockedinto position in their housings. Before this was done, removing the cameras from their housingsand replacing them was found to necessitate changes in the constants of up to 4% to accountfor small shifts in the position of the camera within the housing. Furthermore, although thetwo cameras should both function equally well in either the left or right position, interchangingthem resulted in errors of up to 20%, indicating less than perfect alignment within the housings..‘2Once again, it is for this reason the larger estimate from the two images is chosen by the system.CHAPTER 4THE AUTOMATIC SYSTEM: DESIGN AND IMPLEMENTATION DETAILSThere are two ways of constructing a software design. One way is to make it sosimple that there are obviously no deficiencies. And the other way is to make it socomplicated that there are no obvious deficiencies.- C.A.R. Hoare4.1 DESIGN DECISIONSWe wish to identify fish in images and determine automatically their length from head to tail-fork. As mentioned in Section 2.1, the first method proposed to estimate size sought to discoversome non-uniformity in the growth of a fish which gives rise to a ratio of measurements thatchanges appreciably as a fish grows. The value of this ratio for a particular fish would thenaccurately reflect the stage of growth and size of the fish. The chief advantage of this ratiois that, being a ratio, any effects of scale would divide out, thus the size of the fish could beestimated from a single image without knowledge of distance or scale. Unfortunately, fish werefound to grow uniformly, hence all such ratios remain constant (to within biological diversity)throughout the life of the fish and provide no useful information.Given that no ratio can be used to determine the size of a fish, any measurement taken froman image will clearly reflect the distance of the fish from the camera as much as it does theactual size of the fish because apparent length varies directly with actual length and inverselywith distance (see Appendix B). Thus the system must be able to determine the distance tothe fish in order to eliminate this unknown.Sonar ranging is one method capable of furnishing distance information and has been used in31Chapter : The Automatic System: Design and Implementation Details 32aquaculture research to provide information about fish population and density. Sonar equipmentworks by emitting a sound wave which is reflected back to a detector by the entire school of fish.Based on the strength of the return signal and the time required for it to reach the detector, thedensity of and distance to the fish population can be calculated. However, because the emittedwave front spreads as the wave travels, resulting in a cone of energy, sonar equipment cannotbe aimed so as to give the distance to a particular fish. In fact, no single-camera method existswhich can provide accurate distance information without specialized lighting, background orthe inclusion of markers in the image. Binocular stereopsis was chosen because it is entirelynon-invasive, requiring no handling of the fish or specialized background or lighting.The model-based recognition/feature-based stereo approach is not the only way to solve theproblem at hand. Many recognition/stereo systems begin by performing correlation on a pair ofstereo images to obtain the disparity at each point, then locate objects by searching for regionsof approximately constant or smoothly-varying disparity. This approach was rejected becausecorrelation had difficulty dealing with the poor contrast and specularities found in real imagesand, moreover, the approach misdirects effort, generating more information than can be usedand not making optimal use of that which it does generate (see “Problems with Correlation”).As wifi be discussed presently, (Section 4.1.1), the model-based recognition/feature-based stereoapproach taken, on the other hand, handles difficult images well, provides only the informationrequired and makes very good use of this information.4.1.1 MODEL-BASED RECOGNITIONA recognition system adequate to the task at hand must be able to locate the heads and tailsof fish in images. This task is complicated by the poor contrast often exhibited in underwaterimages, which makes it difficult to extract complete edges, and by subtle variations in shapebetween fish, which make it more difficult to obtain a precise match. Matching leniently andusing multiple models and segmentations should help overcome these problems.Fortunately, the tail of a fish is a distinct shape characterized by sharp angles. This makessegmentation and matching more reliable because a sharp angle signifies clearly where onelinear segment should end and the next begin, thereby facilitating accurate determination of theChapter j: The Automatic System: Design and Implementation Details 33lengths of segments and the angles between them. However, the heads of certain fish are muchmore rounded (i.e., no sharp angles), hence the dividing points between segments become ratherarbitrary and the lengths of and angles between segments become correspondingly unreliable.It was decided that a two-dimensional recognition system would be sufficient because, as wasmentioned in Section 3.1, the cameras can generally be positioned so as to provide images of thefish in side prolile, thereby simplifying the problem. Furthermore, tests on the manual systemshowed measurements to be more accurate when the fish are oriented roughly parallel to theimage plane. A two-dimensional system should also be much simpler and faster than a generalthree-dimensional system, which must calculate projections to solve for arbitrary orientation.Given that we wish to identify two-dimensional objects in two-dimensional images, it isnatural to use the boundary contours of objects as features. With this in mind, the systemimplemented uses a shape representation similar to that used in Stein’s two-dimensional Structural Indexing described in Section 2.3 and {SM92, Ste92]. Recall that this method seeks tomatch shapes based on descriptions of the piecewise-linear segmentations of their outlines.The basic feature used herein is Stein’s super-segment with a change in the way lengthsare used, the addition of orientation angles and removal of the eccentricity (see Figure 4.1).The orientation angle, which describes the absolute orientation of a segment, was included toexploit any regularity of orientation exhibited by the objects being recognized, in this case thetendency of Pacific salmon to swim horizontally.’ Angles between successive segments are alsoused and are referred to as “relative angles.”The length information introduced problems because the system must be scale invariant.Nonetheless, inclusion of length information proved beneficial once the length of each segmentwas divided by the length of its predecessor to normalize the measurements. The eccentricity isredundant once length is included because the lengths and angles completely specify the shapeof a super-segment.. The orientation as Stein defines it is unnecessary in this application.Because a given shape can have several polygonal approximations depending on the resolution of the segmentation or the size of the shape at a fixed resolution, many recognition systemspermit the use of multiple models and/or multiple image segmentations to make recognition1t should be pointed out that orientation angles cannot be used for all objects or even all fish. Atlanticsalmon, for example, swim randomly in all directions and thus have no preferred orientation.Chapter : The Automatic System: Design and Implementation Details 34Figure 4.1: A super-segment modified to include orientation angles 9. Relative angles c, aresimilar to those used by Stein. Segment So is not part of the super-segment but is includedto show the relative angle c. The relative length of segment S, is calculated by dividing thelength of S,, by the length of The location of this super-segment is marked L.more robust. These capabilities are particularly important to the NIFE system because, inaddition to segmentation and scale differences, recognition must also contend with shape differences arising naturally as a result of biological diversity. Using multiple image segmentationsmeans that the image being processed is approximated at several resolutions2and each approximation is compared to the models. This increases robustness by increasing the likelihood of asegmented feature in the image matching one of the models (see Figure 4.2).Multiple models may be either multiple segmentations of a single instance of a featureat different resolutions (to accommodate differences of scale and detail) or different instancesof a given feature (to better accommodate biological diversity). To accommodate multiplemodels of a single feature, each model in the NIFE system can be given both a name and afeature identification number. The name of each model is intended to be unique in order todiscriminate between specific models while the identification number is intended to be the same2The resolution is determined by the deviation permitted between a contour and its linear approximation.SiL a3S3/Chapter 4: The Automatic System: Design and Implementation Details 35(a) (b) (c)Figure 4.2: (a) The linked outline of a fish. (b) and (c) Segmentations (polygonal approximations) at lower and higher resolutions respectively.for all models of a given feature so as to reflect the equivalence of multiple models of a singlefeature. Thus, all models of a fish tail would be given the same identification number and it isthis identification number that is referenced during verification or when determining disparitiesand lengths. This ensures that the verification and stereo procedures handle equivalent featuresuniformly regardless of which particular model was matched.Finally, because the system is designed to determine the dimensions of objects as delimitedby specific features, the models are generally of the specific features involved in measurementsrather than of entire objects. For example, the length of a fish is measured from nose totail-fork, thus separate models are provided for the nose and the tail-fork rather than a singlemodel of an entire fish. This way, the system is able to isolate the constituent features of anobject individually and a desired dimension is calculated as the distance between two delimitingfeatures. The user specifies these features by their identification numbers.3The general recognition strategy is to match features leniently and then perform strict verification. Although strict matching would decrease susceptibility to false matches by admittingonly very close matches, it may do so at the expense of failing to match desired features. However, verification provides a second chance to eliminate falsely matched features while failure to3The recognition system can make use of models of entire objects but it cannot use these matches to determinethe dimensions of a single object; to do this it must isolate at least two features of the object individually.Chapter 4: The Automatic System: Design and Implementation Details 36match a feature results in its being lost for good. Thus, because the NIFE system has both averification step and a stereo step offering ample opportunity and information to discriminatecorrect matches from false matches, all reasonable matches are admitted and false matches areeliminated later in the process.Matching is performed by comparing the angle and length information between the imageand the models segment-by-segment, searching for super-segments over a fixed range of cardinalities with average error below a predetermined threshold. All plausible matches (those withaverage error below the threshold) are passed to the verifier.The verification step checks whether candidate feature matches are in plausible locationsrelative to one another by comparing the distance and angle between pairs of different featuresin the image to the distance and angle between the corresponding features in the model. Toaccommodate differences of scale, the distance between features is scaled by the ratio of arclengths between the image super-segment and the model super-segment. The verifier assumesthat each object has only one instance of any given feature. This is a valid assumption in thecase of fish heads and tails but does not work for bicycle wheels, airplane wings or chair legs.4.1.2 STEREOPROBLEMS WITH CORRELATION-BASED STEREOAlthough correlation is a straightforward, off-the-shelf technique for stereo correspondencematching, it has a number of drawbacks which led to its rejection.Because the size assessment system requires distance estimates only at the nose and tailof each fish to calculate its most reliable length estimate (i.e., to employ the Pythagoreantheorem in three dimensions), the dense disparity maps produced by performing correlation atevery pixel in an image provide much more information and require more computation thanis necessary. Tests of a PC-based implementation of Fua’s algorithm [Fua9l] indicate thatcorrelation would require roughly 24 minutes per image pair. Although the program could befurther optimized, it does not appear that this time requirement can be substantially decreasedwithout an expensive upgrade such as a card which implements parallel convolution in hardware.Because the goal of NIFE is a system which analyzes many images and runs on inexpensiveChapter : The Automatic System: Design and Implementation Details 37PC’s, correlation is likely simply too computationaily intensive a process. Furthermore, thedisparity maps produced by correlation are only a first step; they must be interpreted in orderto locate the required features.Even if the time requirement were acceptable, limited testing indicates that the specularsurface of the fish combined with the difficult lighting conditions encountered underwater,either very low light and poor contrast or pronounced backscattering, make it very difficultfor correlation to function effectively. The correlation program was unable to provide accuratedisparity measurements even where edge detectors had no trouble recovering edge detail.Finally, correlation requires near-perfect epipolar alignment of the two images in order toperform well because it attempts to match regions pixel-by-pixel. If the epipolar lines in thetwo images are not aligned properly, a candidate window scanned along a horizontal line in oneimage is never properly aligned with the corresponding window in the other image, resulting ininaccurate correlation scores (see Section 2.4). Unfortunately, because of non-uniform ripplesintroduced by the dome-ports and the fact that the cameras are moved around and removed fromtheir housings often, epipolar alignment has been less than perfect and cannot be sufficientlyrectified by the software or a known geometric correction.FEATURE-BASED STEREOFeature-based stereo offers an excellent solution to these problems. First, feature-based stereouses the features returned by the recognition system as the basis for its disparity calculations.Because the recognition system retains only those image features necessary to provide sizeestimates, feature-based stereo calculates disparity estimates only for those points at whichdisparity is required. This results in much lower computational requirements than correlationbased stereo because the number of features in an image is typically much smaller than thenumber of pixels. Furthermore, the feature-based verification step associates heads and tailsthought to belong to the same fish, thereby identifying ambiguous matches and providing theapparent length of the fish in addition to the disparity. Correlation cannot provide such associations because it operates on syntactic primitives (pixels) rather than meaningful (semantic)features.Chapter 4: The Automatic System: Design and Implementation Details 38Secondly, feature-based stereo is less susceptible than correlation to the specific problemsassociated with images of fish mentioned above. Because it does not endeavour to matchbrightness patterns directly, feature-based matching is not as sensitive to specular reflections orpoor contrast. Provided the edge detector locates the outline of the same identifiable feature inboth images, the stereo algorithm should be able to identify the features as a correspondencematch. As mentioned above, the Canny edge detector was able to recover the necessary edgeinformation in places where the correlation algorithm failed.Finally, because feature-based stereo does not endeavour to match regions pixel-by-pixel,it does not require perfect epipolar alignment. Instead, features are sought independently inthe two images and two features are declared epipolar if their vertical separation is below apredetermined threshold.Determination of the distance between two given features in a pair of stereo images requiresthat (1) both features be identified in at least one image to provide the apparent distancebetween them and (2) at least one and ideally both features be identified in the second imagein order to provide the distance to one or both features. With this in mind, the stereo matcherwas designed to accept features from both images and attempt to interpret ambiguous matchesbased on the disparity estimates in addition to making simple correspondence matches.4.2 IMPLEMENTATION DETAILSThe system was implemented entirely in ANSI C using the UBC Vista library. Vista is acollection of programs, subroutines and data structures developed by UBC’s Laboratory forComputational Inteffigence for use in image processing and computer vision. It was adoptedbecause it provides representations for images and edge sets as well as a number of related toolssuch as the edge detector, linker and command-line parser.A schematic diagram of the entire process is provided in Figure 4.3.4.2.1 PRE-PROCESSINGBecause objects and models are represented by their outlines, edges are first extracted fromimages using an edge detector. As mentioned earlier, the Canny edge detector (described inChapter 4: The Automatic System: Design and Implementation Details 39Create ModelsImages of Canny Edgeisolated objects DettionGradient mapGray-scale imageLink EdgesContour map Analyze Stereo ImagesSegment Canny EdgeEdges Detection t\ containing )at multiple “.obiectS/’resolutions Gradient maps Gray-scalePiecewise-linear J, segmentation I Image pairL_Link EdgesClean ImageJoin & order .y Contour mapssegments SegmentEnhanced / segmentation EdgesCreate & edit at multiplemodels resolutionsModel fii-. ,/‘Piecewise-linear(ASCII) N segmentationsMatch image segmentsto model segmentsto identify featuresLists of candidate featuresVerify using relativepositions of features;group features that arethought to belong tothe same objectLists of venfied featuresLocate all epipolarfeaturesCandidate corresponding featuresDisambiguate anymultiple matchesif possible______, Corresponding feature pairsDetermine disparityand lengthFigure 4.3: A schematic diagram of the entire procedure showing inputs and outputs.Chapter : The Automatic System: Design and Implementation Details 40Section 2.2) provided the best results. The Canny edge detector4 first smooths the imageintensities with a Gaussian filter and then calculates the gradient of the intensity at each pointin the resulting image. Local maxima in the gradient are marked with a value proportional tothe magnitude of the gradient at that point; pixels that are not local maxima are set to zero.The edge detector returns the resulting gradient map.The linker accepts these gradient maps and joins connected non-zero pixels to form edges.Because the gradient magnitudes supplied by the edge detector represent the “strength” of eachedge, these values should be thresholded to discriminate true edges from noise. Like the Cannyedge detector, the Vista linker5 performs Canny hysteresis using two thresholds to minimizestreaking (see Section 2.2). The linker returns an edge set in which each contour is numberedand the coordinates of each pixel in each contour are specified. These contours may be ofarbitrary length and shape as each contour follows the intensity gradient contour traced bythe edge detector and a new contour begins only where there is a discontinuity in the currentgradient contour.(a) (b) (c)Figure 4.4: (a) The polygonal approximation of contour AB is computed by joining its endpointsand finding the point of maximum perpendicular deviation C. (b) If this deviation d is abovea pre-set threshold, the contour AB is replaced with two straight segments AC and CB. (c)These two new segments are themselves subdivided if necessary.These contours are next broken into piecewise-linear segments to generate the polygonalapproximations used in matching. This segmentation is accomplished by joining the endpointsof each contour with a straight line and locating the point on the contour of maximum perpendicular deviation from this straight line. If this deviation is below a predetermined threshold,the original contour is replaced by the straight line joining its endpoints and the next contour4The Vista Canny edge detector was written by Richard Pollock, Daniel Ko and Art Pope.5The Vista linker was written by David Lowe.Chapter 4: The Automatic System: Design and Implementation Details 41is examined. If the deviation exceeds the threshold, the contour is divided at this point ofmaximum deviation and the two resulting partial contours are passed back to the procedurerecursively for further subdivision. This process ensures that each contour is approximated bya set of straight segments such that no segment deviates from the original contour by more thanthe allowed distance (see Figure 4.4). When the recursive subdivision is complete, the entireset of straight segments is written to a new edge set. This new edge set is a more compactrepresentation than the input edge set because all the segments are linear, thus only the coordinates of the endpoints are retained. By varying the deviation allowed between a contour andits linear approximation, the original contours can be approximated at different resolutions.Both the linker and the segmentation program can help remove noise from the images byretaining only those contours which exceed a pre-set minimum length (in pixels). Elimination ofshort contours is a valuable time-saving step because most of the contours resulting from noiseare very short. If the edge detector were guaranteed to extract complete outlines of objects,this threshold could be set quite high to remove virtually all contours except these outlines.Unfortunately, because discontinuities often cause a long contour to be broken into a numberof shorter contours, setting this threshold too high will likely remove valuable information fromthe image.The data structure used to store Vista edge sets (such as those created during segmentation)permits access to information about individual segments only in numerical order. In order tocompute and match super-segments efficiently, we must have random access to informationabout the segments, including knowledge of which segments are connected. Random accessis required because connected segments are not necessarily numbered sequentially throughoutan image: sequentially-numbered segments may lie at opposite corners of an image while anysegment may be connected to several segments numbered seemingly randomly.To accommodate this, each segment in the edge set is entered into an array with its identification number serving as the array index. Each entry in the array provides the length, endpointsand orientation angle of the segment and a pointer to a list of all the connected segments (seeFigure 4.5). This array provides fast access to all the information about any segment withouthaving to calculate values repeatedly. The relative angle of a segment is calculated from theChapter 4: The Automatic System: Design and Implementation Details 42segment[ 12]number: ii Li number: 17connect: 1 Iconnect: I [ct:_I _L.Figure 4.5: The segment information calculated for segment 12. Segments 11, 17 and 33 areconsidered connected to segment 12 because each has an endpoint within the prescribed radiusof an endpoint of segment 12 (dashed circle); segment 37 is not connected. The orientationangle is in radians from the horizontal. The connect field describes what type of connectionexists between the specified segments.absolute angle of the segment and that of the previous segment.Two segments are considered connected if either end of one segment lies within a predetermined radius of either end of the other segment. Using the start-points and endpoints assignedto each segment by the segmenter, the connections are labeled by the type of connection thatexists between them. If a single contour is divided into segments, the endpoint of one segmentwill meet the start-point of the next. However, discontinuities in the contours and contoursresulting from noise and nearby objects often lead to segments joined start-to-start, end-to-endor start-to-end.4.2.2 MODEL CREATION AND USEModels are created from single images of isolated objects. Because only the shape of the outlineis required, a clean image of the object to be modeled can be created by cropping the originalimage, enhancing contrast or including only those contours longer than some minimum length(in pixels). In fact, a model can be created from an image of a black paper cutout of the objectagainst a white background.The pre-processing follows the same course described above with one additional step performed after segmentation. This step, called “cleaning” the image, reorders the segments so1 length: 87.4orientation: 0.1points: (50, 92)(137, 100)connections: —Chapter 4: The Automatic System: Design and Implementation Detailsthat connected segments are numbered consecutively and joins those segments which are likelypart of the same closed shape.Irregular ordering often results from discontinuities in the edges traced by the edge detector.Such discontinuities can cause the linker to declare two separate contours where there shouldbe only one. During segmentation, these two separate contours may not be be processedconsecutively, in which case the segments approximating the second contour will not numericallyfollow those approximating the first contour. Cleaning the image after segmentation reordersthe segments such that each one is followed numerically by the closest remaining segmentwhich has not already been numbered. This method is based on the notion that contours whichwere separated by erroneous discontinuities will still lie close to one another and will thus bere-numbered consecutively.After reordering, each segment is extended in both directions to determine if its extensioncrosses within a predetermined radius of either endpoint of the next segment, in which caseit is extended to meet the next segment. The overall effect of the cleaning process is thus torepair breaks caused by imperfect edge detection and order segments such that their numbersincrease sequentially around the perimeter of a closed figure. This step is helpful when creatingmodels because it makes the segment lengths more accurate and ensures that information aboutconnected segments is stored sequentially in the model file. Cleaning is performed only on theimages used for model creation because they are assumed to have relatively few segmentsoutlining only the one object being modeled6 while real images upon which recognition isperformed are often littered with too many spurious and disconnected segments to be cleanedeffectively.A model file is simply a text file which provides the length and angle information for eachsegment in the model. It is created semi-automatically. A model is itself a super-segmentdescribing a single feature, either a head or a tail, at a given resolution. Heads and tails areconnected only during verification and each model includes information specifying by whichmodels it can be verified. Each model file has a header line which specifies the unique nameof the model, the number of segments in the model, a flag indicating whether to compare6Recall that these images can be cropped and contrast-enhanced.Chapter 4: The Automatic System: Design and Implementation Detailsshad_headl 4 1 63.866.5 252.5 27.5 304.5 56.6 2.2 2.727.5 304.5 37.5 306.5 10.0 0.1 1.037.5 306.5 40.5 324.5 18.2 1.4 1.940.5 324.5 93.5 339.5 55.1 0.3 2.093.5 339.5 185.5 342.5 92.0 0.0 2.8Image: fishl.v, sigma = 1.0, segmented at deviation 3Table 4.1: A sample model file. The ‘1’ on the first line indicates that orientation angles areto be examined. The last line is a note to the user. Note that although the first line statesthat the model consists of four segments, there are actually five segments in the file. Thefifth segment, despite not being part of the model proper, describes the next segment in thepolygonal approximation and provides relative angle and length information when examined inconjunction with the previous segment.orientation angles when matching the model7 and the length of the segment preceding themodel in order to calculate a relative length for the first model segment. Each line in the filerefers to one segment and includes its length, orientation angle, relative angle (relative to theprevious segment) and the coordinates of its endpoints. The user is free to add text after thelast information line of the file (see Table 4.1).Because the models are stored as ASCII files, they can be edited easily. The user typicallycreates a model of a specific feature by generating a model of the entire object and deleting fromthis model all the segment information lines except those which describe the desired feature.By doing this repeatedly, separate models for all the features of a given object can be createdfrom a single model of the entire object.For reasons that will be explained when verification is discussed (Section 4.2.4), all featuresof a single object must be extracted from images which show the object at the same scale andin the same location in the image. In the development and testing, all models were taken fromdifferent segmentations of the same image of a fish.Most applications of the system require several models. To simplify use, the user may createa file providing the number of models being used, their names, the feature identification number7Orientation angles need not be compared if the object being recognized exhibits no regularity of orientation.Furthermore, orientation-dependent models can be used together with orientation-independent models if someof the objects to be recognized have expected orientations and others do not.Chapter 4: The Automatic System: Design and Implementation Details 45associated with each one and a list of those models which can verify each model.8. Specifyingthe system models, feature identification numbers and verifying models in this separate fileallows a single collection of models to be used for several different recognition tasks.4.2.3 MATCHINGAfter segmenting the current image at the current resolution, the program examines each imagesegment and attempts to match a super-segment beginning with the current image segmentagainst every possible super-segment in each candidate model. A super-segment may beginon any segment of either the image or any model provided a super-segment of the specifiedminimum cardinality is attainable: it is pointless to examine a match beginning on the thirdsegment of a four-segment model or image feature if the minimum cardinality is three.Matching is performed by comparing the angle and length information between the imageand the models segment-by-segment, searching for super-segments over a fixed range of cardinalities with average error below a predetermined threshold. The three values compared are asfollows:Orientation Angle The absolute orientation of the segments, measured relative to the horizontal, is examined to exploit any regularity of orientation exhibited by the objects, inthis case the tendency of Pacific Salmon to swim horizontally. This value can be ignoredto identify randomly-oriented objects.Relative Angle The relative angle is the angle between a segment and the previous segmentin the super-segment. This value is invariant to changes in scale and orientation.Relative Length The length of a segment is divided by the length of the previous segment inorder to accommodate differences of scale.Matching is accomplished by summing the squares of the discrepancies in these three valuesbetween the current image segment and model segment. Prior to squaring, each discrepancyis divided by a pre-set constant which specifies a “pivot point,” the value which separates8Thjs information can also be entered from the keyboard during execution but this is very slow.Chapter 4: The Automatic System: Design and Implementation Detailsan acceptable discrepancy which decreases when squared from a significant discrepancy whichincreases when squared. Division by a constant also permits specification of an equivalencebetween errors in relative length and errors in angle. Setting the angle constant to 0.15 andthe length constant to 0.1, for example, specifies that an angular discrepancy of 0.15 radian isequivalent to a 10% error in relative length. Thus the total error for a super-segment is givenby:iengthm1 — 1ength,error‘ (L7fl3 — 3 )2 + (OmJ — 6: )2 + (tethmJ_a length, )23=i Cangle Cangle Clengthwhere a is the relative angle, 6 the orientation angle, length the length of the segment, subscriptsm3 and ij refer to the th model segment and image segment of the current super-segmentrespectively and cangle and Clength are the constants explained above. The cumulative error isdivided by n, the number of segments matched, so that each segment may contribute a smallerror without the average error exceeding the allowed maximum. Without this division, everylong super-segment (high cardinality) would likely exceed the error limit even if each of itsconstituent image segments matched the corresponding model segment quite closely. The totalerror could also be divided by some function of n so as to encourage longer or shorter matchesas desired. Any super-segment having average error below the pre-set limit is considered amatch; lowering this limit makes the matching more strict.Orientation angles are restricted to the interval —ir < 6 ir. Restricting the orientationangles to this interval creates an unwanted discontinuity at ±ir. Switching to degrees forsimplicity, a segment oriented at 183° to the horizontal is recorded as lying at —177°, thusthe discrepancy in orientation angles between a segment oriented at 178° and one oriented at—177° is 178— (—177) = 355° where it should be 5°. To compensate for this discontinuity, anyorientation angle within a predetermined E of br is examined both as is and after adding orsubtracting 2ir. In the previous example, the system would realize that adding 2ir to —177°results in the correct discrepancy of 178 — ((—177) + 360) = 5°.The angle information is useful and easy to work with because angles are invariant to scalechanges. The length information poses a problem in this respect because segment lengths arelargely determined by the scale of an object in the image. This makes it difficult to compareChapter 4: The Automatic System: Design and Implementation Detailsimage segment lengths to model segment lengths when the scale difference between the imageand model objects is unknown. To compensate for this, the length of each segment is dividedby the length of the previous connected segment to provide a scale-invariant ratio (see “Travel,Traversal & Reversal”). Unfortunately, this ratio is very sensitive to the precise locationsof the breaks between segments because the division introduces a compounding effect: if arelatively straight contour is divided into a three-unit segment followed by a four-unit segment,the ratio is if the order of these two segments is reversed, the ratio is only . However, suchdiscrepancies generally occur only when the contour is fairly straight, resulting in somewhatarbitrary breakpoints between segments; the method is reliable whenever sharp angles betweensegments provide clear breakpoints.9The matching process is simplified by the fact that model segments are stored sequentially,thus there is never any doubt regarding which model segment should follow the current one orwhich is the previous segment. However, this is not the case with image segments: any segmentmay be connected to any number of other segments.To accommodate this, all matching after the first segment of a super-segment is carriedout by a procedure which uses the segment connection lists (see Figure 4.5) to traverse allpossible paths through connected segments depth-first. Using recursion, this procedure keepstrack of the current path and the cumulative error to this point, thus it can always back up toa previous juncture to try another path without repeating the work performed to reach thatjuncture. The procedure always tries to construct the longest match possible, stopping onlywhen the accumulated error exceeds the threshold or the cardinality of the match reaches thespecified upper limit. Although this recursive procedure introduces combinatorial complexity,it is the only way to ensure that all possible super-segments are examined. This is necessarybecause noise and nearby objects in the images often give rise to unwanted segments whichappear to be connected to other segments, requiring that several paths along the connectedsegments be examined in order to find a proper match.9Using the angle between segments to gauge reliability is discussed in Section 6.2.Chapter 4: The Automatic System: Design and Implementation DetailsMATCHING THE FIRST SEGMENTEvaluating the first segment of a potential match calls for special care because the previousconnected segment, required in order to calculate the relative length and relative angle of thecurrent segment, is not always obvious. Often noise and nearby objects give rise to more thanone connected segment in the image and make the resulting relative length and angle valuescorrespondingly unreliable.The system offers a choice of two methods to deal with this problem. The first methodexamines each connected segment and calculates the relative value errors only if exactly onesegment bears the appropriate connection to the current segment. This solution is based on thenotion that if only one previous segment is suitable, it is likely the correct one. If the systemlocates either fewer or more than one suitable segment, no relative value errors are added tothe accumulated error, thereby tending to favour a match.The second method examines each suitable connected segment and chooses the one whichminimizes the accumulated error in relative angle and length, thereby also tending to favour amatch. Although this second method would seem to find the correct matches while admittingfewer false matches than the first method, it was found to offer no advantages to justify itsslightly increased computation. Nonetheless, the second method was left in the system and canbe selected upon execution; the system defaults to the first method for all orientation-dependentmatching.Also assessed during examination of the first segment are the likely directions of traversaland travel of the current object, discussed presently.TRAVEL, TRAVERSAL & REVERSALThe linker and segmenter typically traverse boundaries clockwise, meaning that the segmentidentification numbers increase sequentially as one travels clockwise around the perimeter of anobject. Assuming the model was traversed clockwise, matching is most straightforward whenthe perimeter of the object in the image was also traversed clockwise. In this situation, thefirst image segment of the object would match one segment in the appropriate model withoutadjusting the angles or relative lengths, the next image segment would match the next modelChapter : The Automatic System: Design and Implementation Detailssegment and so forth until the match is complete.Under certain circumstances, however, some object boundaries may be traversed counterclockwise in an image. In order to detect matches when this has occurred, the program is ableto reverse the direction in which the model is traversed. As mentioned earlier, the programattempts to begin a match on every potentially-successful segment of each model. If a reversematch beginning with the current model segment is possible (i.e., there are enough previoussegments in the current model to achieve the minimum cardinaiity), the error in orientationangle is calculated both normally and after reversing the traversal direction of the currentmodel segment. If the error for the reversed segment is lower than for the forward segment,the system reverses the direction in which the model is traversed. This reversal entails movingbackwards through the segments (i.e., examining segment n — 1 after segment n), reversing thedirection of each individual segment in order to adjust its orientation angle and calculating therelative angles and lengths relative to the segment which follows rather than precedes the currentsegment in the model (i.e., the relative length is given by-----rather than---). This reversetraversal is also attempted whenever only a reverse match is possible from the current modelsegment (i.e., there are enough previous segments in the model but not enough subsequentsegments to achieve the minimum cardinality).If a discontinuity divides the outline of a shape into two contours, it is possible for the twocontours to be traversed separately from their opposite ends such that the resulting two setsof segments meet at the discontinuity. This situation can be detected because the segmentswhere the two traversals meet will be joined endpoint-to-endpoint rather than endpoint-tostartpoint as with a single continuous traversal. When this is detected, the system continuestraversing forward through the model segments but reverses the direction of the individualimage segment in order to adjust the angles. It then examines segments connected to the startpoint of the current segment. If the current segment has been reversed, the start-point of thecurrent segment should meet the endpoint of the next segment, thus the system will generallyremain in “reverse” mode until the match is completed. If, however, the start-point of thecurrent segment meets the start-point of the next segment, another reversal has occurred andthe system returns to “forward” mode. By doing this, the system can handle any number ofChapter 4: The Automatic System: Design and Implementation Details 50reversals (see Figure 4.6).Figure 4.6: Segments 6 and 7 were derived from a single contour thus they meet endpoint-to-startpoint. However, segment 7 meets segment 23 endpoint-to-endpoint, thus the system moves into “reverse” mode. The system then finds segment 22 by examining the start-pointof segment 23. Because segment 23 meets segment 22 startpoint-to-endpoint, the system remains in “reverse” mode. However, segment 37 meets segment 22 startpoint-to-startpoint, thusthe system returns to “forward” operation.Another possible reversal is in the direction of travel of the object in the image: even iforientation angles are being used to specify that the fish swim horizontally, they may swimleft-to-right or right-to-left in the images. This corresponds to a perfectly-legitimate reflectionof the object (and each of its constituent features and segments) about a vertical axis. Thedecision of whether the object in the image is traveling parallel or anti-parallel to the objectin the model’° is also made during examination of the first segment of the match. Once again,the decision is made by comparing the error in orientation angle for both possible orientationsof the model segment; the system selects whichever direction of travel minimizes the error.Orientation angles are used to determine the directions of traversal and travel because theorientation angle is the only matching criterion that is intrinsic to a single segment; all othercriteria require the identification of an adjacent connected segment and thereby introduce apossible source of error.MATCHING RANDOMLY-ORIENTED OBJECTSBecause the system was designed to identify objects which are generally found in a particularorientation, it would be foolish to neglect this source of information when matching features.However, the system is also able to ignore the absolute orientation of segments in order toidentify objects that do not exhibit any preferred orientation. The system can be instructed to‘°If orientation angles are being used, the model must describe the object in (one of) its expected orientation(s).Chapter 4: The Automatic System: Design and Implementation Details 51do this either upon execution, in which case no orientation angles are used, or within individualmodel files, which allows the user to specify that orientation is to be examined for certainobjects being identified and not for others.When matching randomly-oriented objects, the relative lengths of the segments and therelative angles between the segments are the only criteria available to the system. In orderto locate the previous connected segment when computing these values for the first segmentof a match, the system examines each suitable connected segment and chooses the one whichminimizes the accumulated error in relative angle and length, thereby tending to favour a match.This is the second method described in “Matching the First Segment.”The matching process is slightly simplified when orientation is not being examined becausethe system need not ascertain the direction of travel: only the orientation angles change iftravel is reversed; relative angles and lengths remain the same. However, the system still mustdetermine the direction in which to traverse the model. This is accomplished by computingthe relative angle and length for both traversal directions of the current model segment;’1 thetraversal direction which minimizes the resulting accumulated error is chosen.Once the first segment has been matched and a traversal direction has been selected, matching the remainder of the super-segment proceeds precisely as with orientation-dependent matching except that discrepancies in orientation angle between image and model are ignored.As each segment of a super-segment is matched, the difference in orientation angle betweenthe image segment the corresponding model segment is added to a running total. Upon completion of the super-segment, this total is divided by the number of segments in the super-segmentin order to provide an estimate of how the image feature is rotated relative to the model.ELIMINATING REDUNDANT MATCHESOne difficulty of using multiple models and segmentations is that the same image feature maybe matched more than once. That is, a single feature in an image may be matched to twodifferent models of that feature or at two different resolutions. Although these superfluous11Recall that reversing the traversal direction also changes the end to which the previous segment must beconnected (see “Travel, Traversal & Reversal”).Chapter : The Automatic System: Design and Implementation Details 52matches can cause confusion during the stereo processing stage, a match cannot be discardedsimply because the same image feature was matched to an equivalent model earlier because thenew match may be more precise or complete, thereby providing a more accurate location of thefeature. Furthermore, it is difficult to determine if a feature has already been matched becausedifferent segmentations result in different endpoints and numberings for the segments, thus onecannot simply check whether the current segment has already been matched.However, because the segmentation process operates repeatedly on a single set of contours,it is virtually guaranteed that the sets of segments comprising different segmentations of agiven contour will share at least one common endpoint.12 Exploiting this property to eliminateredundant matches, the system examines the list of matches for the current image each timea match is found; if a previous equivalent match shares a segment endpoint with the currentmatch, only the match with the longer arclength is retained.Any of several values could have been used to compare matches but arclength was found tobe the most satisfactory. Choosing the match made at the higher resolution could be misleadingbecause a match made at a lower resolution may represent a larger portion of the feature andhence more accurately reflects the location of the feature. Similarly, comparison based on thenumber of segments comprising a match is unsatisfactory because a higher-resolution matchcould represent a small portion of the feature despite having more segments than a lower-resolution match of the entire feature. Selecting the match with lower cumulative error couldalso result in a shorter match being chosen over a more complete match. Although the proximityof two matches may indicate whether the two are redundant, this test is not as reliable ascomparing segment endpoints and the locations cannot indicate which match is preferable.Comparing arclengths takes care of a number of these problems. First, because both matchesare of the same feature derived from a single set of contours, the comparison will not beinfluenced by differences of scale. Secondly, the “completeness” problem is solved automatically:if a low-resolution match comprises significantly more of the feature than a higher-resolutionmatch, it should have a longer arclength and will thus be chosen. If, however, two matchescomprise the same “amount” of a feature, the higher-resolution match will show greater detail‘2The only time this wifi not occur is if one segmentation is very low-resolution, the other very high-resolutionand the contour exhibits high curvature.Chapter 4: The Automatic System: Design and Implementation Details 53and will thus have a longer arclength, causing it to be chosen as we would like.All super-segments that match a model (i.e., have normalized cumulative error below theallowed threshold) and are not redundant are entered in a linked list of matches. Each entry inthis list contains all the information about the match including the feature identification numberand the particular model matched, the cardinality of the match, the starting segment, locationand arclength of the super-segment in both the model and the image, the relative rotationbetween the image feature and the corresponding model feature, the hypothesized directions oftravel and traversal, the resolution at which the feature was matched and a pointer to a list ofsupporting features which will be constructed during verification.Both images of the stereo pair are passed to the matching procedure separately, resultingin two linked lists of features.4.2.4 VERIFICATIONCandidate feature matches are verified by testing whether they are located plausibly relative toone another. The verification procedure examines the matched features in one image at a time,comparing the distance and angle between pairs of different but related image features to thedistance and angle between the corresponding model features. These consistency criteria, alsoused by Stein, are those introduced by Grimson and Lozano-Perez ([GLP84I) and are calculatedas follows:’3Distance The distance between features in two dimensions. To compensate for differencesof scale between the model and the image, the distance between the two features in themodel is scaled by the ratio of super-segment arclengths between the image and the model.Angle The angle between features is measured relative to the horizontal. For objects of knownorientation, only direction reversal is allowed without penalty; for randomly-oriented objects, arbitrary rotation is permitted.Related features are simply features which belong to the same object and are thus eligibleto verify one another. The user specifies the related features for each model; this helps prevent‘3Grimson and Lozano-Perez (and Stein) also compare the ranges of components of vectors spanning the twoscene super-segments and the two model super-segments. This criterion was not used in the current system.Chapter 4: The Automatic System: Design and Implementation Details 54false verifications when features belonging to more than one object are being identified. Forexample, if two different types of fish are being identified, the user can specify that a head andtail must belong to the same type of fish in order to verify each other, thereby preventing ahead from fish A from falsely verifying a tail from fish B.The method by which the consistency of a pair of matches is determined is very similarto the method used to find matches: the discrepancies in distance and angle between the pairof image features and the corresponding model features are squared, summed and comparedto a predetermined threshold. As was done when matching, the discrepancies are divided bypredetermined constants prior to being squared, once again to provide a “pivot point” and todefine an equivalence between discrepancies in length and in angle. Thus the verification erroris given by:d—(--derror= ( —__m)2 + ( am1 rn)2cangle cdistancewhere 8 and d refer to the angle and distance between the two features, a refers to the arclength of the matched super-segment, Cangje and Cdjstance are predetermined constants and thesubscripts i and m refer to values in the image and model respectively. Two features are considered consistent with one another if the cumulative error in their relative positions is below apredetermined threshold; lowering this threshold makes the verification more strict.To expedite verification, the list of matches is first sorted so that all instances of a particularfeature are grouped consecutively, followed by all instances of the next feature and so forth.Given t different feature identification numbers, verification now proceeds by checking all instances of feature m against all instances of related feature n for 1 m < t and m < n t.In English, each instance of each feature is compared to each instance of every related featurebearing a higher feature identification number.14 Because features are sorted by increasingidentification number, the comparisons begin with the first feature in the list bearing a higheridentification number than the current feature and continue to the end of the list of features.The reflexive nature of verification (i.e., A verifies B = B verifies A) ensures that checking afeature against features with lower identification numbers would be redundant.14These identification numbers actually refer to the position of a group of equivalent features within the sortedlist and need not correspond to the assigned feature identification numbers.Chapter 4: The Automatic System: Design and Implementation Details 55The angle between features is handled differently depending upon whether the objects beingidentified have definite or random orientation. If the objects have definite orientation, the anglebetween the image features must match either the angle between the corresponding model features (if the image object was identified traveling parallel to the model object) or the reflectionof this angle about the vertical axis (if the image object was identified traveling anti-parallel tothe model object) (see “Travel, Traversal & Reversal”). Two orientation-definite features canverify each other only if they were both identified traveling in the same direction.If the objects exhibit no preferred orientation, the verifier must determine where two imagefeatures should be located relative to each other in order to determine whether or not theyare consistent. To do this, the verifier uses the previously-calculated estimate of the relativerotation between the current image feature and the corresponding model feature (see “MatchingRandomly-Oriented Objects”). The angle between the two model features is rotated by thisangle before calculating the discrepancy in orientation. This effectively aligns the model withthe first image feature and then determines whether the second image feature being consideredis in the appropriate location. The system cannot check whether the two features were travelingin the same direction because no direction of travel is calculated for randomly-oriented objects.The user specifies the number of consistent features required to constitute verification. Inthe NIFE system, only one consistent feature is necessary for verification as only heads and tailsare available to verify each other; if another application were to identify several features on eachobject, the user could demand more consistent matches to make verification more rigorous.Each feature maintains both a counter and a list to keep track of how many and whichfeatures have been found to verify it. Whenever a consistent pair of features is found, bothfeatures have their associated counters incremented to reflect this new support and a pointer tothe appropriate supporting feature is appended to each support list. Thus if feature B is foundto be consistent with feature A, both A and B will have their support counters incrementedand a pointer to B will be added to the support list for A while a pointer to A will be addedto the support list for B. This once again highlights the reflexivity of verification.By keeping track of supporting features in this way, the verification process not only helpseliminate false matches, those that are not consistent with any others (as judged by location),Chapter 4: The Automatic System: Design and Implementation Details 56but also serves to associate features that likely belong to the same object. With the NIFEsystem, for example, if a head is found to be consistent with a tail, they are probably parts ofthe same fish and the apparent length of that fish is simply the distance between them.The verification step relies upon one of the assumptions made during the design of the system, the assumption that each object has at most one of each feature. Because the verificationprocedure checks each feature against only features with different identification numbers, twoequivalent features’5 would not be checked against one another. Thus the two wheels of abicycle would not be examined to verify one another. If this restriction were removed, stepswould have to be taken to prevent a feature from verifying itself.Finally, because the verification procedure calculates distances and angles between featuresin the models, all models of features belonging to single object must be acquired from imageswhich show the features in the same locations and at the same scale. if the model of a headwere taken from the top left corner of the image and the model of a tail from the bottom left,the calculated distance between the head and tail would be arbitrary and the calculated anglebetween them would be 9Q0 (vertical) rather than 00 (horizontal). Of course, this could allbe accounted for by scaling and translating features or models as necessary, but it has proveneasier to acquire models from similar images. All models of a single fish used in testing wereacquired from different segmentations of the same image.4.2.5 STEREOOnce the features have been verified and associated with their supporting features, the stereoprocedure attempts to ascertain the distances to individual fish by identifying pairs of corresponding features in the two images of a stereo pair and examining the disparity between them(see Appendix A). The stereo procedure also examines the locations of features in both imagesto interpret ambiguous matches, those in which one head, for example, is consistent with twotails. The system favours accuracy of results over quantity, thus any matches which cannot beinterpreted reliably are discarded. This policy should not result in a shortage of usable data15R that equivalent features are features that have been matched against models bearing the same feature identification number; multiple models of a tail acquired at different resolutions would all have the sameidentification number and would be equivalent models.Chapter : The Automatic System: Design and Implementation Details 57because images are easily collected: at 7.5 stereo pairs per second,16 a one-hour videotape ofthe fish circulating in a net pen provides 27000 image pairs, each of which may show severalfish.In order to determine the actual length of a fish, the system requires the apparent lengthof the fish (in pixels) as an indication of size and the distance to the fish to establish scale.The apparent length can be calculated provided the system can locate both the head and tailof a single fish in one image. The distance to a fish is calculated from the disparity exhibitedby a particular feature of the fish, requiring that that feature be identified in both images of astereo pair. Hence in order to determine the actual size of a fish from a stereo pair, the systemmust be able locate both the head and tail of a single fish in one image and either the heador tail of the same fish in the other image. Of course, more information can be put to use: ifboth the head and tail are present in both images, the system can double-check or average twoapparent length measurements and can obtain estimates of the distances to the head and tailindependently.Dimensions are calculated as distances between pairs of features, thus the stereo procedureis provided with the identification numbers of those features which delimit the particular dimension desired. For example, if the heads tails, dorsal fins and ventral fins have been locatedand are identified by the numbers 1, 2, 3 and 4 respectively, the length of the fish is determinedby having the stereo procedure find the distance between features 1 and 2 and the width of thefish can be determined from the same lists of matches by having the stereo procedure calculatethe distance between features 3 and 4. By doing this, the system can perform all the matchingand verification at once, regardless of how many dimensions are being calculated, and avoidsredundancy, particularly when a single feature delimits more than one dimension.The stereo procedure begins by searching for an instance of a head in the left image’7 and,if one is found, searches for other heads on the same epipolar line of the same image. Twofeatures are considered epipolar if their vertical separation (in pixels) is below a pre-set limitor if this separation would be below the limit had the two features been matched to the same‘6Standard video captures one image every 1/15 second.‘7This explanation will use the concrete example of estimating the length of fish by identifying heads and tailsfor ease of understanding; the system wifi of course calculate the dimension between any two arbitrary features.Chapter 4: The Automatic System: Design and Implementation Details 58model. This second test is required because one instance of a feature may, as a result of differentsegmentation, have its centrepoint (which is considered its location) at a slightly different levelthan some other instance of the same feature. If these two features were matched to two modelswith similarly-displaced centrepoints, the system would scale the vertical displacement betweenthe centrepoints of the models by the ratio of the image arclength to the model arclength andexamine the locations again with this offset taken into account (see Figure 4.7).Figure 4.7: Two instances of the same feature at different resolutions. The right-hand instanceis located above the left-hand instance even though both features occupy the same regionof the image.’8 Provided these two instances matched two models with similarly-displacedcentrepoints, the offset d would be taken into account and the two would be considered epipolar.If more than one verified head is found on the current epipolar of this image, the systemignores this epipolar and searches for another head elsewhere in the image. It does this becauseif multiple heads were verified on the same epipolar line, the information on that epipolar willlikely be difficult to interpret reliably; although the system does attempt disambiguation, itis preferable to begin with at least one unambiguous feature. Furthermore, as was mentionedabove, there should be enough images available that there is no need to go to great trouble orcomputational expense to interpret particularly difficult data, especially when such data is lesslikely to provide reliable results.19 Note that the system accepts unverified features becausethey often supply useful information provided the missing information is available elsewhere(as will be explained shortly).Assuming only the one verified head is found on the current epipolar, the system examinesits list of supporting features to determine if it was verified by one or more tails. The right‘8Reca]1 that for super-segments of even cardinality, the location of the super-segment is that of the middlevertex; for super-segments of odd cardinality, the location is the midpoint of the middle segment.19With improved disambiguation and record-keeping, this shortcoming could be addressed in future work.Chapter : The Automatic System: Design and Implementation Details 59image is then searched for corresponding heads and tails.Of course, the ideal situation would be to locate a single head supported by a single tail in theleft image and the corresponding head and tail in the right image. If this happens, the systemobtains two independent values for the disparity, one from each pair of corresponding features,and two independent estimates of the apparent length, one from each image. The differencebetween the two disparities and the percent difference between the two length estimates mustbe below pre-set limits to help detect false matches. Assuming the values are acceptable, thesystem creates a new entry in a linked list of “objects” and records the two disparities, theaverage of the two apparent lengths and the features involved. The two apparent lengths areaveraged because the system has no way to determine which one is more accurate; if it is foundto be advantageous, the system can be made to choose the larger length on the assumption thatcontortions may make the fish look shorter than it really is but never longer. Both disparities areretained because the two features could legitimately be at different distances from the cameras.Unfortunately, factors such as multiple objects, occlusion, noise, poor contrast and the likeoften result in more or fewer features being detected. If fewer features are detected, the systemattempts to make use of the information that is available. Recall that in order to calculatean actual length, the system must locate both the head and tail of a single fish in one image(to obtain apparent length) and either the head or tail of the same fish in the other image (todetermine distance). With this in mind, if the system locates a head with no correspondingtail (i.e., an unverified head) in the left image, it examines the epipolar line in the right imagefor a corresponding head which has been associated (verified) with at least one tail. If it findssuch a head, it can attempt to ascertain the apparent length of the fish from the head and tailin the right image and the disparity from the two corresponding heads. Similarly, if the headlocated in the left image was verified by a tail, the system can calculate the apparent lengthfrom these features and the disparity given only an unverified head or tail in the right image.Of course, when one of the features is missing, the system has no way to double-check theaccuracy of either the length or the disparity estimate, nor can it determine whether both thehead and tail are the same distance from the cameras. Nonetheless, the ability to make use ofall available information is important: even in the absence of noise and occlusion, frequentlyChapter 4: The Automatic System: Design and Implementation Details 60the disparity between a pair of stereo images will alone result in a feature being absent fromone image while the entire object is visible in the other image. It is for this reason that thestereo system examines unverified features.The opposite situation occurs when the recognition system locates more features than areexpected, as would happen if one fish were partially hidden behind another fish such that onlypart of it were visible. If this occurs, the system has three procedures by which it identifiesthe features that most likely offer the correct information. These procedures are based on thefollowing criteria:Disparity Range Because disparity is a function of distance, the range of acceptable disparities specifies the range of distances within which the objects are either confined or likelyto be found.Discrepancy Limit This limit specifies the maximum allowable difference between the disparity of a pair of heads and that of the corresponding pair of tails. Limiting this discrepancyhelps eliminate false matches and gives the user some control over the range of orientationsthat will be accepted.The first procedure is used when at least one head is found in one image and more than oneepipolar head is found in the other image. The disparity is calculated for each possible pair ofheads in the two images; if only one pair yields a disparity within the allowed range, those twoheads are considered correct. This procedure is particularly valuable if the objects are confinedto a region such that it is impossible (rather than just unlikely) for the disparity to be outsidethe specified range.The second procedure is called upon when an uncontested disparity has been obtained fora pair of heads but one or both of the images has more than one tail associated with the head.This situation could easily occur if one fish is partially occluded by another fish such that onlyits tail is visible, resulting in the head and tail of the closer fish appearing in both images andthe tail of the more distant fish appearing in at least one image (see Section 5.3). If this occurs,the disparity between the two more distant tails should be significantly smaller than that ofthe heads simply because the tails are more distant than the heads. Similarly, the “disparity”Chapter 4: The Automatic System: Design and Implementation Details 61between a closer tail and a more distant tail should also differ significantly from the disparityof the heads because it reflects a shift in the absolute location of one of the features in additionto a shift in camera position. Thus only the disparity between the two correct (closer) tailsshould be sufficiently close to the the disparity between the heads. To identify these two tails,the disparity is calculated for every contending pair of tails and if the disparity of exactly onepair of tails matches the disparity of the heads to within the allowed discrepancy, that pair oftails is chosen.The final procedure is used if the recognition system has located one head in the left image,one or more epipolar heads in the right image and one or more corresponding tails in bothimages. This procedure mates the head in the left image with each head in the right image,calculates the resulting disparity, then tests every possible pair of tails to determine which head(in the right image) and pair of tails provide a disparity match within the allowed disparityrange and discrepancy. Once again, if exactly one pair of heads and one pair of tails are foundto satisfy these criteria, these pairs are chosen. This same procedure could in principle be usedin the event that multiple heads and multiple tails were found in both images, but this situationcannot arise because the system discards any occurrence of multiple epipolar verified heads inthe left image. As was mentioned earlier, it is unlikely to interpret such a collection of featuresreliably if every feature is open to multiple interpretations.In keeping with the reliability requirement, each of these procedures succeeds only if exactlyone allowable match is found: if more than one is found, the system cannot reliably select anyone and discards the features.The latter two of these disambiguation techniques rely on the disparity between tails equaling the disparity between heads to within the allowed discrepancy. This implies that the headsand tails are expected to be found at roughly the same distance from the cameras, meaning thefish are roughly parallel to the image plane.2° While this situation can generally be arrangedwhen videotaping fish for the NIFE system, it may be difficult or impossible to arrange withother objects in other applications. In such applications, the allowed discrepancy may have tobe be increased in order for any feature pairs to be accepted. If the objects are free to assume20The routine which compares the disparity between heads to that between tails when exactly one pair of eachwas found makes the same assumption.Chapter 4: The Automatic System: Design and Implementation Details 62any orientation, the only solution may be to set the limit to the maximum possible discrepancy, presumably allowing a difference in distance almost equal to the length of the objects.Although this would appear to defeat the purpose of having such a limit, the limit may stillserve to eliminate preposterous pairings between falsely-matched features and would also givethe user some control over the range of orientations which will be accepted.Finally, there is an undesired side-effect to these arrangements: the choices of which feature and image are examined first can affect which particular fish are measured. Assume therecognition system has located two verified heads and one tail on a given epipolar in the leftimage and one head and one tail on the same epipolar in the right image. If the stereo system isinstructed to begin by searching the left image for heads, it will find two on the same epipolarand will discard this epipolar because it is too difficult to interpret reliably. However, if thesystem begins by searching the left image for tails, it will find the one tail and will attempt toassess the size of that fish. If the right image is examined first, it makes no difference whichfeature is sought first because the stereo system will find one of either feature and will attemptdetermine the size of the fish. Similarly, had no head been found in the left image, a search forheads in this image would yield nothing while a search for tails or a search of the right imagewould likely result in the fish being evaluated. Nonetheless, the choices of which feature andimage to examine first are arbitrary: the goal of the system is to provide a statistical estimateof the size distribution and it is assumed that any effects of these decisions will vanish in anystatistically significant sample.The information returned to the user consists of a list of those feature pairs which the systemconsiders to comprise objects, the apparent distance between these pairs of features (in pixels)and either one or two disparity values (in pixels), depending on which features were located.Once the system has been calibrated, the disparity values can be converted into distance estimates (using the equations of Appendix A) and, once this is done, the apparent lengths canbe converted into actual lengths (using the equations of Appendix B). For development andtesting purposes, the current system also provides both lists and images showing the candidatefeatures identified in each image and those candidate features that were successfully verified.CHAPTER 5RESULTS[lit is a natural instinct to be charmed by one’s own productions. That’s why ravenchicks are such a delight to their parents and mother apes find their babies exquisitelybeautiful.- Thomas More, “Utopia”This section provides some examples which illustrate the successes and shortcomings ofthe system. Model and test images for the first three examples were created using a set ofblack paper cutouts of an American Shad and a Requiem Shark.’ These were placed on asheet of textured Styrofoam and imaged from above. All models were created from differentsegmentations of two images, one of the shad and one of the shark. The models show the fishtraveling from right to left and boundaries are traversed clockwise. The images used to generatemodels were not used in subsequent testing.The images for the fourth example were captured in a commercial sea cage using the NIFEapparatus (described in Section 3.1). These images show Pacific salmon; models were generatedfrom other images captured in the same tank.5.1 EXAMPLE #1The first stereo pair was created from a single image using a graphics editor. The originalimage was produced by photocopying the American Shad at three different magnifications,placing the three resulting cutouts on a sheet of white Styrofoam and capturing an image withthe Styrofoam sheet roughly parallel to the image plane. This image was copied and, using agraphics editor, the three fish in the copied image were shifted to the right as they would haveoutlines were photocopied from [Sau92].63Chapter 5: Resultsbeen had a second image been captured simultaneously by a camera located slightly to the leftof the camera used. Thus a simulated stereo pair was created. To show that the system istolerant to imprecise camera calibration, the fish in one image were shifted vertically by smallamounts. One of the fish was also slightly shortened in one image. The graphics editor wasthen used to clutter both images with various shapes, some random and some resembling partsof fish. The resulting images are shown in Figure 5.la.The program was supplied with two models of the shad head and one model of the tail.The system segmented the images at two resolutions and sought all super-segments consistingof between three and five segments.The results of matching stage are displayed in Figure 5.lb and detailed (for the right image)in Table 5.1, both generated by the matching procedure. The images show that the matchingprocedure successfully located all the correct features plus a few of the added features. In theright image, one of the added features resembling a head and one resembling a tail were falselymatched (second head and tail from the bottom).These images illustrate the value of multiple models: the head of the middle fish (see leftimage) does not show any mouth detail while the upper and lower heads show the mouthclearly. Correspondingly, the middle head (feature #5 in Table 5.1) was matched to model #0while the other two heads (features #1 and #4) were matched to model #1.2 Because they areequivalent features, all heads have feature ID #1 (column #4).Figure 5.lc and Table 5.2 show those matches that were successfully verified. All falsematches were correctly identified and removed.Table 5.3 displays the final output of the system: the apparent lengths and associateddisparities of the three objects identified in the images. Note that the disparities of the headand tail of the second fish differ because this fish was shortened slightly in one image only.The locations of the six verified features indicate that the epipolar alignment was off bybetween 7 and 11 pixels between the two images, a misalignment that would have hamperedthe performance of correlation-based stereo.By segmenting only those contours over 24 pixels in length,3 the number of contours was2The false head in the right image (feature #3), showing no mouth detail, was also matched to model #0.3The minimum length is set upon execution.Chapter 5: Results 65model_centreTable 5.1: Candidate matches: The left # column provides reference numbers for the features, the Edges column refers to edge numbers in the image of the matches, model gives thenumber of the model matched, ID shows the feature identification number of the match, seggives the number of the starting model segment of the match, res tells at which resolution thefeature was matched, nsegs gives the cardinality of the match and score provides the cumulativediscrepancy. The two right-hand columns give the location of the feature match in the imageand the model.Table 5.2: Verified matches: This table shows the details of each feature successfully verified.The edges column gives edge numbers in the image of verified features while the # columncorresponds to the list of candidate matches (Table 5.1).# Edges Model ID Seg Res nsegs Score image_centre1 0 1 2 1 1 0 3 3 3.30 (6.5, 466.0) (32.5, 305.5)2 3 4 5 0 1 0 3 3 15.37 (226.5. 293.0) (34.5, 309.5)3 6 7 8 0 1 2 3 3 17.45 (71.5, 199.0) (34.5, 309.5)4 9 10 11 1 1 0 3 3 8.18 (133.5, 146.0) (32.5, 305.5)5 12 13 14 0 1 0 5 3 21.84 (84.0, 325.5) (34.5, 309.5)6 15 16 17 2 2 2 3 3 12.81 (29.5, 427.0) (433.5, 273.0)7 18 19 20 2 2 3 3 3 15.68 (144.0, 383.0) (425.0, 208.5)8 21 22 23 2 2 2 3 3 11.20 (378.5, 362.5) (433.5, 273.0)9 24 25 26 2 2 0 3 3 9.71 (311.5, 219.5) (433.5, 273.0)10 27 28 29 30 2 2 0 5 4 19.03 (180.5, 453.5) (409.5, 242.5)11 31 32 33 34 2 2 0 5 4 7.44 (331.5, 283.5) (409.5, 242.5)12 35 36 37 38 2 2 0 5 4 20.13 (333.5, 152.5) (409.5, 242.5)# Feature Model Edges image_centre1 head3 1 0 1 2 (6.5, 466.0)4 head3 1 3 4 5 (133.5, 146.0)5 head2 0 6 7 8 (84.0, 325.5)10 tail 2 9 10 11 12 (180.5, 453.5)11 tail 2 13 14 15 16 (331.5, 283.5)12 tail 2 17 18 19 20 (333.5, 152.5)Chapter 5: Results 66Table 5.3: Measured objects: The apparent length is given in pixels, the two disparities (inpixels) are of the two pairs of features (heads & tails) and the remaining four columns indicatewhich features were used to obtain these measurements (feature numbers refer to Table 5.1).reduced from 621 to 115 in the left image and from 760 to 183 in the right. Running ona Sparc2 workstation, the system required 22.8 seconds including the time to generate andstore the diagnostic images and information.4Adding an extra head model increased this timerequirement by only 0.7 seconds but the extra model resulted in an false match which couldnot be reliably disambiguated, thus only two of the three fish were successfully measured.Figure 5.ia: The left and right image contours for example #1.4Because linked contour images form the input to the system, times given in this paper do not include thetime required for edge detection or linking. These typically required 18 and 0.5 seconds respectively.App...Length dispi disp2 fil f21 fir f2r174.4 16.0 16.0 1 9 1 10201.7 70.0 73.0 3 11 4 12251.0 64.0 64.0 5 10 5 iiChapter 5: Results 67-j()Figure 5.lb: Candidate matches from the left and right images.Figure 5.lc: Verified matches from the left and right images.5.2 EXAMPLE #2The images in the second example contain both a shad and a shark and were created using apair of stereo cameras. The images are shown in Figure 5.2a.Six models were provided: three shad heads, one shark head and one of each tail. Thesystem segmented the image at two resolutions and sought au super-segments with cardinalitybetween three and five.Chapter 5: Results 68# Feature Model Edges image_centre1 shktaill 3 0 1 2 3 4 (438.0, 362.5)2 shkheadi 2 5 6 7 (253.0, 373.5)3 tail 1 8 9 10 11 (454.5, 183.5)4 head2 0 12 13 14 (298.5, 205.0)Table 5.4: The four verified matches in the left image.App_Length dispi disp2 fil f2l fir f2r161.4 241.5 233.0 4 3 3 4185.7 225.0 224.0 2 1 2 1Table 5.5: The two objects identified in the images.Figure 5.2b shows the candidate feature matches. Only the correct image features were identified during the matching stage, hence the images of verified features are identical. Examinationof the polygonal approximations reveals that the shad outline was traversed counter-clockwisein the right image, requiring the system to reverse its traversal of the models to locate thisfish. The lists of verified matches (see Table 5.4) shows that the four required features were allcorrectly identified. The resulting measurements are shown in Table 5.5.Once again to show that the system is tolerant to imprecise camera calibration, the cameraswere crudely adjusted by hand while viewing their output on a monitor. The locations of thefour verified features show epipolar misalignment of up to 8.5 pixels.Segmenting only contours over 24 pixels in length reduced the number of contours from 238to 47 in the left image and from 325 to 68 in the right, resulting in an average time requirementof 3.0 seconds. Diagnostic information indicated that only four of the six models were used;eliminating the two unused models reduced the time requirement to an average of 2.7 seconds.Chapter 5: Results 69—Figure 5.2a: The left and right image contours for example #2.Figure 5.2b: Candidate matches from the left and right images; all are correct.5.3 EXAMPLE #3This image pair was created by placing the shad horizontally and the shark diagonally on theStyrofoam. The shark was also raised slightly to give it a different disparity. To demonstratestereo disambiguation, an extra shad tail was placed behind the shad, thereby simulating apartially occluded shad. The Styrofoam sheet was then cluttered with random laboratory junkand the scene was captured with a pair of stereo cameras. The images are shown in Figure 5.3a.The system was provided with two models of a shad head, one model of a shark head andone model of each tail. The shad models were declared orientation-dependent, exploiting thehorizontal orientation of the shad, while the shark models were declared orientation-independentto identify the shark in its diagonal orientation. Once again, images were segmented at twoChapter 5: Results 70AppJength dispi disp2 fil f21 fir f2r166.4 192.5 198.0 1 15 1 56305.8 254.5 252.0 38 18 45 24Table 5.6: The size and disparity estimates.resolutions and all super-segments with cardinality between three and five were sought.The images of candidate matches (see Figure 5.3b) show both the correct matches (allcorrectly identified) and many false matches. The false matches are caused partially by themany random objects, some of which resemble fish parts when segmented, and partially by themany models and the arbitrary orientation permitted for the shark models. For example, if themodel of the shark tail is rotated roughly 150°, three of its segments match the dorsal fin ofa shad very closely (see Figure 5.4). This accounts for the matched shad dorsal fin in the leftimage.Inspection of the polygonal approximations reveals that the shark was traversed counterclockwise in the right image and the segments constituting the shark head were numberederratically in both images. The shad in this image pair is reflected about the vertical axisrelative to the shad models.5 The system handled all these abnormalities without difficulty.The verifier managed to eliminate most of the false matches from the left image but severalremained in the right image because of coincidental image features (see Figure 5.3c). However,these false matches do not affect the final size information; all are eliminated during stereoprocessing either through disambiguation or because there are no corresponding features in theleft image. The resulting size information is shown in Table 5.6.Examination of the diagnostic information reveals that the stereo disambiguation procedureselected the correct pair of tails by comparing the disparities of all possible pairs of shad tails tothe disparity of the heads and finding only the correct pair of tails within the allowed disparitydiscrepancy.These images contained 521 (left) and 570 contours (reduced to 113 and 146 respectively)5Reca]1 that, to accommodate fish crossing the cameras in both directions, this is the only orientation differenceallowed without penalty in orientation-dependent matching.Chapter 5: Results 71Figure 5.3b:Figure 5.3a: The left and right image contours for example #3.4-<SIIr‘3 ,©Candidate matches from the left and right images.)N(IC-Figure 5.3c: Verified matches from the left and right images.Chapter 5: Results 72(b)Figure 5.4: (a) A portion of shark tail rotated 150°. (b) Shark head. (c) Shad dorsal fin.Segments with corresponding numbers could be confused, thereby causing false matches evenwhen several adjacent segments are examined.and required 11.2 seconds to analyze. Supplying three extra models increased the time requirement by only 0.8 seconds but once again yielded a false match that could not be disambiguated,resulting in measurements for only one of the two fish.5.4 EXAMPLE #4The images for the final example were captured at a salmon farm using the NIFE equipment.The fish are Pacific Salmon and the only change to their environment was the addition of alight-coloured backdrop which the farmers said would not disturb the fish at all. The imagesare shown in Figure 5.5a and the resulting contours in Figure 5.5b.Two models, one head and one tail, were generated from a fish found in another image.This fish was chosen for the models because it appeared similar to the fish in the test imagebut the resulting models were not altered in any way to aid the system.The system parameters were set as for previous examples except the matching thresholdhad to be increased by roughly a factor of two in order to identify necessary features.32(a)2Chapter 5: Results 73The images of candidate matches (Figure 5.5c) show that the head and tail of the uppermostfish were correctly identified in both images as were two other tails in the left image and onehead in the right image; the remaining matches are incorrect. The the large fish (centre of leftimage) was not measured because its tail is partially occluded and its head is missing in the rightimage. Although the head and tail likely could be identified in the left image if more modelswere provided, this fish cannot be evaluated because neither its head nor tail is identifiable inthe right image.The verification stage removed half of the incorrect matches in both images (see Figure 5.5d).The false matches do not lead to incorrect measurements because for each false match in oneimage, there is no corresponding feature in the other. The system provides the correct valuesfor the one fish correctly identified in both images:App_Length dispi disp2 fil f21 fir f2r176.6 i06.5 95.5 i 14 i 10The system analyzed 217 of 775 segments in the left image and 235 of 1156 in the rightimage; the time required was 15.3 seconds including the time to generate and store the diagnosticinformation (approximately 0.9s).Figure 5.Sa: The left and right images for example #4.CD 01 CD CD 0 CD‘Jq CDCD 01 CD CD CD 0 0I)1LIpCD 0’ II”, rI-’I>4 ‘1‘.1ItL 1JChapter 5: Results 755.5 GENERAL OBSERVATIONSA number of general trends became apparent while working with the system:• To minimize false matches, the shortest acceptable match (lowest cardinality) should generally be three segments. It is unlikely for three segments to join in a configuration similarenough to a model to produce a false match. However, when the minimum required cardinality was reduced to two, the number of false matches resulting from coincidentalarrangements increased sharply. Although this does not prevent the system from identifying the correct features, the system may fall to measure identified objects if the falsematches cannot be successfully eliminated or disambiguated.• Setting the matching threshold too high or providing extra models likely increases thenumber of false matches. Once again, even if the desired features are correctly matched,the false matches may result in a loss of measurement information if they cannot besuccessfully eliminated or disambiguated.• The system locates matches much more successfully when there are sharp angles betweensuccessive segments in the model shapes. Such angles provide well-defined breakpointsbetween segments, resulting in reliable segment lengths and angles. During testing, thesystem located the highly-angular shad tails and shark tails very reliably despite havingonly one model of each. The system had difficulty identifying the much more roundedshad head even with multiple models because the lack of sharp angles leads to somewhatarbitrary breakpoints between segments, hence lengths and angles are less reliable.• The shark head illustrates a difficulty with the system. Because the two long segmentsrunning from the top of the shark ventral fin to the front of the head (see Figure 5.4b,segments 2 and 3) form an almost-straight line, they are often approximated as a singlestraight segment. When this happens, the shark head becomes a two-segment shaperesembling a “V” on its side and, consisting of only two segments, is ignored when theminimum required cardinality is three. The simple solution is to add a two-segmentmodel of the head and reduce the minimum required cardinality to two. Unfortunately,Chapter 5: Results 76Example Contours7 Models Segmentation Time (s)8 Total Time (s)1 298 4 4.3 22.81 298 3 4.3 23.52 115 6 1.1 3.02 115 4 1.1 2.73 259 8 2.2 12.03 259 5 2.2 11.24 452 2 2.0 15.3Table 5.7: Time requirements for the examples above. Note that examples 1 and 3 have similarnumbers of segments but example 1, even with half as many models, requires roughly twice thetime.as was mentioned above, lowering the minimum cardinality to two will likely increase theincidence of false matches.• The majority of unexplained false matches6are to small features in the image; rarely arefeatures that appear large in the image falsely matched without explanation.• Although the system did sometimes fail to measure objects in images, never did featuresin an image pair conspire to yield measurements for a non-existent object.5.6 COMPLEXITYThe time complexity is difficult to calculate because it depends not only on the number of imagesegments but also on their configuration and shape. It is clear, however, that the most time-consuming operations are, in order, matching and segmentation, which process image segmentsand contours respectively. The verification and stereo procedures, which, examine matchedfeatures (hence less information), complete their tasks almost instantly. The execution timesfor the examples above are given in Table 5.7.Because the recursive matching procedure traverses all possible paths through connectedsegments, its time requirement is higher if many of the segment endpoints lie close to one6Matches that cannot be explained by obvious similarity to one of the models.6Total number of contours over 24 pixels in length in both images.7The total time to segment both images at both resolutions; does not include the time to read in or store theimagesChapter 5: Results 77another, resulting in many possible paths and longer paths. On a random image, the timerequired to locate candidate matches should increase with the number of image segments, itselfa function of the segmentation resolution and the number of resolutions analyzed, and thenumber of model segments.The time requirement of the segmentation procedure, which increases with the numberof contours in the image, is reduced by segmenting only those contours exceeding a pre-setminimum length. However, it also depends on the length and shape of the contours. The timerequired to locate the point of maximum deviation on a contour increases with the number ofpixels in the contour, thus shorter contours are processed more quickly. Also, straighter contoursare generally divided into fewer straight segments and are hence segmented more quickly thancontours with many curves or high curvature.Interestingly, the number of models has only a very smafl effect on the time requirement.Even if the segmentation time and I/O time are subtracted from the total to leave only thematching time, the time difference between executions with different numbers of models is notexplained by the number of models. This result indicates that several models can be includedto increases robustness without causing unreasonable execution times.A number of system parameters can also affect the time requirement:• Higher cardinality matches generally take longer to construct and involve the examinationof more paths. However, shorter matches cause the system to attempt more matchesbecause fewer model segments are required to achieve the minimum cardinality.• Increasing the segmentation resolution increases the number of segments involved inmatching. Of course, each segmentation analyzed involves matching two new images(left and right).• Increasing the radius within which two segment endpoints are considered connected increases the number of connected segments and hence the number of paths that must beattempted.• Reducing the minimum required contour length results in more contours being segmentedand analyzed.CHAPTER 6CONCLUSIONS & FUTURE WORKDon’t let it end like this. Tell them I said something.- Last words of Pancho Villa6.1 CoNcLusioNsI have presented a unified system to identify known objects in images and, using stereo information, determine the distance to these objects and their dimensions. Whereas many suchsystems perform stereo analysis first and segment the stereo output to identify objects, thissystem locates features first. This permits successful analysis of images where poor imagingconditions such as low contrast, pronounced backscattering, specular reflection and imperfectepipolar alignment would make standard stereo analysis difficult. The approach also reducescomputation by calculating disparity only where it is required. The computational requirementsof the system are reasonable so that the system can be used on inexpensive hardware by smallenterprises.The basic feature used in recognition is obtained by grouping adjacent segments; a shapeis described by the lengths of the segments comprising it and the angles between them. Inaddition to this information, which is invariant to rotation, the system is able to examine theabsolute orientation of segments in order to exploit any regularity of orientation exhibited bythe objects being recognized.The feature-matching routine performs well on highly-angular shapes where individual segments are well-defined; the shape representation is particularly well-suited to such shapes.Performance is not as good when the shapes being recognized are smooth. Overall recognitionperformance is improved by the use of multiple models and multiple image segmentations. The78Chapter 6: Conclusions Future Work 79verification stage both eliminates false matches and identifies those features which belong to asingle object. Specified dimensions are then calculated as the distance between these features.The stereo procedure is equipped with techniques to interpret ambiguous matches resultingfrom occlusion and imperfect recognition. The information returned is easily converted todistance and size information using the equations of stereopsis and optics.With many features and promising performance, the current system provides an excellentframework for research and development in recognition and stereo.6.2 FUTURE WORKYou never really finish a master’s thesis, you just decide to stop working on it.- Scott FlinnThe first step in improving the system would be to make the recognition stage more robust.A number of steps exist which could give the matcher more power and increase the amount ofinformation brought to bear upon matching within the existing framework.• The system has difficulty calculating lengths and angles reliably without fairly sharpangles between segments. This difficulty arises because, as the angle between two segmentsapproaches 180°, meaning the two segments form a fairly straight line, the location of thebreakpoint between them becomes somewhat arbitrary. To remedy this, the angle betweensegments could be used to estimate the accuracy of calculated values and the resultingdiscrepancies could be weighted accordingly. Thus, as the angle between two segmentsapproaches 1800, the system could reduce the effect of the resulting discrepancies onthe cumulative error of the match to reflect the fact that the values are not as reliable.Conversely, if two segments met at a relatively sharp angle such as 90°, the effect of theresulting discrepancies could be magnified. This weighting could be built into the models,where the angles between segments remain constant, or calculated for each pair of imagesegments.Chapter 6: Conclusions Future Work 80The segment-by-segment matching strategy likely fails if a segment is missing from theimage or model (see Figure 6.la). Although this situation could be accommodated byproviding more models, it could be handled better by aflowing the matcher to “lookahead” one segment if a promising match, perhaps two or more segments long, encountersan image segment that does not match the current model segment. If the following imagesegment matched the model segment, the system could assume that a segment was missingand continue the match. Ideally the matcher could skip either an image or a modelsegment because the segment could be missing from a or an image.• A problem similar to that described above arises if two segments in the model appear asone segment in the image or vice-versa, as could happen if the two segments meet at closeto 1800 (see Figure 6.lb). Rather than provide another model or reduce the minimumcardinality, the matcher could combine two such segments and compare the resultingsingle segment to the corresponding image or model segment.• Having two distinct modes for recognition is somewhat restrictive and may neglect valuable information about objects. During orientation-dependent matching, penalizing apotential match for any deviation from the model orientation may be too strict in all butthe most controlled environments.1 Conversely, during orientation-independent matching, the freedom to rotate models arbitrarily is unnecessarily permissive and discardsvaluable information: Atlantic salmon may swim randomly in all directions but theydon’t swim upside-down. Furthermore, this arbitrary rotation can cause false matches(see Figure 5.4). Instead of these two distinct modes, each model could specify an allowable range of orientations for that object; this range could be as restrictive or permissiveas the objects and environment dictate. If this were done, the system should also havea least-squares line-fitting stage or some similar means of accurately determining theorientation of individual features to perform reliable verification.• Because most false matches involve very small image features, a minimum “size” formatches could be specified either for the entire matching process or within each model.‘Of course, the matching threshold can be raised to admit less precise matches.Chapter 6: Conclusions é4 Future Work 81This limit would reflect both the size of the feature and the fact that, as an object movesaway from the cameras and gets smaller, the likelihood correctly identifying correspondingfeatures and deriving accurate information about it decreases.(b)Figure 6.1: (a) This shad head illustrates the problem of missing segments: if the markedsegment in the left-hand shape were “skipped,” the two shapes would match very closely.(b) This problem with the shark head was mentioned in Section 5.5: if the two marked segmentsin the left shape were combined, the resulting single segment would yield a perfect matchbetween the two shapes.In addition to these relatively simple modifications, there are changes that would mandatesubstantial changes to the recognition system.• The efficiency of matching could be increased by indexing the models and candidatematches. Stein ([Ste92]) uses such indexing for fast retrieval of models and to groupclusters of related or consistent matches.• With an appropriate indexing scheme, matching could proceed by examining each imagesegment or group of segments and retrieving only similar models which are likely to yielda match. This would be much more efficient and elegant than the current brute-forcemethod of comparing each group of image segments to each model. In addition, this(a)Chapter 6: Conclusions 8 Future Work 82would be similar to the human vision system, which can recognize similarities betweenobjects and can provide at least a range of possible explanations given restricted knowledgeor a partial view of an object.• Instead of searching for features as though each one stands alone in the image, recognitioncould be guided by features already identified. If a fish tail is located in an image, thesystem can estimate where the corresponding head and, if desired, dorsal and ventral finsshould be. These estimates could be used to influence which models are examined orwhich feature is chosen in the event of multiple matches.• Using previous matches to guide recognition (discussed above) could be extended to ageneral model-fitting approach. This would also provide a more accurate means of verification. However, model-fitting would be difficult because of the shape differences betweenfish.• Saund ([Sau92]) describes a shape representation designed to capture subtleties betweensimilar shapes by detailing many geometric properties and spatial relationships. Therepresentation is a hierarchical system in which low-level features are combined to createa vocabulary of abstract features. Features are indexed by location and scale ona “ScaleSpace Blackboard” to expedite matching. Saund’s sample shape domain is the dorsalfins of different species of fish. Incorporation of such a representation would be necessaryif the system were to distinguish between similar objects. The NIFE system could useit to discriminate grilse, salmon undergoing precocious maturation, from fish maturingnormally; the ratio of grilse to “normal” fish is an important factor affecting harvest dates.• Because the video apparatus provides continuous stereo footage, the system could incorporate motion sequences in its analysis. This could aid in feature recognition because realfeatures would tend to retain their state of motion from image to image while spurious features and coincidental alignments would vanish over time. Analysis of motion sequencescould also help interpret partially occluded objects. In addition, if the objects being recognized travel at known speeds, the camera-to-object distance can be calculated from thedistance traveled between images given the apparent distance traveled between images (inChapter 6: Conclusions Future Work 83pixels). This would provide distance information from single images (i.e., without stereo).Unfortunately, because fish swim at different speeds, the distance to a fish would still berequired to calculate the actual distance traveled between images and vice-versa.A few final ideas relate to other aspects of the system.• The accuracy of size measurements and disparities could be increased by integrating abetter way to specify the locations of features. Currently, the location of a feature isdetermined by the centre of its super-segment. This does not necessarily correspond tothe point on the feature which correctly delimits desired dimensions. Furthermore, ifthe feature is only partially matched, its centrepoint is shifted accordingly. A bettermethod would be to specify the location of a feature in its model such that the systemcould properly locate it even in the event of a partial match. While on the subject ofdimensions, it might also be advantageous to provide maximum and minimum dimensionsof objects as yet another means of eliminating false matches.• Several aspects of the entire process merit further study. First-off, we require more information about where the cameras should be positioned in the sea-cages in order tocollect images which accurately reflect the fish population. During pre-processing, boththe standard deviation (o) of the Gaussian with which the image is first smoothed andthe minimum contour length could have significant effects on the reliability and efficiencyof the system. Many more parameters such as the matching and verification thresholds,the radius within which segments are considered connected and the permissible disparityrange and discrepancy should also be examined to yield optimum results under differentconditions. Most of these can be set upon execution.• The stereo disambiguation stage could be expanded to become a powerful addition to thesystem. In addition, images and features could be used symmetrically during stereo analysis to improve reliability and ensure identical results regardless of the order in which imagesor features are examined. These modifications would likely necessitate more thorough examination of local features and better record-keeping of which features are available andwhich have been used.Chapter 6: Conclusions 4 Future Work 8• One last idea that may be somewhat far-fetched is the incorporation of intelligence in thelinking stage. If it were possible to favour elongated contours or to guide the linking ofintensity contours with descriptions of desired shapes, the linker may be able to extractmore complete outlines or mark potential matches, thereby making subsequent recognitioneasier or faster.• Finally, if the system were to be used commercially, the current user-hostile interfaceshould be replaced with a snazzy graphical user interface.APPENDIX ABINOCULAR STEREO GEOMETRYCLx/ I— I— I—; d/ I—/ IFF Ifocal plane/___________________/ Fimage plane_--‘“Lr<SFigure A.l: The geometry for binocular stereo imaging using two parallel cameras of focallength f separated horizontally by distance s. We wish to determine the distance d from thefocal plane to arbitrary point P.The diagram above shows the geometry used to determine distance from a stereo pair ofimages. The cameras have focal length f and are mounted on a horizontal plane with theiroptical axes parallel and separated by a known distance s. With the cameras arranged thus, apair of images taken simultaneously by the left and right cameras will have substantial overlapin the centre with points in the right image displaced to the left relative to correspondingpoints in the left image. This relative displacement is called disparity and it varies inversely85Appendix A: Binocular Stereo Geometry 86with desired distance d, as shown below.xl= f . tan(a) xv = f tan(,i3) (A.1)tan (a) X +s/2 tan = x —s/2 (A.2)disparity=xl — xv = f I tan(a) — tan(/3) I (A.3)= I(x+)—(x—) (A.4)= 1± (A.5)Thus the distance d is given byd= IxixrI (A.6)— f.s— disparityThis derivation applies when disparity is continuous, as when it is measured from stereophotographs. Because disparity takes on discrete values in digitized images, another constantis required which relates distance to the corresponding number of pixels. Thus, for digitizedimages, the distance d is given byd= c.f.sdisparitywhere c reflects the spatial resolution of the imaging hardware.Note that:• Disparity varies inversely with distance.• This result is independent of the location of the point in the image.APPENDIX BIMAGING GEOMETRYI,P1I’dfocal planeB’A’11Figure B.1: Geometry relating the actual length I of parallel projection AB to its apparentlength 1: in the image.Diagram B.l shows the projection of object PB onto the image plane through a perfectlens of focal length f. Because PB is oblique to the image plane, the lens images AB, thecomponent of PB lying parallel to the focal plane a distance d in front of it.87Appendix B: Imaging Geometry 88From similar triangles,_AB _BC BiA’B’ B’Cd/coscB2f/cosa(B.3)Thus, the actual length of the object in a continuous image is given byd.lfIn a digitized image the length 1 is given as a discrete number of pixels, thus anotherconstant is required relating length to number of pixels, yielding— c• d 1:ffor some constant c which is a parameter of the imaging hardware.Note that:• Apparent length in the image is directly proportional to the actual length of the parallelprojection of the object.• Apparent length is inversely proportional to the distance of the object from the camera.• The focal length and pixel density are constant, leaving distance as the only unknown inthe right hand side of the equation.• Rather than a single constant c, one will likely require a c for the horizontal directionand a c, for the vertical direction because the aspect ratio of a pixel is seldom 1:1.BIBLIOGRAPHY[Can86j J.F. Canny. A Computational Approach to Edge Detection. IEEE PAMI, 8(6):679—98, 1986.[CSB65J J.M. Cullen, E. Shaw, and H.A. Baldwin. Methods for Measuring the Three-Dimensional Structure of Fish Schools. Animal Behaviour, 13(4):534—43, 1965.[DR76] Larry S. Davis and A. Rosenfeld. Applications of Relaxation Labeling: Spring-Loaded Template Matching. In Proc. 3rd mt. Joint Conf. Pattern Recognition,pages 591—597, 1976.[FF65] J. Feder and H. Freeman. Segment Fitting of Curves in Pattern Analysis UsingChain Correlation. AD6195.25, March 1965.[Fua9l] Pascal Fua. A Parallel Stereo Algorithm that Produces Dense Depth Maps andPreserves Image Features. INRIA Technical Report 1369, January 1991.[GLP84] W.E.L. Grimson and T. Lozano-Perez. Model-based Recognition and Localizationfrom Sparse Range or Tactile Data. mt. J. Robotics Res., 3(3):3—35, 1984.[K1o93] G.W. Klontz. Producing a Marketable Fish. Part V. Inventory Techniques Aqua-culture, 10:21—25, 1993.[MMN89] Rakesh Mohan, Gerard Medioni, and Ramakant Nevatia. Stereo Error Detection,Correction and Evaluation. IEEE PAMI, 11(2):113—120, 1989.[MP79] D. Marr and T. Poggio. A Theory of Human Stereo Vision. Proc. Roy. Soc. Lond.B, 204:301—328, 1979.[Neu83] T. Neufeld. The Use of Underwater Stereotelevision to Determine Three-Dimensional Orientation and Distribution of Sea-Cage Salmonids. Proposal forcourse of study, UBC Bio-Resource Engineering, 1983.[NPSN93] A. Naiberg, R.J. Petrell, C.R. Savage, and T. Neufeld. Non-Invasive Fish Sizeassessment Method for Tanks and Sea Cages Using Stereo Video. In Proc. ASAEConf. on Techniques for Modern Aquaculture, Spokane, Washington, June 1993.[Oha93] Andrew Ohara. The Evaluation of Error in a Non-Invasive Salmon Inventory System. Undergraduate thesis, UBC Bio-Resource Engineering, April 1993.[PL84] T. Pitcher and E Lawrence. A Simple Stereo Television System with Application tothe Measurement of Three-Dimensional Coordinates of Fish in Schools. BehaviourResearch Methods, Instruments tY Computers, 16(6):495—501, 1984.89Bibliography 90[PMF85] S. Pollard, J. Mayhew, and J. Frisby. PMF: A Stereo Correspondence AlgorithmUsing a Disparity Gradient Limit. Perception, 14:449—70, 1985.[PMO86j R. Piper, I. McElwain, L. Orme, J. McCraren, L. Fowler, and J. Leonard. FishHatchery Management. United States Department of the Interior, Fish and WildlifeService, Washington D.C., 1986.[RHZ76J A. Rosenfeld, Robert A. Hummel, and Stephen W. Zucker. Scene Labeling byRelaxation Operations. IEEE SMC, 6(6):420—433, 1976.[Sau92] Eric Saund. Putting Knowledge into a Visual Shape Representation. ArtificialIntelligence, 54(1-2):71—119, 1992.[Sch82J C.B. Schreck. Stress and Rearing Salmonids. Aquaculture, 28:241—50, 1982.[SM92] F. Stein and C. Medioni. Structural Indexing: Efficient 2-D Object Recognition.IEEE PAMI, 14(2):125—44, 1992.[SS781 R.J. Strange and C.B. Schreck. Anaesthetic and Handling Stress on Survival andCortisol Concentration in Yearling Chinook Salmon, 1978.[Ste92] Fridtjof J. Stein. Structural Indexing for Object Recognition. PhD thesis, Universityof Southern California, April 1992.[Tsa87) Roger Y. Tsai. A Versatile Camera Calibration Technique for High-Accuracy 3DMachine Vision Using Off-the-Shelf TV Cameras and Lenses. IEEE J. RoboticsAutomat, RA-3(4):323--344, 1987.[VLA85] Le Van Long and T. Agayama. Photographic Measurement for Obtaining theLength, Aspect and Bearing of Free-Swimming Fish from their Spatial Position.Bulletin of the Japanese Society of Scientific Fisheries, 51(2):191—5, 1985.[VLAI85] Le Van Long, T. Agayama, and T. Inagaki. A Stereo Photographic Method forMeasuring the Spatial Position of Fish. Bulletin of the Japanese Society of ScientificFisheries, 51(2):183—90, 1985.[VS72] W.J. Van Sciver. Scale Determination of Unrecognized Undersea Objects by Stereographic Photography. MTS Journal, 6(4):14—16, 1972.

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051464/manifest

Comment

Related Items