UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An embeded tree data model for web content adaptation Wang, Yanming 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2004-0679.pdf [ 7.95MB ]
Metadata
JSON: 831-1.0065596.json
JSON-LD: 831-1.0065596-ld.json
RDF/XML (Pretty): 831-1.0065596-rdf.xml
RDF/JSON: 831-1.0065596-rdf.json
Turtle: 831-1.0065596-turtle.txt
N-Triples: 831-1.0065596-rdf-ntriples.txt
Original Record: 831-1.0065596-source.json
Full Text
831-1.0065596-fulltext.txt
Citation
831-1.0065596.ris

Full Text

A N  E M B E D E D  T R E E  D A T A  C O N T E N T  M O D E L  F O R W  E  A D A P T A T I O N  by YANMING WANG B. Eng., Beijing Institute of Technology, 1997  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING  We accept this thesis as conforming to the required standard  THE UNIVERSITY OF BRITISH COLUMBIA October 2004 © Yanming Wang, 2004  B  FACULTY OF GRADUATE STUDIES  THE UNIVERSITY OF BRITISH COLUMBIA  Library Authorization  In p r e s e n t i n g t h i s t h e s i s in partial f u l f i l l m e n t o f t h e r e q u i r e m e n t s f o r a n a d v a n c e d d e g r e e a t t h e U n i v e r s i t y o f British C o l u m b i a , I a g r e e t h a t t h e L i b r a r y s h a l l m a k e it f r e e l y a v a i l a b l e f o r r e f e r e n c e a n d s t u d y . I f u r t h e r a g r e e t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y p u r p o s e s m a y b e g r a n t e d b y t h e h e a d o f m y d e p a r t m e n t o r b y h i s o r h e r r e p r e s e n t a t i v e s . It is u n d e r s t o o d t h a t c o p y i n g o r publication o f this thesis for financial gain shall not b e allowed without m y written permission.  N a m e o f A u t h o r (please  Title o f T h e s i s :  Degree:  Department of  Date (dd/mm/yyyy)  print)  AM £/l/lB£DDPD ~T^BF [) A J/X M 0 D trL  MflXfer  fafrkeA  ci&C tjYi ClL I  Clnrx  ^OtMU.  Year:  ^ervryutcf-  r~0jl  2*fsQ>  B^'AMy^  T h e U n i v e r s i t y o f British C o l u m b i a Vancouver, B C  Canada  grad.ubc.ca/forms/?formlD=THS  page 1 of 1  last updated:  20-Jul-04  ABSTRACT This thesis presents a novel data model named Content Container Tree (CCT) to enable dynamic web content adaptation for heterogeneous terminal devices. The rationale for the CCT data model is that a web page can be visually divided into several segments and each segment can be iteratively divided into several smaller sub-segments until certain granularity level has been reached; the hierarchy of the web page segments can be expressed with a tree structure; thus, by traversing this segmentation tree, the user can find and locate a particular segment of interest which is at the dimension comparable to the screen size of the terminal device. The CCT data model organizes the semantics of the hierarchical content segments with a tree structure and embeds this semantic tree into the content DOM tree to enable the user to navigate through the content segments at different levels and retrieve the target content segment according to their semantics. Based on the CCT data model, the high level structural adaptation can be explicitly separated from the low level presentational adaptation, resulting in a highly modular and extensible framework for web content adaptation which is proposed in this thesis too. The CCT web content adaptation approach is distinguished from other approaches in the following aspects: first, the CCT approach is data oriented rather than process oriented; second, the CCT approach supports multilevel granularity of the contents; finally, CCT approach is dynamic; result pages can be generated on the fly by request. As an empirical evaluation, a content adaptation application is built according to the CCT adaptation framework to show the feasibility and effectiveness of our approach.  ii  Table of Contents ABSTRACT  II  TABLE OF CONTENTS  HI  LIST O F T A B L E S  VI  LIST O F FIGURES  VII  LIST O F A B B R E V I A T I O N S  IX  ACKNOWLEDGEMENTS  XI  C H A P T E R 1: I N T R O D U C T I O N  1  1.1 Motivation  1  1.2 Two Levels of Content Adaptation  2  1.3 Organization of This Thesis  5  C H A P T E R 2: B A C K G R O U N D  6  2.1 Related Works  6  2.1.1 Server Side Approaches  6  2.1.2 Client Side Approaches  7  2.1.3 Intermediary Approaches  8  2.2 Enabling Technologies  16  2.2.1 X M L  16  2.2.2 XHTML  20  2.2.3 XSLandXSLT  21  2.2.4 XPath  22 iii  2.2.5 JAXP  23  2.2.6 WAP  24  2.2.7 WBI  26  C H A P T E R 3: C C T D A T A M O D E L  31  3.1 The Web Content Adaptation Framework  31  3.2 Multilevel Web Page Segmentation Rationale  34  3.3 The CCT Data Model  39  3.4 The Layered View of the CCT Data Model  44  C H A P T E R 4: I M P L E M E N T A T I O N O F C C T W I T H X H T M L  48  4.1 Different Approaches to Implement the CCT Data Model  48  4.2 New Container Element in XHTML  50  4.3 Validation and Implementation Guidelines  55  4.4 User Interface Design  57  4.5 Content Extraction from CCT Documents  59  4.6 Extraction of the Embedded Container Tree  62  C H A P T E R 5: C C T A P P L I C A T I O N D E S I G N A N D E X P E R I M E N T S  65  5.1 CCT Adaptive Web Server Design  65  5.2 Extending the Adaptive Web Server for Automatic Content Adaptation  71  5.3 Test Bed Setup  71  5.4 Experiment Results  73  5.4.1 Experiment with UBC Homepage  73  5.4.2 Experiment with Yahoo Homepage  78  5.4.3 Conclusions on the Experiment Results  80  iv  CHAPTER 6: CONCLUSIONS AND FUTURE WORK  81  6.1 Conclusions  81  6.2 Future Work  84  BIBLIOGRAPHY  86  V  List of Tables Table 2-1: An HTTP request containing both a header part and a body part  27  Table 2-2: An HTTP response containing both a header part and a body part  27  Table 4-1: Example granularity definition  53  vi  List of Figures Figure 2-1: Screenshot of same page with power browser using text summarization  10  Figure 2-2: Digestor web content adaptation framework  12  Figure 2-3: Two-level hierarchy with thumbnail view representation  14  Figure 2-4 WAP 2.0 programming model  25  Figure 2-5: WAP 2.0 optional proxy model  25  Figure 2-6: A transaction flows through a series of WBI MEGs  28  Figure 3-1: CCT web content adaptation framework  31  Figure 3-2: A sample web page layout with multiple segments  37  Figure 3-3: CCT data model for the UBC home page  42  Figure 3-4: The layered CCT data model  46  Figure 4-1: Container element coding example  54  Figure 4-2: CCT container index user interface generation  57  Figure 4-3: User interface for the UBC homepage  58  Figure 4-4: An XSLT script example for generating container index page  60  Figure 4-5: A container index page generated with XSLT  61  Figure 4-6: XSLT script for extracting content node  61  Figure 4-7: A content page generated with XSLT  62  Figure 4-8: <ct> tags embedded in HTML tags  63  Figure 5-1: ServerFileGenerator design  67  Figure 5-2: CctXslProcEditor design  68  vii  Figure 5-3: XSLT style sheet template  70  Figure 5-4: Test bed configuration  72  Figure 5-5: The UBC homepage on the CCT adaptive web server  76  Figure 5-6: Granularity hierarchy of the content in the right column container  77  Figure 5-7: The Yahoo homepage on the CCT adaptive web server  79  viii  List of Abbreviations ASP  . Active Server Page  AST  Abstract Syntax Tree  BMP  Bit Map  CC7PP  Composite Capability/Preference Profiles  CCT  Content Container Tree  CDPD  Cellular Digital Packet Data  CDMA  Code Division Multiple Access  CSS  Cascading Style Sheet  DOM  Document Object Model  GIF  Graphics Interchange Format  GSM  Global System for Mobile Communication  HTML  Hyper Text Markup Language  HTTP  Hyper Text Transfer Protocol  J2ME  Java 2 Mobile Edition  J2SE  Java 2 Standard Edition  JAXP  Java API for X M L Processing  JPEG  Joint Photographic Experts Group  JRE  Java Runtime Environment  LAN  Local Area Network  MEG  Monitor Editor Generator  PC  Personal Computer  ix  PDA  Personal Digital Assistant  PHP  Personal Home Page  SAX  Simple API for X M L Parsing  SGML  Standard Generalized Markup Language  STU  Semantic Text Unit  URL  Uniform Resource Locator  VIPS  Vision-based Page Segmentation  W3C  World Wide Web Consortium  WAE  Wireless Application Environment  WAP  Wireless Application Protocol  WBI  Web Intermediaries  WML  Wireless Markup Language  XHMTL  Extensible Hypertext Markup Language  XPATH  X M L Path Language  XML  Extensible Markup Language  XSL  Extensible Style-sheet Language  XSLT  Extensible Style-sheet Language for Transformation  X  A C K N O W L E D G E M E N T S  I would take this opportunity to express my gratitude to my supervisors Dr. Victor Leung and Dr. Mathew Yedlin for their great supports and inputs to this research. I would also give my sincere thanks to Dr. Philippe Kruchten and Dr. Lee Iverson for their very constructive suggestions and valuable opinions on this thesis. Yanming Bruce Wang The University of British Columbia September, 2004  xi  CHAPTER 1  INTRODUCTION 1.1 Motivation The World Wide Web experienced an explosive growth in the last decade due to the wide spread of the Internet and the rapid increase in the number of the personal computers (PC). The next decade will see the booming of the mobile Internet which issues many new challenges in terms of devices, protocols, networks and user preferences to the web. One of these challenges is the diversity of the terminal devices in the input/output capability, processing capability and supported standards etc. To meet this challenge, a well-defined data model is needed for web content organization and presentation. The existing HTML-based web content data model is mainly inherited from the traditional paper-based publication industry. It focuses on the layouts and visual effects rather than the semantics and the structures of the content. In the paper-based publications such as newspapers, magazines and books, the size of the paper for printing is pre-determined; similarly, a personal computer with a display resolution at least 800 x 600 pixels is usually assumed when most of today's web pages are designed. Hyper Text Markup Language (HTML) is a language that mainly describes how content are displayed rather than what the contents are about and how they are related. Although Cascading Style Sheets (CSS) can be applied to bring more flexibility on the presentation of a web document, its main purpose is still defining different visual effects of the content  1  rather than the semantics of the content. For this reason, there exists the incompatibility between the display-oriented web content data model and the capabilities of the heterogeneous terminal devices. And the content adaptation is needed to overcome the incompatibility to allow a web content to be accessed from heterogeneous terminal devices. Content adaptation is about generating multiple versions of presentations for the contents according to the particular characteristics of the client device which is trying to access that content. However, the display-oriented web content data model of today makes it hard to dynamically modify the presentation of the contents because some information are conveyed implicitly by the layouts and the visual effects of the content display. Changing the presentation of the contents may cause losses of the information and yield illegible results. Thus, we want to design a data model which can overcome the shortcomings of the current HTML-based web content data model by including semantics into the data model. This data model should also be compatible with the HTML-based data model to enable dynamic adaptation of existing web pages.  1.2 Two Levels of Content Adaptation The term "content adaptation" has two levels of meanings. The first level, which we call the low-level adaptation, is about the adaptation on the format or the encoding of the content. For instance, a web page is written in HTML, but the terminal device only understand Wireless Markup Language (WML), so it is necessary to translate the HTML syntax into the WML syntax. For another example, the content being requested is a 32 bit colors 800x600 pixels Bit Map (BMP) image, but the terminal device can only render  2  Graphics Interchange Format (GIF) images with 256 colors and no larger than 300x200 pixels, so converting the image from its original format, size and color depth into the destined format, size and color depth supported by the terminal device is necessary. The low-level adaptation only involves the encoding and the format of particular web content, it has little to do with the semantics of that content or its relationship with other contents, so it is the adaptation on the content data. The second level, which we call the high-level adaptation, is about adaptations on the structural, navigational and functional aspects of the content. For instance, a typical web page which is designed for rendering on desktop computer is not suitable for a mobile phone's screen; to fit it in, we need to divide it into several sub-pages, that is structural adaptation; and we need to design a scheme to let the user navigate through all sub pages, that is navigational adaptation; moreover, we might want to remove some contents that are not suitable for the terminal device such as advertisements, animations and embedded script codes etc., that is functional adaptation. To achieve the high-level adaptation, knowledge about the structures and functions of the target contents and sometimes the intention of the author is needed. So, the high-level adaptation is on the information about the content. However, as the primary purpose of HTML is to define the appearance rather than the structure and function of the contents, the high-level adaptation is more difficult than the low-level adaptation. There have been a few researches on abstracting the structural and functional information of a web page by analyzing its source code and the visual effects. To manage both the information for high-level adaptation and the content data for low-level adaptation, we design a novel data model named the Content Container Tree  3  (CCT) data model. The CCT data model organizes both the high-level information and the low-level content data of a web page with an embedded hierarchical tree structure. With the CCT data model, web content adaptation can be achieved in three steps. First, analyze and abstract the semantics and the structure of a web page to form the CCT. Second, navigate through the CCT and retrieve the target content data according to the user's requirements. The first step and the second step together can be viewed as high level adaptation. The third step is to perform low level adaptation on the selected content data according to the terminal device's capability. If the served document already conforms to the CCT data model, the first step can be omitted. In this case, the first step of adaptation is actually performed in the authoring stage. The advantages of CCT data model include: •  By applying the CCT data model, the high level adaptation and low level adaptation can be explicitly separated, thus making it easier to study and build web content adaptation systems.  •  The CCT data model embeds the structural and functional information into a web page without requiring extensive changes to the existing HTML-based data model. So, an existing HTML document can be easily modified to conform to the CCT data model. Moreover, the current web architecture does not need any changes either.  •  Besides content adaptation, CCT data model can also be utilized by other web applications such as searching engines to improve their efficiency and accuracy according to the high-level semantic information.  4  1.3 Organization of This Thesis The rest of this thesis is organized as follows. In chapter 2, the related works and technology background on web content adaptation are discussed. In chapter 3, we present the CCT web content adaptation framework as well as the formal definition of the CCT data model. In chapter 4, we discuss the implementation of the CCT data model with XHTML in detail. In chapter 5, we present the design of the CCT adaptive web server application and the experiment results with this application. Finally, conclusions and future work are given in chapter 6.  5  CHAPTER 2 B A C K G R O U N D  2.1 Related Works To enable the heterogeneous terminal devices to access the web content, many efforts have been made and various solutions have been proposed. Among all these solutions, three general categories can be summarized, they are: server side approaches, client side approaches and intermediary approaches.  2.1.1  Server Side Approaches  Server side approaches include device-specific authoring and dynamic content generation. Device-specific authoring involves authoring a set of web pages for a particular type of terminal devices. For example, there exist web sites that are authored in WML for Wireless Application Protocol (WAP) compatible mobile phones. The advantages of this approach include the simplicity of implementation and high quality result. A content provider will simply create a set of new web pages based on the capabilities of a particular type of terminal devices. And since it is designed intentionally by human designers for the particular type of terminal device, the display effect is usually the best among all approaches. As the cons, the users of this particular type of devices will only have access to the web pages that are intentionally designed for them, thus limiting them from taking advantage of the huge number of other web pages on the web. On the other hand, due to the heterogeneous nature of the handheld terminal devices, the  6  content providers may have to provide many versions of web pages for the same content to fit various terminal devices. This will impose a heavy burden on the content providers to generate, maintain, synchronize and update their web sites. Dynamic content generation systems generate different presentations from one common source content on the fly to fit various terminal device capabilities and user preferences. One example of such dynamic content generation system is Motorola Labs' web-enabled museum system [1]. The system retrieves exhibit data from a content server based on a visitor's context including location, information interests, and device capabilities etc. Motorola's effort proved that dynamic content generation is feasible and can dynamically generate acceptable result for various terminal devices. However, its approach is very complex and requires a totally different architecture from the current file retrieval web server architecture, because all the relative information including the source content and the context information are stored in databases. Thus, it is still impossible for those legacy web sites to be accessed from the various terminal devices. 2.1.2 Client Side Approaches  Client side approaches are mainly about the techniques to provide the users the ability to interactively navigate a single web page by altering the portion of it that is displayed at any given time. Scroll bars, back and forward buttons are very common techniques falling in this category. More complex examples include the PAD++ system [2], which lets the user zoom and pan the device display over the document, and the Active Outlining [3], which lets the user dynamically expand and collapse sections of the document under their respective section headings. Client side navigation can work well if a good set of viewing techniques can be developed. But it requires the entire document to  7  be delivered to the client device at once which may waste valuable wireless bandwidth and memory. Besides, client-side navigation has high requirements on the hardware and software of the terminal devices which may not be met by many mobile terminal devices. 2.1.3 Intermediary Approaches  Intermediary approaches involve developing software that can take an arbitrary web document designed for the desktop environment, along with the characteristics of the target rendering device, and re-author the document through a series of adaptations so that it can be properly rendered on the device. This process is usually performed on a proxy, but it can also be performed either on the client side or on the server side. Examples of automatic content adaptation include the text summarization facility [4] developed by Buyukkokten et al. in Stanford University; the Digestor system [5] developed by Bickmore et al. in MIT; Browsing via thumbnail view [6] developed by Yu Chen et al. in Microsoft Research; Aurora system [7] developed by Anita Huang et al. in IBM Almaden Research Center. Intermediary approaches do not require much extra changes on the content server side or the client side. The re-authored documents are usually light weight and require less bandwidth for transmission and less memory for storage than the original ones. So if an intermediary approach can produce acceptable results, which is legible, navigable and aesthetically pleasing, it will be the most preferred over other approaches. In Stanford's text summarization facility, an accordion representation is generated and the detail content can be folded or unfolded. The proxy begins by partitioning the page into 'semantic textual units' (STU). The result of this step is shown by the rectangles around sections of the page in Figure 2-1(a), which are not present in the  8  actual HTML. In summary, STUs are page fragments such as paragraphs, lists, or ALT tags that describe images. In a second step, the proxy uses font and other structural information to identify a hierarchy of STUs. For example, the elements within a list are considered to be item STUs nested within a list STU. Similarly, elements in a table, or frames on a page, are nested. Hiding the nested STUs finally completes summarization. Figure 2-1(b) exemplifies the result. Initially, each STU is represented by a single line on the personal digital assistant (PDA) screen. The arrows in Figure 2-1 denote the correspondence of STUs with their single-line representations. (The line numbers on the PDA are only for convenience of explanation and do not appear on the actual display). The '+' symbols next to some STU representations indicate the presence of hidden STUs at a deeper nesting level. Users may view STUs at those deeper levels by tapping their pen on the appropriate '+' symbol, or by performing a left-to-right pen gesture. In response, the '+' turns into a ' - ' symbol, and the nested STUs are displayed indented. For example, the STU of line 4 in Figure 2-1(b) has been expanded, revealing lines 5-9. Then the STU of line 9 was expanded to reveal lines 10-13. The STU of line 3 has not been expanded, hence the '+' on that line. Initially, only the top level of the STU hierarchy is shown on the screen. In Figure 2-1(b) this top level consists of four STUs in lines 1—4. Lines 5-13 were initially blank. Using this 'accordion' approach to Web browsing, users can initially get a good high-level overview of a Web page, and then "zoom into" the portions that are most relevant.  9  F i g u r e 2-1: Screenshot o f same page w i t h p o w e r b r o w s e r using text s u m m a r i z a t i o n  In our proposed CCT data model, we apply an 'index' approach which is similar to the 'accordion' approach of the Stanford's text summarization facility to help the user to navigate through a web page and locate the most relevant portion of the contents. The  10  main differences between the CCT data model and the text summarization facility lie in that: 1. The text summarization facility abstracts only the text content from a web page. Images and other HTML elements are abandoned. The CCT data model, on the other hand, preserves the full original HTML document. When a portion of the contents is selected, the respective segment of the web page is returned rather than just the text content. 2. The text summarization granularity is fixed. A summary line only corresponds to one certain text paragraph. The user can only choose to view the summary or the full text no matter what type of terminal device he is using. On the other hand, the CCT data model supports multilevel granularity which means the user can choose the granularity of the content to display according to the characteristics of his terminal device. Refer to the section 3.3 for detail explanations on multilevel granularity. 3. The text summarization facility applies static link anchors to organize the summary and its respective contents, while the CCT data model use the tree structure to define the relationships between the index and the contents, allowing dynamic operations for navigating and locating relevant content with treestructure queries.  MIT's Digestor system re-authors web documents through a series of transformations, and links the resulting individual pieces. The first thing that users of Digestor will typically do is to specify the size of display for their devices and indicate  11  the size of their default browser font; these are required in order to estimate the screen area requirements of the text blocks. Based on the estimation, Digestor will perform a series of transformation such as scaling down some images and replacing some texts with links etc. The Digestor system and the CCT system both take into account of the heterogeneous terminal devices characteristics, thus they share a similar framework for content adaptation. As shown in figure 2-2, the Digestor system is implemented as a proxy server, which receives a request for an HTML document, retrieves the document from the specified HTTP server, parses the HTML and constructs an Abstract Syntax Tree (AST), labels each of the AST nodes with a unique identifier, and then retrieves any embedded images so that their size can be determined (as necessary). Once this has been accomplished, the planner is initialized with a state containing the AST for the original document. During each planning cycle, the planner selects the state with the best document version so far, then selects the best applicable transformation technique and applies it resulting in a new state and document version being generated.  Figure 2-2: Digestor web content adaptation framework  12  The CCT content adaptation system can also be implemented as a proxy server. A Content Container Tree (CCT) is generated from the original HTML document. However, unlike the Digestor system which chooses a series of transformations to re-author the web page according to the terminal device's screen size, the CCT system generate a multilevel indexes for the user to navigate through the web page, only the chosen portion of the contents will be further transformed on the fly according to the terminal device's characteristics. Moreover, the Digestor system does not distinguish between navigational transforms such as outlining and content transforms such as image reduction; all transformation techniques are applied in one stage. In the CCT system, we explicitly defines two types of transformation, the high level transformation on the navigation and structure of the web page, and the low level transformation on the syntax, format and encoding of the contents in a web page. And these two types of transformation are accomplished in two separate stages in the CCT web content adaptation system. Please refer to section 3.1 for detail introductions to the CCT content adaptation framework. Microsoft's Thumbnail View system provides a graphical overview of the web page and allows the user to select a desired portion of the web page to zoom in for details. As shown in figure 2-3, a web-page is organized into a two level hierarchy with a thumbnail representation for providing a global view with each block of semantically related content represented by a different color. By tapping on a block in the thumbnail, a user can easily go to view the corresponding content which is formatted to fit well into a small screen.  13  Figure 2-3: Two-level hierarchy with thumbnail view representation  The thumbnail view system best preserves the feel and look of the original web page. But its two-level hierarchy is insufficient for displaying large and complex web pages on a small screen terminal device. Moreover, its graphic-based navigation method constrains its applications to only those terminal devices which have good support to graphics. On the contrary, the CCT system adopts the more flexible multi-level indexes for navigation through a web page. The chosen contents can be displayed just as the original web page or can be further adapted according to the terminal device's characteristics. The similarity between the thumbnail view system and the CCT system lays in that they both apply vision-based segmentation to re-organize the contents in a web page. IBM's Aurora system adapts web content based on semantic rather than syntactic constructs - facilitating navigation by streamlining the web interface according to abstract user goals. Instead of attempting to generalize for every web page, Aurora  analyzes web objects based on their semantic functions within aggregations of web pages. For example, within the context of a search-engine, it transforms a web object as a search-box rather than just an HTML form element. The goal is to make the abstract services provided by these collections accessible to the handheld device users. The advantage of Aurora is its ability to provide concept-oriented content adaptation. The trade-off is its scalability, since Aurora depends on a repository of custom contentextraction rules (one set of rules per web site). The generation and maintenance of these rules can be very labor-intensive and time-consuming. As a result, only a very small portion of web sites can be supported by Aurora. The idea of adapting web contents based on semantics is also adopted by CCT, but for a different purpose. Unlike the Aurora system which works on the semantics to determine how to transform a particular web object according to its function in the context, the CCT system works on the semantics for content segments organization and navigation. In short, The CCT system utilizes the semantics of the hierarchical content segments to achieve high level adaptation on structures and navigations, while the Aurora utilizes the semantics of web objects to achieve low level adaptations on presentation and transcoding. In conclusion, each of the approaches on server side, client side and intermediary has pros and cons. None of them can independently solve the problem of web content adaptation. The future architecture for web content adaptation is more likely to be a combination of more than one approach. First, on the server side, the content data model should be improved to be more structural and informative to allow intelligent processing. Second, on the client side, the terminal devices' input, output and processing capability  15  should be improved to a certain level. A mono screen which can only display several lines of text will never provide satisfaction for web content browsing. With the rapid development in the personal handheld data device industry, colorful graphic screen with certain level of resolution will soon become the main stream, providing a base line for web content adaptation. Finally, the development of intermediary technologies will make the web smarter and more adaptable. By comprehensively considering the various approaches for web content adaptation, the CCT data model overcomes the shortcomings of some existing attempts on web content adaptation and presents a flexible and extensible method to enable heterogeneous terminal devices to access the web contents.  2.2 Enabling Technologies This section will introduce some state of the art technologies that can be applied to enable the web content adaptation.  2.2.1 XML In some sense, the success of the World Wide Web attributes to the simplicity nature of HTML, the de facto standard markup language to author web documents. However, it is argued that the simplicity is also one of its weaknesses. The tags in HTML are mainly used for describing the visual presentation characters of a document, but have little to do with carrying the structural and semantic information about the content. For example, <H1> to <H6> tags are defined to emphasize the titles and subtitles of an essay, but in practice, text tagged by <H1> to <H6> are not necessarily a title, many people use  16  these tags only to get certain display effect on the font size or font type. Moreover, due to the loose grammar of HTML, it is hardly capable of advanced data manipulations and extractions. To overcome the shortcomings of HTML, X M L (extensible Markup Language) [21] emerges in the mid 90's as the standard for data representation and exchange on the Web. Note that X M L is not a data model, but a meta-model to represent various documents. In a way, X M L representations can be regarded as representations for semistructured data. Semistructured data is data that has some structure, but the structure may not be rigid, regular, or complete and generally the data does not conform to a fixed schema (sometimes the term schema-less or self-describing is used to describe such data). In semi-structured data, the information that is normally associated with a schema of the data is conveyed by the data itself. An example of semi-structured data is that text content could be represented as title, author, affiliation and paragraphs. Such data are not fully structured like relational structures, but they have partial structure. Semistructured data has gained the notice recently for the reason that it may be desirable to treat Web sources like databases which have no fixed schemas. X M L is a restricted version of SGML (Standard Generalized Markup Language), designed especially for Web documents. As a meta-language (a language for describing other languages), X M L enables designers to create their own customized tags to provide functionality not available in HTML. And in this way, X M L enables different kinds of data to be exchanged over the Web. The advantages of X M L include: simplicity, open standard and platform/vendor-independent, extensibility, separation of content and presentation, support for the integration of data from multiple sources, ability to describe  17  data from a wide variety of applications, and more advanced search engines. The components of X M L documents include the following: •  Declaration: the X M L declaration defines the X M L version and the character encoding used in the document. For example, <?xml version-'1.0" encoding="ISO-8859-l"?> means the document conforms to the 1.0 specification of X M L and uses the ISO-8859-1 (Latin-1/West European) character set.  •  Elements: elements, or tags, are the most common form of markup. The first element must be a root element. An X M L document must have one root element. All other elements must be within this root element. All elements can have sub elements (child elements). Sub elements must be correctly nested within their parent element. For example, <root> <child> <sub child>... </sub child> </child> </root> X M L Elements are extensible and they have relationships. X M L documents can be extended to carry more information. Elements are related as parents and children.  •  Attributes: attributes are name-value pairs that contain descriptive information about an element. The attribute is placed inside the start-tag after the  18  corresponding element name with the attribute value enclosed in quotes. For example, <file type="text">. •  Entity References: entity reference is used in an X M L document to refer to an entity. An entity reference starts with an ampersand (&) and ends with a semicolon (;), for example, &lt; refers to the left angle bracket (<) which is a reserved character.  •  Comments: The syntax for comments in X M L is similar to that of HTML. For example, "<!-- This is a comment —>."  •  CD ATA sections and processing instructions: a CD ATA section instructs the X M L processor to ignore markup characters and pass the enclosed text directly to the application without interpretation. A processing instruction is of the form <?name pidata?>, where name identifies the processing instruction to the application. Since the instructions are application specific, an X M L document may have multiple processing instructions that tell different applications to do similar things, but perhaps in different ways.  Also note that, with X M L , elements are ordered, which means two X M L segments with same elements but appear in different order are different. In contrast, attributes in X M L are not ordered, thus it does not matter which attribute goes first in an element. One of the most outstanding characteristics of X M L is the separation of content and presentation. Unlike HTML, X M L simply structures a collection of data to store in a text format document. It does not deal with how the data looks. Another significant  19  characteristic of X M L is that every X M L document can be expressed as a well-formed tree, making it easy to process with the X M L parsers. Since many other technologies including XHTML, XSLT and XPath are based on X M L ; X M L is the most basic enabling technology for our research. 2.2.2  X H T M L  XHTML (extensible HTML) [24] 1.0 is a reformulation of HTML 4.01 in X M L 1.0 and it is intended to be the next generation of HTML. It is basically a stricter and cleaner version of HTML. The most differences between XHTML and HTML are: •  XHTML elements must be properly nested  •  XHTML documents must be well-formed  •  Tag names must be in lowercase  •  All XHTML elements must be closed  XHTML applies a method called modularization to achieve user-agent interoperability. An XHTML module is a set of existing HTML elements (or tags) or firrther elements which offer a specific functionality. The modules can be combined with each other and with the desired feature sets of different user agents to form different XHTML subsets or different XHTML-conforming document types, or XHTML-family markup languages. The modularization of XHTML facilitates the creation of new markup languages. A certain XHTML-family markup language, a collection of XHTML modules, is used to encode the Web content so as to best fit the capabilities of a certain type of user-agent. Every such language must include the core modules, and may also use other XHTML-provided modules, other World Wide Web Consortium (W3C) defined modules,  20  or any other module that is correctly defined. One example of an XHTML-family markup language is "XHTML Basic," which includes the minimal set of XHTML modules and some other modules such as Images, Basic Forms, Basic Tables, and Objects. It is designed for thin clients that do not support the full set of XHTML features. Another example of an XHTML-family markup language is XHTML Mobile Profile (XHTML MP). XHTML MP intends to provide a content authoring language suitable for resourceconstrained devices such as mobile phones. It extends XHTML Basic by adding some presentation elements and support for internal style sheets. XHTML MP has been taken by WAP forum as its client-specific markup language in the latest WAP 2.0 Specification. Because XHTML are linked to both HTML and X M L , we make use of it to implement the proposed CCT data model in our research. Such XHTML implementation will maintain the basic structure and functions of a conventional HTML web document while bringing in the data processing capabilities of X M L . Please refer to section 4.2 for detail discussions on how to utilize XHTML for implementing the CCT data model. 2.2.3 X S L and X S L T  In HTML, default styling is built into browsers because the tag set for HTML is predefined and fixed. The CSS (Cascading Stylesheet Specification) allows the developer to provide an alternative rendering for the tags. Similarly, Extensible Stylesheet Language (XSL) [23] is the standard style definition language for X M L . XSL is a formal W3C recommendation that has been created specifically to define how an X M L document's data is rendered and to define how one X M L document can be transformed into another document. XSL is the counterpart in X M L of CSS but more powerful.  21  Extensible Stylesheet Language for Transformations (XSLT) forms a subset of XSL. It is a language in both the markup and the programming sense in that it provides a mechanism to transform a X M L structure into another X M L structure. While it can be used to create the display output of a Web page, XSLT's main ability is to change the underlying structures rather than simply the media representations of those structures, as is the case with CSS. XSLT is important because it provides a mechanism for dynamically changing the view of a document and for filtering data. In our research, we make use of XSLT to extract the proper content from a CCT document to dynamically generate and format a web page. Please refer to the section 4.5 for detail discussions and examples on how to use XSLT. 2.2.4 X P a t h  X M L Path Language (XPath) [24] is a declarative query language for X M L that provides a simple syntax for addressing parts of an X M L document. It was designed for use within XSLT (for pattern matching). With XPath, collections of elements can be retrieved by specifying a directory-like path expression with predicates. XPath uses a compact, string-based syntax, rather than the X M L element-based syntax. XPath treats an X M L document as a logical (ordered) tree with nodes for each element, attribute, text, processing instruction, comment, namespace, and root. The basis of the addressing mechanism is the context node (a starting point) and the location path, which describes a path from one point in an X M L document to another, providing a way for the items in an X M L document to be addressed. XPath can be used to specify an absolute location or a relative location. A location path is composed of a series of steps joined with 7', which serves much the same function as 7' in a directory path. Each 7'  22  moves down the tree from the preceding step. An XSLT operation can take the XPath expression as an argument. For example, <xsl:copy-of select=7catalog/cd" /> selects cd node from the catalog. XPath also defines a set of standard functions for working with strings, numbers and Boolean expressions. For example, the XPath expression '7catalog/cd[price>10.80]" selects all the cd elements that have a price element with a value larger than 10.80. In our research, we make use of XPath to locate and select particular contents or information from a CCT document. Please refer to the section 4.5 for detail discussions and examples on how to use XPath together with XSLT. 2.2.5 JAXP  Many web developers have come to the conclusion that X M L and Java programming language form the perfect pair because they complement each other so well. X M L contributes platform-independent data — portable documents and data. Java contributes platform-independent processing — portable object oriented software solutions. Thanks to the combination of X M L and Java, many applications that cannot be accomplished within the limitations of HTML are now solvable. Although there are many X M L tools and libraries based on other languages such as Python, Perl, and C, the majority of X M L development is focused on Java, which is emerging as the language of choice for processing X M L . Java API for X M L Processing (JAXP) [25] makes it easy to process X M L data using applications written in the Java programming language. JAXP leverages the parser standards SAX (Simple API for X M L Parsing) and DOM (Document Object Model) so that developers can choose to parse their data as a stream of events or to build a tree-  23  structured representation of it. The latest versions of JAXP also support the XSLT (XML Stylesheet Language Transformations) standard, giving control over the presentation of the data and enabling converting the data to other X M L documents or to other formats, such as HTML. JAXP also provides namespace support. In our research, we use JAXP along with X M L , XHTML, XPath and XSLT to build applications that implement the poposed CCT data model for web content adaptation. Please refer to the section 5.1 for how these technologies can be integrated together to create a concrete application. 2.2.6 W A P  WAP (Wireless Application Protocol) [26] is an open specification developed by the WAP forum to empower mobile users to access the Internet via wireless communication networks. WAP works with most wireless networks such as CDPD, GSM, GPRS, CDMA, TDMA, and 3G. What WAP provides is an application environment and a suite of communication protocols for mobile devices to access the Internet and advanced telephony service. The latest version of WAP is WAP 2.0 which adopts more existing Internet and Web technologies. WAP 2.0 incorporates HTTP and TCP into its communication protocol stack and migrates its content encoding language from WML to XHTML. The resulting WAP programming model is more closely aligned with the Web programming model. In the earlier versions of WAP, namely WAP 1.0 and WAP 1.1, a WAP proxy (often referred to as a WAP gateway) was required to handle the protocol inter-working between the client and the origin server. The WAP proxy communicated with the client using the WAP protocols that are based largely on Internet communication protocols, and  24  it communicated with the origin server using the standard Internet protocols. WAP 2.0 does not require a WAP proxy, since the communication between the client and the origin server can be conducted using standard Internet protocol HTTP/1.1 over TCP/IP. Nevertheless, although the deploying of a WAP proxy is no longer mandatory in WAP 2.0, it can optimize the communications process and may offer mobile service enhancements. The WAP 2.0 programming model is shown in figure 2-4. The WAP Optional Proxy Model is shown in figure 2-5.  Figure 2-4 W A P 2.0 programming model  Figure 2-5: W A P 2.0 optional proxy model  25  In our research, we'll use the WAP 2.0 programming model and optional proxy model as the basic model for mobile handheld devices to access the adaptive web content. We'll also conduct our experiments with WAP 2.0 compliant simulators. Please refer to the section 5.3 for details. 2.2.7 W B I IBM's WBI [17] is a programmable web proxy and web server. 'WBI' stands for Web Intermediaries. The WBI Development Kit provides a set of convenient and flexible APIs for programming intermediaries on the web. In this case, intermediaries are computational entities that lie along the HTTP stream and are programmed to tailor, customize, personalize, or otherwise enhance data as they flow along the stream. WBI has a data model and a processing model. Briefly, the data model describes the way requests and responses are accessed and manipulated. The processing model describes (a) the way different modules within WBI are activated when requests are received from the browser or other client, and (b) the way these different modules work together to produce the response that is sent back to the browser. WBI's data model is based on the request/response structure of HTTP version 1.0. Each request and each response consists of a structured part and a stream part. The structured part corresponds to the header and the stream part corresponds to the body. Table 2-1 shows an HTTP request that contains both a header and a body. When WBI receives this request, it is parsed into these two parts. The header information is stored in an object of class Documentlnfo. The body information is made available through an object of class MeglnputStream. The body information can then be read from the MeglnputStream using its read(...) method. This technique is used because the body  26  portion of a request or a response can be very long and it is not practical to store it in memory as a string. HEADER (Documentlnfo) POST http://www.ibm.com/java HTTP/1.0 User-agent: Mozilla/4.0 Accept: text/html Content-length: 15 BODY (MeglnputStream) This is a sample http request. Table 2-1: A n H T T P request containing both a header (structured) part and a body (stream) part.  HEADER (Documentlnfo) HTTP/1.0 200 Ok Server: MyWebServer Content-type: text/html Content-length: 36 BODY (MeglnputStream) <html> <hl>Hello, world</hl> </html> Table 2-2: A n H T T P response containing both a header (structured) part and a body (stream) part.  Table 2-2 shows a typical HTTP response. When WBI receives this response, it is parsed in the same way as a request. The header information is stored in a Documentlnfo object and the body is made available through a MeglnputStream object. To produce new requests and responses, a WBI module is given a Documentlnfo object and a MegOutputStream object to manipulate. Simply set a property of the Documentlnfo object using either the setRequestHeaderj...) or setResponseHeader(...) methods. When the header information has been set appropriately, the WBI module may begin writing the body content to the MegOutputStream using its write(...) methods. In summary, WBI divides both HTTP requests and HTTP responses into two parts: the structured header portion and the unstructured body portion. The header is 27  manipulated through a Documentlnfo object and the body is manipulated through MeglnputStream and MegOutputStream objects. WBI's processing model consists of four types of processing entities: the request editor, the generator, the document editor and the monitor. WBI is a programmable HTTP request and response processor. It receives an HTTP request from a client, such as a web browser, and produces an HTTP response that is returned to the client. The processing that happens in between is controlled by the modules programmed into WBI.  Figure 2-6: A transaction (request/response) flows through a series of W B I M E G s (monitors, editors, generators).  Figure 2-6 shows the flow through a typical WBI transaction. It goes through three basic stages: request editors, generators, and editors (which would be more appropriately called "document editors"). Request Editors receive a request and have the freedom to modify the request before passing it along. Generators receive a request and produce a corresponding response (i.e., a document). Editors receive a response and have the freedom to modify the response before passing it along. When all the steps are completed, the response is sent to the originating client. A fourth type of element, the Monitor, can be designated to receive a copy of the request and response but cannot otherwise modify the data flow. The Monitor, Editor, Request Editor and Generator modules are collectively referred to as MEGs.  28  WBI dynamically constructs a data path through the various MEGs for each transaction. To configure the route for a particular request, WBI has a rule and a priority number associated with each MEG. The rule specifies a boolean condition that indicates whether the MEG should be involved in a transaction. The boolean condition may test any aspect of the request header or response header, including the URL, content-type, client address, server name, etc. Priority numbers are used to order the MEGs whose rules are satisfied by a given request/response. When it receives a request, WBI follows these steps: 1. The original request is compared with the rules for all Request Editors. The Request Editors whose rule conditions are satisfied by the request are allowed to edit the request in priority order. 2. The request that results from this Request Editor chain is compared with the rules for all Generators. The request is sent to the highest priority Generator whose rule is satisfied. If that Generator rejects the request, subsequent valid Generators are called in priority order until one produces a document. 3. The request and response are used to determine which Editors and Monitors should see the document on its way back to the client. The document is modified by each Editor whose rule condition is satisfied, in priority order. Monitors are also configured to monitor the document either (a) as it is produced from the generator, (b) as it is delivered back to the client, or (c) after a particular Editor. 4. Finally, the response is delivered to the requester. A WBI application is usually composed of a number of MEGs that operate together to produce a new function. Such a group of MEGs is a WBI plugin. Plugins  29  define the basic unit for installation, configuration, and enable/disable. A plugin bears some resemblance to a Java Applet. It is a class that WBI instantiates at start-up, similar to how a browser loads an applet. In our research, we develop our CCT application as a WBI plugin. For detail design issues, please refer to the section 5.1 "CCT adaptive web server design" for detail introduction on developing the application with WBI.  30  CHAPTER 3 CCT DATA MODEL  3.1 The Web Content Adaptation Framework To apply the CCT data model for dynamic web content adaptation, we propose a proxy-based framework as shown in figure 3-1.  HTTP Request  ;HOT;Prbxy&>; Central Conftbl  Content request  User  Dynamically generated page  Adaptive Transcoding  HTTP Request for Web page  Other file types JQuery for content  Content Reformation Reformed pagt  Query result Cache  W e b Server  Automatic Content Adaptation Framework Figure 3-1: C C T web content adaptation framework  This framework mainly consists of three functional modules, the HTTP proxy module, the content reformation module and the adaptive transcoding module. 1. The HTTP proxy module receives the HTTP requests from the client, interpret the request and forward it to the web server. This module is also responsible for analyzing the HTTP request for information about the client and the target network characters. Based on the request and the information, the control logic will coordinate the work of the other two functional modules.  31  2. The content reformation module is an intelligent agent responsible for analyzing the web page structure from the original web server and changing that web page to comply with the CCT data model. The resulting intermediate CCT document will be saved in a cache for further consumption by the adaptive transcoding module. In the case that the original web page already conforms to the CCT data model, this module can be omitted, and the system becomes an adaptive content generation system. 3. The adaptive transcoding module makes queries to the CCT document in the cache according to the HTTP request. Based on the client information gathered by the HTTP proxy module, the adaptive transcoding module makes respective low level content transcoding to the query results from the CCT document to fit the terminal devices' capability requirements.  Comparing this CCT web content adaptation framework with the Digestor web content adaptation system as shown in figure 2-2, we can see some similarities between the two frameworks. First, the two frameworks both have proxy modules which are used to accept and forward HTTP requests. Second, they both have intelligent modules to reauthor a web page. Finally, they both use caches to temporarily store re-authoring results for further consumption by the user. The major difference between the twoframeworksis that the Digestor system perform all the transformations in one step and store the final transformation results in the cache for the user to access sequentially; while the CCT system performs the transformation in two separate steps, the first step is to re-author the web document into an intermediate format and store it in the cache, the second step is to  32  extract a particular portion of the contents from the intermediate document and perform content transcoding on the fly according to the terminal device's characteristics. The first step of transformation is accomplished by the content reformation module while the second step of transformation is accomplished by the adaptive transcoding module. The main advantage of the CCT web content adaptation framework over the Digestor system and other frameworks for web content adaptation is that it explicitly separates the high level adaptation and the low level adaptation, where the content reformation module is mainly responsible for the high level adaptation by analyzing the semantics and the structure of the original web page and adding these information to the intermediate CCT document. The adaptive transcoding module is mainly responsible for the low level adaptation such as markup language transcoding and image reduction on the chosen portion of the content. The adaptive transcoding module can be viewed as a miniature of a content adaptation system such as Digestor except that it only deals with portions of content that are at comparable size to the terminal device's screen. For this reason, it is much easier to determine the proper transformation techniques because those high level navigational and structural transformation techniques such as outlining can be omitted. By separating the high level adaptation and the low level adaptation, the complex problem of web content adaptation can be reduced to two smaller sub-problems: finding algorithms to analyze the semantics and structures of a web page; and finding techniques to trancode various types of contents. The pivot for connecting the whole content adaptation system is the intermediate document which is the output of the content reformation module and the input of the  33  adaptive module. Thus, the data model for this intermediate document should be carefully designed to be able to carry the necessary high level semantics and structural information acquired by the content reformation module. The intermediate data model should also posses a structure that allow the adaptive transcoding module easily and efficiently extract both the high level information and the content data from it. To achieve such a data model, both the display-oriented nature of the HTML web pages and the heterogeneity nature of the terminal devices should be considered. In the following sections in this chapter, we'll discuss the design of such a data model in detail.  3.2 Multilevel Web Page Segmentation Rationale Microsoft's thumbnail view browsing system as introduced in the section 2.1.3 inspired us that if we can segment a large web page into several small sub-pages at a size that is comparable to the screen size of the terminal device, these small sub-pages will be much easier for users to browse. Then, the process of web content adaptation can be divided into several steps: 1. Find a proper set of sub-pages that are either visually or logically independent. The granularity of these sub-pages should be decided according to the screen size of the terminal device. 2. Find an efficient method to allow the user to navigate through the sub-pages and pick the most relevant one. 3. Find a set of transcoding techniques to further transform the chosen sub-page to make it more suitable for being rendered on the terminal device.  34  Microsoft's thumbnail view browsing system applies a two-level hierarchy to manage the sub-pages and a graphical overview to enable the user to navigate through the sub-pages. The limitation of such design is that with the two-level hierarchy structure, only one granularity level of the sub-pages can be defined, thus only those terminals with certain screen size are supported. The graphical overview navigation system also limits its usage to only those terminal devices with certain graphical display capability. Because the thumbnail view browsing system supposes the terminal devices are capable to display graphical HTML pages, no further transformation is made to the chosen sub-page. To enable heterogeneous terminal devices to access the web contents, we need a more flexible design of the data structure to organize the multi-level segmentations of a web page, so that we can have sub-pages at various granularity levels for various screen size of the terminal devices. We also need a method to help the user to traverse this multilevel hierarchy to find the target sub-page of proper granularity. Based on this rationale, we design the CCT data model which mainly describes how multilevel segmentations of a web page should be organized and how to traverse the multilevel hierarchy to locate and retrieve the relevant information and data. Before we introduce the detail design of the CCT data model, we first study an example of web page segmentation to gain some ideas about the vision-based segmentation method, principle and process. Because HTML web page design are focused on the design of visual effects and the content layout, the semantics and structural information of the web page are usually conveyed implicitly by the visual effects of the content and the page layout. For example, the content in large or bold font is usually more important than those in small or light font;  35  the content in the central area of the page is usually more important than those in the edge areas; logically related contents are usually geographically  adjacent; logically  independent contents are usually separated by visual cues such as blank areas, vertical or horizontal lines, or by applying different visual patterns including fonts, colors etc. As an example, figure 3-2 shows the home page of UBC, which has a very typical layout design of today's web pages. In figure 3-2 we can see that the web page can be visually divided into five main segments which are logically independent with each other. They are: •  The header bar, which contains the logo, the organization name and some navigational links.  •  The left column, which is mainly used to display navigational links and indexes.  •  The main content area, which is used to display the main content of the web page.  •  The right column, which is used to display less important or irrelevant content.  •  The footer bar, which is used to provide complementary information about the web page such as address, phone numbers, copy right and some additional links.  In addition, the main content segment and the right column segment can be further divided into several sub-segments. This segmentation process can go on iteratively until the finest granularity level by definition is reached.  36  IjJKCI  I"  .  _,  "1  ^^^^•g^||^g|MM|*AgWgM||gM*g|*H||||^^^^^Bm  W E L C O M E TO UBC  UBC.CA ABOUT UBC  •  PROSPECTIVE STUDENTS CURRENT STUDENTS FACULTY & STAFF  N t W i | f i (NTS | DIRECIGRir*] | SEARCH UD  1  """  C LOGIN  1  QUICKFIND • • • • • •  President's Welcome Trek 2000/. UBC's Vis.on UBQ Annua) Report University Town Services for Media I Find UBC Experts Faculties & School?  ALUMNI  SEARCH UBC  TEACHING & LEARNING RESEARCH UBC LIBRARY  One Great University  Search j ; 6.th«r fgarch tu>o|f, i  Gothic*  UBC ROBSON SQUARE SUPPORTING UBC  Left Column  UBC TO ESTABLISH OK AN AG AN CAMPUS The BC government has asked UBC to establish a new Okanagan campus, with undergraduate, graduate and research programs, to serve the Okanagan region and beyond. UBC Okanagan wilt begin offering programs at what is now the North Kelowna campus of Okanagan University College (OUC), starting in September 2005. more..  Main Content Area TREK 2010rettE*N-pAr>Ett The revision of UBC's strategic plan Trek 2000 has moved to the 'green paper' stage with the publication of a: first draft of Trek 2010. The UBC community and other stakeholders are invited to comment on this document, available on-line as well as in hard copy. A campus-wide forum on Trek is tentatively scheduled for Apr, 1/2004, more ,,  FACULTY & ADMINISTRATIVE DIRECTORY .afesj:..iij:9si;firj.si  Right  OUICKLtST j select horn below  "Hid  Way ft n d i D g It UBC  Looking for directions to a building on campus? Access UBC's searchable campus map,  CttfnPUS  UBC's CampusCarn is a live wireless Webcam that is moved to a different campus location every few weeks. Current location: Buchanan Tower  MORE UBC NEWS See the UBC Public Affairs: Web site for mpr? UBC news, or visit Live®UBC for complete listings of events on campus. Media, Find U.BQ gx££rt5 or access comprehensive Services for Media,  thton ! UBC.c*  Footer Bar  Figure 3-2: A sample web page layout with multiple segments  Automatic web page segmentation can be achieved with algorithms such as Vision-based Page Segmentation (VIPS) [11-13]. VIPS makes full use of page layout features such as font, color and size. It first extracts all the suitable nodes from the HTML DOM tree, and then finds the separators between these nodes. Here, separators denote the horizontal or vertical lines in a web page that visually do not cross any node. Based on these separators, the semantic tree of the web page is constructed. A value called degree of coherence (DOC) is assigned for each node to indicate how coherent it is. Consequently, VIPS can efficiently keep related content together while separate semantically different blocks from each other. The granularity of segmentation in VIPS is 37  controlled by a predefined degree of coherence (PDOC), which plays a role as a threshold of the most appropriate granularity for different applications. The segmentation only stops when the DOCs of all blocks are no smaller than the PDOCs. For detail description of the VIPS algorithm, please refer to [12]. The display-oriented web page design works well for human readers facilitated with large screen devices, because human viewers can easily find out the patterns and logical relationships hidden behind the graphical layout. However, it is hard for the machines to recognize the patterns and understand the information conveyed by the graphical layout. It is also unsuitable for human viewers with small screen devices. Because the web page cannot be fit within one screen, the viewer will have to scroll both horizontally and vertically to get the whole picture of the page, making the discovery of the patterns difficult. Moreover, in some situations, the constraints on the display of handheld devices (such as color, images and font sets etc.) can even make it impossible to render a web page properly on the handheld device. For this reason, it is necessary to explicitly express the semantics and relations of the segments of a web page to support small screen devices and machine processing. The segmentation semantics can be embedded in the content by the author or be explored by an intelligent agent program with segmentation algorithms such as VIPS. With the segmentation semantics, the user can navigate through the segments hierarchy in a web page and choose the target segment of content at proper granularity to display. By this way, the high level adaptation on inpage navigations is achieved.  38  3.3 The C C T Data Model To manage the multilevel segmentation hierarchy of a web page, we design a special tree-structured data model named the Content Container Tree (CCT) data model. Unlike the thumbnail view browsing system which uses flat sub-pages, the CCT introduces the concept of 'container' and 'content'. The 'content' is simply a piece of HTML code that corresponds to a particular segment in a web page. The 'container' is like a box to hold the 'content'. The CCT data model works in a way similar to how we organize stuffs when we move our home. When we move, we put a category of stuffs into a box and put a label on the box to tell what is in it, for example we may put all of our clothes in a box with a label "clothes" and put all of our shoes in another box with a label "shoes". Then for the ease of transportation, we may use larger boxes to hold close categories of stuffs. For example, we may use a large box to hold the clothes box and the shoes box and the quilt box. And we may use another large box to hold the kitchen utensil box and the garage tool box. We may also want to apply labels such as 'bedroom stuff and 'kitchen and garage stuff on these large boxes to give us some idea of what kind of stuffs are in these boxes. Finally, we put all the boxes in a vehicle for delivering to the destination. The advantage of such a hierarchical organization is that we can quickly locate an article by inspecting the labels of the boxes without going through the detail stuff in each box. For example, if we want to find a pan for cooking, we know it must be in the 'kitchen and garage' box. After we open this box and find there are two small boxes 'kitchen utensils' and 'garage tools', we know it must be in the 'kitchen utensils' box. Then we can open the kitchen utensil box to dig out the pan we need.  39  The idea of the CCT 'container' is very similar to the moving box concept but with some differences. When we say "open a container", we mean something different than opening a moving box. We use the term "expanding a container" to express the similar meaning of opening a box. We'll explain this difference in detail in the following parts in this section. For now, we just want to show the basic idea of how the CCT containers work with the moving-box analogy. Similar to the moving box, a CCT container is used to hold a piece of HTML code that represents a particular segment in a web page. A CCT container can also be used to hold other CCT containers to form a hierarchy, and this CCT containers hierarchy reflects the multi-level segmentation hierarchy we described in the forgoing section. That is, an upper-level container corresponds to an upper-level segment and a lower-level container corresponds to a lower-level segment. The advantage of this "content and container" design is that it does not require opening the detail contents to know what the content is about. To locate target content, we only need to traverse the container hierarchy and check the labels of the container. After we find the right container which contains the target content, we can open that container and retrieve the content just like what we do to find an article in the moving boxes. The hierarchical structure of the containers and contents can be expressed as a tree. The root node of the tree stands for the upper-most container which contains the whole page; inner nodes are the top level coarser containers; and all leaf nodes are flat segmentation of the web page content. We refer the leaf nodes as the content nodes; we refer all the non-leaf nodes, which have at least one child node, as the container nodes. The container nodes may have one to many children nodes. These children nodes can be  40  either other container nodes or content nodes. A container conserves semantics information about the content segment it corresponds to. Applying the description method of document representation in [14], the basic model of CCT is described as follows. A web page (root container) co can be represented as a triples = {Q,<D, A}, where Q. = {co ,co ,...,co }\s a finite set of sub-containers. All these x  1  N  sub-containers are not overlapped. Each container can be recursively viewed as a subpage with structures induced from the whole page structure. O = {q> ,<p ,...,(p } is a finite x  2  T  set of flat content segments. A container may only contain content nodes, where Q=NULL; or it may only contain sub-container nodes, where 0=NULL. In the case where neither Q nor O is NULL, it means the current container contains some content that are not in any of its sub-containers. This is possible when some content is meaningful only in the upper-level segment, for instance a horizontal line that visually separates two sub-segments of the upper-level segment. A = {5 ,8 ,...,5 } is a set of rules that defines X  2  M  the relationship among the sub-containers of the current node; for instance 51 may define a rule that "sub-container col precedes all other sub-containers". Figure 3-3 shows the diagram of the CCT data model for the UBC home page as presented in figure 3-2. Note that the circles in the diagram stand for the content nodes, while the rectangles stand for the container nodes. The hierarchy of the web page segmentation is reflected by the tree structure of the container nodes. The data model in figure 3-3 shows that the document container consists of five sub-containers: the header-bar container, the left-column container, the main-area container, the right-column container and the footer-bar container. The main-area container further consists of four sub-containers; each of which contains a text paragraph. The right-column container consists of three sub-containers too.  41  Such a tree structure of the containers can help the user to navigate through the content segments from a small screen terminal device.  document  Header bar  Left column  welcome  Main area  articaM  artical2  Right column  artical3  Quick find  Search  Quick list  Figure 3-3: C C T data model for the U B C home page.  The granularity of the segmentation is decided based on the size of the smallest screen supported, which means the iteration process of segmentation can stop when the size of the content segment is comparable to the smallest screen size supported by design. However, a user with large screen device does not have to go through all the segmentation hierarchy and display content segments at the finest granularity level. In fact, a user can choose any content segment which is at a granularity level higher or comparable to the screen size of his device. For instance, instead of displaying the three articles in the main area one by one, all the contents in the main area segment can be displayed together in one page. In the extreme case, the whole web page can be displayed in one page if the terminal device is a personal computer. This flexible-granularity feature  42  is also a significant advantage of the CCT data model over other web content adaptation approaches. Now we can explain the difference between the CCT container and its movingbox analog. Due to the physical constraints, when we open an outer moving box, we can only see the inner boxes. If we want get some stuff out, we must open all the boxes one by one on the hierarchy until we reach the inner-most box which holds the thing we want. However, such physical constraint does not apply to the CCT containers. When we 'open' a CCT container, what we actually mean is to retrieve the content segment this container corresponds to no matter how many sub-level containers it has. So we define the 'opening' operation in CCT as retrieving the content segment the container corresponds to which includes all the content nodes of its own as well as its descendent containers. To allow the user to inspect the sub-containers of the current container, we define another operation "expanding". 'Expanding' a container means to read the labels of its immediate sub-containers. The 'expanding' operation is similar to opening an outer level moving box to inspect what inner boxes are in it. Not all containers are eligible to both the "opening" and the "expanding" operations. For instance, the containers which contain only content nodes are not 'expandable'. In what condition a container is 'openable' or not is a little bit more complex. Basically, all the containers which contain only content nodes are 'openable'. For those containers which have sub-containers, whether they are 'openable' or not will be determined on the fly by comparing the granularity level of the content segments they correspond to and the display capability of the terminal device that makes the request. If  43  the granularity level of a content segment is much coarser than the terminal device's display capability, the respective container will be set to 'not openable' for this terminal device; otherwise it will be 'openable'. In practice, we can set if the width of a content segment is twice larger than the width of the terminal device's screen or the height of the content segment is four times larger than the height of the terminal device's screen then the respective container is 'not openable'. The ratio we give here is just based on estimation and should be optimized in practice by testing with different types of terminal devices. Thus, the determination of whether a container is 'openable' is dynamic while the determination of whether a container is 'expandable' is static.  3.4 The Layered View of the C C T Data Model In the forgoing CCT data model, we did not define the data format for the content nodes. Theoretically, the content nodes may contain any type of data no matter they are text or binary. In practice, the data on the web is organized with hypertext documents. As this thesis is targeted at web content adaptation, we will focus our discussion on the application of the CCT data model to the HTML web pages. Since an HTML web page has its internal structure and can be expressed as a DOM [15] tree, we define a layered view of the CCT data model for a clearer understanding of how the CCT data model work with HTML web pages. That is, instead of viewing the content nodes and container nodes in the same tree, we view them as two different trees with mapping relations between the nodes on the two trees. From the layered view point, we can divide the CCT data model into two sub layers, the data layer and the semantic layer. The data layer contains the HTML DOM tree. The semantic layer  44  contains the container tree which is similar to the CCT tree described in the previous session except that all the content nodes are removed. Each container node maps to one to many sub-trees in the data layer, and the sub-trees mapped from the parent container node always contain the sub-trees mapped from the children containers. However, the sub-trees mapped from two sibling container nodes should not overlap with each other because the segmentation of a web page is not overlapped. Moreover, overlapping containers will violate the "well formed" rule of the X M L syntax and make our embedded implementation impossible. We will discuss the implementation of the CCT data model in the next chapter. This layered structure provides us a three dimensional view of the CCT data model which has two flat tree structures interconnected with each other. In the layered view of the CCT data model, the container tree in the semantic layer can be expressed as a triple co = {&.,$>,A}, in which the definition of Q and A remains the same to the forgoing CCT data model. Q. =  {co ,G) ,...,co } x  2  N  defines a finite set  of containers which can be recursively defined with co = {Q,0, A}. A = {S S ,...,S } is t  V  2  M  a set of rules that defines the relationship among the sub-containers of the current node. The definition of 0 is slightly different with that of the CCT data model. In the layered view of CCT data model, G> defines a set of sub-trees of the content DOM tree rather than a set of content nodes.  <& = {(p ,(p ,...,(p }, x  2  T  in which each cpi defines a sub-tree in the  content DOM tree. Every two sub-trees (pi and (pj are mutually exclusive, which means they do not share any common nodes. This is because in a DOM tree, a node can have as most one parent node.  45  Figure 3-4 shows the layered view of the CCT data model. It consists of two layers, the data layer and the semantic layer. The HTML DOM tree of the web page lies in the data layer; and the container tree lies in the semantic layer. The arrows from the semantic layer to the data layer indicate the mapping relations between the container nodes and their respective DOM sub-tree. Each sub-tree in the data layer starts from the node which is pointed by an arrow from the semantic layer. For example, the root node of the container tree maps to the entire content DOM tree, the root node's first child container node maps to two sub-trees of the DOM tree in the data layer, and the other child node of the root of the container tree maps to another sub-tree of the DOM tree. Figure 3-4 also indicates that, the sub-trees mapped from the parent container node always consist of the sub-trees mapped from the children container nodes, as we have mentioned earlier.  Figure 3-4: The layered C C T data model  The layered CCT data model provides an alternative view of the CCT data model. It clearly separates the semantic structure of the document from the document content 46  itself, while still maintains the internal structure of the content. The container tree in the semantic layer conveys the semantic and structure information of the web page. The DOM tree in the data layer conveys the content data for end users. By applying a layered view, it is easier to understand how the CCT data model can be used for automatic web content adaptation by processing the information in the semantic layer to locate and retrieve portions of the web content in the data layer.  47  CHAPTER 4 I M P L E M E N T A T I O N  O F C C T W I T H  X H T M L  4.1 Different Approaches to Implement the CCT Data Model We have several choices of the approaches to implement the CCT data model for HTML web pages. First, we could embed a set of special elements in an HTML web page to label the beginning and the end of the segments. Second, we can add some attributes to the existing elements in an HTML web page to indicate whether or not these elements are used for labeling content segments. Finally, we can define a separate reference document to describe the hierarchical segmentation structure of an HTML web page. Each of these approach options has its pros and cons. The embedded element approach conforms to the layered view of the CCT data model. A container is labeled by the starting and closing tags of a special container element. All the HTML elements between the starting tag and closing tag of a container element will be regarded as the DOM sub-tree this container corresponds to. A container element can be nested into another container element to form the hierarchical structure of the container tree. The advantage of this embedded approach is its simplicity for implementation. The mappings between the container tree and the content DOM tree are implied by the nesting relations of the container elements and the HTML elements. The disadvantage of this approach is that it introduces a new element into HTML language so that further validation may be needed to avoid breaking the standard.  48  The attribute approach does not require dedicated elements to define the range of the containers. Instead, it utilizes the existing HTML elements to label the beginning and ending positions of the containers. To distinguish those HTML elements which are used for labeling containers, new attributes are needed. In the attribute approach, an HTML element may serve as a node both in the content DOM tree and in the container tree at the same time. Thus, the mapping relations are easy to define. The main shortcoming of the attribute approach is that the definition of a container is constrained by the structure of the HTML document and sometimes it is even impossible to find a proper HTML element in the document to server as the container for a particular segment. For example, we may have two parallel text paragraphs which are marked as "<p> paragraph 1 text </p> <p> paragraph 2 text </p>", and we want to define a container to contain these two paragraphs; since there is no such an HTML element which is the exact parent node of only these two paragraph elements, it is impossible to define the container with the attribute approach. Another concern of this approach is its inefficiency for processing. Because each HTML element is potentially a container, we have to test all the HTML elements to find out which of them are containers. Moreover, although the attribute approach does not require new container elements, it does require new attributes for the existing HTML elements to indicate whether they are containers. Thus, further validation is still necessary. Comparing the attribute approach with the embedded approach, the attribute approach does not expose any advantages over the embedded approach. On the contrary, it is less flexible and efficient, so we rule this approach out of the options. The separate reference document approach directly conforms to the layered view of the CCT data model. In this approach, the original HTML document will be left  49  untouched, and a separate document will be created to describe the hierarchical structure of the containers and mapping relationship between the container nodes and their respective sub-trees in the HTML document. The reference document can be generated as an X M L document with the container elements to form the container tree. A container element can have attributes and children elements to describe its properties and define the HTML sub-tree it maps to. The separate reference document approach does not perform any modifications to the original HTML document. All information in the semantic layer of the CCT data model is kept in the separate reference document, so no validation is needed to keep the HTML standard unbroken. Moreover, because the reference document is dedicated for describing the container tree, it can be designed to be highly efficient for the semantic layer processing without intervention with the data layer. The main problem of this approach lies in the difficulty of defining the mapping relations with the HTML DOM sub-trees. For a complex web page with hundreds of elements, the expression to locate a particular element in the HTML document can be very complex too. For this reason, we keep the separate reference document approach as an option and will study how we can implement the CCT data model with this approach in the future work. In this thesis, we will focus on the embedded approach due to its simplicity. 4.2 New Container Element in X H T M L Because the HTML syntax is not strict, we apply the Extensible Hyper Text Markup Language (XHTML) [16] for the embedded approach to implement the CCT data model. The advantages of XHTML include: •  XHTML is an official W3C recommendation.  50  •  XHTML is applied by the WAP 2.0 as the specified markup language for its Wireless Application Environment.  •  XHTML is a stricter and cleaner version of HTML. Transcoding between HTML and XHTML is easy.  •  XHTML strictly conforms to X M L syntax, so it can be processed like a X M L document.  •  XHTML is modular, so extensions can be added to XHTML using X M L without breaking the X H T L M standard. Because XHTML is just a reformation of HTML in X M L , we can use conversion  tools such as TIDY [18] to convert an HTML document into an XHTML document before applying the CCT data model on it. To implement the CCT data model, we introduce a new element named "container" for embedding into XHTML. The container element is used to define the range of a container. The tag <ct> is used to represent the container element. The definition of the container element is given below.  Definition  <ct id = "unique alphanumeric identifier" label = "index title " desc = "content summary text" gran = "level of content granularity" </ct>  51  Element Specific Attributes  •  id (required) This attribute defines a unique alphanumeric identifier for the container element. The id attribute is used to locate a container.  •  label (optional) This attribute contains a short text serving as the index of the container. The label attribute should be meaningful and descriptive about the respective content in the container, because it will be used to generate the hierarchical index for the user interface.  •  desc (optional) This attribute contains a summary text of the content in the container. It can be used in fast keyword search or displayed as complementary information to the label index.  •  gran (optional) This attribute defines the granularity level of the content in the container. The granularity level can be defined according to width and height of the respective content segment and the W3C's Composite Capability/Preference Profiles (CC/PP) [19] or J2ME Profile [20] to provide the information about for what devices the content in this container is suitable. The value of the granularity level is determined during the segmentation process. And this value will be used during the adaptation process to determine whether a container is 'openable' for a particular terminal device. The lower the granularity level is, the larger the screen  52  size of the terminal device is needed to view the content. For instance, the granularity level ' 1' means the content segment is only suitable for rendering on the screen of a desktop computer with at least 800 * 600 graphical display resolution; granularity level '2' means the content segment is suitable for PDAs with at least 400 * 300 graphical display resolution; granularity level '3' means the content segment is suitable for smart phones with 200 * 150 graphical display resolution; granularity level '4' means the content segment is suitable for cell phones with 100*75 graphical display resolution. When a granularity level of a container is higher than the designated granularity level for a terminal device, this container is 'openable' for that terminal device; otherwise it is 'not openable'. Table 4-1 shows an example of granularity definition. GraWiarity'I^vilC;' 1 2 3 4  :  i - • Segment Dimension W g n a 1 S t f l l f i c t ? r ^ l t't5pen' eligiBleldevices 800*600 to 1600*2400 PC PC 400*300 to 800*600 PDA PC, P D A ?  ;:  200* 150 to 400*300 100*75 to 200*150  Smart phone Cell phone  PC, P D A , Smart phone PC, P D A , Smart phone, Cell phone  Table 4-1: Example granularity definition  Contents  The <ct> tag encloses a piece of XHTML code chunk, which should be well formed according to the XHTML syntax.  Example  A CCT document example is given below in figure 4-1 based on the web page of figure 3-2 and its respective container tree structure shown in figure 3-4. The nested container elements form the container tree. The omitted parts in the example are the HTML 53  elements from the original web page which form the content nodes of the CCT data model. <html> <head> </head> <body> <ct  id="0" label="header b a r " desc="logo  ands i t e  links"  gran="3">  </ct> <ct  id="l"  label="left  column"  desc="site navigations"  gran="3">  </ct> <ct i d = " 2 " l a b e l = " m a i n c o n t e n t " desc="main t e x t a n d g r a p h i c a l c o n t e n t " gran="2"> <ct i d = " 3 " l a b e l = " w e l c o m e t o UBC" d e s c = " p h o t o o f t h e t w o campus" gran="3"> </ct> <ct i d = " 4 " l a b e l = " U B C t o E s t a b l i s h O k a n a g a n Campus" d e s c = " s p o t l i g h t t e x t " gran="3"> </ct> <ct i d = " 5 " l a b e l = " T r e k 2 0 1 0 : G r e e n P a p e r " desc="news c a l l o u t s " gran="3"> </ct> <ct i d = " 6 " l a b e l = " M o r e UBC News" desc="news b u l l e t i n s " gran="3"> </ct> </ct> <ct i d = " 7 " l a b e l = " r i g h t c o l u m n " d e s c = " Q u i c k F i n d " gran="2"> < c t i d = " 8 " l a b e l = " Q u c i c k F i n d " d e s c = " h y p e r l i n k s " gran="3"> </ct> <ct i d = " 9 " l a b e l = " S e a r c h UBC" d e s c = " s e a r c h e n g i n e s " gran="3"> </ct> <ct i d = " 1 0 " l a b e l = " W a y f i n d i n g a t UBC" d e s c = " s e a r c h a b l e campus maps" gran="3"> </ct> </ct> <ct i d = " l l " . l a b e l = " f o o t e r b a r " d e s c = " m o r e i n f o r m a t i o n a b o u t c o p y r i g h t " gran="3"> </ct>  </body> </html>  Figure 4-1: Container element coding example  54  contact,  4.3 Validation and Implementation Guidelines Since we introduce a new element to XHTML in the embedded approach, we need to update the XHTML Document Type Definition (DTD) to reflect the change so that that an XHTML document with CCT implementation can be correctly validated. To add the new container element into the XHTML DTD, we simply need to add the following lines in the DTD: <! ELEMENT ct ANY> <!ATTLIST ct id ID #REQUIRED> <!ATTLIST ct label CD ATA #IMPLIED> <! ATTLIST ct desc CD ATA #IMPLIED> <!ATTLIST ct gran CD ATA #IMPLIED> In this thesis, we use CCT only for structuring the intermediate document. Thus, we can build our own DTD by extending the XHTML DTD with the new element we introduced and validate the intermediate CCT document against this DTD. There is no need to update the standard XHTML DTD, because the adaptation results for the end user will not contain the container elements. The <ct> tags will be removed from the content segment by the adaptive transcoding module after it is retrieved from the intermediate CCT document. To implement the CCT data model in an XHTML document with the embedded approach, there are some rules of thumb to follow. 1. The segmentation rule: The segmentation of a web page should be based on both the visual cues and logical cues. Segmentation process should be iterative from the coarsest granularity level to the finest granularity level. Segmentation process  55  can stop when either there are no more cues available to further divide a segment or the granularity level of the segment has reached the finest granularity level by predefinition. 2. The well-formed rule: The positions for embedding the container elements are usually determined by the segmentation algorithm. However it should be noted that embedding the container elements should not violate the well-formed rule of XHTML syntax. The embedded container elements must be properly nested with other XHTML elements. If the position for embedding a container element determined by the segmentation algorithm violates the well-formed rule, it should be adjusted to another position that conforms to the well-formed rule. 3. The labels rule: As the labels of the containers will serve as the index for the user interface, it is better to be meaningful and descriptive. If the CCT document is designed by a human, the human designer should choose short and descriptive text which best tell the nature of the respective content segment. If the CCT document is generated automatically by a program, it is desirable that a brief summary of the content can be generated with some artificial intelligence • algorithm to be used as the labels of the containers. However, before an algorithm which can produce satisfying results is invented, we can use the absolute position of a segment in its parent segment with its shape as its label, for example "header stripe", "left column", "center box" etc. Finding techniques for effective content summarization will also be a future work for this thesis.  56  4.4 User Interface Design To enable the user to efficiently traverse the embedded tree structure of the containers and retrieve the content segment at desired granularity, we design a multilevel hierarchical index user interface. The end user could take advantage of this multilevel index to navigate a web page and locate the target content segment. The multilevel index can be generated dynamically as follows: At the beginning, all children containers of the root node (first level containers) are listed, and then wait for the user's instruction. A container can be "expanded" or "opened" if applicable. If a container is "opened" by the user, its respective content DOM tree will be retrieved, the process is over. If a container is "expanded" by the user, its immediate sub-containers will be listed and then wait for the user's instruction again. The process iterate until the user choose to open a container. The pseudo code for this process is presented in figure 42.  Set current_node = root; Loop { list_children (current_node); //wait for user's instruction Set select_node = the container user selected; Set select_mode = the operation user selected; If (select_mode == "open") { retrieve_content (select_node); End loop; } else { Set current_node = select_node; } }  Figure 4-2: CCT container index user interface generation  57  Figure 4-3 shows the user interface for the example UBC homepage of figure 3-2.  +UBC homepage - Main Area + - Header bar - Left column - Right column + - Footer bar  +Main Area - Photos of campus - Okanagan campus -Trek 2010 - More U B C news <  <  > | Home <  > t Home (a)  More U B C news See the U B C Public Affairs Web site for more U B C news, or visit Live@,UBC for complete listings > t Home (c)  (b)  Figure 4-3: User interface for the U B C homepage  Panel (a) in figure 4-3 shows the main index page that will be returned when the web page is required for the first time. In the index page, the user can open a container by tapping on the respective label or expand a container by tapping on the plus symbol "+" after it when applicable. For those containers which are not 'expandable', no plus symbol will be displayed after the label, for instance the "header bar", "footer bar" and "left column" containers in panel (a). For those containers which are not 'openable' for the current terminal device, their index labels will not be underlined. For instance, in panel (a) the "main area" container and the "right column" container are not underlined because the supposed terminal device is a smart phone which is at a higher granularity level than the containers. The sub-container index as shown in panel (b) will be returned when the "main area" container is expanded by tapping the '+' symbol after it. If the "more UBC news" container in panel (b) is opened by tapping on its index label, the content page as shown in panel (c) will be returned.  58  Additional navigational tools can be provided in the user interface to assist the user to move around. For example, we may provide the following quick links at the bottom of each page. •  Left arrow "<—" expands the previous sibling container of the current node if applicable, otherwise open it if applicable.  •  Right arrow "—>" expands the next sibling container of the current node if applicable, otherwise open it if applicable.  •  Up arrow " | " expands the parent container of the current node.  •  Home link, go to the initial index page.  For instance, suppose the current page is the content page as shown in panel (c), the user can display the article in the container "Trek 2010" by tapping the left arrow; the user can go to the index page of panel (b) by tapping the up arrow; the user can go to the main index page of panel (a) by tapping the home link. Since there is no next sibling to the current container, the right arrow will take no effect in this page.  4.5 Content Extraction from CCT Documents With the CCT data model, we can apply X M L technologies such as XPATH and XQUERY to easily extract the target content from the respective container. And it is easy to generate the container indexes as described in the previous section (the user interface). For example, we can create an index of the containers by listing their label attributes. An XPath query like "ct/@label" can return the label attributes of all children containers in the current node.  59  To generate the index as shown in the figure 4-3 (b) from the CCT document of figure 4-1, we can evaluate the XPath query "//ct[@id = '2']/ct/@label" to get the labels of each children container of the "main content" container. While the forgoing example can be regarded as "expanding" a container, the XPath query "//ct[@id = 'container id']/" can be used to "open" a container and return the respective content segment of that container. For instance, to get the content displayed in figure 4-3 (c) from the CCT document of figure 4-1, the XPath query "//ct[@id = '6']/" should be evaluated. We can use XSLT to format the XPATH query results and generate web pages suitable for displaying on the terminal devices. For example, the XSLT script as shown in figure 4-4 can generate a container index page as shown in figure 4-5. <?xml v e r s i o n = " l . 0" e n c o d i n g = " I S O - 8 8 5 9 - l " ? > <xsl:stylesheet version="l.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> < x s l : t e m p l a t e match="/"> <html> <head> <xsl:copy-of select="//head/*"/> </head> <body> <h2Xxsl:value-of select="//head/title"/></h2> <ul> <xsl:for-each select="//body/ct"> <li><xsl:value-of select="@label"/></li> </xsl:for-each> </ul> </body> </html> </xsl:template> </xsl:stylesheet> Figure 4-4: A n X S L T script example for generating container index page  60  '3| Welcome to the ^~|ile £dit„-^ew Favorites*  Address  loo!  :  |g} http:;gjr^io'T|tinks. ;|Q i * >i  Google**,  |html element d e f i n i t i o n _  Welcome t o the University of B r i t i s h Columbia - UBC. ca  header bar l e f t column main content + right column + footer bar  mm  I Uo'calilntranei  I  Figure 4-5: A container index page generated with X S L T  Similarly, we can use XSLT to generate and format a content page. Since the HTML elements for visual formatting are kept in the content nodes, the resulting content page will preserve the visual apperance of the original page if the brower supports HTML. Otherwise, transcoding may be needed to convert the content page into the coding scheme that is supported by the browser. While XHTML is the standard markup language in WAP 2.0, we assume it is supported by most types of the terminal device. The XSLT script as shown in figure 4-6 will extract the content of a sub container in the right column container, and the result in the content page as shown in figure 4-7. <?xml v e r s i o n = " l . 0 " e n c o d i n g = " I S O - 8 8 5 9 - l " ? > <xsl:stylesheet version="l.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> < x s l : t e m p l a t e match="/"> <html> <head> <xsl:copy-of select="//head/*"/> </head> <body> < x s l : c o p y - o f s e l e c t = " / / c t [ @ i d = '9']"/> </body> </html> </xsl:template> Figure 4-6: X S L T script for extracting content node  61  !B  B  C  vev.  Is.crtes  Sack » - » • @ Agttess Go  2  lad* :2f  j  •&  httptflocalhosl 8!_J  u*J S  e  i>Go  ' g l C • [xpeth U c i  a  r  c  h  Links * •  - | (|f•••e^.c^A•  "  Search U B C  I  8Barch j  m  GcMglc  mother search tools I  Faculty & Administrative Directory  I  _  .Find I  other directones  j  i )"J ittr r  F i g u r e 4-7: A c o n t e n t p a g e g e n e r a t e d w i t h X S L T  Note that we use the <xsl:copy-of> element instead of the <xsl:value-of> element in the example XSLT script to preserve the HTML elements of the original web page.  4.6 Extraction of the Embedded Container Tree In the embedded CCT implementation with XHTML, the <ct> tags are mingled with other HTML tags. So, there may be zero to many elements between two container elements. For example, we may want to make an HTML table the content node of the <ct id = "table"> container element, and we also want to put the content of each row of the table in a <ct id = "row number"> container element. To keep the completeness of the table structure, the table element should be the child node of the <ct id = "table"> container element; to conform to the well-formed rule, the <ct id = "row id"> elements should be placed in the middle of the starting tag <tr> and the closing tag </tr> for the table row elements. So the resulting CCT document will look like the example code in figure 4-8.  62  <ct id = "table"> <table> <tr> <ct id = "rowl"> . . . </ct> </tr> <tr> <ct id = "row2"> . . . </ct> </tr> </table> </ct> Figure 4-8: <ct> tags embedded in H T M L tags  In the above example, the <ct id = "row number"> container elements are descendents of the <ct id = "table">. They are children of the respective table row elements, which are children of the table element, which in turn are child of <ct id = "table"> container element. But from the layered view point of the CCT data model, we do not care how many table row elements or table elements or any other HTML elements are in the middle of the hierarchy of two container elements; we view all the container elements in a separate tree in the semantic layer. Thus, those <ct id = "row number"> container elements are immediate children of the <ct id = "table id"> container element in the container tree, and they are sibling nodes themselves. When we use an XPath query such as"./ct" to locate a <ct> element, we should clarify that the query is evaluated to the container tree only. Because there may be other elements between the current node and the target <ct> element in a DOM tree, we should use XPath query like ".//ct[with no other container nodes between the current node and the target node]" for the DOM tree to express the same query of "./ct" for the container tree. Unfortunately, such query cannot be expressed in XPath 1.0 because it needs universal quantifiers to express which are not supported by XPath 1.0. The good news is, the universal quantifier will be supported in the coming XPath 2.0. Then, there are basically two methods to query a container element. The first method is to extract the container tree from the embedded CCT document first with text processing tools, and then apply XPath query on the container tree to locate a container.  63  Once a container is located, we can use its container id to extract its content nodes from the embedded CCT document. The second method is to apply XPath 2.0 query with universal quantifiers on the embedded CCT document directly, ignoring any element which is not a container. Each of the two methods has its pro and con. The first method is more efficient for querying a container node because the separate container tree is much more compact than the embedded CCT tree. The second method is less efficient for querying a container node due to the large scale of the embedded CCT tree, but it does not require pre-processing to extract the container tree. Which method to apply will much depends on the specific architecture of the content adaptation system.  64  CHAPTER 5 C C T  A P P L I C A T I O N  D E S I G N  A N D  E X P E R I M E N T S  5.1 CCT Adaptive Web Server Design According to the CCT content adaptation framework, we developed an adaptive web server to test the feasibility and the effectiveness of the CCT data model. This CCT adaptive web server allows the user to navigate through the container indexes of a CCT document and display the target content segments at a desired granularity level. We implement the application as a WBI plugin. WBI is a programmable web proxy developed by IBM that enables developers to create intermediaries to tailor, customize, personalize, or enhance data as they flow along the HTTP stream. For detail introduction of WBI, please refer to the section 2.2.7 of this thesis. The adaptive web server application consists of two MEGs (The Monitor, Editor, Request Editor and Generator modules are collectively referred to as MEGs in WBI). One is the ServerFileGenerator and the other is the CctXslProcEditor. A plugin named CctWebServerPlugin is used to hold and initialize these two MEGs.  The  ServerFileGenerator corresponds to the web server in the CCT content adaptation framework. The CctXslProcEditor corresponds to the adaptive transcoding module in the framework. The function of the proxy module in the framework is carried out by a default proxy MEG provided by WBI.  65  The system works in this way. When a client sends a HTTP request for a file to the CCT web server, an object of ServerFileGenerator will be instantiated. The ServerFileGenerator object will generate a HTTP response header and put the requested file into its output stream. The HTTP response header and file content will be sent to the client directly unless certain rules are matched to trigger the CctXslProcEditor. In this application, we set the rule to be "the requested content is of CCT document type." If this rule is matched which means a CCT document is returned by the generator, an object of the CctXslProcEditor will be instantiated. This editor object will load a template of XSLT style sheet into the memory, and modify the style sheet based on the parameters retrieved by the generator from the original HTTP request header and passed through with WBI transaction data mechanism. Then the modified style sheet will be applied on the requested document coming from the MeglnputStream to generate proper result content which flows out through the MegOutputStream. The output of the CctXslProEditor is either a container index page or a content page depending on what kind of query to evaluate to the CCT document. To retrieve a document from the local file system, we take use of a WBI MEG bean named FileGenerator provided by the WBI API. The ServerFileGenerator simply analyzes the input HTTP request header, generates file path based on the request and forward the file path and the request to the FileGenerator bean. The FileGenerator will take care of getting the document from the local file system and put the content into the output steam. Figure 5-1 shows the UML diagram of the ServerFileGenerator.  66  com.ibm.wbi.protocol.http HKpCemtatot  }<—i  ca.ubc.ece.cct jaya.lang< String h^-  ca.ubc.ece.cct  ServerFileGenerator  •T) CctWebServerPlugin |  % filePath' String % requestPathPretixLength: int •  com.ibm.wbii  ^ handleRequestO. void ^ ServerF.'eGeneratorQ: void «i  l a  |«-  •  —  ; RequestEvent || Requestlnfo || RequestRejectedException j com.ibm.wbi.piutocol.http M  aH Documentlnfo  U  =H  coni.ibiii.wui.piutucol.hltp.buan8 FileGeneralor  java.io< 3°j PrirrtStream jauaJang  71 Exception |' Object 11 StringBuffer 11 System  Figure 5-1: ServerFileGenerator design  The CctXslProcEditor is responsible for performing the XSLT transformation on the CCT document to generate adaptive content. In our design, we use DOM and JAXP for realizing such functions. The CctXslProcEditor applies the DOM API in the org.w3c.dom package for online modification of the XSLT style sheet template and the JAXP API in the javax.xml.transform package for XSLT transformation on the input CCT document. Figure 5-2 shows the UML diagram of the CctXslProcEditor.  67  Luiti.itMii.wbi.|)rirtiiculJrtt|i  Http Editor  ca.ubc.ece.cct u| CctXslProcEditor -N CctWebServerPlugin!! ^ buMDocQ: Document ^ handleRequestQ: void  com.ibm.wbi |  IL  I _ i MagReader 11 MegWrrter j| tiequestf**itt \\ Requeatlnfo j[ RBque^RejecterJException j com.ibm.wbi.protocol.http I ____  Documentlnfo II HttpResponseHeader MnioJo  J,  J_  . FileHotFoundExceptlon~| |; IQEKceptlon!! | |~PrintStream~l |: Rtatttr] [ Writer^ I  "i Exceptional! Object 1|; String i | | StringBuffer 1 System 1 ' jav8.utll H  M^EaSmir»lpSi\ : jauax.xmLparseiit:  ParserConfiguratiortException \ • | DocamtatBaitdtr | j DocumtntBuildtrFactory \\ jauax.xnU Jrartsfor m  ~ J •] £ | |ltW*jMifftf| |BS<MJVC*11 I Transformer: j j Tran«formcrConfiguration£xceptton 11 Tran«fprmerExcep*iuii j j traastotiavtf actoiy,JL  N  ifavax.xrnl.transform.dom s M DOMSource | <,^Hiax.xrnUransfoi m stream  ;  _j  I:!! StroamResult 11 StrearoSourco 11 iOTg.w3c.donv j  W  Figure 5-2:  iM SAXE»ception |  CctXslProcEditor design  The XSLT style sheet template contains two sub templates, template ' A ' for generating the index pages, and template ' B ' for generating the content pages. For easy online modification, each template defines two parameters, the 'source' and the 'xpath'.  68  The source refers to the CCT file name along with its relative path under the web server root, which is needed to create href values of links. The xpath refers to the XPath expression for the interested container node, which is needed to extract proper content from the source file. The value of these two parameters will be modified by the CctXslProcEditor on the fly to set to the run time values. The default values in the stylesheet are for initial request with no queries. The XSLT style sheet source code is listed below. stylesheet.xsl <?xml v e r s i o n = " l . 0" e n c o d i n g = " I S O - 8 8 5 9-1"? > <xsl:stylesheet v e r s i o n = " l . 0 " xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <xsl:copy-of s e l e c t = " / / h e a d / c t / * " / > </head> <body> < x s l : c a l l - t e m p l a t e name="A"> <xsl:with-param name="source">url o f the source file</xsl:with-param> <xsl:with-param name="xpath" select="//body"/> </xsl:call-template> </body> </html> </xsl:template> <xsl:template name="A"> <xsl: pa ram name="source"x/xsl: param> <xsl:param name="xpath" s e l e c t = " / / b o d y " /> <h3><xsl:value-of select="//head/ct/@label"/></h3> <xsl:if test="name($xpath)='ct'"> <em><a href="{$source}?container=body">Go t o Main Index</a></em> <h4xxsl: v a l u e - o f select="$xpath/@label"/></h4> </xsl:if> <ul> <xsl:for-each s e l e c t = " $ x p a t h / c t " > <xsl:variable name="item" s e l e c t = " @ i d " /> < l i x a h r e f ="{$source} ?content=ct (@id={$item}) "> <xsl:value-of select="@ l a b e l "/x/a> <xsl:if t e s t = " c o u n t ( c t ) > 0"> <xsl:text> </xsl:text> <a href="{$source}?container=ct(@id={$item})">+</a> </xsl:if> </li> </xsl:for-each> </ul>  69  </xsl:template> <xsl:template name="B"> <xsl:param n a m e = " s o u r c e " > < / x s l : p a r a m > <xsl:param n a m e = " x p a t h " s e l e c t = " / / c t [ @ i d = 0 ' ] " / > < p > < e m X a h r e f =" { $ s o u r c e } ? c o n t a i n e r = b o d y " > G o t o M a i n Index</a></emx/p> <xsl:copy-of select="$xpath"/> </xsl:template> 1  </xsl:stylesheet>  stylesheet.xsl Figure 5-3: XSLT style sheet template  To correctly set the parameters for the XSLT style sheet, it is necessary to get the information such as the interested container id and what kind of operation on it. This information can be passed with the HTTP GET method. The GET method can pass key values pairs in UML requests. For example, the key is either "container" or "content", and the value is an identification expression for the container node of interest. When the key is "container", it means an index page of the children containers is requested; when the key is "content", it means a content page of current container is requested. The CCT identification expression is in the format "ct(@id = #)". The reason we use the round bracket instead of the bracket in an XPath expression is because the bracket will be misinterpreted by the Nokia WAP gateway simulator for other function. An example of such an HTTP request is like: http://com-pc03:8088/web/ubchomepage.xml?container=ct(@id=2L  70  5.2 Extending the Adaptive Web Server for Automatic Content Adaptation Comparing the CCT web server application architecture with the proposed CCT content adaptation framework, the ServerFileGenerator corresponds to the web server in the framework. The CctXslProcEditor corresponds to the adaptive transcoding module in the framework, although it lacks of the markup language transcoding or image conversion functions. These functions can be easily accomplished by adding new editors in the system to further manipulate the output of the CctXslProEditor. This is one of the advantages of applying the WBI. The function of the proxy module in the framework is carried out by a default proxy MEG provided by WBI. Since the CCT documents are manually generated rather than automatically transformed from a plain HTML document, there is no respective MEG in this adaptive web server application corresponding to the content reformation module of the CCT content adaptation framework. However, this function can be realized by adding new editors that can automatically transform an HTML web page into a CCT document with vision-based page segmentation algorithms to achieve dynamic automatic content adaptation.  5.3 Test Bed Setup To demonstrate the effectiveness of the proposed CCT data model, we setup a test bed for accessing the adaptive web server from heterogeneous terminal devices. The test bed applies the server-client architecture. The server is a PC running the adaptive web server application. The client can be either the Nokia mobile browser simulator or the  71  Microsoft's IE web browser running on another PC on the same Local Area Network (LAN) with the server. The IE web browser is a typical web browsing software on PC. According to the definition in table 4-1, we set the designated granularity for the IE client to '1'. The Nokia mobile browser simulator simulates the work of the web browser on a smart mobile phone. We set the designated granularity for the mobile browser simulator client to '3'. To enable the communication between the web server and the mobile browser simulator, a Nokia WAP gateway simulator is also needed on the same machine of the mobile browser simulator. Figure 5-4 shows the configuration of the test bed.  WAP Gateway Simulator  CCT Web Server with WBI  Server  Mobile Browser Simulator  Client  IE Web Browser  Client  Figure 5-4: Test bed configuration  Software used in the test bed includes: •  WINDOWS 2000  •  WBI 4.5  72  •  Internet Explorer (IE) 6.0  •  Nokia Mobile Browser Simulator 4.0  •  Nokia WAP Gateway Simulator 4.0  •  Java SDK 1.5 beta  •  Java Runtime Environment 1.5 beta  5.4 Experiment Results We created two CCT documents for test. The first one is the UBC homepage as shown in figure 3-2. The second one, which is more complex, is the YAHOO homepage. Some necessary modifications to the original HTML pages are made to comply with the CCT data model. First, as CCT documents are essentially XHTML documents, we cleaned the original HTML file with TIDY [18] to make it well-formed. Then, we embedded the <ct> tags into the proper places in the documents to reveal the hierarchical structure of the content. Finally, we changed the file type of the document to CCT.  5.4.1 Experiment with UBC Homepage Figure 5-5 shows the result of browsing the UBC homepage from both the Nokia mobile browser simulator and the IE browser. Panel (a) in figure 5-5 shows the main index displayed on the mobile browser when the web page is requested for the first time. The main index page displayed on the IE browser is shown on panel (b). Comparing the two index pages on different browsers, we can notice that the "main content" label and the "right column" label are not hyperlinked in the page for the mobile browser, which means they are not 'openable' for this terminal device. This is because the granularity  73  levels of the "main content" container and the "right column" container are both '2', but the designated granularity level for smart phone is '3' which is higher than the granularity levels of the two containers. Thus, according to our definition on granularity levels in table 4-1, the "main content" container and the "right column" container are not 'openable' for this smart phone mobile browser. On the other hand, because the designated granularity level for PC is '1', which is lower than the two containers' granularity level, so they are 'openable' and hyperlinked in the page for the IE browser. There is one more "full page" link in the main index page for IE which is not presented in the mobile browser. This link corresponds to the root container which contains the whole web page and has granularity '1'. Since the designated granularity level of PC is also '1', the whole page can be displayed by the IE browser on PC. The granularity level for the other three containers "header bar", "left column" and "footer bar" are all '3', so they are 'openable' for both the IE browser and mobile browser. Each label in the index represents a CCT container. Some labels are followed by a plus symbol'+', which means that container is 'expandable'. Tapping on the plus symbol will expand the container and generate an index page of its sub-containers. For example, panel (c) of figure 5-5 will be displayed on the mobile browser if the plus symbol after the "Main Content" label is tapped. Panel (d) shows the index page of the sub-containers of the "Main Content" container on the IE browser. Because the sub-containers of the "Main Content" container are all at granularity level '3', they are all 'openable' on both the mobile browser and the IE browser. Tapping on a hyperlinked label will retrieve the container's respective content segment. For example, the content segment in panel (e) will be displayed on the mobile browser by tapping on the "more UBC news" hyperlink  74  in the index page of panel (c). Panel (f) shows the same content page as panel (e) but in the IE browser. b. -JEli*l •'3 Welcome t o the:Univi*rsity o [tile.' Edit'.View \Favpntes",,To .** !  , ^Back -  (§> Ti| t2t  [Address @ httpffjj- f^Go  [Links'»  -1  ; GoogleWelcome t o t h e University of British  Columbia  -  UBC.ca r.Il  Pare  •  Header B a r  •  Left  •  Main Content +  •  Right  •  Footer Bar  Column Column +  (b)  (a) t$> Nokia Mobile Browser  H i  J-File. -Tools . H e l p , „ . . N M B 4 0 - /  "§§ Welcome t o t h e Univci sity o SCliBlM File'' Edit', ,'„View, • Favorites  Io  4-Back - -tj, M (& gj-' tS* Address")^) http>|  fc»Go  Links' »  ^'l  Google- |~ Go io Main Index  + Iain Content •  Welcome t o UBC  •  UBC t o E s t a b l i s h Okanagan Campus Trek 2010: Green Paper More UBC News  • •  (C)  E  JlUS Loaajjritranet. 75  1]  i^>'Nokia Mobile Browser...pfpBp? 1 File  Tools  Help  WMR 4 0  <f§ Welcome to the University.. g l o l p c l '  File'  Edit • View  iAddress  Favorites  |^h#Jf~^.Go.  ]  w  Links,  :  Google'  a £ o f o #ai/7 Index  More UBC News See the UBC P u b l i c A f f a i r s Web s i t e f o r more UBC news, o r v i s i t Live®UBC f o r complete l i s t i n g s o f events on campus. Media, F i n d UBC Experts or access comprehensive S e r v i c e s  (e)  ir-  pj  (f)  Figure 5-5: The U B C homepage on the C C T adaptive web server  One significant advantage of the CCT adaptive web server is that it gives users the freedom to choose the content segment at any granularity levels as long as it is supported by their terminal devices which means the granularity level of the content segment is equal to or higher than the designated granularity level of the terminal device. For the users with large screen devices, it is not necessary to go through all the levels of sub-indexes to get the desired content, the user can choose to tap on the label hyperlink rather than the plus symbol at a proper level to get the whole content segment at once. For example, the user with a PDA with granularity level '2' can tap the "Right Column" label rather than the following plus symbol to get all the contents in the right column in one page as shown in panel (a) of figure 5-6.  76  Address jiffi httpflcom-pctBeoeejwtAi'tixhJ'jfl^Go  [unta^fe']?  G0.J0, Main, Index  QuickFind • • • • • •  P r e s i d e n t ' 3 Welcome Trek 2000: UBC's V i s i o n UBC Annual Report U n i v e r s i t y Town Services f o r Kedia | Find U B C \ x p e r t s F a c u l t i e s fe S c h o o l s \ ^  File  Edit  - View  . ^ B a c k V % j^jj  Address ]  S e a r c h UBC  £1  v'ig^g) Links »  if>*Go  G o o ^ f e i . |tidy '!?  other'Search^toola:'  Gothic \  F a c u l t y ft A d m i n i s t r a t i v e Directory  *  directories  •  QuickFind  •  S e a r c h UBC  •  Wayfinding Looking f o r  tie  iJ  Address  Vew  jv_-  -i  IUIL  Wayfinding  ij-.i  Fjle ! Edit  E&SSrtsflo.- 'ill-si Jiiew Favorites loots  Search  V  n-s # .  Eli*.* UBC  ' ,  > jj!  . •  • - • l-a.3r.tes  I  Address |g) http //cFj (*Go J Links "j «g [Google>;|bdy  Wayfinding Looking at UBC for directions to a building on campus? Access UBC's searchable campus map.  ! others search^tools-----^--:  Faculty  3) iV. !• MMi> I n  Go 10 Main Index  QuickFind • President* s Welcome • Trek 2000: UBC's Vision • UBC Annual Report • University Town • Services for Media | Find UBC Experts • Faculties & Schools  i i  (b)  G o C g l C - jtidy  Go to Main Index  at  LocahintranetJA  http://c^J r>Go I Links » f e - •Address ]Q http://  GOCg!e- [tidy  K,  If  j|gS Local Kraal.  s  <\  !  UBC  at UBC directions to a b u i l d i n g on canpus? A c c e s s UBC* s e a r c h a b l e campus map.  (a)  1  c ol u m n  r i g h t  \ 1 jgjai other  P  Go to Main Index  &  Administrat ive Directory  [ i other directories Local  (c)  ****  (d)  (e)  Figure 5-6: Granularity hierarchy of the content in the right column container.  Figure 5-6 panel (b) shows the container index of the right column, and figure 5-6 panel (c), panel (d) and panel (e) show the respective content page of the containers in the container index page of panel (b). In some cases, although the user has a large screen device, he may also want to follow the multilevel indexes to go to the content segment at  77  higher granularity level than his terminal device to quickly locate the target information. This is pretty useful when the connection speed is slow, for instance using a dial up connection with notebook computer or PDA, because large scale of irrelevant contents can be filtered by following the indexes to get to the target content directly. 5.4.2 Experiment w i t h Y a h o o Homepage  Many of today's web pages rely on the <table> element to organize their layouts. The homepage of Yahoo is a typical example of extensively using tables for visual organization. Such usage of tables introduces complexity into the web pages; however we'll prove by our example of the adaptive Yahoo homepage that the CCT data model can handle such complexity successfully. In the example, all the page layout structural information of the original page is kept. Thus, 'almost' the same web page of the original one can be generated from the CCT document. We say 'almost' because some page information such as the HTML header information may be modified to suit the needs of the terminal devices, but the main body of the page will remain the same. In this example, the "full page" link in the main index page corresponds to the root container and allows the user to view the whole page from a PC or any equivalent powerful device. The drawback of keeping all the layout structures such as tables, space and background images is that it may cause troubles for some less powerful device to render the content correctly. However, this can be solved by adding more transcoding entities to further filter or modify such elements from the result according to the terminal device's profile. Figure 5-7 shows the test results of accessing the Yahoo homepage on the CCT adaptive web server from the IE web browser.  78  Fie  EJt  VWFV  Favortas  .„,.  MfiTMSi^tw-  Tok Mp  . , ^y^,xn*cw«entJlMdy "  P«iion»lln  File  Edit  View  Favorites  Shop Find Conned  - (Q g ) y  <^Back - ^ Address | f c j h i j  ^Go  Links "  Google ' |tidy /  *)  -  ~2_ » -LI  Organize Fun Info  '  3ftt-sc-rchweb. 0  G O - ^ C - [tidy  Fining  ~~~~  *^"* g^TObiockwf P  R  " "3  g,j  ^Opo tin. ,>  f^Go links  i  Tiivil  *iP !tahflg| TraVftl - EfilhJa. HULBIS.£aa. Vacam l ns. Crmses. LasM -t n iue t Gea tways. Todays' EtBfl [search for: j" ntrnWeb*-] •'Yahoo!Search -puma* Preveiw D ' RY l Jft' on Yahool Games Domajri k id lM L AuiOS, C U ssta iS desa ,rcR alPeE stn aa les.l, S ho o ppn igP . aT ra M tU afcS .fi.M aES, P e op elee h.reo rso Y elw ne svel C h a t . G e o C e i l s . G r t n i g s . G u p s . M a . l i J A derse.sesH .o B iofp ca l a.i M ygsjcY qp ,o yDrie&.t Gd s-n rorsce esse. CafM M ,ah R aJd i,Pa 3Y Fn iance. NSBS,Sports. Weaker Photoa  Moviaa.  Ufiiltil,  More Yahop|,„  TJ» al-new O f MESSENGER  •• M UNadd end o rsesb tran sfesrpeco Ira qestedsovn e iIn rih b o m n ig su ttfi w a rr ireg taty y l •• U S . . o a e e G B s u m m l i f j f j s R e a a a n g e t s t r b i u t e f r o m j e v l b e a n e n m rn a jpy •• e an h cua k ess sm paro k a p oscay lp sew fea ae r son inIfg LsaraMais - D ownload for Free R o v e r n f i d r e s g i n o f t r M ars •' N F 'BrA asvoF 'n ra ecslordn i•gM so ftw are ga n iLs* pG opofu a lrt•y iE i L B • N F l uro •• W baHoDo sm tirn la O n ilP eO ern so W eb tsli • >PCP era so n2304 a Gete aig .•Se M rketn n F a a tiln seyalB aseb ae G m essl Tons of new features. Just add friends.  • E.p.an youri.lf with nan (morions. Audrblai, and Avatars.  Header Main Content + Right Column Local Yahoo is + More Yahoo I + Footer  Tiheol BgHrnti Sarvfcii  Wab Site Mractory - ettai organiiad b.  triaet  5uoaait votjr srt  Njwf - Elections, jtip.rfj - Slacks, • WnMi«r t  Business * t c n n o m y 62B. Ein.a&si. Shwing. Mi•••  ' ProFcrm free shu ion i^  Society t Culture ElSEll. Environ want. Fit. I  On* day only - Sava up to $123  •for H esefoar i/? prc ie?- S earesch-6S 0a 0 0 ,00 i now eco o lfm n m •N tiv 'surH j aJw ids bank ho v m ri hnm ifve- bg  I  iT  I  !  t=ti Local intrar st  •)  nirwiqs in  (a)  Mn lata  (b) jJBJ xj|  Be E* 1«m Favorties I•Is 1 "*ffH Eei Edti »ew Favorties Iools | •i-Back ~ - J _:1 4) : jJSMr* ^B„ck • -v - j dj rJjsear* A iks »j «J. ArJdressit|]htp:/com»J t>Go j Ln iks "! % d -*ess |<Qr«p:/co^3 ^>Go Ln Go glC • jbdy  £o fo Main Index  + Right Column • • • • •  Personal Assistant Yahoo Travel In The Hews Marketplace Entertainment  "3 ft *" 1  • ProForm free shipping . ft. • ,^^((( ' ^\jA ^•jpjejpJpjBL —  One day only - Save up to $123 on tr..dmlll», bikes, ellipticals, end more. Buy now.  • Homes for W nrire? - Rparrh 600,000 foreclosure and bank homas - Save big now • Netflix delivers movies to your home - No lale fees. 20.000+lilies - Free trial • Online degrees - Double your future income • Accredited colleges fc universities Father's Day Specials # Mobile Dad - save up to ISO al AT&T Wireless  -3 I  i Local Nrranet  00  (d)  Figure 5-7: The Yahoo homepage on the CCT adaptive web server  Figure 5-7 panel (a) shows the main container index of the adaptive Yahoo homepage. The full page in figure 5-7 panel (b) will be displayed by tapping the "Full page" link in the main index page. Figure 5-7 panel (c) shows the sub-container index of  79  the right column; and the content of the "Market place" container is displayed infigure57 panel (d). 5.4.3 Conclusions on the Experiment Results These experiments prove the feasibility and effectiveness of the proposed CCT data model and the CCT content adaptation framework. These experiments also demonstrate the following advantages of applying the CCT data model for web content adaptation. 1. Easy content generation: A CCT document complies with the XHTML syntax so it is easy to create just like normal XHTML documents ; 2. Easy application implementation: The CCT application applies the classical web architecture. No changes are needed for the client and server architecture. Adaptive results are generated on the fly; 3. User awareness: The user is involved in the adaptation process for deciding the target content segments and the granularity level, only useful data to the end user will be delivered. The hierarchical index makes it easy for the user to navigate through the contents. The flexible opening and expanding methods allows the user to adjust the granularity; 4. Speed: Due to the smaller size of each sub page, it is less error-prone during transmission and faster to load for browser.  80  CHAPTER 6 C O N C L U S I O N S  A N D F U T U R E  W O R K  6.1 Conclusions In this thesis, we have proposed a novel data model named Content Container Tree (CCT) for the web content adaptation. The CCT data model organizes the semantics and the content data with an embedded tree structure. By "embedded tree", we mean both the containers and contents can be organized as independent tree structures; and the container tree can be embedded into the content tree to indicate the mapping relations between the containers and the content segments. The container tree can be processed independently to generate multilevel indexes for navigating through the web content segments at various granularity levels. A content segment at the proper granularity level can be retrieved according the terminal device's characteristics and the user's input. We have also defined a layered view for the CCT data model, which consists of a semantic layer and a data layer. The semantic layer carries the semantic and logical information about the web content with the container tree, while the data layer carries the content data with the content tree. The layered view of the data model enables the architectural separation of the high level adaptation and the low level adaptation for the web content adaptation framework. We have defined the high level adaptation as the adaptations on structure, navigation, and functions of the web contents which should be processed in the semantic layer. We have defined the low level adaptation as the adaptations on the syntax, format and encoding of the content which should be processed in the data layer.  81  The explicit separation of the high level and low level adaptation results in a flexible and extensible modular web content adaptation framework which we also introduced in this thesis. This framework conducts the high level adaptation and the low level adaptation with different modules and organizes the intermediate adaptation result with the CCT data model. In conclusion, the major differences between the CCT approach and other approaches to web content adaptation include: 1. The CCT approach is data oriented while other existing approaches are process oriented. Existing web content adaptation approaches mostly focus on finding and applying particular techniques to transform a web page based on the terminal device characteristics. The CCT approach, on the other hand, focuses on defining a data model that carries the necessary information and the data for adaptively generating web pages according to the user's preference and the terminal device's characteristics. By using the CCT as the intermediate data structure, the content adaptation process can be divided into two steps. The first step is to transform the original web page to conform to the CCT data model. The second step is to retrieve the content segments at the desired granularity levels from the intermediate CCT document and perform further transcoding if necessary. 2. The CCT approach supports multilevel granularity of the web contents, while other approaches only has one level of content granularity. The existing approaches define only one granularity level of the content according to the terminal device's screen size. For instance, in the Microsoft's thumbnail view browsing system, the dimension of the sub-pages is pre-determined by the screen  82  size of the supported PDA. But these sub-pages may not be suitable for viewing on other devices with larger or smaller screens. The CCT approach, on the other hand, supports multi-level granularity. The highest granularity level is determined by the smallest screen size supported by the system. The hierarchy of the segments at different granularity can be utilized to generate multilevel indexes to enable the user to navigate around. The user with small screen device can choose the segments at higher granularity level to display while the user with large screen device can choose the segments at lower granularity level to display. The user with large screen device also has the freedom to follow the hierarchical index to go to the higher granularity level to quickly locate the target content and save downloading time. CCT approach is dynamic while other approaches are static. Existing web content adaptation approaches usually pre-generate a series of sub-pages from the original web page and wait for the user to choose from these sub-pages. For instance, MIT's digestor system re-authors web documents through a series of transformations according to the type of the terminal device, and links the resulting individual sub-pages. Because the user will not access all of these subpages in most cases, the processing power is wasted to generate those sub-pages that never accessed. On the contrary, with the CCT approach, a result page will be generated on the fly only when it is requested. After transforming the original web page into the intermediate CCT document, CCT adaptation system will generate an index page to deliver to the terminal device and wait for the instruction from the user. If the user requests to expand a container, the CCT system will query the  83  CCT document to generate an index page of the sub-containers of the concerned container. If the user requests to open a container, the CCT system will retrieve the respective content segment from the intermediate CCT document to generate a content page. Because the result page is generated on the fly only when it is requested, no processing power is wasted to generate sub-pages that are never accessed, so CCT approach is more efficient.  6.2 Future Work Our work in this thesis was mainly focused on the design of the CCT data model and the CCT adaptation framework. To achieve automatic web content adaptation, we also need to invent algorithms that can automatically analyze a web page and transform it to the CCT document. There are two main challenges for this work. First, the algorithm should be able to perform iterative web page segmentation based on visual cues as well as logical cues. Second, the algorithm should be able to make summarizations of the content segments to generate meaningful and descriptive labels for the containers. There have been quite a few related works [11-14] on the first challenge that we can refer to. The second challenge is even harder and need more in-depth study. Text summarization techniques as described in [4] can be applied on text-intensive small content segments. But for a large segment with several sub-segments and multimedia content, new techniques should be developed to get the summarization that best describe the content in that segment. In this thesis, we took the embedded tree approach to implement the CCT data model. In the future work, we should also try other approaches for CCT implementation,  84  for instance the separate reference document approach we mentioned in the section 4.1. We should investigate the pros and cons of each approach in detail and find out the most optimum approach for CCT implementation. In this thesis, we only consider the static HTML documents for web content adaptation. However, more and more web pages today are generated dynamically with scripting languages such as ASP (Active Server Page) and PHP (Personal Home Page). Moreover, embedded scripting language such VBScript and JavaScript are also widely used to control the appearance and behavior of a web page. How to handle these dynamic web pages in web content adaptation with CCT is also an important topic for the future work. Other possible future works of this thesis include developing authoring tools to assist manually creation of CCT documents; comparing the effectiveness and efficiency of the CCT adaptation framework with other web content adaptation approaches; and analyzing the performance of the CCT content adaptation system in the terms of speed, usability and content fidelity.  85  B I B L I O G R A P H Y  [1] A. Pashtan, S. Kollipara, M . Pearce, Adapting content for wireless web services. Internet Computing. IEEE, Volume 7, Issue 5, Sept.-Oct. 2003, page: 79- 85. [2] B. Bederson, and J. Hollan, Pad++: A zooming graphical interface for exploring alternate interface physics. In Proc. A C M User Interface Software and Technology '94, pp. 17-26. A C M Press [3] J. Hsu, W. Johnston, and J. McCarthy, Active outlining for HTML documents: An Xmosaic implementation. In Proceedings of the 2nd International conference on World Wide Web. 1994 Chicago, IL. [4] O. Buyukkokten, O. Kaljuvee, H. Garcia-Molina, Efficient web browsing on handheld devices using page and form summarization. A C M transactions on information systems, Volume 20, Issue 1, A C M Press, January 2002. [5] T. W. Bickmore, B. N . Schilit, Digestor: Device-independent access to the World Wide Web. In Proceedings of the 6th International World-Wide Web Conference, 1997 [6] Yu Chen, Weiying Ma, Hongjiang Zhang, Detecting web page structure for adaptive viewing on small form factor devices, In Proceedings of the twelfth international conference on World Wide Web, May 2003, Budapest, Hungary [7] Anita W. Huang, Neel Sundaresan, Aurora: a conceptual model for web-content adaptation to support the universal usability of web-based services. In Proceedings on the 2000 A C M conference on Universal Usability Arlington, Virginia, United States  86  [8] Jinlin Chen, Baoyao Zhou, Jin Shi, Function-Based Object Model Towards Website Adaptation, In Proceedings of the tenth international conference on World Wide Web, May 2001, Budapest, Hungary [9] S.J. Lim, Y.K. Ng, An Automated Approach for Retrieving Hierarchical Data from HTML Table. In ProcCIKM'99, pp466-474. [10] D.W. Embley, Y. Jiang, Y.K. Ng, Record-Boundary Discovery in Web Documents. In Proceedings of SIGMOD'99, 1999, pp467-478. [11] Y.D. Yang, and H.J. Zhang, HTML Page Analysis Based on Visual Cues. In: 6th International Conference on Document Analysis and Recognition, Seattle, The United States, Sept. 10-13,2001. [12] D. Cai, S. Yu, J. R. Wen, W. Y. Ma, VIPS: a vision based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-19, 2003 [13] D. Cai, S. Yu, J. R. Wen, W. Y. Ma, Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In the Proceedings of twelfth World Wide Web conference (WWW 2003), Budapest, Hungary, May 2003. [14] Y. Y. Tang, M . Cheriet, J. Liu, J. N . Said, and C. Y. Suen, Document Analysis and Recognition by Computers, Handbook of Pattern Recognition and Computer Vision, World Scientific Publishing Company, 1999. [15] World Wide Web Consortium, Document Object Model (DOM), Level 2 HTML Specification, version 1.0, January 2003; http://www.w3.org/TR/DOM-Level-2-HTML/ [16] World Wide Web Consortium, The Extensible Hyper Text Markup Language, version 1.0, January, 2000; http://www.w3.org/TR/xhtml 1 /. [17] IBM research, Web Intermediaries (WBI), http://www.almaden.ibm.com/cs/wbi/  87  [18] Dave Raggett, HTML TIDY, http://www.w3.org/People/Raggett/tidy/ [19] W3C, CC/PP Composite Capability/Preference Profiles (CC/PP): A user side framework for content negotiation, 16 July 2003, http://www.w3.org/TR/NOTE-CCPP/ [20] Sun Microsystems, Mobile Information Device Profile (MIDP), July 2003, http://iava.sun.com/products/midp [21] W3C Recommendation, Extensible Markup Language (XML), February 2004, http://www.w3.org/TR/2004/REC-xml-20040204/ [22]W3C  Recommendation,  XSL  Transformations (XSLT), November  1999,  http://www.w3 .org/TR/xslt [23]W3C Recommendation, XML Path Language (XPATH), November 1999, http://www.w3.org/TR/xpath [24] W3C Recommendation, The Extensible HyperText Markup Language (XHTML), January 2000, http://www.w3.org/TR/xhtml 1 [25]  Sun  Microsystems,  Java  API  for  XML  Processing  http ://j ava. sun.eom/xml/j axp/ [26] WAP Forum, Wireless Application Protocol (WAP), www.wapforum.org  88  (JAXP),  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065596/manifest

Comment

Related Items