Abstract (eng)
Over the past few years, numerous attempts to categorize webpages automatically have been described. Such categories are for example University-Websites, Company-Websites, Onlineshops, Web-Search engines and so on. To achieve this, many different approaches where proposed, such as analysing the hyperlink structure within a website and the connections to and from a webpage, or the information contained in the Meta-Tags of a webpage. This kind of categorization should mainly help search-engine operators of the World Wide Web (WWW) to be able to present categorized results of their searches. Besides that there were also quite some efforts to find ways to automatically summarize webpages to be able to describe these webpages as compact and meaningful as possible.
The present work also deals with contents of websites, but not with the goal to summarize or understand its contents, but to produce a so called "fingerprint" of them. A fingerprint is an abstract representation of a webpage based on selected components of the underlying Hypertext Markup Language (HTML) code. With the fingerprint of a certain webpage a computer shall be enabled to recognize this webpage in an automated manner, as well as webpages which have comparable contents and are to be viewed as alike, just by comparing the fingerprints of the webpages in question. The application scenario of this approach is the automated navigation through webpages to extract data. There is a great number of webcrawlers, often called wrappers, whose task is to automatically navigate through different webpages to extract some information of interest. The proposed application is to be viewed as an extension to a webcrawler that gives it the ability to save representations of webpages the wrapper has to deal with, as fingerprints. Generally a wrapper has to handle several different webpages in a very strict order and has well defined rules of navigation or extraction for each one of these webpages. If the sequence of called webpages during execution-time does not correspond to the scheduled order, the intended mechanisms for navigation or extraction cannot be executed at this point. To be able to keep up execution of a wrapper a fingerprint of the at this point "wrong" webpage is made to compare it with the fingerprints of those webpages which have been saved as "known webpages" before. This "pool of known webpages" has been created by the developer during developing the wrapper and contains examples of webpages a wrapper has to deal with. By using a scoring system it is being determined which one of the webpages contained in this "pool" has the most similarities with the webpage being compared, to be able to activate those navigation and extraction mechanisms which are intended for this kind of webpage.
Creation and comparison of fingerprints were implemented as a java application and several tests have shown that the desired functionality has been achieved, as every webpage comparison had correct results. Still the application can be improved in many ways, mainly by considering more HTML-components of webpages and by optimizing the scoring system.
All analysis and tests were based on eight websites which enable their users to book hotelrooms online.
The developed application is attached to this work.