1. SITE SPECIFIC : It extracts information from a web page or family of web pages.
2. GENERIC WRAPPERS : They can be applied to almost any page, regardless of the specific structure. We refer to the data to be extracted by the wrapper with the term “Structural Tokens”.
The semiautomatic wrappers use several techniques based on query languages, labeling tools, grammar rules, NLP and HTML tree processing.
An extractor uses user-specified wrappers to convert web pages into database object which can then be queried. Specifically, hypertext pages are treated as text, from which site-specific information is extracted in the form of a database object. Another human-assisted approach, using query language but also labeling tools is a component called provider where user has to perform a sample query on it and then mark the important elements in the pages.
Lixto used a declarative extraction language Elog. The extractor can use these Elog files to create a “Pattern Instance Base” for different attributes contained in the HTML page. It extracts information from the HTML page and transforms the pattern instances to XML documents.
Another category of semiautomatic wrappers is based on tree representation of the HTML page. This system allows accessing of semistructured data represented in HTML interfaces. A tree representation of the HTML document is used to direct the extraction process. Simple extractors perform extraction rules written in the “Qualified Path Expression Extractor Language.” W4F (World Wide Web Wrapper Factory) uses different sets of rules for the wrapper generation. These retrieval rules define the loading of Web pages from the internet and their transformation to a tree representation. XWRAP Elite provides a graphical user interface for semiautomatic generation of wrappers for Web data sources, based on a tree representation. The toolkit includes: Object and Element Extraction, Filter Interface Extraction and Code Generation. It performs automatic normalization and removal of possible errors contained in the document. The region extraction is based on wizards aiming at the creation of rules for discovery and extraction of objects of interest and their elements. The user can choose among several sets of object extraction heuristics and element extraction heuristics.
All the systems presented so far are semiautomatic, meaning that human assistance is required at some point of their operation. Other systems have been proposed in the literature, automating the process of information extraction. According to the approach in Record-Boundary Discovery in Web-Document, the structure of a document is captured as a tree of nested HTML tags.
OMINI is a fully automated extraction system. It parses web pages into tree structures and performs object extraction in two stages. In the first stage the smallest subtree that contains all the objects of potential interest is located. In the second, the correct object separator tags that can effectively separate objects are found.
MDR proposes a technique based on two observations about data records on the web. The first observation is that a group of data records containing descriptions of a set of similar objects are typically presented in a particular region of a page and are formatted using similar HTML tags. The HTML tags of a page are regarded as a string therefore; a string matching algorithm is used to find those similar HTML tags. The second observation is that a group of similar data records being placed in a specific region is reflected in the tag tree by the fact that they are under one parent node which must be found. The proposed technique is able to mine both contiguous and noncontiguous data records in a Web page.
SYSTEM DESCRIPTION
The proposed system comprised of two modules which we call transformation and extraction module. They are subdivided into components, each one responsible for a different task. The overall procedure which extracts the information from the data source is performed in three distinct phases. During the first phase the HTML document is inserted into the transformation module, the components of which generate a tree. The Second phase aims at discovering and segmenting the region of the tree in which the structural tokens are located. The third phase concludes the operation by mapping the selected tree nodes to elements in the initial HTML Web document.
PREPARATION PHASE:
The first component is “Validation, Correction, and XHTML Generation Component”. This component performs a syntactical correction to the source HTML by transforming it into XHTML. This is necessary due to the leniency in HTML parsing by modern Web browsers, a major portion of the Web page is not well formed. Data sources often either contain invalid tags or their tags are placed in a wrong manner. Therefore, this component’s usage for cleaning and normalizing the HTML page is imperative. The cleaned and normalized page is then fed into the “Tree Transformation and Terminal Node Selection Component”, which generates tree representation of the page. The root of this tree corresponds to the whole Web document. The intermediate nodes represent HTML tags that determine the layout of the page. The terminal nodes correspond to visual elements on the Web page, namely, images, links, and/or text. Once the tree construction is completed, the terminal nodes are selected.
SEGMENTATION PHASE:
In this phase, the most important structural token extraction is achieved. We perform a segmentation of the selected terminal nodes, in a way that a one to one correspondence between the nodes representing elements of the same structural token and the extracted segment exists.
Nodes Comparison Component is responsible for calculating the N X N terminal node similarity matrix S that will be used for the clustering of these nodes. The element S [i, j] of the similarity between the terminal nodes ni and nj. For the purpose of clustering only the similarity of the adjacent nodes needs to be considered. The similarity of the adjacent nodes can be derived from S through the elements S [i, i+1], i Є [1, N-1]. The output of this component is a list s= (s1, s2… sn-1).
Hierarchical Clustering Component performs a one dimensional hierarchical clustering. Its purpose is to select a subset of the terminal nodes generated by phase one, or equivalently, to locate a subtree in the initial tree. This subtree will be selected to correspond to the region of the Web documents containing the structural tokens. The subset Tn(1 to N) of the natural numbers is hierarchically divided into clusters. Each index i Є Tn represents a terminal node n1. The HCC calculates, hierarchy of clusters in Tn. Let us denote with Tli the ith cluster at the l level in this clustering. Every such cluster is subset of Tn of consecutive natural numbers which may itself subdivided into similar subsets of consecutive natural numbers at the next level of hierarchy. This hierarchical clustering of Tn corresponds to a clustering of the terminal nodes through the nodes indices, thus the segment Tli corresponds to the set of nodes nj: j Є Tli.
Groups of adjacent nodes form clusters according to the similarity measure calculated by the previous component. The rule necessary for two nodes ni, nj to belong to the same cluster C, or equivalently, for the respective indices i, j to belong to the same cluster CT at level l:
ni, nj Є C ⬄ s (nk, nk+1)>=l, k=i (1) (j-1). → (1)
Cluster Evaluation and Target Area Discovery Component discovers which level in the hierarchical clustering of indices is the “cut-off” level. To discover the cut-off level, we calculate the value that maximizes the separation criterion and request its largest cluster to contain more than 20 percent of the total number of the indices. The separation criterion is structured by using variance, cluster compactness, cluster separation.
The condition of at least 20 percent of the total number of terminal nodes belonging to the largest cluster of a given level is due to the hypothesis that Web sources contain at least this percent of useful information. Once a level has been selected as the cut-off level, the cluster with the maximum number of elements is focused since it corresponds to the target area.
Boundary Selection Component further separates the structural tokens in region Co. The heart of this procedure is to locate a new level l’ inside the interesting region Co. Once l’ is located, a clustering similar to the one performed by the HCC is performed only for the nodes in Co and using l’ as the new cut-off level and substituted in (1).
INFORMATION RETRIEVAL PHASE:
Mapping the extracted nodes to the corresponding elements in the Web page, retrieving the using information this way, is the final step performed here by the “Information Extraction Component”. This after parsing the initial Web page, extracts the desired information according to the segmentation achieved in the previous phase.
CONCLUSION
A novel fully automated Web wrapper was presented. The main characteristic of the proposed wrapper is the fact that, in contrast with most of the related work, it does not require any human assistance or training phases. The main innovation and contribution of this system consists in introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages into structural tokens. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach the other systems use. A thorough comparison to two state-of-the-art fully automated systems, namely, OMINI and MDR, revealed the higher performance of our system. The comparison was performed on a large set of real Web pages covering a variety of layout structures and templates. The evaluation has been performed in terms of two widely used criteria, namely recall and precision. The comparison showed that STAVIES outperformed the other two systems in both of these criteria. STAVIES is able to execute in real time, not requiring more than 0.4sec, even for web pages of high complexity, on typical configuration machine. STAVIES was further tested successfully in more than 63,000 HTML pages from 50 different Web data sources achieving excellent performance.
REFERENCES
- IEEE Magazine “Knowledge and Data Engineering”
VOL 17 No. 12 DECEMBER 2005
- http://feynman.stanford.edu
- http://tidy.sourceforge.net