STAVIES: A System for Information Extraction from unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques

Authors Avatar by harsha03372 (student)

STAVIES: A System for Information Extraction from unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques

ABSTRACT

        In this paper a fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of "human browsing.” The World Wide Web is today the main "all kind of information” repository and has been so far very successful in disseminating information to humans. By automating the process of information retrieval, further utilization by targeted applications is enabled. This novel presents a fully automated scheme for creating generic wrappers for structures web data sources. The key idea in this novel system is to exploit the format of the Web pages to discover the underlying structure in order to finally infer and extract pieces of information from the Web page. This system first identifies the section of the Web page that contains the information to be extracted and then extracts it by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach. The system performance is evaluated in real-life scenarios by comparing it with two state-of-the-art existing systems, OMINI and MDR.

INTRODUCTION

        It is very easy for humans to navigate through a web site and retrieve the useful information. An emerging need occurs to go “beyond the concept of human browsing” by automating the process of information retrieval. A common scenario utilizing a robotic approach is retrieving online information on a daily basis. For a human this is routine, time-consuming and tiresome. Instead, being able to have this information ready for use (e-mail or sms) saves precious time and effort. Another scenario includes activities like data mining which require a vast amount of available information for statistical and training purposes. This information is very difficult - if not impossible to have an automatic mechanism able to retrieve the required data. In both cases, we deal with a repeated process, a routine: visit a list of sites and then retrieve pieces of information from each of them. Once a program able to locate and extract the desired information has been developed, this process can be performed as often as and for as long as we want.

        Sophisticated mechanisms are needed to automatically discover and extract structured Web data. The problem in designing a system for web information extraction is the lack of homogeneity in the structure of the source data found in web sites. The web site structure changes from time to time or even from session to session in pages. Hence, a dedicated piece of software is required for each web site, to exploit this correspondence in structures. Such pieces of software are called Data source Wrappers  and their purpose is to extract the useful information from the web data sources. Wrappers are divided into two main categories:

Join now!

1. SITE SPECIFIC : It extracts information from a web page or family of web pages.

 2. GENERIC WRAPPERS : They can be applied to almost any page, regardless of the specific structure. We refer to the data to be extracted by the wrapper with the term “Structural Tokens.

        The semiautomatic wrappers use several techniques based on query languages, labeling tools, grammar rules, NLP and HTML tree processing.

        An extractor uses user-specified wrappers to convert web pages into database object which can then be queried. Specifically, hypertext pages are treated as text, from which site-specific information is extracted in the form ...

This is a preview of the whole essay