Exhibits, Demos & Posters
OGSA–DWC: A Middleware for Deep Web Crawling Using the Grid
Authors
- Jihwan Song, Department of Electrical Engineering and Computer Science, KAIST
- Dong-Hoon Choi, e-Science Division, KISTI
- Yoon-Joon Lee, Department of Electrical Engineering and Computer Science, KAIST
Abstract
The early Web sites could present and manage the information in static pages. This Web is called the Surface Web. However, to efficiently manage and retrieve a vast amount of information, current Web generally stores information in its back-end databases. Since the information comes from deep or hidden databases, this Web is called the Deep Web. Nonetheless, most of current search engines hardly retrieve information from the Deep Web. Hence, the search engines sometimes fail to create indexes for Deep Web pages. Recently, a lot of efforts are being tried to retrieve information from the Deep Web.
One of the challenges for crawling the Deep Web is the requirement of huge computing resources. "Grid computing" might be a feasible solution to implement a Deep Web crawler system. In this paper, we propose the design of the middleware, OGSA–DWC (Open Grid Services Architecture–Deep Web Crawling) for grid-based Deep Web crawling.
OGSA–DWC is the first OGSA-based middleware for crawling the Deep Web.
This middleware supports fundamental functions for easy using idle computing resources (volunteer Grid nodes) in the world and for crawling the Deep Web. Thus, developers will be able to implement grid-based Deep Web crawling applications using OGSA–DWC without much effort. Moreover, companies that need Deep Web crawling will be able to reduce costs for building a huge amount of distributed servers.