Introduction Problems Addressed : RD & Info Sys Interoperability Current RD tools: WWW Web Robots FTP Archie Gopher Veronica NetNews WAIS Each tool works just with a single protocol unstructured, low quality data Introduction Problems Addressed : RD & Info Sys Interoperability cannot find relevant information server and network bottlenecks no community/topical focus Òhard-wiredÓ search algorithms poor scaling characteristics Introduction Problems Addressed : RD & Info Sys Interoperablitity Harvest Approach Harvest Approach Highlights efficient distributed Gathering architecture topic/community focussed Brokers customizable content extraction in each Gatherer plug-and-play index/search engine in each Broker network-aware caching and replication for scalable access structured content summary interchange protocol Harvest Approach Overview Gatherer collects info from resources available at Provider sites Broker retrieves indexing info from gatherers, suppresses duplicate info, indexes collected info, and provides WWW interface to it Replicator replicates Brokers around the Internet Users retrieve located info from through the Cache SubSystems Gatherer Collection specified as enumerated URLs with stop lists and other parameters SubSystems Gatherer - Actions performed Enumeration Retrieval Unnesting (compression, tar, etc.) Type-specific summarizing Content summary serving SubSystems Customized Content Extraction (Essence) SubSystems Summary Object Interchange Format (SOIF) Gatherer-Broker protocol Structured object summary stream Arbitrary binary data & extensible fields Type + URL + set of attribute-value pairs Easy interoperation with other standards for communication/data exchange SubSystems SOIF Example SubSystems Broker Less than a dozen routines define index/seasrch API Current Engines: freeWAIS, Glimpse, Nebula, Verity, WAIS Inc. Current query interface: WWW Customizable result formatting MD5-based duplicate removal Provide data for indexing from other brokers as well as well as themselves SubSystems Distributed Gatherer-Broker Arrangement SubSystems Index & Search General Broker-Indexer interface to accomodate variety of index/search engines Can use many backends (WAIS, Ingres etc.) Index & Search Implemented using two specialized backends: Glimpse for space-efficiency & flexible interactive queries Nebula for time-efficiency & complex standing queries SubSystems Cache alleviates bottlenecks from popular objects Uses Mosaic proxy interface SubSystems Replicator alleviates bottlenecks from popular servers minimizes network traffic and server workload Perfomance Gatherer Server load efficiency: scans objects periodically & builds cache allows info to be retirieved in single stream local gatherers vs. unco-ordinated gathering: 6,600 times more server load savings Network traffic efficiency: supports incremental updates local vs unco-ordinated gathering : 59 times more efficient Performance Glimpse booleans, reg. expressions, approximate searches incremental indexing searches take several seconds indexing ~6000 files (~70MB) on SparcStation IPC: incremental indexing in 2-4 min. complete indexing in 20 min. Implementation Standalone Archie, Veronica, WWW etc. Gatherer configuration + Essence extraction script Content Router WAIS enumerator + Essence extraction script Implementations Implementing other systems WAIS Essence full text extraction + ranking search engine WebCrawler Gatherer configuration + Essence full text extraction WHOIS++ Gatherer configuration + Essence extraction script for centroids. Query front-end to search SOIF records