Friday, August 24, 2007

Crawler Basics & Challenges

The first step required to index HTML pages is to fetch all the pages to be indexed using a crawler. One way to collect URLs is to scan collected pages for hyperlinks (outlinks) to other pages that have not yet been indexed. Indeed crawling in this fashion might never halt. It is quite simple to write a basic crawler, but a great deal of engineering goes into industry-strength crawlers. The central functions of an industry-strength crawler is to fetch many pages at the same time, so as to overlap the delays involved in:-
  1. Resolving the hostname in the URL to an IP address using DNS
  2. Connecting a socket to the server and sending the request
  3. Receiving the requested page
Here are a few concerns that we will face while engineering a large scale crawler
  1. Since a a single page fetch may involve several seconds of network latency, it is essential to fetch many pages at the same time to utilize the network bandwidth available
  2. Many simultaneous fetches are possible only if the DNS lookup is streamlined to be highly concurrent, possibly replicated on a few DNS servers
  3. Multiprocessing and multithreading will cause overheads. This is rectified by explicitly coding the state for a fetch context in a data structure and by suing asynchronous sockets, which would not block the process or thread using it.
  4. Eliminate redundant URL fetches and avoid spider traps.

Tuesday, August 21, 2007

Creating a Search Engine

Hi I am doing my final year in Computer Science Engineering. So I and my friends thought creating a search engine might be a really kewl project for our final year and all... So we started going through the net for resources only to find out that many of the technology was industry oriented and not research oriented.

After some digging we were able to find some ACM papers on very specific sections in a search engine, but we quickly understood that for doing any new contribution to them we needed a good knowledge of the basics.

The first and most complete work published in this region was the paper by Sergey Brin and Lawrence Page, the first generation google search engine. Really it wasn't very inspiring actually it made us realize that the whole stuff was one complex piece of code .. seemed quite impossible actually to make a full scale search which was efficient and all. Well we'll see about that later. Wish us best of luck ... we would really need it ....