Implement a Web Crawler
The Web crawler is a standalone program that crawls the web and retrieves web pages starting from a ¡°seed¡± URL. It parses the seed webpage, extracts any embedded URLs, then retrieves each of those pages, recursively, but limiting its exploration to a given ¡°depth¡±.
Initial Setup
Make sure you have the libcurl library installed, including its headers. You can install it with sudo apt install libcurl4 libcurl4-gnutls-dev. For compiling your code, make sure you link the library properly by adding -lcurl to your linking command.
Implementation
The crawler implementation must:
1. execute from a command line with usage syntax
./crawler seedURL pageDirectory maxDepth
¡ö where seedURL is to be used as the initial URL,
¡ö where pageDirectory is the (existing) directory in which to write downloaded web pages,
¡ö where maxDepth is an integer in range [0..10] indicating the maximum crawl depth.
2. mark the pageDirectory as a ¡®directory produced by the Crawler¡¯ by creating a file named
.crawler in that directory.
3. crawl all “internal” pages (i.e., pages from the same domain name) reachable from
seedURL, following links to a maximum depth of maxDepth; where maxDepth=0 means that crawler explores only the page at seedURL, and maxDepth=1 means that crawler explores only the page at seedURL and those pages to which seedURL links, and so forth inductively. It shall not crawl ¡°external¡± pages (pages from different domain names). By “internal,” we mean that a crawler running on foo.example.com should not fetch pages from bar.example.com or from example.org.
4. print nothing to stdout, other than logging its progress; see an example format in the crawler output
2. Links to an external site.. Write each explored page to the pageDirectory with a unique document ID, wherein
¡ö the document ID starts at 1 and increments by 1 for each new page,
¡ö and the filename is of the form pageDirectory/ID,
¡ö and the first line of the file is the URL,
¡ö and the second line of the file is the depth,
¡ö and the rest of the file is the page content (the HTML, unchanged).
5. exit zero if successful; exit with an error message to stderr and non-zero exit status if it
encounters an unrecoverable error, including
¡ö out of memory
¡ö invalid number of command-line arguments
¡ö seedURL is invalid
¡ö maxDepth is invalid or out of range
¡ö unable to create a file of form pageDirectory/.crawler
¡ö unable to create or write to a file of form pageDirectory/id
Assumption:
The pageDirectory does not already contain any files whose name is an integer (i.e., 1, 2, …).
Limitation:
The Crawler shall pause at least one second between page fetches, and shall ignore non-internal and non-normalizable URLs.
Inputs and outputs
Input: there are no file inputs; there are command-line parameters described above.
Output: The crawler will save each explored webpage to a file, one webpage per file, using a unique documentID as the file name. For example, the top file of the website would have documentID 1, the next webpage access from a link on that top page would be documentID 2, and so on. Within each of these files, crawler writes:
¡ñ the full page URL on the first line,
¡ñ the depth of the page (where the seedURL is considered to be depth 0) on the second
¡ñ the page contents (i.e., the HTML code), beginning on the third line.
Functional decomposition into modules
We anticipate the following modules or functions:
1. main, which parses arguments and initializes other modules
2. crawler, which loops over pages to explore, until the list is exhausted
3. pagefetcher, which fetches a page from a URL
4. pagescanner, which extracts URLs from a page and processes each one
5. pagesaver, which outputs a page to the the appropriate file
And some helper modules that provide data structures:
1. bag of pages
2. hashtable of URLs
One way of solving this problem is:
Pseudo code for logic/algorithmic flow
The crawler will run as follows:
parse the command line, validate parameters, initialize other modules add seedURL to the bag of webpages to crawl, marked with depth=0 add seedURL to the hashtable of URLs seen so far
while there are more webpages in the bag:
extract a webpage (URL,depth) item from the bag
pause for one second
use pagefetcher to retrieve a webpage for that URL
use pagesaver to write the webpage to the pageDirectory with a unique document ID if the webpage depth is < maxDepth, explore the webpage to find the links it contains:
use pagescanner to parse the webpage to extract all its embedded URLs for each extracted URL:
normalize the URL (per requirements spec) if that URL is internal (per requirements spec):
try to insert that URL into the *hashtable* of URLs seen; if it was already in the table, do nothing;
if it was added to the table:
create a new webpage for that URL, marked with depth+1
add that new webpage to the bag of webpages to be crawled
Control flow
The Crawler is implemented in one file crawler.c, with four functions.
The main function simply calls parseArgs and crawl, then exits zero.
2. parseArgs
浙大学霸代写 加微信 cstutorcs
Given arguments from the command line, extract them into the function parameters; return only if successful.
¡ñ for pageDirectory, call pagedir_init()
¡ñ for maxDepth, ensure it is an integer in specified range [0 ... 10]
¡ñ if any trouble is found, print an error to stderr and exit non-zero.
Do the real work of crawling from seedURL to maxDepth and saving pages in pageDirectory. Pseudocode:
initialize the hashtable and add the seedURL
initialize the bag and add a webpage representing the seedURL at depth 0 while bag is not empty:
pull a webpage from the bag fetch the HTML for that webpage if fetch was successful:
save the webpage to pageDirectory if the webpage is not at maxDepth:
pageScan that HTML delete that webpage
delete the hashtable delete the bag
4. pageScan
This function implements the pagescanner mentioned in the design. Given a webpage, scan the given page to extract any links (URLs), ignoring non-internal URLs; for any URL not already seen before (i.e., not in the hashtable), add the URL to both the hashtable pages_seen and to the bag pages_to_crawl. Pseudocode:
while there is another URL in the page: if that URL is internal:
insert the webpage into the hashtable if that succeeded:
create a webpage_t for it
insert the webpage into the bag free the URL
Your pagescanner should extract URLs from links in the webpages' HTML. It's sufficient for this assignment to assume all links are of the format , where url varies. That is, your code should scan for occurrences of (again, the url part varies) and extract the contained URLs so they can be crawled. Note that link targets come in
Code Help, Add WeChat: cstutorcs
multiple types: absolute, domain-relative and page-relative. https://en.wikipedia.org/wiki/Dog is an absolute URL, /wiki/Dog is a domain-relative URL and Dog is a page-relative URL. Your page scanner will have to normalize domain-relative and page-relative URLs by expanding them to a full absolute URL. For example, if the current page is http://example.com/foo/bar, /baz should be expanded to http://example.com/baz and quux should be expanded to http://example.com/foo/quux. We have provided code that does this for you in url.c.
Modules from past assignments can be used
Examples: hashtable, set.
Function prototypes
See crawler.c for explanations.
int main(const int argc, char* argv[]);
static void parseArgs(const int argc, char* argv[], char** seedURL, char** pageDirectory, int* maxDepth);
static void crawl(char* seedURL, char* pageDirectory, const int maxDepth);
static void pageScan(webpage_t* page, bag_t* pagesToCrawl, hashtable_t* pagesSeen);
See pagedir.h for explanations.
bool pagedir_init(const char* pageDirectory);
void pagedir_save(const webpage_t* page, const char* pageDirectory, const int docID);
Error handling and recovery
All the command-line arguments should be checked before any data structures are allocated or work begins; problems result in a message printed to stderr and a non-zero exit status.
Out-of-memory errors should be handled properly (a message printed to stderr and a non-zero exit status). We anticipate out-of-memory errors to be rare and thus allow the program to crash (cleanly) in this way.
All code should use defensive-programming tactics to catch and exit, e.g., if a function receives bad parameters.
Code Help