A Pre-emptive Multithreading Web Spider

The Win32 API supports applications that are pre-emptively multithreaded. This is a very useful and powerful feature of Win32 in writing MFC Internet Spiders. The SPIDER project is an example of how to use preemptive multithreading to gather information on the Web using a SPIDER/ROBOT with the MFC WinInet classes.

This project produces a spidering software program that checks Web sites for broken URL links. Link verification is done only on href links. It displays a continously updated list of URLs in a CListView that reports the status of the href link. The project could be used as a template for gathering and indexing information to be stored in a database file for queries.

Search engines gather information on the Web using programs called Robots. Robots (also called Web Crawlers, Spiders, Worms, Web Wanderers, and Scooters) automatically gather and index information from around the Web, and then put that information into databases. (Note that a Robot will index a page, and then follow the links on that page as a source for new URLs to index.) Users can than construct queries to search these databases to find the information they want.

By using preemptive multithreading, you can index a Web page of URL links, start a new thread to follow each new URL link for a new source of URLs to index.

The project uses the MDI CDocument used with a custom MDI child frame to display a CEditView when downloading Web pages and a CListView when checking URL links. The project also uses the CObArray, CInternetSession, CHttpConnection, CHttpFile, and CWinThread MFC classes. The CWinThread class is used to produce multiple threads instead of using the Asynchronous mode in CInternetSession, which is realy left over from the winsock 16 bit windows platform.

The SPIDER project uses simple worker threads to check URL links or download a Web page. The CSpiderThread class is derived from the CWinThread class so each CSpiderThread object can use the CwinThread MESSAGE_MAP() function. By declaring a "DECLARE_MESSAGE_MAP()" in the CSpiderThread class the user interface is still responsive to user input. This means you can check the URL links on one Web server and at the same time download and open a Web page from another Web Server. The only time the user interface will become unresponsive to user input is when the thread count exceedes MAXIMUM_WAIT_OBJECTS which is defined as 64.

In the constructor for each new CSpiderThread object we supply the ThreadProc function and the thread Paramters to be passed to the ThreadProc function.


	CSpiderThread* pThread;
	pThread = NULL;
	pThread = new CSpiderThread(CSpiderThread::ThreadFunc,pThreadParams); // create a new CSpiderThread object


In the CSpiderThread constructor we set the CWinThread* m_pThread pointer in the thread Paramters structure so we can point to the correct instance of this thread;
pThreadParams->m_pThread = this;
The CSpiderThread ThreadProc Function

// simple worker thread Proc function
UINT CSpiderThread::ThreadFunc(LPVOID pParam)
{
	ThreadParams * lpThreadParams = (ThreadParams*) pParam;
	CSpiderThread* lpThread = (CSpiderThread*) lpThreadParams->m_pThread;
	
	lpThread->ThreadRun(lpThreadParams);

	// Use  SendMessage instead of PostMessage here to keep the current thread count
	// Synchronizied. If the number of threads is greater than MAXIMUM_WAIT_OBJECTS (64)
	// the program will be come	 unresponsive to user input

	::SendMessage(lpThreadParams->m_hwndNotifyProgress,
		WM_USER_THREAD_DONE, 0, (LPARAM)lpThreadParams);  // deletes lpThreadParams and decrements the thread count

	return 0;
}
The structure passed to the CSpiderThread ThreadProc Function
typedef struct tagThreadParams
{
	HWND m_hwndNotifyProgress;
	HWND m_hwndNotifyView;
	CWinThread* m_pThread;
	CString m_pszURL;
	CString m_Contents;
	CString m_strServerName;
	CString m_strObject;
	CString m_checkURLName;
	CString m_string;
	DWORD m_dwServiceType;
	DWORD  m_threadID;
	DWORD m_Status;
	URLStatus m_pStatus;
	INTERNET_PORT  m_nPort;
	int m_type;
	BOOL m_RootLinks;

}ThreadParams; 

After the CSpiderThread object has been created we use the CreatThread function to start the execution of the new thread object.

	if (!pThread->CreateThread())   //  Starts execution of a CWinThread object
	{
		AfxMessageBox("Cannot Start New Thread");
		delete pThread;
		pThread = NULL;
		delete pThreadParams;
		return FALSE;
	}    
Once the new thread is running we use the ::SendMessage function to send messages to the CDocument's-> CListView with the status structure of the URL link.

	if(pThreadParams->m_hwndNotifyView != NULL)
		::SendMessage(pThreadParams->m_hwndNotifyView,WM_USER_CHECK_DONE, 0, (LPARAM) &pThreadParams->m_pStatus);
Sturcture used for URL status.

typedef struct tagURLStatus
{
	CString m_URL;
	CString m_URLPage;
	CString m_StatusString;
	CString m_LastModified;
	CString m_ContentType;
	CString m_ContentLength;
	DWORD	m_Status;
}URLStatus, * PURLStatus;
Each new thread creats a new CMyInternetSession (derived from CInternetSession) object with EnableStatusCallback set to TRUE, so we can check the status on all InternetSession callbacks. The dwContext ID for callbacks is set to the thread ID.

BOOL CInetThread::InitServer()
{
	
	try
	{
		m_pSession = new CMyInternetSession(AgentName,m_nThreadID);
		int ntimeOut = 30;  // very important, can cause a Server time-out if set to low
							// or hang the thread if set to high.
		/*
		The time-out value in milliseconds to use for Internet connection requests. 
		If a connection request takes longer than this timeout, the request is canceled.
		The default timeout is infinite. */
		m_pSession->SetOption(INTERNET_OPTION_CONNECT_TIMEOUT,1000* ntimeOut);
		
		/* The delay value in milliseconds to wait between connection retries.*/
		m_pSession->SetOption(INTERNET_OPTION_CONNECT_BACKOFF,1000);
		
		/* The retry count to use for Internet connection requests. If a connection 
		attempt still fails after the specified number of tries, the request is canceled.
		The default is five. */
		m_pSession->SetOption(INTERNET_OPTION_CONNECT_RETRIES,1);
	        m_pSession->EnableStatusCallback(TRUE);

	}
	catch (CInternetException* pEx)
	{
		// catch errors from WinINet
		//pEx->ReportError();
		m_pSession = NULL;
		pEx->Delete();
		return FALSE ;
	}

	return TRUE;
}

The key to using the MFC WinInet classes in a single or multithread program is to use a try and catch block statement surrounding all MFC WinInet class functions. The internet is very unstable at times or the web page you are requesting no longer exist, which is guaranteed to throw a CInternetException Error.


	try
	{
		// some MFC WinInet class function
	}
	catch (CInternetException* pEx)
	{
		// catch errors from WinINet
		//pEx->ReportError();
		pEx->Delete();
		return FALSE ;
	}
 

The maximum count of threads is initially set to 64, but you can configure it to any number between 1 and 100. A number that is too high will result in failed connections, which means you will have to recheck the URL links.

A rapid fire succession of HTTP requests in a /cgi-bin/ directory could bring a server to it's knees. The SPIDER program sends out about 4 HTTP request a second. 4 * 60 = 240 a minute. This can also bring a server to it's knees. Be carefull about what server you are checking. Each server has a server log with the requesting Agent's IP address that requested the Web file. You might get some nasty email from a angry Web Server administrator.

You can prevent any directory from being indexed by creating a robots.txt file for that directory. This mechanism is usually used to protect /cgi-bin/ directories. CGI scripts take more server resources to retrieve.

When the SPIDER program checks URL links it's goal is to not request too many documents too quickly. The SPIDER program adheres somewhat to the standard for robot exclusion. This standard is a joint agreement between robot developers, that allows WWW sites to limit what URL's the robot requests. By using the standard to limit access, the robot will not retrieve any documents that Web Server's wish to disallow.

Before checking the Root URL, the program checks to see if there is a robots.txt file in the main directory. If the SPIDER program finds a robots.txt file the program will abort the search. The program also checks for the META tag in all Web pages. If it finds a META NAME="ROBOTS" CONTENT ="NOINDEX,NOFOLLOW" tag it will not index the URLs on that page.

Build:
Windows 95
MFC/VC++ 5.0
WinInet.h dated 9/25/97
WinInet.lib dated 9/16/97
WinInet.dll dated 9/18/97

Problems:
can't seem to keep the thread count below 64 at all times.
limit of 32,767 URL links in the CListView
wouldn't parse all URLs correctly, will crash program occasionally using CString functions with complex URLs.

Resources:
Internet tools - Fred Forester
Multithreading Applications in Win32
Win32 Multithreaded Programming

Download Source Code and Example (65 KB)

Last updated: 21 June 1998



Comments

  • FWAMo SMo QqZX

    Posted by MMnVHeTIzm on 11/16/2012 09:45am

    buy soma online illegal buy soma online - back medication soma

    Reply
  • YysEU byS mabo

    Posted by jykQxWztbp on 11/14/2012 01:33am

    buy soma online soma to buy online - buy somatropin china

    Reply
  • Works In VC++ 2005 Express

    Posted by japreja on 05/11/2006 01:57am

    You need to have The Platform SDK as well as the Windows Driver Development Kit installed and have the paths set for the lib and include files in the VC++ 2005 Express options. Other that setting the proper paths, nothing needs to be modified. The source compiles fine.

    Reply
  • Great tool

    Posted by Mirsad on 08/29/2005 11:12am

    Hi, this is a great tool, my question is, is it posible to change this tool to an Dialogbased Application? regards mh

    Reply
  • thanks for such good search engine's kernel spider programe

    Posted by everestsun on 05/17/2005 04:52am

    i have revised it to search all the links in the web file,including link,img,flash,and other embeded things ,and i can get the keywords to index for a database,the next step is develop a website to test the results of searchs.Thanks very much!!

    Reply
  • Same problem...

    Posted by Legacy on 03/03/2000 12:00am

    Originally posted by: Dong Won

    When I open the source with VC++ 5.0 and compiler it ,
    it give me some error such as :
    INTERNET_CONNECTION_LAN |
    INTERNET_CONNECTION_MODEM |
    INTERNET_CONNECTION_PROXY;

    InternetGetConnectedState
    ......

    as undeclare indentifier.
    Am I missing some thing , need to install winsock.

    Thanks in advance .

    Moonly


    I have the same problem....
    Please help me!!!
    Thank you so much for good code.

    Reply
  • Cannot compiler the source ....

    Posted by Legacy on 07/27/1999 12:00am

    Originally posted by: Moonly

    When I open the source with VC++ 5.0 and compiler it ,
    
    it give me some error such as :
    INTERNET_CONNECTION_LAN |
    INTERNET_CONNECTION_MODEM |
    INTERNET_CONNECTION_PROXY;

    InternetGetConnectedState
    ......

    as undeclare indentifier.
    Am I missing some thing , need to install winsock.

    Thanks in advance .

    Moonly

    Reply
  • Local Web Site

    Posted by Legacy on 06/25/1999 12:00am

    Originally posted by: Thomas Younsi

    I am currently working on a set of thread function to handle
    local web site.

    In source file Thread.cpp in fonction

    BOOL CSpiderThread::CheckAllURLs(LPCTSTR ServerName,ThreadParams *pThreadParams)

    change to handle frames

    if(!GetHref(lpszText,_T("href"),list))
    {
    if(!GetHref(lpszText,_T("src"),list))
    {
    return FALSE;
    }
    }

    Great Code !

    Other Nice Code if you want to learn more on Web Spider:

    Description: WWW Site grabber
    Name: GetWeb
    Version: 2.7
    Release: 1
    Source: http://nemesis.ee.teiath.gr/~stelios/GetWeb-2.7.tar.gz
    Copyright: GPL
    Group: Networking/Utilities


    Title: getwww
    Version: 1.4
    Entered-date: 12/09/96
    Description: Download an entire html source tree from a remote URL;
    recursively chases image and hypertext links.
    Change links in html files to relative links between
    local html files.
    It can treat cgi virtual text recursively.
    It is c++ version.
    README file is written in korean.
    I found someone translate it to english text.
    if you can read korean and english, please help me.
    Keywords: www http html getwww webget snag wget webcopy www-robot
    Author: kisskiss@soback.kornet.nm.kr ( In-sung Kim)
    Maintained-by: kisskiss@soback.kornet.nm.kr ( In-sung Kim)
    Primary-site: ftp.kaist.ac.kr /incoming/www
    27kB getwww-1.4.tar.gz
    Alternate-site: sunsite.unc.edu /pub/Linux/system/Network/info-systems/www
    27kB getwww-1.4.tar.gz
    Platform: Linux-2.0.24 gcc 2.7.2.1 Irix-5.3 sunOS 4.1.1 HP-UX 10.01
    Copying-policy: GPL

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Today's agile organizations pose operations teams with a tremendous challenge: to deploy new releases to production immediately after development and testing is completed. To ensure that applications are deployed successfully, an automatic and transparent process is required. We refer to this process as Zero Touch Deployment™. This white paper reviews two approaches to Zero Touch Deployment--a script-based solution and a release automation platform. The article discusses how each can solve the key …

  • Learn How A Global Entertainment Company Saw a 448% ROI Every business today uses software to manage systems, deliver products, and empower employees to do their jobs. But software inevitably breaks, and when it does, businesses lose money -- in the form of dissatisfied customers, missed SLAs or lost productivity. PagerDuty, an operations performance platform, solves this problem by helping operations engineers and developers more effectively manage and resolve incidents across a company's global operations. …

Most Popular Programming Stories

More for Developers

RSS Feeds