CodeGuru
Earthweb Search
Forums Wireless Jars Gamelan Developer.com
CodeGuru Navigation
Member Sign In
User ID:
Password:
Remember Me:
Forgot Password?
Not a member?
Click here for more information and to register.

Search
The Business Internet

jobs.internet.com

internet.commerce
Partners & Affiliates
Imprinted Promotions
Web Design
Shop Online
Server Racks
Cell Phones
Prepaid Phone Card
Computer Deals
Compare Prices
Online Education
Baby Photo Contest
Compare Prices
Logo Design
Online Shopping
PDA Phones & Cases


RSS Feeds

RSSAll

RSSVC++/C++

RSS.NET/C#

RSSVB

See more EarthWeb Network feeds

Home >> Visual C++ / C++ >> Internet & Networking >> Internet Protocols >> HTTP

Best Practices for Developing a Web Site. Checklists, Tips, Strategies & More. Download Exclusive eBook Now.

Crawling Using WINHTTP 5
Rating: none

Prabhdeep Singh (view profile)
July 8, 2003

Environment: VC6 SP4, W2K SP2, WINHTTP 5, ATL

This article pertains to simple data extraction from a Web URL using the WINHTTP library.


(continued)



Turbo Screen Sharing
Adobe Acrobat Connect Professional offers users the ability to have a more productive and engaging web conferencing experience while providing the IT department with a program that efficiently utilizes bandwidth and minimally impacts the infrastructure. Learn More! »

Informal Learning: Extending the Impact of Enterprise Ideas and Information
Forward-thinking organizations are turning to enterprise learning in their quest to be better informed, better skilled, better supported at the point of need, and more competitive in their respective marketplaces. Learn More! »

Rapid E-Learning: Maturing Technology Brings Balance and Possibilities
Rapid e-learning addresses both time and cost issues by using technology tools to shift the dynamics of e-learning development. Learn why more skilled learning professionals use these tools and how you can get a solution to keep pace with your business demands. »

Delivering on the Promise of ELearning
This white paper defines the framework to launch e-learning as a set of teaching, training, and learning practices not bound by a specific technology platform or learning management system. It offers practical suggestions for creating digital learning experiences that engage learners by building interest and motivation and providing opportunities for active participation. »

The WINHTTP library complies with the HTTP 1.2 model that is based on a persistent protocol model; this means that we first connect to a Web server and then make requests for the documents from it. The subsequent requests from the same Web server (hostname in our case) does not involve making and breaking the connection. I wrote this article because the main problem I felt with this was that for the crawling, you might have a big URL given to you. This has to be broken up into the hostname and the rest of the URL path.

For example, if the URL to be traversed was:

http://news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq

WINHTTP expects to connect to

news.yahoo.com

and then put in a request for

/fc?tmpl=fc&cid=34&in=world&cat=iraq

Given below is the complete description of how to take the URL and spilt it (using WinHttpCrackUrl). Then, you need to make changes because the cracking does not give us the results that we want. Finally, you feed this data to the WINHTTP calls.

After we are done with this, the data extraction from the URL comes into the picture. To do this, first we connect to the URL. Then, we get the size of the data available on that URL by using WinHttpQueryDataAvailable.

The catch is that we don't get all the data of a Web page in one shot, so we initialize a buffer to which we'll keep appending the data gotten from the WinHttpReadData. We also get the Web page when all the data has been read (indicated by the available data size being equal to zero).

This is exactly how an equivalent URLReader class in Java works.

Given below is the complete code to do such a feat, with explicit comments at each step.

[
    USES_CONVERSION;

    // First, split up the URL
    URL_COMPONENTS urlComp;    // a structure that would contain the
                               // individual components of the URL
    LPCWSTR varURL;            // ***** varURL is the URL to be
                               // traversed
    DWORD dwUrlLen = 0;
    LPCWSTR hostname, optional;

    // Initialize the URL_COMPONENTS structure.
    ZeroMemory(&urlComp, sizeof(urlComp));
    urlComp.dwStructSize = sizeof(urlComp);

    //MessageBox(NULL,OLE2T(varURL),"the url to be traversed", 1);

    // Set required component lengths to non-zero so that they
    // are cracked.
    urlComp.dwSchemeLength    = -1;
    urlComp.dwHostNameLength  = -1;
    urlComp.dwUrlPathLength   = -1;
    urlComp.dwExtraInfoLength = -1;

    // Split the URL (varURL) into hostname and URL path
    if (!WinHttpCrackUrl( varURL, wcslen(pwszUrl1), 0, &urlComp))
    {
        printf("Error %u in WinHttpCrackUrl.\n", GetLastError());
    }

    // You can inspect the cracked URL here
    // For our example of varURL =
    // http://news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq
    // MessageBox(NULL,W2T(urlComp.lpszHostName),
    //            "INTERPRETER-> hostname",MB_OK);
    // We get the hostname as
    // "news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq"
    //  MessageBox(NULL,W2T(urlComp.lpszUrlPath),
    //             "INTERPRETER-> urlpath",MB_OK);
    // We get the urlPath as "/fc?tmpl=fc&cid=34&in=world&cat=iraq"
    // MessageBox(NULL,W2T(urlComp.lpszExtraInfo),
    //            "INTERPRETER->extrainfo",MB_OK);
    // We get the extrainfo as ""
    // MessageBox(NULL,W2T(urlComp.lpszScheme),
    //            "INTERPRETER->Scheme",MB_OK);
    // We get the scheme as
    // "http://news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq"

    String myhostname(W2T(urlComp.lpszHostName));
    String myurlpath(W2T(urlComp.lpszUrlPath));
    int strindex = myhostname.IndexOf(myurlpath);
    String newhostname(myhostname.SubString(0,strindex));

    strindex = 0;


    DWORD dwSize        = 0;
    DWORD dwDownloaded  = 0;
    LPSTR pszOutBuffer;
    BOOL  bResults      = FALSE;
    HINTERNET  hSession = NULL,
               hConnect = NULL,
               hRequest = NULL;

    // Use WinHttpOpen to obtain a session handle.
    hSession = WinHttpOpen( L"WinHTTP Example/1.0",
                            WINHTTP_ACCESS_TYPE_DEFAULT_PROXY,
                            WINHTTP_NO_PROXY_NAME,
                            WINHTTP_NO_PROXY_BYPASS, 0);

    // Specify an HTTP server.
    // In our examples, it expects just "news.yahoo.com"
    if (hSession)
        hConnect = WinHttpConnect( hSession, T2W(newhostname),
                                   INTERNET_DEFAULT_HTTP_PORT, 0);

    // Create an HTTP request handle.
    // In our example, it expects
    // "/fc?tmpl=fc&cid=34&in=world&cat=iraq"
    if (hConnect)
        hRequest = WinHttpOpenRequest( hConnect, L"GET",
                                       urlComp.lpszUrlPath,
                                       NULL, WINHTTP_NO_REFERER,
                                       WINHTTP_DEFAULT_ACCEPT_TYPES,
                                       WINHTTP_FLAG_REFRESH);
    // Send a request.
    if (hRequest)
        bResults = WinHttpSendRequest( hRequest,
                                       WINHTTP_NO_ADDITIONAL_
                                              HEADERS, 0,
                                       WINHTTP_NO_REQUEST_DATA, 0,
                                              0, 0);

    // End the request.
    if (bResults)
        bResults = WinHttpReceiveResponse( hRequest, NULL);
        String respage="";    // The buffer that'll contain the
                              // extracted Web page data

    // Keep checking for data until there is nothing left.
    if (bResults)
        do
        {

            // Check for available data.
            dwSize = 0;
            if (!WinHttpQueryDataAvailable( hRequest, &dwSize))
                printf("Error %u in WinHttpQueryDataAvailable.\n",
                        GetLastError());

            // Allocate space for the buffer.
            pszOutBuffer = new char[dwSize+1];
            if (!pszOutBuffer)
            {
                printf("Out of memory\n");
                dwSize=0;
            }
            else
            {
                // Read the Data.
                ZeroMemory(pszOutBuffer, dwSize+1);

                if (!WinHttpReadData( hRequest,
                                      (LPVOID)pszOutBuffer,
                                      dwSize, &dwDownloaded))
                    printf("Error %u in WinHttpReadData.\n",
                            GetLastError());
                else
                    respage.Append(pszOutBuffer);

                // Free the memory allocated to the buffer.
                delete [] pszOutBuffer;
            }

        } while (dwSize>0);
        // MessageBox(NULL,respage,"fetched page from
        // interpreter",1);

]

Tools:
Add www.codeguru.com to your favorites
Add www.codeguru.com to your browser search box
IE 7 | Firefox 2.0 | Firefox 1.5.x
Receive news via our XML/RSS feed

Generate Complete .NET Web Apps in Minutes . Download Iron Speed Designer today.
Becoming a Better Project Manager. Checklists, Tips & Strategies. Download Exclusive eBook Now.
Improve How You Manage Information: Becoming a Better Project Manager. Exclusive eBook - Download Now.
Intel Go Parallel Portal: Translating Multicore Power into Application Performance. Learn more.
Guide to Developing a Web Site. Best Practices, Tips and Strategies. Download Exclusive eBook Now.


RATE THIS ARTICLE:   Excellent  Very Good  Average  Below Average  Poor  

(You must be signed in to rank an article. Not a member? Click here to register)

Latest Comments:
help in winhhtp sending xml data - ggpp (11/17/2004)
where to get WinHTTP library ? - hspc (04/17/2004)
Quite Informative - Legacy CodeGuru (08/19/2003)
Thanks for sharing it. This saved all my efforts. - Legacy CodeGuru (07/25/2003)
This is what I have been looking for - Legacy CodeGuru (07/09/2003)

View All Comments
Add a Comment:
Title:
Comment:
Pre-Formatted: Check this if you want the text to display with the formatting as typed (good for source code)



(You must be signed in to comment on an article. Not a member? Click here to register)
Access FREE BIRT Developer Tools from Actuate:
Webinar:
Automate Secure Report Delivery
Demo:
Web-based Report Creation for End Users
Demo:
Automate the Creation of Spreadsheet Reports


JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers