Crawling Using WINHTTP 5

Environment: VC6 SP4, W2K SP2, WINHTTP 5, ATL

This article pertains to simple data extraction from a Web URL using the WINHTTP library.

The WINHTTP library complies with the HTTP 1.2 model that is based on a persistent protocol model; this means that we first connect to a Web server and then make requests for the documents from it. The subsequent requests from the same Web server (hostname in our case) does not involve making and breaking the connection. I wrote this article because the main problem I felt with this was that for the crawling, you might have a big URL given to you. This has to be broken up into the hostname and the rest of the URL path.

For example, if the URL to be traversed was:

http://news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq

WINHTTP expects to connect to

news.yahoo.com

and then put in a request for

/fc?tmpl=fc&cid=34&in=world&cat=iraq

Given below is the complete description of how to take the URL and spilt it (using WinHttpCrackUrl). Then, you need to make changes because the cracking does not give us the results that we want. Finally, you feed this data to the WINHTTP calls.

After we are done with this, the data extraction from the URL comes into the picture. To do this, first we connect to the URL. Then, we get the size of the data available on that URL by using WinHttpQueryDataAvailable.

The catch is that we don't get all the data of a Web page in one shot, so we initialize a buffer to which we'll keep appending the data gotten from the WinHttpReadData. We also get the Web page when all the data has been read (indicated by the available data size being equal to zero).

This is exactly how an equivalent URLReader class in Java works.

Given below is the complete code to do such a feat, with explicit comments at each step.

[
    USES_CONVERSION;

    // First, split up the URL
    URL_COMPONENTS urlComp;    // a structure that would contain the
                               // individual components of the URL
    LPCWSTR varURL;            // ***** varURL is the URL to be
                               // traversed
    DWORD dwUrlLen = 0;
    LPCWSTR hostname, optional;

    // Initialize the URL_COMPONENTS structure.
    ZeroMemory(&urlComp, sizeof(urlComp));
    urlComp.dwStructSize = sizeof(urlComp);

    //MessageBox(NULL,OLE2T(varURL),"the url to be traversed", 1);

    // Set required component lengths to non-zero so that they
    // are cracked.
    urlComp.dwSchemeLength    = -1;
    urlComp.dwHostNameLength  = -1;
    urlComp.dwUrlPathLength   = -1;
    urlComp.dwExtraInfoLength = -1;

    // Split the URL (varURL) into hostname and URL path
    if (!WinHttpCrackUrl( varURL, wcslen(pwszUrl1), 0, &urlComp))
    {
        printf("Error %u in WinHttpCrackUrl.\n", GetLastError());
    }
    
    // You can inspect the cracked URL here
    // For our example of varURL =
    // http://news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq
    // MessageBox(NULL,W2T(urlComp.lpszHostName),
    //            "INTERPRETER-> hostname",MB_OK);
    // We get the hostname as
    // "news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq"
    //  MessageBox(NULL,W2T(urlComp.lpszUrlPath),
    //             "INTERPRETER-> urlpath",MB_OK);
    // We get the urlPath as "/fc?tmpl=fc&cid=34&in=world&cat=iraq"
    // MessageBox(NULL,W2T(urlComp.lpszExtraInfo),
    //            "INTERPRETER->extrainfo",MB_OK);
    // We get the extrainfo as ""
    // MessageBox(NULL,W2T(urlComp.lpszScheme),
    //            "INTERPRETER->Scheme",MB_OK);
    // We get the scheme as
    // "http://news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq"

    String myhostname(W2T(urlComp.lpszHostName));
    String myurlpath(W2T(urlComp.lpszUrlPath));
    int strindex = myhostname.IndexOf(myurlpath);
    String newhostname(myhostname.SubString(0,strindex));

    strindex = 0;


    DWORD dwSize        = 0;
    DWORD dwDownloaded  = 0;
    LPSTR pszOutBuffer;
    BOOL  bResults      = FALSE;
    HINTERNET  hSession = NULL,
               hConnect = NULL,
               hRequest = NULL;

    // Use WinHttpOpen to obtain a session handle.
    hSession = WinHttpOpen( L"WinHTTP Example/1.0",
                            WINHTTP_ACCESS_TYPE_DEFAULT_PROXY,
                            WINHTTP_NO_PROXY_NAME,
                            WINHTTP_NO_PROXY_BYPASS, 0);

    // Specify an HTTP server.
    // In our examples, it expects just "news.yahoo.com"
    if (hSession)
        hConnect = WinHttpConnect( hSession, T2W(newhostname),
                                   INTERNET_DEFAULT_HTTP_PORT, 0);

    // Create an HTTP request handle.
    // In our example, it expects
    // "/fc?tmpl=fc&cid=34&in=world&cat=iraq"
    if (hConnect)
        hRequest = WinHttpOpenRequest( hConnect, L"GET",
                                       urlComp.lpszUrlPath,
                                       NULL, WINHTTP_NO_REFERER,
                                       WINHTTP_DEFAULT_ACCEPT_TYPES,
                                       WINHTTP_FLAG_REFRESH);
    // Send a request.
    if (hRequest)
        bResults = WinHttpSendRequest( hRequest,
                                       WINHTTP_NO_ADDITIONAL_
                                              HEADERS, 0,
                                       WINHTTP_NO_REQUEST_DATA, 0,
                                              0, 0);

    // End the request.
    if (bResults)
        bResults = WinHttpReceiveResponse( hRequest, NULL);
        String respage="";    // The buffer that'll contain the
                              // extracted Web page data

    // Keep checking for data until there is nothing left.
    if (bResults)
        do
        {

            // Check for available data.
            dwSize = 0;
            if (!WinHttpQueryDataAvailable( hRequest, &dwSize))
                printf("Error %u in WinHttpQueryDataAvailable.\n",
                        GetLastError());

            // Allocate space for the buffer.
            pszOutBuffer = new char[dwSize+1];
            if (!pszOutBuffer)
            {
                printf("Out of memory\n");
                dwSize=0;
            }
            else
            {
                // Read the Data.
                ZeroMemory(pszOutBuffer, dwSize+1);

                if (!WinHttpReadData( hRequest,
                                      (LPVOID)pszOutBuffer,
                                      dwSize, &dwDownloaded))
                    printf("Error %u in WinHttpReadData.\n",
                            GetLastError());
                else
                    respage.Append(pszOutBuffer);

                // Free the memory allocated to the buffer.
                delete [] pszOutBuffer;
            }

        } while (dwSize>0);
        // MessageBox(NULL,respage,"fetched page from
        // interpreter",1);

]


Comments

  • Suggest winhttp equivalent on Windows phone 8.1

    Posted by Saurabh on 08/06/2014 06:07am

    Suggest winhttp equivalent on Windows phone 8.1

    Reply
  • where to get WinHTTP library ?

    Posted by hspc on 04/17/2004 11:34am

    Hi. where to get WinHTTP library ? I searched MS Downloads for WinHTTP but no good !!

    • reply: where to get WinHTTP library?

      Posted by keedo60 on 04/25/2004 05:32pm

      It's included in the Platform SDK from Microsoft.
      
      http://www.microsoft.com/msdownload/platformsdk/sdkupdate/update.htm

      Reply
    Reply
  • Quite Informative

    Posted by Legacy on 08/19/2003 12:00am

    Originally posted by: Kapil Sharma

    Hey,

    Great Article. Keep up the good work.

    ~ Kapil

    Reply
  • Thanks for sharing it. This saved all my efforts.

    Posted by Legacy on 07/25/2003 12:00am

    Originally posted by: Ravneet Kaur

    Thanks for sharing it. This saved all my efforts.
    

    Reply
  • This is what I have been looking for

    Posted by Legacy on 07/09/2003 12:00am

    Originally posted by: Bhupesh Gupta

    Ya this is great, this is exactly what I have been looking for. Thanks PD.

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Today's agile organizations pose operations teams with a tremendous challenge: to deploy new releases to production immediately after development and testing is completed. To ensure that applications are deployed successfully, an automatic and transparent process is required. We refer to this process as Zero Touch Deployment™. This white paper reviews two approaches to Zero Touch Deployment--a script-based solution and a release automation platform. The article discusses how each can solve the key …

  • On-demand Event Event Date: December 18, 2014 The Internet of Things (IoT) incorporates physical devices into business processes using predictive analytics. While it relies heavily on existing Internet technologies, it differs by including physical devices, specialized protocols, physical analytics, and a unique partner network. To capture the real business value of IoT, the industry must move beyond customized projects to general patterns and platforms. Check out this webcast and join industry experts as …

Most Popular Programming Stories

More for Developers

RSS Feeds