Environment: VC6 SP4, W2K SP2, WINHTTP 5, ATL
This article pertains to simple data extraction from a Web URL using the WINHTTP library.
(continued)

 |
Turbo Screen Sharing
Adobe Acrobat Connect Professional offers users the ability to have a more productive and engaging web conferencing experience while providing the IT department with a program that efficiently utilizes bandwidth and minimally impacts the infrastructure. Learn More!
»
Informal Learning: Extending the Impact of Enterprise Ideas and Information
Forward-thinking organizations are turning to enterprise learning in their quest to be better informed, better skilled, better supported at the point of need, and more competitive in their respective marketplaces. Learn More! »
Rapid E-Learning: Maturing Technology Brings Balance and Possibilities
Rapid e-learning addresses both time and cost issues by using technology tools to shift the dynamics of e-learning development. Learn why more skilled learning professionals use these tools and how you can get a solution to keep pace with your business demands. »
Delivering on the Promise of ELearning
This white paper defines the framework to launch e-learning as a set of teaching, training, and learning practices not bound by a specific technology platform or learning management system. It offers practical suggestions for creating digital learning experiences that engage learners by building interest and motivation and providing opportunities for active participation. »
|
 |

The WINHTTP library complies with the HTTP 1.2 model that is based on a persistent protocol model; this means that we first connect to a Web server and then make requests for the documents from it. The subsequent requests from the same Web server (hostname in our case) does not involve making and breaking the connection. I wrote this article because the main problem I felt with this was that for the crawling, you might have a big URL given to you. This has to be broken up into the hostname and the rest of the URL path.
For example, if the URL to be traversed was:
http:
WINHTTP expects to connect to
news.yahoo.com
and then put in a request for
/fc?tmpl=fc&cid=34&in=world&cat=iraq
Given below is the complete description of how to take the URL and spilt it (using WinHttpCrackUrl). Then, you need to make changes because the cracking does not give us the results that we want. Finally, you feed this data to the WINHTTP calls.
After we are done with this, the data extraction from the URL comes into the picture. To do this, first we connect to the URL. Then, we get the size of the data available on that URL by using WinHttpQueryDataAvailable.
The catch is that we don't get all the data of a Web page in one shot, so we initialize a buffer to which we'll keep appending the data gotten from the WinHttpReadData. We also get the Web page when all the data has been read (indicated by the available data size being equal to zero).
This is exactly how an equivalent URLReader class in Java works.
Given below is the complete code to do such a feat, with explicit comments at each step.
[
USES_CONVERSION;
URL_COMPONENTS urlComp;
LPCWSTR varURL;
DWORD dwUrlLen = 0;
LPCWSTR hostname, optional;
ZeroMemory(&urlComp, sizeof(urlComp));
urlComp.dwStructSize = sizeof(urlComp);
urlComp.dwSchemeLength = -1;
urlComp.dwHostNameLength = -1;
urlComp.dwUrlPathLength = -1;
urlComp.dwExtraInfoLength = -1;
if (!WinHttpCrackUrl( varURL, wcslen(pwszUrl1), 0, &urlComp))
{
printf("Error %u in WinHttpCrackUrl.\n", GetLastError());
}
String myhostname(W2T(urlComp.lpszHostName));
String myurlpath(W2T(urlComp.lpszUrlPath));
int strindex = myhostname.IndexOf(myurlpath);
String newhostname(myhostname.SubString(0,strindex));
strindex = 0;
DWORD dwSize = 0;
DWORD dwDownloaded = 0;
LPSTR pszOutBuffer;
BOOL bResults = FALSE;
HINTERNET hSession = NULL,
hConnect = NULL,
hRequest = NULL;
hSession = WinHttpOpen( L"WinHTTP Example/1.0",
WINHTTP_ACCESS_TYPE_DEFAULT_PROXY,
WINHTTP_NO_PROXY_NAME,
WINHTTP_NO_PROXY_BYPASS, 0);
if (hSession)
hConnect = WinHttpConnect( hSession, T2W(newhostname),
INTERNET_DEFAULT_HTTP_PORT, 0);
if (hConnect)
hRequest = WinHttpOpenRequest( hConnect, L"GET",
urlComp.lpszUrlPath,
NULL, WINHTTP_NO_REFERER,
WINHTTP_DEFAULT_ACCEPT_TYPES,
WINHTTP_FLAG_REFRESH);
if (hRequest)
bResults = WinHttpSendRequest( hRequest,
WINHTTP_NO_ADDITIONAL_
HEADERS, 0,
WINHTTP_NO_REQUEST_DATA, 0,
0, 0);
if (bResults)
bResults = WinHttpReceiveResponse( hRequest, NULL);
String respage="";
if (bResults)
do
{
dwSize = 0;
if (!WinHttpQueryDataAvailable( hRequest, &dwSize))
printf("Error %u in WinHttpQueryDataAvailable.\n",
GetLastError());
pszOutBuffer = new char[dwSize+1];
if (!pszOutBuffer)
{
printf("Out of memory\n");
dwSize=0;
}
else
{
ZeroMemory(pszOutBuffer, dwSize+1);
if (!WinHttpReadData( hRequest,
(LPVOID)pszOutBuffer,
dwSize, &dwDownloaded))
printf("Error %u in WinHttpReadData.\n",
GetLastError());
else
respage.Append(pszOutBuffer);
delete [] pszOutBuffer;
}
} while (dwSize>0);
]