URL Encoding

WEBINAR: On-demand webcast

How to Boost Database Development Productivity on Linux, Docker, and Kubernetes with Microsoft SQL Server 2017 REGISTER >

Environment: VC++, MFC

Introduction

The purpose of the article is to design a C++ class that does URL encoding. The motivation behind this article was that, in my previous project, I need to post data from a VC++ 6.0 application, which was required to be URL encoded. I have searched the MSDN to include a class or API that returns a URL encoded value for a given string input, but I haven't found one. So, I had to come out with my own URLEncode C++ class.

The URLEncoder.exe is a MFC dialog-based application that uses the URLEncode class.

Process

URL encoding is a special process that makes sure that all the characters are "safe" to transmit across the Internet. Some characters have special meaning to various programs involved in sending the data across the Internet.

For example, a carriage return has an ASCII value of 13. Programs involved in sending you "FORM" data may consider this to mean the end of a line of data.

Traditionally, all Web applications transfer data between the client and server by using the HTTP or HTTPS protocols. There are basically two ways in which a server receives input from a client:

  1. Data can be passed in the HTTP headers (either via cookies or a posted form), or
  2. It can be included in the query portion of the requested URL.

When data is included in a URL, it must be specially encoded to conform to proper URL syntax. On the Web server side, the data is automatically decoded. Consider the following URL, where data is posted as a query string parameter.

Example: http://WebSite/ResourceName?Data=Data

Where Web Site is the URL Name
Resource Name is either the ASP or Servlet Name.
Data is the one that is to be posted to the Web Server. This requires to be encoded if the MIME type is .Content-Type: application/x-www-form-urlencoded.

RFC 1738

The RFC 1738 specification defining Uniform Resource Locators (URLs) restricts the characters allowed in a URL to a subset of the US-ASCII character set. This poses a limitation because HTML, on the other hand, allows the entire range of the ISO-8859-1 (ISO-Latin) character set to be used in documents. This leads to the case of, if the data to be uploaded is in the form HTML post (or as a part of Query string), all the HTML data to be encoded.

ISO-8859-1 (ISO-Latin) Character Set

The following table, ISO-8859-1, contains the complete ISO-8859-1 (ISO-Latin) character set, corresponding to the first 256 entries. The table provides each character ISO 8859-1Position(its decimal code), Description, Entity Number, Hex-Decimal Values, and HTML Result. Broadly, the range can be divided into Safe and Unsafe characters as follows.

Character range(decimal) Type Values Safe/Unsafe
0-31 ASCII Control Characters These characters are not printable Unsafe
32-47 Reserved Characters ' '!?#$%&'()*+,-./ Unsafe
48-57 ASCII Characters and Numbers 0-9 Safe
58-64 Reserved Characters :;<=>?@ Unsafe
65-90 ASCII Characters A-Z Safe
91-96 Reserved Characters [\]^_` Unsafe
97-122 ASCII Characters a-z Safe
123-126 Reserved Characters {|}~ Unsafe
127 Control Characters ' ' Unsafe
128-255 Non-ASCII Characters ' ' Unsafe

All the ASCII characters that are unsafe are required to encoded; for example, ranges (32-47, 58-64, 91-96, 123-126).

Below is the table that describes why these characters are not safe.

Character Unsafe Reason Character Encode
"<" Delimiters around URLs in free text %3C
> Delimiters around URLs in free text %3E
. Delimits URLs in some systems %22
# It is used in the World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. %23
{ Gateways and other transport agents are known to sometimes modify such characters %7B
} Gateways and other transport agents are known to sometimes modify such characters %7D
| Gateways and other transport agents are known to sometimes modify such characters %7C
\ Gateways and other transport agents are known to sometimes modify such characters %5C
^ Gateways and other transport agents are known to sometimes modify such characters %5E
~ Gateways and other transport agents are known to sometimes modify such characters %7E
[ Gateways and other transport agents are known to sometimes modify such characters %5B
] Gateways and other transport agents are known to sometimes modify such characters %5D
` Gateways and other transport agents are known to sometimes modify such characters %60
+ Indicates a space (spaces cannot be used in a URL) %20
/ Separates directories and subdirectories %2F
? Separates the actual URL and the parameters %3F
& Separator between parameters specified in the URL %26

How It Is Done

URL encoding of a character is done by taking the character's 8-bit hexadecimal code and prefixing it with a percent sign ("%"). For example, the US-ASCII character set represents a space with decimal code 32, or hexadecimal 20. Thus, its URL-encoded representation is %20.

URLEncode: URLEncode is a C++ class, which does URL encoding for a given string of data. The CURLEncode class has the following member functions.

  • isUnsafeString
  • decToHex
  • convert
  • URLEncode

The URLEncode() method does the encoding process. URLEncode checks each character in the string to see whether the character is safe or unsafe (isUnsafe). If the character is unsafe, the character is replaced with the .%. HEX value (convert) and appended to the original string.

Code Snippet

class CURLEncode
{
private:
  static CString csUnsafeString;
  CString (char num, int radix);
  bool isUnsafe(char compareChar);
  CString convert(char val);

public:
  CURLEncode() { };
  virtual ~CURLEncode() { };
  CString (CString vData);
};

bool CURLEncode::isUnsafe(char compareChar)
{
  bool bcharfound = false;
  char tmpsafeChar;
  int m_strLen = 0;
  
  m_strLen = csUnsafeString.GetLength();
  for(int ichar_pos = 0; ichar_pos < m_strLen ;ichar_pos++)
  {
    tmpsafeChar = csUnsafeString.GetAt(ichar_pos);
    if(tmpsafeChar == compareChar)
    {
      bcharfound = true;
      break;
    }
  }
  int char_ascii_value = 0;
  //char_ascii_value = __toascii(compareChar);
  char_ascii_value = (int) compareChar;

  if(bcharfound == false &&  char_ascii_value > 32 &&
                             char_ascii_value < 123)
  {
    return false;
  }
  // found no unsafe chars, return false
  else
  {
    return true;
  }

  return true;
}

CString CURLEncode::decToHex(char num, int radix)
{
  int temp=0;
  CString csTmp;
  int num_char;

num_char = (int) num;
  if (num_char < 0)
    num_char = 256 + num_char;

  while (num_char >= radix)
    {
    temp = num_char % radix;
    num_char = (int)floor(num_char / radix);
    csTmp = hexVals[temp];
    }

  csTmp += hexVals[num_char];

  if(csTmp.GetLength() < 2)
  {
    csTmp += '0';
  }

  CString strdecToHex(csTmp);
  // Reverse the String
  strdecToHex.MakeReverse();

  return strdecToHex;
}

CString CURLEncode::convert(char val)
{
  CString csRet;
  csRet += "%";
  csRet += decToHex(val, 16);
  return  csRet;
}

URLEncoder

References

URL Encoding: http://www.blooberry.com/indexdot/html/topics/urlencoding.htm.

RFC 1866: The HTML 2.0 specification (plain text). The appendix contains the Character Entity table: http://www.rfc-editor.org/rfc/rfc1866.txt.

The Web version of the HTML 2.0 (RFC 1866) Character Entity table: http://www.w3.org/MarkUp/html-spec/html-spec_13.html.

The HTML 3.2 (Wilbur) recommendation [This includes all character entities listed in HTML 2.0, plus new named entities covering the ISO 8859-1 120-191 range.]: http://www.w3.org/MarkUp/Wilbur/.

The HTML 4.0 Recommendation [Includes new Unicode character entities]: http://www.w3.org/TR/REC-html40/.

The W3C HTML Internationalization area: http://www.w3.org/International/O-HTML.html.

Downloads

URLEncoder Source Code - 42 Kb


Comments

  • Online Tools

    Posted by davitz38 on 01/13/2010 03:53am

    Hi guys,
    Try this Online Url Encoder
    Pretty cool!!!!
    David

    Reply
  • Unicode builds

    Posted by Syslock on 11/19/2009 06:13pm

    Is this code safe for Unicode builds?

    Reply
  • A very simple MFC class to Encode and Decode an url string

    Posted by serhardt on 03/03/2006 04:29am

    Tis is my contribution to encode and decode an url string; My objective was simplifying source code, using CString existing functions...

    /*****************************************************************************
    Module :     UrlString.h
    Notices:     Written 2006 by Stephane Erhardt
    Description: H URL Encoder/Decoder
    *****************************************************************************/
    #ifndef __CURLSTRING_H_
    #define __CURLSTRING_H_
    
    class CUrlString
    {
    private:
    	CString m_csUnsafe;
    
    public:
    	CUrlString();
    	virtual ~CUrlString() { };
    	CString Encode(CString csDecoded);
    	CString Decode(CString csEncoded);
    };
    
    #endif //__CURLSTRING_H_
    
    /*****************************************************************************
    Module :     UrlString.cpp
    Notices:     Written 2006 by Stephane Erhardt
    Description: CPP URL Encoder/Decoder
    *****************************************************************************/
    #include "stdafx.h"
    #include "UrlString.h"
    
    /*****************************************************************************/
    CUrlString::CUrlString()
    {
    	m_csUnsafe = _T("%=\"<>\\^[]`+$,@:;/!#?&'");
    	for(int iChar = 1; iChar < 33; iChar++)
    		m_csUnsafe += (char)iChar;
    	for(int iChar = 124; iChar < 256; iChar++)
    		m_csUnsafe += (char)iChar;
    }
    
    /*****************************************************************************/
    CString CUrlString::Encode(CString csDecoded)
    {
    	CString csCharEncoded, csCharDecoded;
    	CString csEncoded = csDecoded;
    
    	for(int iPos = 0; iPos < m_csUnsafe.GetLength(); iPos++)
    	{
    		csCharEncoded.Format(_T("%%%02X"), m_csUnsafe[iPos]);
    		csCharDecoded = m_csUnsafe[iPos];
    		csEncoded.Replace(csCharDecoded, csCharEncoded);
    	}
    	return csEncoded;
    }
    
    /*****************************************************************************/
    CString CUrlString::Decode(CString csEncoded)
    {
    	CString csUnsafeEncoded = Encode(m_csUnsafe);
    	CString csDecoded = csEncoded;
    	CString csCharEncoded, csCharDecoded;
    
    	for(int iPos = 0; iPos < csUnsafeEncoded.GetLength(); iPos += 3)
    	{
    		csCharEncoded = csUnsafeEncoded.Mid(iPos, 3);
    		csCharDecoded = (char)strtol(csUnsafeEncoded.Mid(iPos + 1, 2), NULL, 16);
    		csDecoded.Replace(csCharEncoded, csCharDecoded);
    	}
    	return csDecoded;
    }

    Reply
  • how about url decoding?

    Posted by Legacy on 06/26/2003 12:00am

    Originally posted by: william hwang

    Anyone has any idea?

    • Online Url Decoder

      Posted by davitz38 on 01/13/2010 03:56am

      Hey,
      Use this url decoder and get details too
      David

      Reply
    Reply
  • URL Encoding

    Posted by Legacy on 06/14/2003 12:00am

    Originally posted by: Elisha Tiyagnet

    This article is grate!

    It solves a lot of mannual URL scriptings. Give yourself a pat on the back.

    St June 14th 2003.

    Reply
  • http://www.alrojo.com

    Posted by Legacy on 06/04/2003 12:00am

    Originally posted by: fermin

    good job
    

    Reply
  • URL Encoding in C with Win32 code sample.

    Posted by Legacy on 11/15/2002 12:00am

    Originally posted by: AmbientHex


    #include "stdafx.h"
    #include "windows.h"
    #include <string.h>
    #include <ctype.h>
    #include <stdio.h>
    char * UrlEncode(char *szText, char* szDst, int bufsize) ;

    int APIENTRY WinMain(HINSTANCE hInstance,
    HINSTANCE hPrevInstance,
    LPSTR lpCmdLine,
    int nCmdShow)
    {
    // TODO: Place code here.

    char szBigger[2048];

    UrlEncode("WOO#$%$%^567567567567HOO!t",szBigger,sizeof(szBigger));
    MessageBox(0,szBigger,"",0);

    return 0;
    }

    char * UrlEncode(char *szText, char* szDst, int bufsize) {
    char ch;
    char szHex[5];
    int iMax,i,j;

    iMax = bufsize-2;
    szDst[0]='\0';
    for (i = 0,j=0; szText[i] && j <= iMax; i++) {
    ch = szText[i];
    if (isalnum(ch))
    szDst[j++]=ch;
    else if (ch == ' ')
    szDst[j++]='+';
    else {
    if (j+2 > iMax) break;
    szDst[j++]='%';
    sprintf(szHex, "%-2.2X", ch);
    strncpy(szDst+j,szHex,2);
    j += 2;
    }
    }
    szDst[j]='\0';
    return szDst;
    }


    Reply
  • Space Conversion

    Posted by Legacy on 10/01/2002 12:00am

    Originally posted by: Robert Rehrl

    Hi,
    
    

    I think the Space-Character ' ' must be converted to the Plus-Sign '+'.

    cu
    Robert

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • As all sorts of data becomes available for storage, analysis and retrieval - so called 'Big Data' - there are potentially huge benefits, but equally huge challenges...
  • The agile organization needs knowledge to act on, quickly and effectively. Though many organizations are clamouring for "Big Data", not nearly as many know what to do with it...
  • Cloud-based integration solutions can be confusing. Adding to the confusion are the multiple ways IT departments can deliver such integration...

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date