URL Encoding

CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Environment: VC++, MFC

Introduction

The purpose of the article is to design a C++ class that does URL encoding. The motivation behind this article was that, in my previous project, I need to post data from a VC++ 6.0 application, which was required to be URL encoded. I have searched the MSDN to include a class or API that returns a URL encoded value for a given string input, but I haven’t found one. So, I had to come out with my own URLEncode C++ class.

The URLEncoder.exe is a MFC dialog-based application that uses the URLEncode class.

Process

URL encoding is a special process that makes sure that all the characters are “safe” to transmit across the Internet. Some characters have special meaning to various programs involved in sending the data across the Internet.

For example, a carriage return has an ASCII value of 13. Programs involved in sending you “FORM” data may consider this to mean the end of a line of data.

Traditionally, all Web applications transfer data between the client and server by using the HTTP or HTTPS protocols. There are basically two ways in which a server receives input from a client:

  1. Data can be passed in the HTTP headers (either via cookies or a posted form), or
  2. It can be included in the query portion of the requested URL.

When data is included in a URL, it must be specially encoded to conform to proper URL syntax. On the Web server side, the data is automatically decoded. Consider the following URL, where data is posted as a query string parameter.

Example: http://WebSite/ResourceName?Data=Data

Where Web Site is the URL Name

Resource Name is either the ASP or Servlet Name.

Data is the one that is to be posted to the Web Server. This requires to be encoded if the MIME type is .Content-Type: application/x-www-form-urlencoded.

RFC 1738

The RFC 1738 specification defining Uniform Resource Locators (URLs) restricts the characters allowed in a URL to a subset of the US-ASCII character set. This poses a limitation because HTML, on the other hand, allows the entire range of the ISO-8859-1 (ISO-Latin) character set to be used in documents. This leads to the case of, if the data to be uploaded is in the form HTML post (or as a part of Query string), all the HTML data to be encoded.

ISO-8859-1 (ISO-Latin) Character Set

The following table, ISO-8859-1, contains the complete ISO-8859-1 (ISO-Latin) character set, corresponding to the first 256 entries. The table provides each character ISO 8859-1Position(its decimal code), Description, Entity Number, Hex-Decimal Values, and HTML Result. Broadly, the range can be divided into Safe and Unsafe characters as follows.

Character range(decimal) Type Values Safe/Unsafe
0-31 ASCII Control Characters These characters are not printable Unsafe
32-47 Reserved Characters ‘ ‘!?#$%&'()*+,-./ Unsafe
48-57 ASCII Characters and Numbers 0-9 Safe
58-64 Reserved Characters :;<=>?@ Unsafe
65-90 ASCII Characters A-Z Safe
91-96 Reserved Characters [\]^_` Unsafe
97-122 ASCII Characters a-z Safe
123-126 Reserved Characters {|}~ Unsafe
127 Control Characters ‘ ‘ Unsafe
128-255 Non-ASCII Characters ‘ ‘ Unsafe

All the ASCII characters that are unsafe are required to encoded; for example, ranges (32-47, 58-64, 91-96, 123-126).

Below is the table that describes why these characters are not safe.

Character Unsafe Reason Character Encode
“<“ Delimiters around URLs in free text %3C
> Delimiters around URLs in free text %3E
. Delimits URLs in some systems %22
# It is used in the World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. %23
{ Gateways and other transport agents are known to sometimes modify such characters %7B
} Gateways and other transport agents are known to sometimes modify such characters %7D
| Gateways and other transport agents are known to sometimes modify such characters %7C
\ Gateways and other transport agents are known to sometimes modify such characters %5C
^ Gateways and other transport agents are known to sometimes modify such characters %5E
~ Gateways and other transport agents are known to sometimes modify such characters %7E
[ Gateways and other transport agents are known to sometimes modify such characters %5B
] Gateways and other transport agents are known to sometimes modify such characters %5D
` Gateways and other transport agents are known to sometimes modify such characters %60
+ Indicates a space (spaces cannot be used in a URL) %20
/ Separates directories and subdirectories %2F
? Separates the actual URL and the parameters %3F
& Separator between parameters specified in the URL %26

How It Is Done

URL encoding of a character is done by taking the character’s 8-bit hexadecimal code and prefixing it with a percent sign (“%”). For example, the US-ASCII character set represents a space with decimal code 32, or hexadecimal 20. Thus, its URL-encoded representation is %20.

URLEncode: URLEncode is a C++ class, which does URL encoding for a given string of data. The CURLEncode class has the following member functions.

  • isUnsafeString
  • decToHex
  • convert
  • URLEncode

The URLEncode() method does the encoding process. URLEncode checks each character in the string to see whether the character is safe or unsafe (isUnsafe). If the character is unsafe, the character is replaced with the .%. HEX value (convert) and appended to the original string.

Code Snippet

class CURLEncode
{
private:
  static CString csUnsafeString;
  CString (char num, int radix);
  bool isUnsafe(char compareChar);
  CString convert(char val);

public:
  CURLEncode() { };
  virtual ~CURLEncode() { };
  CString (CString vData);
};

bool CURLEncode::isUnsafe(char compareChar)
{
  bool bcharfound = false;
  char tmpsafeChar;
  int m_strLen = 0;

  m_strLen = csUnsafeString.GetLength();
  for(int ichar_pos = 0; ichar_pos < m_strLen ;ichar_pos++)
  {
    tmpsafeChar = csUnsafeString.GetAt(ichar_pos);
    if(tmpsafeChar == compareChar)
    {
      bcharfound = true;
      break;
    }
  }
  int char_ascii_value = 0;
  //char_ascii_value = __toascii(compareChar);
  char_ascii_value = (int) compareChar;

  if(bcharfound == false &&  char_ascii_value > 32 &&
                             char_ascii_value < 123)
  {
    return false;
  }
  // found no unsafe chars, return false
  else
  {
    return true;
  }

  return true;
}

CString CURLEncode::decToHex(char num, int radix)
{
  int temp=0;
  CString csTmp;
  int num_char;

num_char = (int) num;
  if (num_char < 0)
    num_char = 256 + num_char;

  while (num_char >= radix)
    {
    temp = num_char % radix;
    num_char = (int)floor(num_char / radix);
    csTmp = hexVals[temp];
    }

  csTmp += hexVals[num_char];

  if(csTmp.GetLength() < 2)
  {
    csTmp += '0';
  }

  CString strdecToHex(csTmp);
  // Reverse the String
  strdecToHex.MakeReverse();

  return strdecToHex;
}

CString CURLEncode::convert(char val)
{
  CString csRet;
  csRet += "%";
  csRet += decToHex(val, 16);
  return  csRet;
}

URLEncoder

References

URL Encoding: http://www.blooberry.com/indexdot/html/topics/urlencoding.htm.

RFC 1866: The HTML 2.0 specification (plain text). The appendix contains the Character Entity table: http://www.rfc-editor.org/rfc/rfc1866.txt.

The Web version of the HTML 2.0 (RFC 1866) Character Entity table: http://www.w3.org/MarkUp/html-spec/html-spec_13.html.

The HTML 3.2 (Wilbur) recommendation [This includes all character entities listed in HTML 2.0, plus new named entities covering the ISO 8859-1 120-191 range.]: http://www.w3.org/MarkUp/Wilbur/.

The HTML 4.0 Recommendation [Includes new Unicode character entities]: http://www.w3.org/TR/REC-html40/.

The W3C HTML Internationalization area: http://www.w3.org/International/O-HTML.html.

Downloads


URLEncoder Source Code – 42 Kb

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read