Step by Step Developing a SOHO HTTP Filter


Most available web filters work inline; this means that all outgoing and incoming packets are passed through a filter driver. This approach, along with its own benefits, has a big flaw; the filtering process affects data transfer throughput. This project represents an experimental remedy to this issue by putting the filter engine in sniffer mode. This way, the filtering process and data transfer act independently.

2. Requirements

  • The article expects the reader to be familiar with C++ and TCP/IP concepts.
  • The source code uses the following libraries:
    • Winsock
    • Ethereal: A packet capture and network analyzer
  • The code was compiled and built with VC7 on Windows server 2003.

3. Introduction

The main goal of this article is to explain the practical details of low-level network programming. There are many commercial and open source firewalls and HTTP filters available on both Linux and Windows. But internally, most of them follow the same approach to find their targets. The reference section of this article provides you with handy books related to this topic.

Specifically, a web filter could act in two modes to inspect outgoing packets for blacklist keywords: Inline mode and Sniffer mode. I explain both modes of operation and compare them.

4. Background

This section explains the TCP/IP protocol stack, HTTP protocol behavior, and a Boyer-Moore algorithm to perform fast pattern matching. If you think that you have enough experience and knowledge and want to get your hands dirty, please continue to Section 7 (Implementation).

A mission critical system is totally different from an office application. Imagine your team plans to develop a firewall, web filter, or even a secure proxy server that processes tons of packets in a few seconds. What are the main characteristics of these systems?

Being fail-safe, high-performance, full-feature, and GREEN are a few. Green means the system should not eat up CPU cycles and the memory of the hosted platform. Meanwhile, in such an expensive project, there is no room for logical mistakes due to a lack of technical knowledge! In a typical environment, your last result must be deployed in the model below:

4.1 TCP/IP protocol stack

As the name describes, TCP/IP refers to a number of protocols, each of which was developed for a purpose. To understand a HTTP session establishment process, you should know at least the following protocols: Ethernet II, ARP, IP, TCP, UDP, HTTP, and DNS.

4.2 Ethernet II

The Ethernet II frame format was defined by the Ethernet specification created by Digital, Intel, and Xerox before the IEEE 802.3 Specification. The Ethernet II frame format is also known as the DIX frame format.

Ethernet II consists of the following fields: (Totally 26 bytes + payload from 46 bytes to 1500 bytes)

  • The Preamble (8 bytes) consists of 7 bytes of alternating 1s and 0s (each byte is the bit sequence 10101010) to synchronize a receiving station and a 1-byte 10101011 sequence that indicates the start of a frame. The Preamble provides receiver Synchronization and frame delimitation services.
  • The Destination Address (6 bytes) indicates the destination’s address. The destination can be a unicast, a multicast, or the Ethernet broadcast address. The unicast address is also known as an individual, physical, hardware, or MAC address. For the Ethernet broadcast address, all 48 bits are set to 1 to create the address 0xFF-FF-FF-FF-FF-FF.
  • The Source Address (6 bytes) indicates the sending node’s unicast address.
  • The EtherType (2 bytes) indicates the upper layer protocol contained within the Ethernet frame. For an IP datagram, the field is set to 0x0800. For an ARP message, the EtherType field is set to 0x0806. For a complete list, refer to the references.
  • The Payload field for an Ethernet II frame consists of a protocol data unit (PDU) of an upper-layer protocol. Ethernet II can send a maximum-sized payload of 1500 bytes. Because of Ethernet’s collision detection facility, Ethernet II frames must send a minimum payload size of 46 bytes.
  • The FCS (4 bytes) provides bit-level integrity verification on the bits in the Ethernet II frame using the CRC algorithm.

A typical capture shows you the following fields: (FCS and preamble are excluded)

  | destination   | source       | protocol |                   |
  | mac address   | mac address  | type     | IP DATAGRAM       |
  | (6 bytes)     | (6 bytes)    | 0X0800   |                   |

4.3 ARP

ARP is a broadcast-based, request-reply protocol that provides a dynamic address resolution facility to map next-hop IP addresses to their corresponding MAC addresses.

There are two facts regarding the datalink layer that show you need ARP:

  1. When an Ethernet frame is sent from one host on a LAN to another, it is the 48-bit Ethernet address that determines for which interface the frame is destined. The device driver software never looks at the destination IP address in the IP datagram.
  2. The next-hop IP address is not necessarily the same as the destination IP address of the IP datagram, the result of the route. The determination process for every outgoing IP datagram is a next-hop interface and a next-hop IP address.

For direct deliveries to destinations on the same subnet, the next-hop IP address is the datagram’s destination IP address. For indirect deliveries to remote destinations, the next-hop IP address is the IP address of a router on the same Subnet as the forwarding host. To get that device as a next hop, the packet needs its hardware address.

4.4 IP

IP, the heart of a TCP/IP protocol suite, provides a connectionless, unreliable delivery of data. By unreliable, I mean that there is no guarantee that a datagram successfully gets to its destination. By connectionless, I mean that the IP doesn’t maintain any information regarding successive datagrams. On the other hand, each datagram is handled independently. The IP makes a best effort to deliver packets to the next hop or the final destination. End-to-end reliability is the responsibility of upper-layer protocols such as TCP.

The IP header contains the following fields that you need to know for later packet processing:

  • Version (4 bytes): Indicates the format of Internet header.
  • IHL (4 bytes): Is the length of IP header in 32-bit words.
  • Type of service or TOS (1 byte): As the name indicates, it specifies how important the IP packet is for you. Some intermediate devices evaluate this field in the case of high load and prioritize the datagram. In RFC 791 (Internet protocol), this field is structured as follows:
  •      0     1     2     3     4     5     6     7
         |                 |     |     |     |     |     |
         |   PRECEDENCE    |  D  |  T  |  R  |  0  |  0  |
         |                 |     |     |     |     |     |
         D                >> Delay
         T                >> Throughput
         R                >> Reliability
         Bit6 and Bit7    >> reserved
  • Total length (2 bytes): Total size of header + payload. The total length field conceptually allows the length of a datagram to be up to 65,535 bytes, although such a long packet is impractical for most hosts and network devices.

    Later, you will learn what an MTU is and how it may help you put this in reality. Anyway, remember that the only IP header is 20 bytes and if there are any options, the length can go as high as 60 bytes. No more!

  • ID (2 bytes): Assigned by the sender so that the receiver can decrement the fragmented IP packets due to the MTU value.

    For the second time, I’ve mentioned “MTU;” see what the MTU is: Simply put, most of the data you generate while, for example, surfing the web, are bulk data. It means that the size of the data is big. The underlying media access protocol splits the bulk into smaller parts so that it can send seamlessly over the network infrastructure. In case of HTTP 802.3 Ethernet Protocol, the maximum size of a datagram is 1500 bytes. This number is the MTU; it stands for Maximum Transmission Unit.

    In case you want to transmit a 15000-byte data stream to your mate, the protocol stack splits your message to 10 * 1500 bytes and transmits them one by one. If you put 20 bytes for the IP header, 1480 byes remain for the transport layer, header, and payload. There is where ID comes into the picture. The protocol stack splits the message into 10 smaller messages and assigns a unique ID in the IP header. When the receiver takes all the pieces, it can do further processing over the whole message.

  • Flags (3 bits): Says whether or not the datagram is a part of a fragmented data.
  •         0   1   2
            |   | D | M |
            | 0 | F | F |
  • Fragment Offset (13 bits): An 8-byte chunk of data is called a fragment block. The number in the Fragment Offset field reports the size of the offset in fragment blocks. The Fragment Offset field is 13 bits long, so offsets can range from 0 to 8191 fragment blocks—corresponding to offsets of from 0 to 65,528 bytes.
  • TTL (1 byte): This field says how long a datagram could remain alive in a network system. It measures time in seconds.
  • Protocol (1 byte): Indicates the upper layer protocol type. For example:
      1 >> ICMP
      2 >> GIMP
      4 >> IP in IP encapsulation
      6 >> TCP
    17 >> ADP
    41 >> IPV6
    47 >> Generic routing encapsulation (ARE)
    50 >> IP security encapsulation security payload (ESP)
    51 >> IP security authentication header (AH)
    89 >> ASP
  • Header checksum (2 bytes): To measure the integrity of the header, the protocol stack handler performs a CRC on the header and compares it with the checksum value. It is a kind of sanity check.
  • Source IP Address and Destination IP Address (each 4 bytes).
  • Options (variable length): Maintains a list of optional information for the datagram.

    |Version|  IHL  |Type of Service|          Total Length       |
    |         Identification        |Flags|      Fragment Offset  |
    |  Time to Live |    Protocol   |         Header Checksum     |
    |                       Source Address                        |
    |                    Destination Address                      |
    |                    Options                    |    Padding  |

More by Author

Must Read