Most available web filters work inline; this means that all outgoing and incoming packets are passed through a filter driver. This approach, along with its own benefits, has a big flaw; the filtering process affects data transfer throughput. This project represents an experimental remedy to this issue by putting the filter engine in sniffer mode. This way, the filtering process and data transfer act independently.
- The article expects the reader to be familiar with C++ and TCP/IP concepts.
- The source code uses the following libraries:
- Ethereal: A packet capture and network analyzer
- The code was compiled and built with VC7 on Windows server 2003.
The main goal of this article is to explain the practical details of low-level network programming. There are many commercial and open source firewalls and HTTP filters available on both Linux and Windows. But internally, most of them follow the same approach to find their targets. The reference section of this article provides you with handy books related to this topic.
Specifically, a web filter could act in two modes to inspect outgoing packets for blacklist keywords: Inline mode and Sniffer mode. I explain both modes of operation and compare them.
This section explains the TCP/IP protocol stack, HTTP protocol behavior, and a Boyer-Moore algorithm to perform fast pattern matching. If you think that you have enough experience and knowledge and want to get your hands dirty, please continue to Section 7 (Implementation).
A mission critical system is totally different from an office application. Imagine your team plans to develop a firewall, web filter, or even a secure proxy server that processes tons of packets in a few seconds. What are the main characteristics of these systems?
Being fail-safe, high-performance, full-feature, and GREEN are a few. Green means the system should not eat up CPU cycles and the memory of the hosted platform. Meanwhile, in such an expensive project, there is no room for logical mistakes due to a lack of technical knowledge! In a typical environment, your last result must be deployed in the model below:
4.1 TCP/IP protocol stack
As the name describes, TCP/IP refers to a number of protocols, each of which was developed for a purpose. To understand a HTTP session establishment process, you should know at least the following protocols: Ethernet II, ARP, IP, TCP, UDP, HTTP, and DNS.
4.2 Ethernet II
The Ethernet II frame format was defined by the Ethernet specification created by Digital, Intel, and Xerox before the IEEE 802.3 Specification. The Ethernet II frame format is also known as the DIX frame format.
Ethernet II consists of the following fields: (Totally 26 bytes + payload from 46 bytes to 1500 bytes)
- The Preamble (8 bytes) consists of 7 bytes of alternating 1s and 0s (each byte is the bit sequence 10101010) to synchronize a receiving station and a 1-byte 10101011 sequence that indicates the start of a frame. The Preamble provides receiver Synchronization and frame delimitation services.
- The Destination Address (6 bytes) indicates the destination's address. The destination can be a unicast, a multicast, or the Ethernet broadcast address. The unicast address is also known as an individual, physical, hardware, or MAC address. For the Ethernet broadcast address, all 48 bits are set to 1 to create the address 0xFF-FF-FF-FF-FF-FF.
- The Source Address (6 bytes) indicates the sending node's unicast address.
- The EtherType (2 bytes) indicates the upper layer protocol contained within the Ethernet frame. For an IP datagram, the field is set to 0x0800. For an ARP message, the EtherType field is set to 0x0806. For a complete list, refer to the references.
- The Payload field for an Ethernet II frame consists of a protocol data unit (PDU) of an upper-layer protocol. Ethernet II can send a maximum-sized payload of 1500 bytes. Because of Ethernet's collision detection facility, Ethernet II frames must send a minimum payload size of 46 bytes.
- The FCS (4 bytes) provides bit-level integrity verification on the bits in the Ethernet II frame using the CRC algorithm.
A typical capture shows you the following fields: (FCS and preamble are excluded)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | destination | source | protocol | | | mac address | mac address | type | IP DATAGRAM | | (6 bytes) | (6 bytes) | 0X0800 | | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ARP is a broadcast-based, request-reply protocol that provides a dynamic address resolution facility to map next-hop IP addresses to their corresponding MAC addresses.
There are two facts regarding the datalink layer that show you need ARP:
- When an Ethernet frame is sent from one host on a LAN to another, it is the 48-bit Ethernet address that determines for which interface the frame is destined. The device driver software never looks at the destination IP address in the IP datagram.
- The next-hop IP address is not necessarily the same as the destination IP address of the IP datagram, the result of the route. The determination process for every outgoing IP datagram is a next-hop interface and a next-hop IP address.
For direct deliveries to destinations on the same subnet, the next-hop IP address is the datagram's destination IP address. For indirect deliveries to remote destinations, the next-hop IP address is the IP address of a router on the same Subnet as the forwarding host. To get that device as a next hop, the packet needs its hardware address.
IP, the heart of a TCP/IP protocol suite, provides a connectionless, unreliable delivery of data. By unreliable, I mean that there is no guarantee that a datagram successfully gets to its destination. By connectionless, I mean that the IP doesn't maintain any information regarding successive datagrams. On the other hand, each datagram is handled independently. The IP makes a best effort to deliver packets to the next hop or the final destination. End-to-end reliability is the responsibility of upper-layer protocols such as TCP.
The IP header contains the following fields that you need to know for later packet processing:
- Version (4 bytes): Indicates the format of Internet header.
- IHL (4 bytes): Is the length of IP header in 32-bit words.
- Type of service or TOS (1 byte): As the name indicates, it specifies how important the IP packet is for you. Some intermediate devices evaluate this field in the case of high load and prioritize the datagram. In RFC 791 (Internet protocol), this field is structured as follows:
0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | | | | PRECEDENCE | D | T | R | 0 | 0 | | | | | | | | +-----+-----+-----+-----+-----+-----+-----+-----+ D >> Delay T >> Throughput R >> Reliability Bit6 and Bit7 >> reserved
Later, you will learn what an MTU is and how it may help you put this in reality. Anyway, remember that the only IP header is 20 bytes and if there are any options, the length can go as high as 60 bytes. No more!
For the second time, I've mentioned "MTU;" see what the MTU is: Simply put, most of the data you generate while, for example, surfing the web, are bulk data. It means that the size of the data is big. The underlying media access protocol splits the bulk into smaller parts so that it can send seamlessly over the network infrastructure. In case of HTTP 802.3 Ethernet Protocol, the maximum size of a datagram is 1500 bytes. This number is the MTU; it stands for Maximum Transmission Unit.
In case you want to transmit a 15000-byte data stream to your mate, the protocol stack splits your message to 10 * 1500 bytes and transmits them one by one. If you put 20 bytes for the IP header, 1480 byes remain for the transport layer, header, and payload. There is where ID comes into the picture. The protocol stack splits the message into 10 smaller messages and assigns a unique ID in the IP header. When the receiver takes all the pieces, it can do further processing over the whole message.
0 1 2 +---+---+---+ | | D | M | | 0 | F | F | +---+---+---+
1 >> ICMP
2 >> GIMP
4 >> IP in IP encapsulation
6 >> TCP
17 >> ADP
41 >> IPV6
47 >> Generic routing encapsulation (ARE)
50 >> IP security encapsulation security payload (ESP)
51 >> IP security authentication header (AH)
89 >> ASP
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| IHL |Type of Service| Total Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identification |Flags| Fragment Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Time to Live | Protocol | Header Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+