Of the basic matrix transforms in any 3D graphics programmer's toolkit, projection matrices are among the more complicated. Translation and scaling can be understood at a glance, and a rotation matrix can be conjured up by anyone with a basic understanding of trigonometry, but projection is a bit tricky. If you've ever looked up the formula for such a matrix, you know that common sense isn't enough to tell you where it came from. And yet, I haven't seen many resources online that will describe just how one derives a projection matrix. That is the topic I will address in this article.
For those of you just starting out in 3D graphics, I should mention that understanding where a projection matrix comes from may be a matter of curiosity to the mathematically inclined among us, but it's not a necessity. You can make do with just the formula; and if you're using a graphics API like Direct3D that will build a projection matrix for you, you don't even need that. So, if the details of this article seem a little overwhelming, fear not. As long as you understand what projection does, you needn't concern yourself with how it works if you don't want to. This article is for programmers who like to know a bit more detail than is strictly necessary.
Overview: What Is Projection?
A computer monitor is a two-dimensional surface, so if you want to display three-dimensional images, you need a way to transform 3D geometry into a form that can be rendered as a 2D image. And that's exactly what projection does. To use a very simple example, one way to project a 3D object onto a 2D surface would be to simply throw away the z-coordinate of each point. For a cube, that might look something like Figure 1.
Figure 1: Projection onto the xy plane by discarding z-coordinates.
Of course, this is overly simple and not particularly useful in most cases. For starters, you will not be projecting onto a plane at all; rather, your projection formulae will transform your geometry into a new volume, called the canonical view volume. The exact coordinates of the canonical view volume may vary from one graphics API to another, but for the purposes of this discussion, consider it to be the box that extends from (–1, –1, 0) to (1, 1, 1), which is the convention used by Direct3D. Once all your vertices have been mapped into the canonical view volume, only their x- and y-coordinates are used to map them to the screen. The z-coordinate is not useless, however; it's typically used by a depth buffer for visibility determination. This is the reason you transform into a new volume, rather than project onto a plane.
Note that Figure 1 also depicts a left-handed coordinate system, where the camera is looking down the positive z-axis, with the y-axis pointing up and the x-axis pointing to the right. This is again a convention used by Direct3D, and one I'll use throughout the article. None of the calculations are significantly different for a right-handed coordinate system, or for a slightly different canonical view volume, so everything discussed will still apply even if your API of choice uses different conventions than those used by Direct3D.
With that, you can get into the actual projection transforms. There are quite a few different methods of projection out there, and I will cover two of the most common: orthographic and perspective.
Orthographic projection, so called because all the lines of projection are perpendicular to the eventual drawing surface, is a relatively simple projection technique. The view volume—that is, the region of eye space that contains all the geometry you want to display—is an axis-aligned box that you transform into the canonical view volume, as shown in Figure 2.
Figure 2: Orthographic projection.
As you can see, the view volume is defined by six planes:
Because the view volume and the canonical view volume are both axis-aligned boxes, there is no correction for distance in this type of projection. The end result is, in fact, a lot like the result in Figure 1 where you just dropped the z-coordinate of every point. Objects of the same size in 3D space appear the same size in the projection, even if one is much further from the camera than the other. Lines that are parallel in 3D space remain parallel in the final image. Using this kind of projection would be out of the question for something like a first-person shooter—imagine trying to play one of those without being able to tell how far away anything is!—but it does have its uses. You might use it in a tile-based game, for instance, especially one where the camera is positioned at a fixed angle. Figure 3 shows a simple example.
Figure 3: A simple example of orthographic projection.
So without further ado, start to figure out how this is going to work. The easiest approach may be to consider each of your three axes separately, and compute how to map points along that axis from the original view volume into the canonical view volume. You begin with the x-coordinate. A point within your view volume will have an x-coordinate on the range [l, r], and you want to transform it to the range [–1, 1].
Now, in preparation to scale the range down to the size you want, you subtract l from all terms to produce a zero on the left-hand side. Another approach you could take here would be to translate the range so that it centers on zero, rather than having one of its endpoints at zero, but the algebra is a bit neater this way, so I'll do it like this for the sake of readability.
Now that one end of your range is positioned at zero, you can scale it down to the size you want. You want the range of x-values to be two units wide, from 1 to –1, so you multiply through by 2/(r – l). Note that r – l is the width of your view volume and is thus always a positive number, so you don't have to worry about the inequalities changing directions.
Next, you subtract one from all terms to produce your desired range of [–1, 1].
A bit of basic algebra allows you to write the center term as a single fraction:
Finally, you split the center term into two fractions so that it takes the form px + q; you need to group your terms this way so that the equations you derive can be easily translated into matrix form.
The center term of this inequality now gives you the equation you need to transform x into the canonical view volume.
The steps required to obtain a formula for y are exactly the same—just substitute y for x, t for r, and b for l—so rather than repeat them here, I'll just show the result:
Finally, you need to derive a formula for z. It's a little different in this case because you're mapping z to the range [0, 1] rather than [–1, 1], but this should look very familiar. Here's your starting condition, a z-coordinate on the range [n, f]:
You subtract n from all terms so the lower end of the range is positioned at zero:
And now, all that's left is to divide through by f – n to produce a final range of [0, 1]. As before, note that f – n indicates the depth of your viewing volume and thus will never be negative.
Finally, you split this into two fractions so it takes the form pz + q:
This gives you your formula for transforming z:
Now, you're ready to write your orthographic projection matrix. To recap your work thus far, here are the three projection equations you've derived:
If you write this in matrix form, you get:
That's it! Direct3D provides a function called D3DXMatrixOrthoOffCenterLH() (what a mouthful!) that constructs an orthographic projection matrix based on this same formula; you can find it in the DirectX documentation. The "LH" in that unwieldy function name refers to the fact that you're using a left-handed coordinate system. But, what exactly does "OffCenter" mean?
The answer to that question leads you to a simplified form of the orthographic projection matrix. Consider a few points: First, in eye space, your camera is positioned at the origin and looking directly down the z-axis. And second, you usually want your field of view to extend equally far to the left as it does to the right, and equally far above the z-axis as below. If that is the case, the z-axis passes directly through the center of your view volume, and so you have r = –l and t = –b. In other words, you can forget about r, l, t, and b altogether, and simply define your view volume in terms of a width w, and a height h, along with your other clipping planes f and n. If you make those substitutions into the orthographic projection matrix above, you get this rather simplified version:
This equation is implemented by the Direct3D function D3DXMatrixOrthoLH(). You can almost always use this matrix instead of the more general, "off center" version that you derived above, unless you're doing something strange with your projection.
One further point before you finish this section. It's instructive to note that this matrix can be represented as the concatenation of two simpler transforms: a translation followed by a scale. This should make sense to you if you think about it geometrically because all you're doing in an orthographic projection is shifting points from one axis-aligned box to another; the viewing volume doesn't change its shape, only its position and its size. Specifically, you have:
This product form of your projection is perhaps a bit more intuitive because it lets you more easily visualize what's happening. First, the viewing volume is translated along the z-axis so that its near plane coincides with the origin; then, a scale is applied to bring it down to the dimensions of the canonical view volume. That's easy enough to understand, right? The matrix for an off-center orthographic projection also can be represented as the product of a transformation and a scale, but it's similar enough to the result shown above that I won't list it here.
That about wraps it up for orthographic projections, so now you can move onto something a little more challenging.
Perspective projection is a slightly more complicated method of projection, and more frequently used because it creates the illusion of distance and thus produces a more realistic image. Geometrically speaking, the difference between this method and orthographic projection is that in perspective projection, the view volume is a frustum—that is, a truncated pyramid—rather than an axis-aligned box. You can see this in Figure 4.
Figure 4: Perspective projection.
As you can see, the near plane of the view frustum extends from (l, b, n) to (r, t, n). The extents of the far plane are found by tracing a line from the origin through each of the four points on the near plane until they intersect the plane z = f. Because the view frustum gets increasingly wider as it extends further from the origin; and because you're transforming that shape into the canonical view volume, which is a box; the far end of the view frustum is compressed to a greater extent than the near end. Thus, objects further back in the view frustum are made to appear smaller, and this gives you the illusion of distance.
Because the shape of the volume changes in this transformation, perspective projection can't be expressed as a simple translation and scale like orthographic projection. You'll have to work out something a bit different. But, that doesn't mean that the work you did on orthographic projection is useless. A handy problem-solving technique in mathematics is to reduce a problem to one that you already know how to solve. So, that's what you can do here. The last time, you examined one coordinate at a time, but this time you'll do the x- and y-coordinates together, and then worry about z later on. Your plan of attack for x and y can be broken down into two steps:
Step 1: Given a point (x, y, z) within the view frustum, project it onto the near plane z = n. Because the projected point is on the near plane, its x-coordinate will be on the range [l, r], and its y-coordinate will be on the range [b, t].
Step 2: Using the formulae you derived in your study of orthographic projection, map the new x-coordinate from [l, r] to [–1, 1], and the new y-coordinate from [b, t] to [–1, 1].
Sound good? Then, have a look at Figure 5.
Figure 5: Projection of a point onto z = n using similar triangles.
In this diagram, you've drawn a line from a point (x, y, z) to the origin, and noted the point at which the line intersects the plane z = n—it's the one marked in black. From those points, you drop two perpendiculars to the z-axis, and suddenly you have a pair of similar triangles. If you've suppressed your memories of high school geometry, similar triangles are triangles that have the same shape but are not necessarily the same size. To show that two triangles are similar, it is sufficient to show that their corresponding angles are equal, and it's not hard to do that in this case. Angle 1 is shared by both triangles, and obviously it's equal to itself. Angles 2 and 3 are corresponding angles made by a traversal intersecting two parallel lines, so they are equal. And, right angles are of course equal to each other, so your two triangles are similar.
The property of similar triangles that you're interested in is that their pairs of corresponding sides all exist in the same proportion. You know the lengths of the sides that lie along the z-axis; they're n and z. That means that the other pairs of sides also exist in the ratio n / z. So, consider what you know. By the Pythagorean theorem, the perpendicular from (x, y, z) down to the z-axis has the following length:
If you knew the length of the perpendicular from your projected point to the z-axis, you could figure out the x- and y-coordinates of that point. But, that's easy! Because you have similar triangles, the length is simply L multiplied by n / z:
Thus, your new x-coordinate is x * n / z, and your new y-coordinate is y * n / z. Thus concludes Step 1. Step 2 simply called for you to perform the same mapping you did in the last section, and so it's time to revisit the formulae you derived in your study of orthographic projection. Recall that you mapped our x- and y-coordinates into the canonical view volume like so:
You now can invoke these same formulae again, except you need to take your projection into account; so, you replace x with x * n / z, and y with y * n / z:
Now, you multiply through by z:
These results are a bit odd. To write these equations directly into a matrix, you need them to be written in this form:
But clearly, that's not going to happen right now, so it looks like you're at a bit of an impasse here. What to do? Well, if you can find a way to get a formula for z'z like you have for x'z and y'z, you can write a matrix transform that maps (x, y, z) to (x'z, y'z, z'z). Then, you'll just divide the components of that point by z, and you'll end up with (x', y', z'), which is what you wanted.
Because you know that the transformation of z into z' does not depend on x or y in any way, you know that you want a formula of the form z'z = pz + q, where p and q are constants. And, you can find those constants pretty easily because you know how to get z' in two special cases: Because you're mapping [n, f] to [0, 1], you know that z' = 0 when z = n, and z' = 1 when z = f. When you plug the first set of values into z'z = pz + q, you're able to solve for q in terms of p:
Now, you plug in the second set of values, and get:
Substitute your value for q into that equation, and you can easily solve for p:
Now that you have a value for p, and you found earlier that q = –pn, you can solve for q:
Finally, if you substitute these expressions for p and q back into your original formula, you get:
You're nearly finished now, but the unusual nature of your approach to this problem requires that you do something with the homogeneous coordinate w. Normally, you're simply content to set w' = 1—you've probably noticed that the bottom row in a basic transform is almost always [0 0 0 1]—but now you're writing a transform to the point (x'z, y'z, z'z, w'z), and so instead of writing w' = 1, you write w'z = z. Thus, the final set of equations you'll use for perspective projection are:
And, when you write this set of equations in matrix form, you get:
When you apply this to the point (x, y, z, 1), it yields (x'z, y'z, z'z, z). But then, you apply the usual step of dividing through by the homogeneous coordinate, and so you end up with (x', y', z', 1). And that's perspective projection. Direct3D implements the above formula in the function D3DXMatrixPerspectiveOffCenterLH(). As with orthographic projection, if you assume that the view frustum is symmetrical and centered on the z-axis (meaning that r = –l and t = –b), you can simplify things considerably by writing the matrix in terms of the view frustum's width w and its height h:
Direct3D has a function for this matrix as well, called D3DXMatrixPerspectiveLH().
Finally, there's one more representation for perspective projection that often comes in handy. In this form, rather than worrying strictly about the dimensions of the view frustum, you define it based on the camera's field of view. See Figure 6 for an illustration of this concept.
Figure 6: The view frustum's height defined in terms of the vertical field of view angle a.
The vertical field of view angle is a. This angle is bisected by the z-axis, so with a bit of basic trigonometry, you can write the following equation that relates a to the near plane n and the height of the screen h:
This expression allows you to replace the height in your projection matrix. Furthermore, you replace the width with the aspect ratio r, defined as the ratio of the width of the display area to its height. So, you have:
Thus, you have a perspective projection matrix in terms of the vertical field of view angle a and the aspect ratio r:
In Direct3D, you can get a matrix of this form by calling D3DXMatrixPerspectiveFovLH(). This form is particularly useful because you can just set r to the aspect ratio of the window you're rendering in, and a field of view angle of p / 4 is usually fine. So, the only things you really need to worry about defining are the extents of the view frustum along the z-axis.
That's about all you need to know about the mathematics behind projection transforms. There are some other, lesser-used projection methods out there, and of course things are slightly different if you use a right-handed coordinate system or a different canonical view volume, but you should be able to figure out those formulae easily by using the results in this article as a base. If you want more information on projection and other transforms, take a look at Real-Time Rendering by Tomas Moller and Eric Haines; or Computer Graphics: Principles and Practice by James D. Foley, Andries van Dam, Steven K. Feiner, and John F. Hughes; these are two excellent books on computer graphics that I referred to in writing this article.
If you have any questions about this article, or spot any corrections that need to be made, you can contact me via PM on the CodeGuru forums, where I go by the name Smasher/Devourer. Happy coding!