Monday, March 22, 2010

Derivation of the perspective matrix, Part 2

In part 2 of Derivation of a Perspective Matrix we look at the actual Matrix part.

In part 1 we leaned how we map points inside our viewing frustum to points on our screen. From here we'd like to see how this becomes a perspective matrix.

To move on I'd like to introduce the concept of the canonical view volume. The canonical view volume is the view volume (the visible area in front of the virtual camera) that is effectively scaled to fit nicely inside a volume where all x and y values are between [-1, 1], and the z values are between [0, 1]. By applying this scaling to points within the camera view volume it becomes trivial to test to see if points lie within the camera view.

The reason we do a mapping from view volume to canonical view volume rather than a straight map to a plane is that we'd ideally like to be able to keep the Z value to be able to test for depth of a point within a scene. We can easily compare a point in canonical view volume space to another point in canonical view volume space to determine if one is potentially part of geometry that obscures other geometry. In modern computer graphics the process happens in the graphics driver, or indeed the graphics hardware, but it does explain the reasoning for the representation.

The canonical view volume for a camera which demonstrates perspective looks like a pyramid with the top chopped off. You can see this shape easily in the original figure depicting the first part of the derivation. The only difference being that the canonical view volume is bounded in dimension in x and y by [-1, 1] and in z direction by [0, 1]. We've scaled all the values to meet this requirement. These space is called clipping space.

At the end of Part1 we derived the formula to map eye space X to screen space X and eye space Y to screen space Y. In clipping space we keep the Z value.

To go on we need to be familiar with the concept of the homogeneous co-ordinates. Homogeneous just means all of the same type/all-together/all the same. All the same of what? You might well ask. Lets start with the basics and refresh out memory about Euclidean space. Euclidean space is the maths we are familiar with when dealing with the basic math of points, vectors and lines. For computer graphics we normally deal with the 2 dimensional Cartesian plane, or the 3D dimensional “real coordinate space”. So Euclidean space co-ordinates are basically the mundane 2D and 3D co-ordinate systems we should all be familiar with by now. In 2D, we generally define the space using linearly independent axis denoted by x and y, and in 3D linearly independent axis x, y and z.

Homogeneous coordinates refer to points in what is know as projective space. The mathematics of projective space is such that points and lines in projective space have corresponding points in Euclidean space. So the two spaces, Euclidean and Projective are connected by a relationship. Thus points in Projective space and Euclidean space can be converted from one to the other easily, and each point in one space has it's equivalent in another space. The word homogeneous in this case is referring to that equivalence.

One particularly nice aspect of working in projective space is that if we are dealing with transformations using matrix mathematics we can create a 4 dimensional matrix that in practice is equivalent to a 3 dimensional euclidean space rotation matrix applied to a point, followed by a translation applied the same point.

The other nicety of projective space for those working in computer graphics is that it's ideally suited to working with projections! Exactly what we're working on deriving here.

Projective space has an additional coordinate. So a 2 dimensional euclidean point is represented by a 3 dimensional projective point and a 3 dimensional euclidean point is 4 dimensions in projective space. As we live in 3D dimensional meat space, the 4 dimensional part is impossible to visualize. It's probably better not to try. Suffice it to say the extra dimension is just providing an additional reference to identifying a point.

There are a infinite projective space points that map to points in paired euclidean space, but the most basic and obvious representation of a point in euclidean space in projective space is the point where the projective (the additional co-ordinate) coordinate is 1. A point (x, y) in 2D Euclid becomes (x, y, 1) in projective space, and a point (x, y, z) becomes the point (x, y, z, 1) in projective space. The projective coordinate is typically represented by the letter w. When w = 1, the Euclid space coordinate is plain to see.

As w is the projective coordinate the general rule for converting from homogeneous coordinates to euclidean coordinates is to use the projective coordinate to divide the other coordinates. (x, y, z, w) in projective space is (x/w, y/w, z/w) in euclidean space. Knowing this it is possible to see that a the projection space points (4, 2, 2, 1) and (8, 4, 4, 2) are the same point (4, 2, 2) in Euclidean space.

Lets put down out equations for a conversion to clipping space from 3D eye space. For x and y they're pretty much the same as formulas for screen space.





We don't really have something for z part yet. We do know that we want to remove the dependence on z for our equations on the right hand side to create linear equations we can place into a matrix, so we'll multiply the equations through by z to leave our simple linear equation on the right hand side. We arrive with





This might not look useful just yet, but bear with me. We've got these two formula mapping into some odd space thats a factor of z. Now we'd like to hang on to z co-ordinate. So we posit a point represented by



So we're trying to find a matrix which maps the point (x, y, z) to to (Xclip*z, Yclip*z, Zclip*z). With (Xclip*z, Yclip*z, Zclip*z), since each term is dependent on z we can divide everything by z and end up with (Xclip, Yclip, Zclip) which is exactly what we want to find. So what we need first the in this matrix which maps between the two spaces. We've got the Xclip*z and Yclip*z component and are currently looking for the Zclip * z component. We know the formula will not be in any way dependent on x or y as the z axis is orthogonal to the plane of projection. Thus the most complicated it will be is a scalar multiplication of z with the possibly of a constant. So we're looking at something like



where p and q are constants.

We have a chance of working out what these constants will be because we know that our camera frustum is bounded by the near plane, and the far plane. We'd also like our screen space Zclip result to be scaled between the 0 and 1. This is a nice way to do it, and it's how 3D API's like OpenGL and DirectX work. So we've got some basic facts to work with.

So we say that Zclip= 0 when z = D (near plane) and that Zclip = 1 when z = F (far plane).

We've got Zclip = 0 when z = D so the right hand side of the equation will be zero. We can solve for q.





So lets do the same thing for when z = F



We know what q is from the earlier step







Now we have a value for constant p and q that actually mean something, returning to our original equation and substituting yields



So we're left the equations







Now if we move this calculation into projective space using homogeneous coordinates we can say we're writing a transform to . Normally we're write Ws = 1 for the most simple equivalence in projective space. Ok so if Ws equals 1 then we can see that



Now we've got four equations we can put into a matrix yielding.



So this matrix maps from euclidean point represented as a homogeneous coordinate (x, y, z, 1) to yield (XclipZ, YclipZ, ZclipZ, Z) as homogeneous coordinate, dividing through by z to create a euclidean coordinate yields (Xclip, Yclip, Zclip, 1). Our desired screen space coordinate.

This current set of equations assumes a completely square view screen. If we take into account different possible aspect ratios we add the term for the aspect ratio. The aspect ratio defined as the view port width versus the hight. That is the width of the near plane (which we assume is our projection screen), versus the height of the near plane.



We introduce this term into the equation for the Xs coordinate, and follow through to yield the matrix



And here we have one common form of the projection matrix.