In this post, we will explain the image formation from a geometrical point of view.

Specifically, we will cover the math behind how a point in 3D gets projected on the image plane.

This post is written with beginners in mind but it is mathematical in nature. That said, all you need to know is matrix multiplication.

The Setup

To understand the problem easily, let’s say you have a camera deployed in a room.

Given a 3D point P in this room, we want to find the pixel coordinates (u, v) of this 3D point in the image taken by the camera.

There are three coordinate systems in play in this setup. Let’s go over them.

1. World Coordinate System

Figure 1: The World Coordinate System and the Camera Coordinate System are related by a Rotation and a translation. These six parameters ( 3 for rotation, and 3 for translation ) are called the extrinsic parameters of a camera.

To define locations of points in the room we need to first define a coordinate system for this room. It requires two things

Origin : We can arbitrarily fix a corner of the room as the origin . X, Y, Z axes : We can also define the X and Y axis of the room along the two dimensions on the floor and the Z axis along the vertical wall.

Using the above, we can find the 3D coordinates of any point in this room by measuring its distance from the origin along the X, Y, and Z axes.

This coordinate system attached to the room is referred to as the World Coordinate System. In Figure 1, it is shown using orange colored axes. We will use bold font ( e.g. ) to show the axis, and regular font to show a coordinate of the point ( e.g. ).

Let us consider a point P in this room. In the world coordinate system, the coordinates of P are given by . You can find , , and coordinates of this point by simply measuring the distance of this point from the origin along the three axes.

2. Camera Coordinate System

Now, let’s put a camera in this room.

The image of the room will be captured using this camera, and therefore, we are interested in a 3D coordinate system attached to this camera.

If we had put the camera at origin of the room, and align it such that its X, Y, and Z axes aligned with the , , and axes of the room, the two coordinate systems would be the same.

However, that is an absurd restriction. We would want to put the camera anywhere in the room and it should be able to look anywhere. In such a case, we need to find the relationship between the 3D room (i.e. world) coordinates and the 3D camera coordinates.

Let’s say our camera is located at some arbitrary location in the room. In technical jargon, we can the camera coordinate is translated by with respect to the world coordinates.

The camera may be also looking in some arbitrary direction. In other words, we can say the camera is rotated with respect to the world coordinate system.

Rotation in 3D is captured using three parameters —- you can think of the three parameters as yaw, pitch, and roll. You can also think of it as an axis in 3D ( two parameters ) and an angular rotation about that axis (one parameter).

However, it is often convenient for mathematical manipulation to encode rotation as a 3×3 matrix. Now, you may be thinking that a 3×3 matrix has 9 elements and therefore 9 parameters but rotation has only 3 parameters. That’s true, and that is exactly why any arbitrary 3×3 matrix is not a rotation matrix. Without going into the details, let us for now just know that a rotation matrix has only three degrees of freedom even though it has 9 elements.

Back to our original problem. The world coordinate and the camera coordinates are related by a rotation matrix and a 3 element translation vector

What does that mean?

It means that point P which had coordinate values in the world coordinates will have different coordinate values in the camera coordinate system. We are representing the camera coordinate system using red color.

The two coordinate values are related by the following equation.

(1)

Notice that representing rotation as a matrix allowed us to do rotation with a simple matrix multiplication instead of tedious symbol manipulation required in other representations like yaw, pitch, roll. I hope this helps you appreciate why we represent rotations as a matrix.

Sometimes the expression above is written in a more compact form. The 3×1 translation vector is appended as a column at the end of the 3×3 rotation matrix to obtain a 3×4 matrix called the Extrinsic Matrix.

(2)