The reason for this is to handle rotation, scaling and translation in a common way. While you can handle rotation and scaling using 3x3 matrices and vectors with 3 components, translation cannot be processed - unfortunately this is normally not further explained although it is quite easy to see:

Suppose we have a point p with coordinate (x, y, z) and we want to translate it by a distance defined by the translation vector t=(tx, ty, tz).

Of course, what you need to do to get new position p' of point p after translation is to add t to p:

p' = p + t, which means

p'.x = p.x + tx

p'.y = p.y + ty

p'.z = p.z + tz

If you try to create a 3x3 matrix M to perform this operation by p' = M * p, you get into trouble. What we require is:

x' x + tx |a b c| |x|

y' = y + ty = |d e f| * |y|

z' z + tz |g h i| |z|

z' z + tz |g h i| |z|

x + tx = ax + by + cz (1)

y + ty = dx + ey + fz

z + tz = gx + hy + iz

Looking at equation (1) we directly see that a = 1 and we are left with tx = by + cz. But the translation in x-direction must be indepent of y and z coordinates, so b and c must be zero.

x + tx = 1*x + 0*y + c*z = x => tx = 0

So only the primitive translation by zero is possible. The same applies for the equations for y and z. Using homogeneous coordinates, that is to add a forth component w = 1 to each point, we can derive a 4x4 matrix for translation without any problems.

x + tx |a b c d| |x|

y + ty = |e f g h| * |y|

z + tz |i j k l| |z|

z + tz |i j k l| |z|

1 |m n o p| |1|

we get e.g. for x-coordinate:

x + tx = ax + by + cz + d,

so a = 1 and d = tx which leads us to the final translation matrix:

|1 0 0 tx|

M = |0 1 0 ty|

|0 0 1 tz|

|0 0 0 1 |