The reason for this is to handle rotation, scaling and translation in a common way. While you can handle rotation and scaling using 3x3 matrices and vectors with 3 components, translation cannot be processed - unfortunately this is normally not further explained although it is quite easy to see:
Suppose we have a point p with coordinate (x, y, z) and we want to translate it by a distance defined by the translation vector t=(tx, ty, tz).
Of course, what you need to do to get new position p' of point p after translation is to add t to p:
p' = p + t, which means
p'.x = p.x + tx
p'.y = p.y + ty
p'.z = p.z + tz
If you try to create a 3x3 matrix M to perform this operation by p' = M * p, you get into trouble. What we require is:
x' x + tx |a b c| |x|
y' = y + ty = |d e f| * |y|
z' z + tz |g h i| |z|
z' z + tz |g h i| |z|
x + tx = ax + by + cz (1)
y + ty = dx + ey + fz
z + tz = gx + hy + iz
Looking at equation (1) we directly see that a = 1 and we are left with tx = by + cz. But the translation in x-direction must be indepent of y and z coordinates, so b and c must be zero.
x + tx = 1*x + 0*y + c*z = x => tx = 0
So only the primitive translation by zero is possible. The same applies for the equations for y and z. Using homogeneous coordinates, that is to add a forth component w = 1 to each point, we can derive a 4x4 matrix for translation without any problems.
x + tx |a b c d| |x|
y + ty = |e f g h| * |y|
z + tz |i j k l| |z|
z + tz |i j k l| |z|
1 |m n o p| |1|
we get e.g. for x-coordinate:
x + tx = ax + by + cz + d,
so a = 1 and d = tx which leads us to the final translation matrix:
|1 0 0 tx|
M = |0 1 0 ty|
|0 0 1 tz|
|0 0 0 1 |