Inverse Perspective Mapping Without OpenCV

Introduction

With the development of autonomous driving technology, more and more people are being exposed to cutting-edge innovations, and they are increasingly curious about how the computer world actually achieves self-driving. The specific functions embedded in autonomous driving systems have also sparked widespread curiosity. This article will explain and implement the algorithmic logic behind the "360° reverse camera" in such systems.

Inverse Perspective Mapping

When capturing images, the vehicle calls multiple cameras and stitches them together to form a "360° panoramic photo."

To create the top-down view of the "360° reverse camera", a mathematical operation is required—Inverse Perspective Mapping, or IPM for short.

In this field, there are many IPM transformation methods, such as the "corresponding point pair homography transformation method" and the "simplified camera model inverse perspective transformation," but all of them rely on matrix transformation principles.

Corresponding Point Pair Homography Transformation

This transformation method is relatively simple, so we won't go into too much detail.

Input at least four corresponding point pairs, with no three or more points collinear. No camera parameters or any information about the plane's position is needed. Using the point pairs, solve for the perspective transformation matrix, which is a third-order square matrix. Therefore, a linear equation can be constructed to solve it. If there are more than four points, the $ransac$ method can be used. The point selection method is usually manual, often choosing vanishing points.

\begin{bmatrix} t_i x'_i \\ t_i y'_i \\ t_i \end{bmatrix} = map\_matrix \cdot \begin{bmatrix} x_i \\ y_i \\ 1 \end{bmatrix}

dst(i) = (x'_i, y'_i), src(i) = (x_i, y_i), i = 0, 1, 2, 3

This transformation is relatively simple to implement in code and can achieve IPM transformation relatively easily. We won't go into further detail here, nor provide code examples.

Simplified Camera Model IPM Method

This is the transformation method we will focus on analyzing this time. The essence of this algorithm is to utilize the conversion relationships between various coordinates during the camera imaging process, abstract and simplify them, and ultimately obtain the world coordinates.

Then, the correspondence between world coordinates and image coordinates is established, and a mathematical transformation is performed using this relationship.

Unlike some complex and lengthy calculation formulas, we still use coordinate operations here. For this IPM calculation method, we need to first measure the actual parameters of the camera.

Here, the elevation angle $θ$ is $23°$ , the center height $H$ is $37 cm$ , the distance from the viewpoint to the view plane $d$ is $87 cm$ , and we need to find the world coordinate $P_W$ .

Let the camera image coordinate be $\boldsymbol{P_G} = (x, y, z, 1)$ , and establish the matrix equation based on the relationship between world coordinates and image coordinates:

\boldsymbol{P_G} = \boldsymbol{P_W} \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & -H & 1 \end{bmatrix} \begin{bmatrix} \cos\theta & 0 & -\sin\theta & 0 \\ 0 & 1 & 0 & 0 \\ \sin\theta & 0 & \cos\theta & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \cdot \begin{bmatrix} 1 & 0 & 0 & \frac{1}{d} \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}. \tag{1}

Substituting the image coordinates into equation $(1)$ , we obtain the matrix of world coordinates:

\boldsymbol{P_W} = \left[ \frac{x\cos\theta - z\sin\theta}{1 - \frac{x}{d}} \quad \frac{y}{1 - \frac{x}{d}} \quad \frac{x\sin\theta + z\cos\theta - \frac{H}{d}}{1 - \frac{x}{d}} \quad 1 \right]. \tag{2}

Let $A = H\cos\theta$ , $B = -d$ , $C = d\sin\theta - \frac{H}{d}$ , $D = \cos\theta$ , and $E = d\sin\theta$ . From the geometric relationship, we know $x\sin\theta + z\cos\theta - \frac{H}{d} = 0$ . The simplest form of $\boldsymbol{P_W}$ is:

\boldsymbol{P_W} = \left[ \frac{A + Bz}{C + Dz} \quad \frac{Ey}{C + Dz} \quad 0 \quad 1 \right]. \tag{3}

Finally, process the image. Since the image being processed is a two-dimensional plane, the image depth is always 0. According to $(3)$ , simply substitute the horizontal and vertical coordinates of the array to obtain the coordinate values in world coordinates, which is the top-down view after IPM.

CPP

#include <cmath>
#include <cstdint>
#include <vector>
#include <algorithm>

namespace ipm
{
    // =========================
    // Basic Data Structures
    // =========================

    struct Vec3
    {
        double x;
        double y;
        double z;
    };

    struct GroundPoint
    {
        double X;     // World coordinate X (left-right)
        double Y;     // World coordinate Y (front-back)
        bool valid;   // Whether there is a valid intersection with the ground
    };

    struct CameraParam
    {
        // Focal length (in pixel units)
        // If you only have one d, you can set fx = fy = d
        double fx;
        double fy;

        // Principal point (usually the image center)
        double cx;
        double cy;

        // Camera height above ground, e.g., in cm
        double H;

        // Camera downward pitch angle (in radians)
        double pitch;
    };

    struct IPMParam
    {
        // Output bird's-eye view dimensions
        int outWidth;
        int outHeight;

        // World coordinate range (same unit as H, e.g., cm)
        // X: left-right range
        // Y: front-back range
        double minX;
        double maxX;
        double minY;
        double maxY;
    };

    // =========================
    // Utility Functions
    // =========================

    inline double clampDouble(double v, double lo, double hi)
    {
        return (v < lo) ? lo : ((v > hi) ? hi : v);
    }

    inline uint8_t clampToByte(double v)
    {
        if (v < 0.0) return 0;
        if (v > 255.0) return 255;
        return static_cast<uint8_t>(v + 0.5);
    }

    // Rotation around X axis: transform direction from camera coordinate system to world coordinate system
    // Assumptions:
    // - World Z axis points upward
    // - Camera optical axis defaults to world Y positive direction
    // - pitch > 0 means the camera is looking downward
    //
    // To match the image coordinate (v downward), construct a commonly used mapping in engineering:
    //
    // Camera ray rc = [x, y, 1]
    // First map to the world direction "without pitch":
    //   x -> Xw
    //   y -> -Zw
    //   z -> Yw
    //
    // Then rotate around the world X axis by pitch
    //
    inline Vec3 cameraRayToWorldRay(const Vec3& rc, double pitch)
    {
        // World direction without pitch
        // Camera right -> World right
        // Camera down -> World negative up
        // Camera forward -> World forward
        const double X0 = rc.x;
        const double Y0 = rc.z;
        const double Z0 = -rc.y;

        const double c = std::cos(pitch);
        const double s = std::sin(pitch);

        // Rotate around X axis
        Vec3 rw;
        rw.x = X0;
        rw.y = c * Y0 - s * Z0;
        rw.z = s * Y0 + c * Z0;

        return rw;
    }

    // =========================
    // Pixel Point -> Ground World Coordinates
    // =========================
    //
    // Given a pixel point (u, v), calculate its corresponding world point (X, Y) on the ground Z=0
    //
    // Note:
    // 1. If this ray points upward or is parallel to the ground, it is invalid
    // 2. fx, fy are in pixel units
    // 3. The unit of H determines the unit of the output world coordinates
    //
    inline GroundPoint imagePixelToGround(
        double u,
        double v,
        const CameraParam& cam)
    {
        // 1) Pixel coordinates -> Camera normalized coordinates
        Vec3 rc;
        rc.x = (u - cam.cx) / cam.fx;
        rc.y = (v - cam.cy) / cam.fy;
        rc.z = 1.0;

        // 2) Camera ray -> World ray
        Vec3 rw = cameraRayToWorldRay(rc, cam.pitch);

        // 3) Camera center position in world coordinates
        // Cw = (0, 0, H)
        // Ray equation: P(t) = Cw + t * rw
        //
        // Intersection with ground Zw = 0:
        // H + t * rw.z = 0  =>  t = -H / rw.z
        //
        GroundPoint gp{};
        gp.valid = false;

        // Ray does not point to the ground, or is almost parallel to the ground
        if (std::abs(rw.z) < 1e-12)
            return gp;

        const double t = -cam.H / rw.z;

        // Only accept "forward" intersections
        if (t <= 0.0)
            return gp;

        gp.X = t * rw.x;
        gp.Y = t * rw.y;
        gp.valid = true;
        return gp;
    }

    // =========================
    // World Coordinates -> Bird's-Eye View Pixel
    // =========================
    //
    // Map the ground point (X, Y) to the output bird's-eye view pixel (bx, by)
    //
    // Output image convention:
    // - Left is minX, right is maxX
    // - Top is maxY (farther away)
    // - Bottom is minY (closer)
    //
    inline bool groundToBirdPixel(
        double X, double Y,
        const IPMParam& ipmParam,
        double& bx, double& by)
    {
        if (X < ipmParam.minX || X > ipmParam.maxX ||
            Y < ipmParam.minY || Y > ipmParam.maxY)
        {
            return false;
        }

        const double xRatio =
            (X - ipmParam.minX) / (ipmParam.maxX - ipmParam.minX);

        const double yRatio =
            (Y - ipmParam.minY) / (ipmParam.maxY - ipmParam.minY);

        // X from left to right
        bx = xRatio * (ipmParam.outWidth - 1);

        // Want "far away at the top of the image"
        by = (1.0 - yRatio) * (ipmParam.outHeight - 1);

        return true;
    }

    // =========================
    // Bilinear Sampling (Grayscale)
    // =========================
    inline uint8_t bilinearSampleGray(
        const uint8_t* src,
        int width,
        int height,
        int stride,
        double u,
        double v)
    {
        if (u < 0.0 || v < 0.0 || u > width - 1.0 || v > height - 1.0)
            return 0;

        const int x0 = static_cast<int>(std::floor(u));
        const int y0 = static_cast<int>(std::floor(v));
        const int x1 = std::min(x0 + 1, width - 1);
        const int y1 = std::min(y0 + 1, height - 1);

        const double dx = u - x0;
        const double dy = v - y0;

        const double p00 = src[y0 * stride + x0];
        const double p10 = src[y0 * stride + x1];
        const double p01 = src[y1 * stride + x0];
        const double p11 = src[y1 * stride + x1];

        const double v0 = p00 * (1.0 - dx) + p10 * dx;
        const double v1 = p01 * (1.0 - dx) + p11 * dx;
        const double val = v0 * (1.0 - dy) + v1 * dy;

        return clampToByte(val);
    }

    // =========================
    // Bilinear Sampling (RGB Three-Channel)
    // Each pixel is 3 bytes, RGBRGB...
    // =========================
    inline void bilinearSampleRGB(
        const uint8_t* src,
        int width,
        int height,
        int stride,
        double u,
        double v,
        uint8_t outRGB[3])
    {
        if (u < 0.0 || v < 0.0 || u > width - 1.0 || v > height - 1.0)
        {
            outRGB[0] = outRGB[1] = outRGB[2] = 0;
            return;
        }

        const int x0 = static_cast<int>(std::floor(u));
        const int y0 = static_cast<int>(std::floor(v));
        const int x1 = std::min(x0 + 1, width - 1);
        const int y1 = std::min(y0 + 1, height - 1);

        const double dx = u - x0;
        const double dy = v - y0;

        const uint8_t* p00 = src + y0 * stride + x0 * 3;
        const uint8_t* p10 = src + y0 * stride + x1 * 3;
        const uint8_t* p01 = src + y1 * stride + x0 * 3;
        const uint8_t* p11 = src + y1 * stride + x1 * 3;

        for (int c = 0; c < 3; ++c)
        {
            const double v0 = p00[c] * (1.0 - dx) + p10[c] * dx;
            const double v1 = p01[c] * (1.0 - dx) + p11[c] * dx;
            const double val = v0 * (1.0 - dy) + v1 * dy;
            outRGB[c] = clampToByte(val);
        }
    }

    // =========================
    // Bird's-Eye View Pixel -> World Coordinates
    // =========================
    //
    // This is the key to "inverse mapping":
    // For each pixel of the output bird's-eye view, first find its point on the world ground,
    // then back-calculate its position in the original image, and finally sample from the original image.
    //
    inline void birdPixelToGround(
        double bx,
        double by,
        const IPMParam& ipmParam,
        double& X,
        double& Y)
    {
        const double xRatio = bx / (ipmParam.outWidth - 1);
        const double yRatio = 1.0 - by / (ipmParam.outHeight - 1);

        X = ipmParam.minX + xRatio * (ipmParam.maxX - ipmParam.minX);
        Y = ipmParam.minY + yRatio * (ipmParam.maxY - ipmParam.minY);
    }

    // =========================
    // World Ground Point -> Original Image Pixel
    // =========================
    //
    // Given a world point (X, Y, 0), back-project it to the input image for inverse mapping sampling.
    //
    inline bool groundToImagePixel(
        double X,
        double Y,
        const CameraParam& cam,
        double& u,
        double& v)
    {
        // World point Pw = (X, Y, 0)
        // Camera center Cw = (0, 0, H)
        // World direction vector d_w = Pw - Cw = (X, Y, -H)
        const double dwx = X;
        const double dwy = Y;
        const double dwz = -cam.H;

        // Need to transform the world direction back to the camera direction
        // cameraRayToWorldRay uses: Rw = Rx(pitch) * base
        // So here we do the inverse rotation: Rx(-pitch)
        const double c = std::cos(cam.pitch);
        const double s = std::sin(cam.pitch);

        // First, inverse rotate to the unpitched state
        const double X0 = dwx;
        const double Y0 = c * dwy + s * dwz;
        const double Z0 = -s * dwy + c * dwz;

        // Then map back to camera coordinates
        // base: [X0, Y0, Z0] = [xc, zc, -yc]
        const double xc = X0;
        const double yc = -Z0;
        const double zc = Y0;

        // Behind the camera, invalid
        if (zc <= 1e-12)
            return false;

        u = cam.fx * (xc / zc) + cam.cx;
        v = cam.fy * (yc / zc) + cam.cy;
        return true;
    }

    // =========================
    // Grayscale IPM
    // =========================
    //
    // src: Input grayscale image
    // dst: Output grayscale image, must be allocated externally with outHeight * dstStride bytes
    //
    inline void warpIPMGray(
        const uint8_t* src,
        int srcWidth,
        int srcHeight,
        int srcStride,
        uint8_t* dst,
        int dstStride,
        const CameraParam& cam,
        const IPMParam& ipmParam)
    {
        for (int by = 0; by < ipmParam.outHeight; ++by)
        {
            uint8_t* dstRow = dst + by * dstStride;

            for (int bx = 0; bx < ipmParam.outWidth; ++bx)
            {
                // 1) Output bird's-eye view pixel -> World ground point
                double X, Y;
                birdPixelToGround(static_cast<double>(bx),
                                  static_cast<double>(by),
                                  ipmParam, X, Y);

                // 2) World ground point -> Original image pixel
                double u, v;
                if (!groundToImagePixel(X, Y, cam, u, v))
                {
                    dstRow[bx] = 0;
                    continue;
                }

                // 3) Bilinear sampling
                dstRow[bx] = bilinearSampleGray(src, srcWidth, srcHeight, srcStride, u, v);
            }
        }
    }

    // =========================
    // RGB IPM
    // =========================
    //
    // src: Input RGB image, arranged as RGBRGB...
    // dst: Output RGB image, arranged as RGBRGB...
    //
    inline void warpIPMRGB(
        const uint8_t* src,
        int srcWidth,
        int srcHeight,
        int srcStride,
        uint8_t* dst,
        int dstStride,
        const CameraParam& cam,
        const IPMParam& ipmParam)
    {
        for (int by = 0; by < ipmParam.outHeight; ++by)
        {
            uint8_t* dstRow = dst + by * dstStride;

            for (int bx = 0; bx < ipmParam.outWidth; ++bx)
            {
                double X, Y;
                birdPixelToGround(static_cast<double>(bx),
                                  static_cast<double>(by),
                                  ipmParam, X, Y);

                double u, v;
                if (!groundToImagePixel(X, Y, cam, u, v))
                {
                    uint8_t* p = dstRow + bx * 3;
                    p[0] = p[1] = p[2] = 0;
                    continue;
                }

                uint8_t rgb[3];
                bilinearSampleRGB(src, srcWidth, srcHeight, srcStride, u, v, rgb);

                uint8_t* p = dstRow + bx * 3;
                p[0] = rgb[0];
                p[1] = rgb[1];
                p[2] = rgb[2];
            }
        }
    }

}