Image Inverse Perspective Without OpenCV
Introduction
With the development of autonomous driving technology, more and more people are exposed to newer technologies, and they are increasingly curious about how the computer world realizes autonomous driving, as well as how certain functions attached to autonomous driving systems are implemented. This time, we will explain and practice the algorithmic logic behind the "360° reverse camera" in the system.
Inverse Perspective Transformation
When capturing the image, the vehicle calls multiple cameras and stitches them into a "360° panoramic photo".
The "360° reverse camera" that forms an overhead view undergoes a layer of mathematical operation, namely Inverse Perspective Transformation, abbreviated as IPM.
In this field, there are many IPM transformation methods, such as the "corresponding point pair homography transformation method" and the "simplified camera model inverse perspective transformation", but they all use matrix transformation rules.
Corresponding Point Pair Homography Transformation Method
This transformation method is relatively simple and will not be described in detail.
Input at least four corresponding point pairs, with no three or more points collinear. No knowledge of camera parameters or any information about the plane position is required. Using the point pairs, solve for the perspective transformation matrix, which is a 3x3 square matrix, so a linear equation system can be constructed to solve. If there are more than four points, the method can be used to solve, and the method of selecting points is usually manual, generally choosing vanishing points.

This transformation is relatively simple to implement in code and can achieve IPM transformation relatively easily, so we will not go into details here and will not provide code examples.
Simplified Camera Model IPM Method
This is the transformation method we focus on analyzing this time. The essence of this algorithm is to use the conversion relationships between various coordinates during camera imaging, then abstract and simplify them, finally obtaining world coordinates.
Then establish the correspondence between world coordinates and image coordinates, and use this relationship to perform mathematical transformation.

Unlike some complex and lengthy calculation formulas, we still use coordinate calculations here. For this IPM calculation method, we need to measure the actual parameters of the camera first.
Here, the elevation angle is , the center height is , the distance from viewpoint to view plane is , and then find the world coordinate .
Let the camera image coordinate be , and establish the matrix equation from the relationship between world coordinates and image coordinates,
Substitute the image coordinates into equation to obtain the world coordinate matrix, that is
Let , , , , and . From the geometric relationship, we know . The simplified form of is
Finally, process the image. Since the image being processed is a two-dimensional plane image, the image depth is always 0. According to , simply substitute the horizontal and vertical coordinates of the array to obtain the coordinate values in world coordinates, i.e., the top view after IPM.

#include <cmath>
#include <cstdint>
#include <vector>
#include <algorithm>
namespace ipm
{
// =========================
// Basic data structures
// =========================
struct Vec3
{
double x;
double y;
double z;
};
struct GroundPoint
{
double X; // World coordinate X (left-right)
double Y; // World coordinate Y (front-back)
bool valid; // Whether there is a valid intersection with the ground
};
struct CameraParam
{
// Focal length (pixel units)
// If you only have one d, you can set fx = fy = d
double fx;
double fy;
// Principal point (usually the image center)
double cx;
double cy;
// Camera height from ground, unit e.g., cm
double H;
// Camera downward pitch angle (radians)
double pitch;
};
struct IPMParam
{
// Output top view size
int outWidth;
int outHeight;
// World coordinate range (units consistent with H, e.g., cm)
// X: left-right range
// Y: front-back range
double minX;
double maxX;
double minY;
double maxY;
};
// =========================
// Utility functions
// =========================
inline double clampDouble(double v, double lo, double hi)
{
return (v < lo) ? lo : ((v > hi) ? hi : v);
}
inline uint8_t clampToByte(double v)
{
if (v < 0.0) return 0;
if (v > 255.0) return 255;
return static_cast<uint8_t>(v + 0.5);
}
// Rotation around X axis: convert direction from camera coordinate system to world coordinate system
// Assumptions:
// - World Z axis points upward
// - Default camera optical axis points towards world Y positive direction
// - pitch > 0 means camera is looking downward
//
// To match image coordinates (v downward), construct a commonly used mapping in engineering:
//
// Camera ray rc = [x, y, 1]
// First map to world direction "when not pitched":
// x -> Xw
// y -> -Zw
// z -> Yw
//
// Then rotate around world X axis by pitch
//
inline Vec3 cameraRayToWorldRay(const Vec3& rc, double pitch)
{
// World direction when not pitched
// Camera right -> World right
// Camera down -> World negative up
// Camera front -> World front
const double X0 = rc.x;
const double Y0 = rc.z;
const double Z0 = -rc.y;
const double c = std::cos(pitch);
const double s = std::sin(pitch);
// Rotate around X axis
Vec3 rw;
rw.x = X0;
rw.y = c * Y0 - s * Z0;
rw.z = s * Y0 + c * Z0;
return rw;
}
// =========================
// Pixel point -> ground world coordinate
// =========================
//
// Input pixel point (u, v), calculate the corresponding world point (X, Y) on ground Z=0
//
// Note:
// 1. If the ray points upward or is parallel to the ground, it is invalid
// 2. fx, fy use pixel units
// 3. The unit of H determines the output world coordinate unit
//
inline GroundPoint imagePixelToGround(
double u,
double v,
const CameraParam& cam)
{
// 1) Pixel coordinates -> camera normalized coordinates
Vec3 rc;
rc.x = (u - cam.cx) / cam.fx;
rc.y = (v - cam.cy) / cam.fy;
rc.z = 1.0;
// 2) Camera ray -> world ray
Vec3 rw = cameraRayToWorldRay(rc, cam.pitch);
// 3) Camera center position in world coordinates
// Cw = (0, 0, H)
// Ray equation: P(t) = Cw + t * rw
//
// Intersection with ground Zw = 0:
// H + t * rw.z = 0 => t = -H / rw.z
//
GroundPoint gp{};
gp.valid = false;
// Ray does not point to the ground, or is almost parallel to the ground
if (std::abs(rw.z) < 1e-12)
return gp;
const double t = -cam.H / rw.z;
// Only accept "forward" intersections
if (t <= 0.0)
return gp;
gp.X = t * rw.x;
gp.Y = t * rw.y;
gp.valid = true;
return gp;
}
// =========================
// World coordinate -> bird's-eye view pixel
// =========================
//
// Map the ground point (X, Y) to the output bird's-eye view pixel (bx, by)
//
// Output image convention:
// - Left is minX, right is maxX
// - Top is maxY (farther)
// - Bottom is minY (closer)
//
inline bool groundToBirdPixel(
double X, double Y,
const IPMParam& ipmParam,
double& bx, double& by)
{
if (X < ipmParam.minX || X > ipmParam.maxX ||
Y < ipmParam.minY || Y > ipmParam.maxY)
{
return false;
}
const double xRatio =
(X - ipmParam.minX) / (ipmParam.maxX - ipmParam.minX);
const double yRatio =
(Y - ipmParam.minY) / (ipmParam.maxY - ipmParam.minY);
// X from left to right
bx = xRatio * (ipmParam.outWidth - 1);
// Want "far away is at the top of the image"
by = (1.0 - yRatio) * (ipmParam.outHeight - 1);
return true;
}
// =========================
// Bilinear sampling (grayscale)
// =========================
inline uint8_t bilinearSampleGray(
const uint8_t* src,
int width,
int height,
int stride,
double u,
double v)
{
if (u < 0.0 || v < 0.0 || u > width - 1.0 || v > height - 1.0)
return 0;
const int x0 = static_cast<int>(std::floor(u));
const int y0 = static_cast<int>(std::floor(v));
const int x1 = std::min(x0 + 1, width - 1);
const int y1 = std::min(y0 + 1, height - 1);
const double dx = u - x0;
const double dy = v - y0;
const double p00 = src[y0 * stride + x0];
const double p10 = src[y0 * stride + x1];
const double p01 = src[y1 * stride + x0];
const double p11 = src[y1 * stride + x1];
const double v0 = p00 * (1.0 - dx) + p10 * dx;
const double v1 = p01 * (1.0 - dx) + p11 * dx;
const double val = v0 * (1.0 - dy) + v1 * dy;
return clampToByte(val);
}
// =========================
// Bilinear sampling (RGB three channels)
// 3 bytes per pixel, RGBRGB...
// =========================
inline void bilinearSampleRGB(
const uint8_t* src,
int width,
int height,
int stride,
double u,
double v,
uint8_t outRGB[3])
{
if (u < 0.0 || v < 0.0 || u > width - 1.0 || v > height - 1.0)
{
outRGB[0] = outRGB[1] = outRGB[2] = 0;
return;
}
const int x0 = static_cast<int>(std::floor(u));
const int y0 = static_cast<int>(std::floor(v));
const int x1 = std::min(x0 + 1, width - 1);
const int y1 = std::min(y0 + 1, height - 1);
const double dx = u - x0;
const double dy = v - y0;
const uint8_t* p00 = src + y0 * stride + x0 * 3;
const uint8_t* p10 = src + y0 * stride + x1 * 3;
const uint8_t* p01 = src + y1 * stride + x0 * 3;
const uint8_t* p11 = src + y1 * stride + x1 * 3;
for (int c = 0; c < 3; ++c)
{
const double v0 = p00[c] * (1.0 - dx) + p10[c] * dx;
const double v1 = p01[c] * (1.0 - dx) + p11[c] * dx;
const double val = v0 * (1.0 - dy) + v1 * dy;
outRGB[c] = clampToByte(val);
}
}
// =========================
// Bird's-eye view pixel -> world coordinate
// =========================
//
// This is the key to "inverse mapping":
// For each pixel of the output bird's-eye view, first find its corresponding world ground point,
// then back-calculate its position in the original image, and finally sample from the original image.
//
inline void birdPixelToGround(
double bx,
double by,
const IPMParam& ipmParam,
double& X,
double& Y)
{
const double xRatio = bx / (ipmParam.outWidth - 1);
const double yRatio = 1.0 - by / (ipmParam.outHeight - 1);
X = ipmParam.minX + xRatio * (ipmParam.maxX - ipmParam.minX);
Y = ipmParam.minY + yRatio * (ipmParam.maxY - ipmParam.minY);
}
// =========================
// World ground point -> original image pixel
// =========================
//
// Given world point (X, Y, 0), back-project to input image for inverse mapping sampling.
//
inline bool groundToImagePixel(
double X,
double Y,
const CameraParam& cam,
double& u,
double& v)
{
// World point Pw = (X, Y, 0)
// Camera center Cw = (0, 0, H)
// World direction vector d_w = Pw - Cw = (X, Y, -H)
const double dwx = X;
const double dwy = Y;
const double dwz = -cam.H;
// Need to convert world direction back to camera direction
// In cameraRayToWorldRay, it uses: Rw = Rx(pitch) * base
// So here we do inverse rotation: Rx(-pitch)
const double c = std::cos(cam.pitch);
const double s = std::sin(cam.pitch);
// First rotate inversely to non-pitched state
const double X0 = dwx;
const double Y0 = c * dwy + s * dwz;
const double Z0 = -s * dwy + c * dwz;
// Then map back to camera coordinates
// base: [X0, Y0, Z0] = [xc, zc, -yc]
const double xc = X0;
const double yc = -Z0;
const double zc = Y0;
// Behind the camera, invalid
if (zc <= 1e-12)
return false;
u = cam.fx * (xc / zc) + cam.cx;
v = cam.fy * (yc / zc) + cam.cy;
return true;
}
// =========================
// Grayscale IPM
// =========================
//
// src: input grayscale image
// dst: output grayscale image, must be allocated externally with outHeight * dstStride bytes
//
inline void warpIPMGray(
const uint8_t* src,
int srcWidth,
int srcHeight,
int srcStride,
uint8_t* dst,
int dstStride,
const CameraParam& cam,
const IPMParam& ipmParam)
{
for (int by = 0; by < ipmParam.outHeight; ++by)
{
uint8_t* dstRow = dst + by * dstStride;
for (int bx = 0; bx < ipmParam.outWidth; ++bx)
{
// 1) Output bird's-eye view pixel -> world ground point
double X, Y;
birdPixelToGround(static_cast<double>(bx),
static_cast<double>(by),
ipmParam, X, Y);
// 2) World ground point -> original image pixel
double u, v;
if (!groundToImagePixel(X, Y, cam, u, v))
{
dstRow[bx] = 0;
continue;
}
// 3) Bilinear sampling
dstRow[bx] = bilinearSampleGray(src, srcWidth, srcHeight, srcStride, u, v);
}
}
}
// =========================
// RGB IPM
// =========================
//
// src: input RGB image, arranged as RGBRGB...
// dst: output RGB image, arranged as RGBRGB...
//
inline void warpIPMRGB(
const uint8_t* src,
int srcWidth,
int srcHeight,
int srcStride,
uint8_t* dst,
int dstStride,
const CameraParam& cam,
const IPMParam& ipmParam)
{
for (int by = 0; by < ipmParam.outHeight; ++by)
{
uint8_t* dstRow = dst + by * dstStride;
for (int bx = 0; bx < ipmParam.outWidth; ++bx)
{
double X, Y;
birdPixelToGround(static_cast<double>(bx),
static_cast<double>(by),
ipmParam, X, Y);
double u, v;
if (!groundToImagePixel(X, Y, cam, u, v))
{
uint8_t* p = dstRow + bx * 3;
p[0] = p[1] = p[2] = 0;
continue;
}
uint8_t rgb[3];
bilinearSampleRGB(src, srcWidth, srcHeight, srcStride, u, v, rgb);
uint8_t* p = dstRow + bx * 3;
p[0] = rgb[0];
p[1] = rgb[1];
p[2] = rgb[2];
}
}
}
}