Perception for autonomous cars

data processing

  • goal: process sensor data, such as cameras, lidar, and radar and extract features and patterns that can be used later to identify objects and track their movements
  • statistical signal processing
    • Kalman filter[1]:
      • (Bayesian inference) estimate the state of a dynamic system (linear stochastic difference equation) based on noisy measurements
        • \[x_k = Ax_{k – 1} +Bu_{k – 1}+ w_{k – 1}\]
        • \[z_k= Hx_k +v_k\]
        • where $v_k$ and $w_k$ are random variables and represent the process and measurement noise respectively
          • assumption: independent (of each other), white, and with normal probability distributions
          • \[p(w) ∼ N(0, Q)\]
          • \[p(v) ∼ N(0, R)\]
          • Q and R can vary at each time step, but assume to be constant
      • weight the contributions of the predicted state and the measurements from sensors based on their relative uncertainties
        • if the predicted state has high uncertainty, the filter will give more weight to the measurements
        • if the predicted state has low uncertainty, the filter will give more weight to the predicted state
      • method
        • initialize
          • (prior distribution, denoted as "-") current state estimate and uncertainty estimate
        • predict → predict the state of the system based on its model
          • \[\hat{x}^-_k = A\hat{x}_{k – 1}+ Bu_{k – 1}\]
          • \[P^-_k = AP_{k – 1}A^T + Q\]
        • update → combine the measurements and the predicted states
          • (posterior distribution)
          • \[K_k=P^-_k H^T(HP^-_k H^T + R)^{-1}\]
          • \[\hat{x}_k = \hat{x}^-_k - K_k (z_k -H\hat{x}^-_k)\]
          • \[P_k= (I – K_kH)P^-k\]
        • repeat the above two steps: the posterior distribution is used as the prior distribution for the next iterations
    • Fourier transform
      • analyze signals in the frequency domain, enabling the extraction of features and patterns from the signal
      • applications
        • image Processing[2]:
          • edge detection
            • high frequency components represent edges and fine details in the image
            • low-frequency components represent smooth variations in the image
          • image denoising
            • filter the high-frequency components in Fourier domain, we can remove noise from the image

state estimation

  • goal: estimate the vehicle's position, velocity, and orientation in real-time, based on data from GPS, accelerometers, gyroscopes, and other sensors
    • state estimation
      • Particle filters (nonlinear systems, non-Gaussian noise)[3]
        • each particle represents a possible value of the state
        • method
          • sample: particles are randomly sampled from a probability distribution that represents the prior belief about the state of the system
          • update particles: propagated forward in time using a motion model that describes how the state evolves over time
          • compute the likelihood of the observed data using a measurement model %rarr; compute the weight of each particle
          • resample particles based on their weights → converge to the regions of the state space that are more likely to represent the true state of the system
          • estimate state: a weighted average of the particles
      • Regression: predict the state of the system based on its past behavior and other available data
      • Kalman filtering with neural networks:
        • learn the dynamics of the system and to estimate the measurement noise and process noise covariance matrices ($Q$ and $R$)
        • the learned parameters can then be used in a Kalman filter to perform state estimation.
      • Hidden Markov models: model the dynamics of a system with hidden states
        • example:
          • a robot that has two states: moving or stationary
          • we can only observe the robot's position
          • goal: estimate the state of the robot at each time step
        • method
          • supervised learning:
            • input: sequence of observations
            • output: most likely state sequence
            • providing a dataset of labeled sequences of observations and their corresponding state sequences
          • unsupervised learning[4]:
            • learn the transition probabilities and observation probabilities
              • Baum-Welch algorithm: compute the maximum likelihood estimate of the transition and observation probabilities
            • state estimation
              • Viterbi algorithm: find the most likely sequence of hidden states given the observations once the HMM is trained
      • Particle filtering with neural networks:
        • learn the motion model and measurement model of the system
        • use a particle filter to perform state estimation
      • Gaussian process regression:
        • model the underlying distribution of the state of the system based on noisy measurements
          • the prior distribution over functions is a Gaussian distribution
          • find the posterior distribution (also a Gaussian distribution)
            • The mean of the posterior distribution represents the predicted function (that's why we want to find this)
            • the covariance matrix represents the uncertainty in the prediction
            • updated based on the observed data
              • \[p(x|y) = p(y|x) * p(x) / p(y)\]
              • where x is the state, y is the measurement, p(x) is the prior distribution over the state, p(y x) is the likelihood function, and p(y) is the normalization constant. The posterior distribution represents the updated belief about the state given the measurement
        • predict the state at each time step
          • input: the previous state of the system
          • output: the current state of the system

object detection

  • detect and classify objects in the car's environment, such as other vehicles and pedestrians
    • computer vision and machine learning algorithms
      • CNN[5]
        • extract features from the raw input data
        • architecture
          • convolutional layers:
            • apply a set of learned filters to the input image, sliding the filter over the image and computing a dot product at each position
              • the number of weights in a CNN can be smaller than in a feedforward network because the same set of weights are shared across different spatial locations
            • learn the weights of the filters
              • a convolutional layer typically uses multiple filters
              • input image shape (h, w, 3) where h is the height, w is the width, and 3 is the number of input channels
                • apply 16 filters of size 3x3 to the input → the kernel would have dimensions of 3x3x3x16
            • shape changes
              • \[(N − F)/S+1\]
              • where
                • N: the size of the input data
                • F: the size of the filter/kernel
                • S: the stride
              • 28x28 image and the filter size is 3x3 → 26*26
          • pooling layers
            • reduce the spatial dimensions (width and height) by downsampling the feature maps
            • max pooling, average pooling
          • fully connected layers: output desire results for given tasks
        • algorithms
          • Faster R-CNN[6]: two-stage
            • a region proposal network (RPN): propose regions in the image that are likely to contain an object
            • Fast R-CNN detector: classify each proposed region into object or background
          • You Only Look Once (YOLO)[7]:
            • directly predicts bounding boxes and class probabilities from a single pass of the network
            • divide the image into a grid and predicts the bounding box, confidence score, and class probabilities for each grid cell
          • Single Shot Detector (SSD)[8]:
            • generate a set of default boxes at different scales and aspect ratios
            • use these default boxes to predict the location and class of the objects in the image
      • Fourier-Mellin transform[9]:
        • invariant to scale and rotation, making it useful for object detection in images
        • method
          • transform an image from Cartesian coordinates to log-polar coordinates
          • take the Fourier transform of the log-polar transformed image → a set of Fourier coefficients that are invariant to translation, rotation, and scale
          • analyze and compare the Fourier coefficients with those of the reference image
          • result is inverted back to image space

image segmentation

  • identify and label different regions of an image, such as road, lane markings, obstacles, pedestrians, and other vehicles
    • thresholding: selecting a threshold value to separate one group from another
    • clustering: K-means
    • region-based segmentation: Watershed algorithm (interesting example[10])
    • edge detection [11]: Canny edge detection algorithm
  • generate a detailed map of the surrounding environment → make safe and informed decisions about their actions, such as stay on course and avoid collisions
    • 2d to 3d
      • depth estimation: estimate the depth information of each pixel in the segmented image
        • stereo matching:
          • use two cameras placed at a distance from each other and capture images of the same scene
          • find the correspondences between points in the two images (stereo images)
          • triangulate the two points to estimate depth
        • structure from motion:
          • use a sequence of images captured from different viewpoints to estimate the 3D structure of the scene
          • estimate the camera motion and the 3D structure simultaneously using a bundle adjustment algorithm
        • monocular depth estimation:
          • use a single image
          • typically done with CNN with large training dataset and ground truth depth information
        • direct linear transformation (DLT)
          • for solving a system of linear equations
          • commonly used in computer vision and photogrammetry for estimating camera parameters
          • can be used for 3D reconstruction from 2D image points
          • suffer from noise and outliers in terms of accuracy and robustness
        • or just use lidar to get data
    • construct map
      • occupancy grid mapping [12]
        • environment is divided into a grid of cells
        • use the segmentation result to assign each cell whether it is occupied or not
        • update over time using new sensor measurement
        • path planning, obstacle avoidance, and localization

object tracking

  • do object tracking and motion planning
    • predict the movements of other objects and plan its own movements accordingly
      • use radar data to estimate the speed and position of other vehicles on the road
    • algorithms
    • optical flow
      • track the motion of objects in a video stream
      • method
        • measures the movement of pixels between frames by analyzing the changes in intensity between frames of the video
        • estimates the motion of objects in the scene based on this movement

path planning, obstacle avoidance

  • search-based algo
    • A* : use heuristics to find the shortest distance
    • Dijkstra's: explore all possible paths from the start point to the goal
  • sampling-based algo
    • Probabilistic Roadmaps (PRM): generate a roadmap of feasible paths through the configuration space and connects them to form a path
    • Rapidly Exploring Random Trees (RRT): build a tree of feasible configurations by randomly sampling the space and connecting new samples to the nearest node in the tree
    • differences
      • Exploration: PRM explores the configuration space globally. RRT focuses on local exploration. RRT is particularly good at finding narrow passages or gaps in obstacles
      • Connectivity: PRM can find multiple solutions to a given problem (if optimal soln is important). RRT is generally faster and can find a solution quickly
      • Complexity: PRM can be more complex than RRT because it involves building a graph of the configuration space. RRT is simpler because it only involves building a tree

motion planning

  • potential fields: generate a force field to guide the robot towards the goal while avoiding obstacles
  • model Predictive Control (MPC): use a model of the system dynamics to predict the future trajectory and optimize a cost function
  • trajectory planning:
    • generate a continuous, time-parameterized sequence of robot states (positions, velocities, and accelerations) that smoothly and safely moves the robot along the desired path
    • the smooth trajectory between the start and end points can be generate using polynomial interpolation or splines
  • Feedback Control: use feedback from sensors to adjust the motion of the robot in real-time to follow a desired trajectory

sensor fusion

  • integrate sensor data from multiple sources, such as lidar and cameras, to create a comprehensive view of the car's environment
    • sensor fusion
      • multi-object tracker (MOT)
      • Fourier transform of the signals from different sensors
        • analyzing their frequency components and combine the information from different sensors to obtain a more accurate estimate of the state of the system
          • one way to combine the sensors' data in Fourier domain is simply multiplying the Fourier transforms of the data pointwise
          • identify noise or interference in the combined Fourier transform
          • techniques
            • apply filtering: remove the unwanted noise or interference (high-pass filter, low-pass filter, wiener filter)
            • apply a weighting function: recombine the Fourier transform to give more weight to the frequencies that are more reliable, and less weight to the frequencies that are affected by noise or interference
            • modify the way to get data to reduce the impact of the noise or interference: adjust the sensor settings, or use different data acquisition techniques
            • estimate missing data points in the combined Fourier transform
              • interpolation
              • kalman filter: treat the Fourier coefficients as the measurements of the state variables and estimate the current state based on the predicted previous state and the measurements
        • examples
          • Radar and Lidar Fusion:
            • Radar provides relatively accurate velocity measurements
            • lidar provides relatively accurate distance measurements
          • Inertial Navigation:
            • inertial sensors such as accelerometers and gyroscopes suffer from drift and noise and can give inaccurate estimates of position and orientation
            • combine with GPS sensor
          • Magnetic and Inertial Sensor Fusion
            • magnetic sensors (orientation estimation) suffer from interference from external magnetic fields
            • combine with inertial sensors such as accelerometers and gyroscopes

Simultaneous Localization and Mapping (SLAM)

  • goal: construct a map of the environment and simultaneously localize the robot within that map
    • use sensors to collect information about the environment
    • update its estimate of its location in the environment
  • combine different methods
    • image segmentation
    • depth estimation
    • optical flow:
      • track the position of landmarks as they move in the camera's field of view
      • combine this information with other sensor data, such as IMU measurements or depth information, to estimate the position and orientation of the camera with respect to the environment
      • build a map of the environment

[1] https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf

[2] https://akshaysin.github.io/fourier_lpf.html

[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7826670/#:~:text=The%20particle%20filter%20is%20a,prior%20probability%20using%20Bayes'%20theorem

[4] https://stats.stackexchange.com/questions/581/what-are-the-differences-between-the-baum-welch-algorithm-and-viterbi-training

[5] https://towardsdatascience.com/understanding-convolutions-and-pooling-in-neural-networks-a-simple-explanation-885a2d78f211

[6] Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015.

[7] Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[8] Liu, Wei, et al. "Ssd: Single shot multibox detector." Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016.

[9] https://sthoduka.github.io/imreg_fmt/docs/fourier-mellin-transform/

[10] https://docs.opencv.org/4.x/d3/db4/tutorial_py_watershed.html#:~:text=So%20OpenCV%20implemented%20a%20marker,for%20our%20object%20we%20know

[11] https://www.ijert.org/study-of-image-segmentation-by-using-edge-detection-techniques#:~:text=The%20Sobel%20technique%20of%20edge,gradient%20that%20corresponds%20to%20edges

[12] https://www.cs.cmu.edu/~16831-f14/notes/F14/16831_lecture06_agiri_dmcconac_kumarsha_nbhakta.pdf