Learning point clouds

brief intro on multiple methods

  • multi-view CNN (projections)
    • cnn → view pooling → cnn
      • multiple camera data go through separate cnns to get features
      • view pooling: max pooliing across all views
      • another set of cnn layers for classification
    • pros:
      • good performance
      • easy to use pretrained models
    • cons:
      • need careful camera-array setup (projection)
      • hard to handle noisy input or incomplete data
      • can’t handle when object rotates or traslates (invariance)
  • volumetric CNN (grids)
    • directly learn on 3d data without projection
    • methods
      • voxel cnn
        • voxelization: convert point cloud / mesh to rasterized image (discrete grid)
        • simply expand 2d convolution (3d kernal moves in 2 directions) to 3d convolution (4d kernal moves in 3 directions)
        • cons
          • complexity input scales up quickly (height * width * depth)
          • information loss when voxelization
      • octree CNN
        • only store surface, which is sparse compared to the whole volume
        • tree structure
          • recursively partition the cube to 8 little cubes (4 top 4 bottom)
          • if the cube contains surface, keep dividing
        • cons
          • complicated structure → difficult optimization process
          • not very flexible
          • still need voxelization
  • point network (coordinates)
    • learn with coordinates without doing voxelization
    • point data is unordered → model needs to be permutation invariance
      • using symmetric function (like sum or max)
    • method
      • point-net
        • structure
          • process each point with separate MLPs (can be shared)
          • max pooling to combine node encodings
          • another MLP for classification
        • cons
          • no local pattern (neighbor)
            • pointnet ++: apply pointnet recursively on a nested portioning of the point cloud
          • features depends on absolute coordinates → hard to generalize to complex scenes
          • big model because it uses transformers to make the input and feature invariant
      • graph-based method
        • dynamic graph CNN: consider local geometric information
          • points to graph
            • pick nearest k neighbots
            • calculate edge to each neighbor: simply use a MLP to get an embedding
          • apply graph convolution given these “edges”
            • like cnn uses kernal to get neighbors info, gcn uses edge to aggregate info
        • can be applied recursively to learn semantic relationships between groups of points (regardless of distance)

Processing point cloud data

  • load dataset
    • 3dvision.princeton.edu
    • open .off file with trimesh → visualization and sampling
  • point cloud data augmentation
    • jitter points: add random noise
    • shuffling
  • apply network (e.g. Point cloud classification with PointNet)

pointpillar fast encoder for object detection from point clouds

  • point cloud → predictions
    • Pillar feature net: extract feature using point net
      • represent point cloud in dense tensor with dim (depth, # of non-empty pillars per stack, # of points per pillar)
        • leveraging sparsity of point cloud
        • depth is the dim where the pillars are stacked
      • point net: get pseudo images with 2d dim (feature dim, # of non-empty pillars per stack)
        • # of points per pillar is dropped by max pooling
    • Backbone 2dCNN: apply to the 2d features
      • decode the pseudo images to high-level representation
    • detection head
      • apply single shot detector (pretrained) to detect one-shot bounding box detection
        • generate 2d bounding boxes on the features from the backbone