Learning point clouds

Published: January 12, 2023 by Chia-Hsien (Cathy) Shih

Categories:
Fundamentals 7

brief intro on multiple methods

multi-view CNN (projections)
- cnn → view pooling → cnn
  - multiple camera data go through separate cnns to get features
  - view pooling: max pooliing across all views
  - another set of cnn layers for classification
- pros:
  - good performance
  - easy to use pretrained models
- cons:
  - need careful camera-array setup (projection)
  - hard to handle noisy input or incomplete data
  - can’t handle when object rotates or traslates (invariance)
volumetric CNN (grids)
- directly learn on 3d data without projection
- methods
  - voxel cnn
    - voxelization: convert point cloud / mesh to rasterized image (discrete grid)
    - simply expand 2d convolution (3d kernal moves in 2 directions) to 3d convolution (4d kernal moves in 3 directions)
    - cons
      - complexity input scales up quickly (height * width * depth)
      - information loss when voxelization
  - octree CNN
    - only store surface, which is sparse compared to the whole volume
    - tree structure
      - recursively partition the cube to 8 little cubes (4 top 4 bottom)
      - if the cube contains surface, keep dividing
    - cons
      - complicated structure → difficult optimization process
      - not very flexible
      - still need voxelization
point network (coordinates)
- learn with coordinates without doing voxelization
- point data is unordered → model needs to be permutation invariance
  - using symmetric function (like sum or max)
- method
  - point-net
    - structure
      - process each point with separate MLPs (can be shared)
      - max pooling to combine node encodings
      - another MLP for classification
    - cons
      - no local pattern (neighbor)
        
        pointnet ++: apply pointnet recursively on a nested portioning of the point cloud
      - features depends on absolute coordinates → hard to generalize to complex scenes
      - big model because it uses transformers to make the input and feature invariant
  - graph-based method
    - dynamic graph CNN: consider local geometric information
      - points to graph
        
        pick nearest k neighbots
        
        calculate edge to each neighbor: simply use a MLP to get an embedding
      - apply graph convolution given these “edges”
        
        like cnn uses kernal to get neighbors info, gcn uses edge to aggregate info
    - can be applied recursively to learn semantic relationships between groups of points (regardless of distance)

Processing point cloud data

load dataset
- 3dvision.princeton.edu
- open .off file with trimesh → visualization and sampling
point cloud data augmentation
- jitter points: add random noise
- shuffling
apply network (e.g. Point cloud classification with PointNet)

pointpillar fast encoder for object detection from point clouds

point cloud → predictions
- Pillar feature net: extract feature using point net
  - represent point cloud in dense tensor with dim (depth, # of non-empty pillars per stack, # of points per pillar)
    - leveraging sparsity of point cloud
    - depth is the dim where the pillars are stacked
  - point net: get pseudo images with 2d dim (feature dim, # of non-empty pillars per stack)
    - # of points per pillar is dropped by max pooling
- Backbone 2dCNN: apply to the 2d features
  - decode the pseudo images to high-level representation
- detection head
  - apply single shot detector (pretrained) to detect one-shot bounding box detection
    - generate 2d bounding boxes on the features from the backbone