- multi-view CNN (projections)
- cnn → view pooling → cnn
- multiple camera data go through separate cnns to get features
- view pooling: max pooliing across all views
- another set of cnn layers for classification
- pros:
- good performance
- easy to use pretrained models
- cons:
- need careful camera-array setup (projection)
- hard to handle noisy input or incomplete data
- can’t handle when object rotates or traslates (invariance)
- volumetric CNN (grids)
- directly learn on 3d data without projection
- methods
- voxel cnn
- voxelization: convert point cloud / mesh to rasterized image (discrete grid)
- simply expand 2d convolution (3d kernal moves in 2 directions) to 3d convolution (4d kernal moves in 3 directions)
- cons
- complexity input scales up quickly (height * width * depth)
- information loss when voxelization
- octree CNN
- only store surface, which is sparse compared to the whole volume
- tree structure
- recursively partition the cube to 8 little cubes (4 top 4 bottom)
- if the cube contains surface, keep dividing
- cons
- complicated structure → difficult optimization process
- not very flexible
- still need voxelization
- point network (coordinates)
- learn with coordinates without doing voxelization
- point data is unordered → model needs to be permutation invariance
- using symmetric function (like sum or max)
- method
- point-net
- structure
- process each point with separate MLPs (can be shared)
- max pooling to combine node encodings
- another MLP for classification
- cons
- no local pattern (neighbor)
- pointnet ++: apply pointnet recursively on a nested portioning of the point cloud
- features depends on absolute coordinates → hard to generalize to complex scenes
- big model because it uses transformers to make the input and feature invariant
- graph-based method
- dynamic graph CNN: consider local geometric information
- points to graph
- pick nearest k neighbots
- calculate edge to each neighbor: simply use a MLP to get an embedding
- apply graph convolution given these “edges”
- like cnn uses kernal to get neighbors info, gcn uses edge to aggregate info
- can be applied recursively to learn semantic relationships between groups of points (regardless of distance)
Processing point cloud data
- load dataset
- 3dvision.princeton.edu
- open
.off
file with trimesh
→ visualization and sampling
- point cloud data augmentation
- jitter points: add random noise
- shuffling
- apply network (e.g. Point cloud classification with PointNet)
- point cloud → predictions
- Pillar feature net: extract feature using point net
- represent point cloud in dense tensor with dim (depth, # of non-empty pillars per stack, # of points per pillar)
- leveraging sparsity of point cloud
- depth is the dim where the pillars are stacked
- point net: get pseudo images with 2d dim (feature dim, # of non-empty pillars per stack)
- # of points per pillar is dropped by max pooling
- Backbone 2dCNN: apply to the 2d features
- decode the pseudo images to high-level representation
- detection head
- apply single shot detector (pretrained) to detect one-shot bounding box detection
- generate 2d bounding boxes on the features from the backbone