Problem: This paper addresses the problem of finding position and orientation from which a given picture was taken.
Approach: Given a database of scene images, using Structure from Motion (SfM) to reconstruct 3D point cloud of the scene, and finding correspondences between 2D image features and a large number of 3D scene points to estimate the full camera pose, i.e. position and orientation. The main problem to solve: find correct correspondences between 2D features and 3D points (accuracy and time efficiency).
Contribution: Combining 2D to 3D and 3D to 2D matching methods and considering the visibility filtering step to balance additional effort of active search.
- Building a visual word vocabulary of SIFT descriptors (100k in implementation) and build a vocabulary tree on top of it.
- All 3D points are assigned to visual words in an offline stage. A point stored in a word by mean of all of its descriptors assigned to this word (entries of mean are rounded to the nearest integer values for memory efficiency).
- Building a kd-tree from 3D points to fast determine 3D nearest neighbors for 3D-2D matching.
- Given a query image, its SIFT descriptors are again mapped to the vocabulary –> to result a list of candidate points of every feature and the features are considered in ascending order of length of their lists as a prioritization scheme (coded PS). For every feature f, search through all points in its word and find to nearest neighbors, and a correspondence is established if the SIFT ratio test is passed. (the 2D-3D matching).
- Once a 2D-3D correspondence between feature f and a point p has been found. Searching 3D points closest in space to p. Every such point p’ , we assign 3D points to find closest 2D features with descriptors similar to p’ using a visual vocabulary (coarser level) which indicated as its predicted search cost. This point p’ is inserted into the prioritization scheme PS by using its search costs. Now, the PS have prioritized points/feature to be searched for 2D-3D or 3D-2D matching (see detail in original paper). Note that the 2D-3D matching requires a rather fine vocabulary to limit the search space, while the 2D-3D matching needs a coarser vocabulary (lower level of the same vocabulary tree) to guarantee enough features are considered.
- It is noteworthy that a filtering step via point visibility is used to remove 3D points in close space before matching 3D-2D. i) A minimum set of cameras, which closest whos viewing direction differs from one image by at most 60 degree, is selected. ii) Keep only 3D points visible to this subset and then again pick only points two edges away from point p for 3D-2D matching.
- Stop when enough correspondences have been found (N= 200 in implementation).
- The camera pose is estimated by a RANSAC-variant using 6-point DLT algorithm.
- 3D-2D matching is time efficiency vs 2D-3D matching leads higher-quality matches.