Problem: This paper addresses the problem of finding position and orientation from which a given picture was taken.
Approach: Given a database of scene images, using Structure from Motion (SfM) to reconstruct 3D point cloud of the scene, and finding correspondences between 2D image features and a large number of 3D scene points to estimate the full camera pose, i.e. position and orientation. The main problem to solve: find correct correspondences between 2D features and 3D points (accuracy and time efficiency).
Contribution: Combining 2D to 3D and 3D to 2D matching methods and considering the visibility filtering step to balance additional effort of active search.
- Building a visual word vocabulary of SIFT descriptors (100k in implementation) and build a vocabulary tree on top of it.
- All 3D points are assigned to visual words in an offline stage. A point stored in a word by mean of all of its descriptors assigned to this word (entries of mean are rounded to the nearest integer values for memory efficiency).
- Building a kd-tree from 3D points to fast determine 3D nearest neighbors for 3D-2D matching.
- Given a query image, its SIFT descriptors are again mapped to the vocabulary –> to result a list of candidate points of every feature and the features are considered in ascending order of length of their lists as a prioritization scheme (coded PS). For every feature f, search through all points in its word and find to nearest neighbors, and a correspondence is established if the SIFT ratio test is passed. (the 2D-3D matching).
- Once a 2D-3D correspondence between feature f and a point p has been found. Searching 3D points closest in space to p. Every such point p’ , we assign 3D points to find closest 2D features with descriptors similar to p’ using a visual vocabulary (coarser level) which indicated as its predicted search cost. This point p’ is inserted into the prioritization scheme PS by using its search costs. Now, the PS have prioritized points/feature to be searched for 2D-3D or 3D-2D matching (see detail in original paper). Note that the 2D-3D matching requires a rather fine vocabulary to limit the search space, while the 2D-3D matching needs a coarser vocabulary (lower level of the same vocabulary tree) to guarantee enough features are considered.
- It is noteworthy that a filtering step via point visibility is used to remove 3D points in close space before matching 3D-2D. i) A minimum set of cameras, which closest whos viewing direction differs from one image by at most 60 degree, is selected. ii) Keep only 3D points visible to this subset and then again pick only points two edges away from point p for 3D-2D matching.
- Stop when enough correspondences have been found (N= 200 in implementation).
- The camera pose is estimated by a RANSAC-variant using 6-point DLT algorithm.
- 3D-2D matching is time efficiency vs 2D-3D matching leads higher-quality matches.
Three-dimensional (3D) human face tracking is a generic problem that has been receiving considerable attention in computer vision community. The main goal of 3D face tracking is to estimate some parameters of human faces from video frames: i) 6 Degrees of Freedom (DOF) – consists of the 3D translation and three axial rotation of a person’s head relative to the camera view. As commonly used in the literature, we adopt three terms Yaw (or Pan), Pitch (or Tilt) and Roll for the three axial rotations. The Yaw orientation is computed when, for example, rotating the head from right to left. The Pitch orientation is related to the movement of the head from forward to backward. The Roll orientation is when bending the head from left to right. The 6 DOF are considered as rigid parameters. ii) The non-rigid parameters describes the facial muscle movements or facial animation, which are usually the early step to recognize the facial expression, such as: happy, sad, angry, disgusted, surprise, and fearful, etc. Indeed, the consideration for non-rigid parameters is often represented in form of detecting and tracking facial points. These points are acknowledged as fiducial points, feature points or landmarks in face processing community. The word ”indexing” in our report means that rigid and non-rigid parameters are estimated from video frames. Our aim is to read a video (or from a webcam) capturing the single face (In case of multiple faces, the big face is selected) and the output is parameters (rigid and non-rigid) of video frames. Indeed, the non-rigid parameters are difficult to be represented because it depends on the application. In our study, they are represented indirectly as localizing or detecting feature points on the face.
There are several potential applications in many domains which use face tracking. The most popular one is to recognize facial behaviors in order to support for an automatic system of human communication understanding. In this context, the visual focus of attention of a person is a very important key to recognize. It is a nonverbal communication way or an indicative signal in a conversation. For this problem, we have to analyze first the head pose to determine the direction where people are likely looking at in video sequences. It makes sense that people may be focusing on someone or something while talking. Furthermore, there are the important meanings in head movements as a form of gesturing in a conversation. For example, the head nodding or shaking indicates that they understand and misunderstand or agree and disagree respectively to what is being said. Emphasized head movements are a conventional way of directing someone to observe a particular object or location. In addition, the head pose is intrinsically linked with the gaze, or head pose indicate a coarse estimation of gaze in situations of invisible eyes such as low-resolution imagery, very low-bit rate video recorders, or eye-occlusion due to sunglasses-wearing. Even when the eyes are visible, the head pose supports to predict more accurately the gaze direction. There are other gestures that are able to indicate dissent, confusion and consideration, etc. Facial animation analysis is also necessary to be able to read what kind of expression people are exposing. The facial expression is naturally occurred in human communication, it is one of the most cogent means for human beings to infer the attitude and emotions of other persons in the vicinity. The expression analysis, which requires facial animation detection, is an crucial topic not only in machine vision but also psychology.
to be continued…..