I have read that it’s possible to create a depth image from a stereo camera setup (where two cameras of identical focal length/aperture/other camera settings take photographs of an object from an angle).
Would it be possible to take two snapshots almost immediately after each other(on the iPhone for example) and use the differences between the two pictures to develop a depth image?
Small amounts of hand-movement and shaking will obviously rock the camera creating some angular displacement, and perhaps that displacement can be calculated by looking at the general angle of displacement of features detected in both photographs.
Another way to look at this problem is as structure-from-motion, a nice review of which can be found here.
Generally speaking, resolving spatial correspondence can also be factored as a temporal correspondence problem. If the scene doesn’t change, then taking two images simultaneously from different viewpoints – as in stereo – is effectively the same as taking two images using the same camera but moved over time between the viewpoints.
I recently came upon a nice toy example of this in practice – implemented using OpenCV. The article includes some links to other, more robust, implementations.
For a deeper understanding I would recommend you get hold of an actual copy of Hartley and Zisserman’s “Multiple View Geometry in Computer Vision” book.