Given an image such as the CakePHP logo, how can this image be converted back into a PSD with the layers. As a human, I can easily work out how to translate this back to a PSD with layers. I can tell that the background is a circular shape with star edges. So the circular star part is at the back, the cake image is on top of this and the words CakePHP is over all of these two images.
I can use Photoshop/Gimp tools to separate these images into three images and fill in the areas in-between. Then I have three layers.

As a human, it is easy to work out the layering of most logos and images and many images have multiple layers, the CakePHP logo is just one example. Images in the real world also have a layering, there may be a tree layer on top of a background of grass. I need a general way to convert from an image back to the layered representation, ideally a software solution.
In absence of a programmed solution, are there any papers or research which solve this problem or are related to this problem? I am mostly interested in converting human constructed images such as logos or website titles back to layered representation.
I want to point out some benefits of doing this, if you can get this image to a layered representation automatically then it is more easy to modify the image. For example, maybe you want to make the cake smaller, if the computer already layered the cake on top of the red background, you can just scale the cake layer. This allows for layer adjustment of images on websites which do not have layer information already.
As already mentioned, this is a non-trivial task. Ultimately, it can be most
simply phrased as: given an image (or scene if real photo) which is composed of
pixels N, how can those be assigned to M layers?
For segmentation, it’s all about the prior knowledge you can bring to bear to
this as to what properties of pixels, and of groups of pixels, give “hints”(and
I use the word advisedly!) as to the layer they belong to.
Consider even the simplest case of using just the colour in your image. I can
generate these 5 “layers” (for hue values 0,24,90, 117 and 118):
With this code (in python/opencv)
But, even here we are having to describe what is “significant” in terms of the
number of pixels that belong to a mask (to the extent that we can miss some
colours). We could start to cluster similar colours instead – but at what
density does a cluster become significant? And if it wasn’t just pure colour,
but textured instead, how could we describe this? Or, what about inference that
one layer is part of another, or in front of it? Or, ultimately, that some of
the layers seem to be what we humans call “letters” and so should probably be
all related…
A lot of the research in Computer Vision in segmentation generally tries to take
this problem and improve it within a framework that can encode and apply this
prior knowledge effectively…