Whilst the constructive theories may then have some limited role in our perception it appears that their view is too simplistic and unable to cope with all the observed phenomena. We turn then, to an idea which contrasts with the ideas of the constructivists, the direct perception theory. Direct theorists argue in favour of 'bottom - up' processing, where little world knowledge is required in order to make sense of what is being seen. They suggest that general constraints are used when analysing the visual input, inferencing only playing a minor role when perceptual ambiguities occur. One of the most influential and complete theories of vision from both a psychological and physiological viewpoint comes from an advocate of the 'bottom - up' processing model, David Marr. In formulating his computational theory of vision, Marr (1976) made use of such constraints as described above, especially drawing from the organisational rules suggested by the Gestaltists. Their fundamental principle was the law of Pragnanz, summarised by Koffka (1935) thus: "psychological organisation will always be as 'good' as possible." In this sense, 'good' refers to the simplest or most uniform of available alternatives, hence their usage of such laws as 'the law of proximity' and 'the law of similarity', laws which group stimuli together according to the most simple relationship which can explain them.
Marr has suggested several levels of representation of the visual input, offering physiological and psychological evidence supporting his theories, which we will now consider in turn.
Perhaps the starting point for Marr's theory lies in his explanation of vision at several levels. Firstly, we have the computational level in which he examines not simply what the visual system does, rather what exactly vision is for, i.e. what the hosts requirements are. Secondly we have the algorithmic level - the link between the requirements of the system and the third level - the hardware, in this case the brain. Marr proposed that each level creates a representation which provide increasingly detailed descriptions of the visual input. In fact, Marr splits the first level into two representations, the raw primal sketch and the full primal sketch.
At this first level, Marr claims that the visual perceiver forms a representation of light intensity values which can then be used to pick out features - edges, bars, blobs etc. which he calls 'tokens' which make up the raw primal sketch, and descriptors - length, orientation etc. which serve to give some indication of relationships between these tokens. It is at this point that Marr introduces evidence from the field of neurophysiology to back up his claims. Cortical cells have been discovered that respond to changes in light intensity across them (an edge can be thought of as a change in light intensity) and hence these could be used as a basis for edge detection. Marr and Hildreth (1980) then implemented an algorithm on computer which could detect edges in an image. Using the zero crossings of the second derivative of the light intensity function, they were able to pick out the edges to a degree of resolution decided by a blurring function which they incorporated in the system and they then proposed that this is exactly what is done in mammalian visual systems in building up the raw primal sketch. This rather jumbled collection of blobs and edges can now be refined to obtain the full primal sketch, and it is here that Marr suggests the usage of the hard - wired Gestaltist knowledge of organisation as discussed earlier to group this series of tokens and descriptors according to their similar properties. In his attempts to write a computer program to move from the raw primal sketch to the full primal sketch Marr found that these organisational rules gave two particularly useful techniques, 'the principle of explicit naming' and 'the principal of least commitment'. The first of these describes the naming of a small set of grouped elements which can then be used again to name other sets of groups, all of which can then form a larger grouping , whilst the latter refers to the resolution of ambiguities only when conclusive proof of the correct resolution is found. This rule is useful in avoiding early mistakes which can lead to more errors at a later level of processing.
To progress to the next level in Marr's theory, he proposes the 2 1/2 D sketch. This level takes the detail of tokens and descriptors from the full primal sketch and adds an impression of depth to the representation through such cues as binocular disparity, textural clues, occlusion and structure analysis through motion. Marr & Poggio (1976) have formulated several rules to ensure correct alignment of the input from the two eyes when assessing disparity - the so-called 'correspondence problem'. To accurately gauge disparity it is important that the images correspond, thus the rules that they outline ensure that correct objects are matched. However, as Marr points out that this level of representation is not truly a 3 - D one, and what is actually being given is an idea of the orientation of surfaces, as he observes that we can perform a comparison of surface orientation of two different parts of the visual field rather better than we can determine their relative depths, since these cues are best obtained via occlusion - if one object occludes another we know it is in front of it.
The final level in Marr's theory is the 3 - D model. This level is a necessary advancement on the 2 1/2 D sketch since there are limitations with the latter which must be overcome if the perceiver is to gain a true understanding of the world he is viewing. Whilst the 2 1/2 D sketch does give some information into depth cues it is primarily operating on surface cues. This necessarily means that the observer has no ideas about those surfaces which are hidden from view. Secondly, it is 'observer centred', that is, the perceived world will look vastly different from different angles, and thus render almost impossible any matching of objects in the 2 1/2 D sketch with any stored in memory. To take account of these requirements, Marr & Nishihara, (1978) have identified three criteria desirable for a 3 - D representation: accessibility - the ease of construction of the representation, scope - the extent to which the representation is applicable to all members of a category, and stability/sensitivity - the incorporation of similarities with other members of the category and it's individual differences. Whilst Marr & Nishihara do not go into much depth in describing how the 3 - D model is represented, they do however suggest that the primitive units for constructing objects in the 3 - D model should be cylinders and they discuss the importance of axes of these 'building blocks', describing how a human form could be thought of through various different levels as one cylinder with a vertical axis, at a greater level of detail as a cylinder body with other cylinders for limbs at different axes to the body, etc.
In the field of cognitive psychology one way that is often used to determine whether a certain mental process occurs in the way we think it does is to look at mentally impaired subjects who have suffered brain damage and thus may have lost certain mental functions. If Marr's theories are correct it seems reasonable that we might expect to find patients who do not have the use of one or more of the representations described in his model. Further, it would seem to follow that impairment to the ability to construct the 2 1/2 D sketch would lead to impairment in the ability to form the 3 - D sketch. Benson & Greenberg (1969) describe the case of patient S, who did indeed seem to possess residual perceptive skills enabling him to distinguish between small intensity differences suggesting some ability to make the primal sketch, but did not seem able to perceive objects or copy simple figures. However, Campion & Latto (1985) have reported a very similar case where they suggest these problems are in fact due to small blind areas in the visual field, though the instances of reported cases similar to S do not correlate highly with instances of sensory deficits and it is thus unlikely that this is the explanation in all cases. Similarly Warrington & Taylor (1978) have reported patients who seem to have difficulties with unusual orientations of objects, suggesting impairment of the 3 - D model representation, again seemingly lending support to Marr's theories.
To look now at a very different approach to the problem of visual perception, we consider the works of J.J. Gibson (1950). Unlike the other theories that we have looked at, Gibson rejected the idea of a passive retinal image which is analysed and broken down, rather postulating that the changes in the visual field as a whole are important - what he terms 'optic flow'. This theory argues for direct perception in a different way to Marr in that whilst it too does not require much inferencing from past experience, no detailed processing of the image is necessary since it is the change in features with which it is primarily interested. Rather than suggesting that we extract tokens such as blobs and bars as does Marr, Gibson says that it is the relationships of the elements in the picture as a whole that are what give us the cues that we need to understand our environment, thus linking perception and action together in a way that none of the other theorists have attempted to do. Gibson argues that certain characteristics exist in the visual field that can inform us of depth, size etc. Referring to texture gradients he argues that, for example, observers know that as stimuli get further away their details change - parallel lines seem to converge, equally sized objects seem to get smaller etc. whilst some aspects of the optic flow are invariant - the ratio of intersection by the horizon of objects of the same height is the same regardless of distance from the observer. Gibson says that these cues can be picked up directly and that they can be combined with information about the pole - the point towards which one is moving and from which elements in the optic array radiate to give locomotive information of oneself and ones environment.
Whilst Gibson's early ideas do not seem unreasonable and are certainly phenomena which we can all experience, Gibson's later work (1966,1979) is far more radical, seeking to cope with the problem of applying meaning to our suroundings with the notion of affordance. Affordance, Gibson says, is how a stimulus appears to the perceiver in terms of what meaning it has for them at that particular time. Hence, that which on one occasion may afford to someone the idea of 'reading material', on a different occasion may afford the idea of 'weapon'. Clearly this is a very dubious area as objects may have myriad different associations and meanings which affordance could simply not cope with.
Having now considered the various standpoints in explaining human visual perception let us now examine them with reference to the question of whether it is possible to model perception on a computer. To look at the arguments of exponents of the constructivist school of thought, for example Gregory and Neisser, it seems that such an approach would be incompatible as a basis for a digital system. Whilst it operates on a static, passive visual input which can be sampled and analysed, the mechanisms by which any information is to be recognised is far too abstract, assuming a vast wealth of knowledge gained from experience of the world and relying on an ability to adapt that knowledge to the visual environment presented. Indeed, as a theory it offers little suggestion of any analytical processes involved in extracting information from the visual field nor does it give much useful information as to how the inferencing is done once the information has been obtained.
The approach of David Marr has been seen to have possibilities for modelling on a computer, indeed Marr himself has implemented many of his ideas into algorithms with some success, though these are primarily at a simple level, concerned with how to break down details of an image and pick out certain forms. Where the failures have come have been in trying to get a machine to recognise what any of these forms are and what they actually mean to us. It seems that perhaps more understanding of the processes by which humans facilitate information storage and retrieval is required before any developments can be made in this area.
Finally to turn to the theories of J.J. Gibson, the ideas of optic flow and the relationships between elements in the visual field does seem like a worthwhile contribution to the topic. It is certainly the case that we are usually in motion and that to therefore think of the information which we are processsing in terms of a series of static images is perhaps a flawed one. The ideas of a central point of expansion as a reference to motion and the use of texture gradients to gain direct information about our environment seem to be sound and would not be hard to implement on a computer. Perhaps if these ideas could be incorporated into a system with the feature detection techniques of Marr & Hildreth, and an intelligent database for reference to in deciding what exactly was being seen, a moderately useful facsimile may yet be attained. However, with regards to Gibson's other ideas concerning 'affordances', concepts which seem at best abstract and at worst almost random in nature, it seems to this author ludicrous to try and implement them on a machine, and whilst there seems little support for them either empirically or intuitively it would also seem fairly silly to want to try.
approx 3,500 words
Bibliography
Bruce, V. & Green, P. (1985). Visual Perception: physiology, psychology and ecology.
Lawrence Erlbaum Associates Ltd.
Eysenck, M.W. & Keane, M.T. (1991). Cognitive Psychology. Lawrence Erlbaum Associates Ltd.
Gordon, I.E. Theories of Visual Perception. Wiley.
Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. Freeman.