The digital computer is an inappropriate model of the visual perceiver." Discuss

Some theories of vision are overtly computational, while others deal in concepts which appear to defy implementation on a computer. By considering, among other things what you consider to be the successes and failures of these apparently different approaches in describing the observed properties of visual perception, assess how far you consider the above assertion to be justified.

____________________________________________________________________

"The digital computer is an inappropriate model of the visual perceiver."

"Perception refers to the means by which information acquired from the environment .....is transformed into experiences of objects, events etc." (Roth, 1986). In the context of this definition it is possible to argue that a digital computer, devoid of consciousness could never be an appropriate model of the visual perceiver as such. It seems unlikely that anyone should wish to term a series of binary inputs and outputs to a piece of some semi-conductor, however intricate, as experiences in the same sense as we might to a human. However, to put aside the philosophical issues, human visual perception is undoubtedly an astonishing feat. The very fact that it has proven so hard to implement a mechanical visual system comparable to that which we possess bears witness to the scope of the problem. To construct a robot that can simply navigate its way around a novel environment without crashing into things is quite an achievement and yet we fail to see anything remarkable in our in-depth understanding of the visual information presented to us. Millions of times a day this information is modified and updated and yet it goes almost completely unnoticed that what we are taking for granted is in fact an extremely complex operation. But consider for a moment what processes are involved. Consider what processing must be going on which can tell us not only what it is that we are seeing but can also give us clues as to spatial relationships, size and distance and allow us to make inferences concerning structure and property, regardless of semi-occlusion, varying angles of observation and unusual illumination sources. But is the ability to make sense of the real world in a similar way to ourselves really something that machines are incapable of? With the technology that we possess is it not possible to model the human visual system?

Clearly the problem is not one of a lack of information. Television cameras can be considered to be an accurate enough representation of the input which our eyes afford us. Effects of binocular disparity cues can be overcome with the use of more than one camera and we can easily enable these cameras to be mobile to simulate any movements that we may make when assessing a scene. What then is the cause of our inability to model the way we see? The problem we are faced with seems to be how this information is analysed by our brains. How do we transform a pattern of electrical signals generated by the excitation of photoreceptors on a 2-D surface into a 3-D image of the world around us? The interpretation of this information is the area in which the controversy reigns and is where the answer to the problem will be found, if found it can be.

There are, as mentioned earlier, several schools of thought as to how the visual information is processed, some of these seeming diametrically opposed to others. With the notable exception of Gibson's (1950) theory of the importance of optic flow in visual perception, most theorists have worked from the principle of an image on the retina which is then broken down for analysis by feature extraction and pattern detection etc., although there is some disagreement within this camp as to whether this analysis is based on hard-wired knowledge or from memory and experience. Whilst all agree that some level of inferencing is necessary for the perceiver to make sense of the world, the constructive theorists argue for the 'top - down' approach and emphasise the importance of world knowledge in our understanding of what we see. They claim that we acquire knowledge of the world through our experiences and that these can then be drawn upon to assist the inferencing process when analysing the visual input, c.f. Gregory's (1972) claim that perceptions are constructions from "fragmentary scraps of data ....drawn from the brain memory banks, themselves constructions from the snippets of the past." If the ideas of the constructivists such as Gregory and Bruner are to be believed then it seems not intuitively unreasonable that our perception may be prone to errors caused by our expectations. Indeed, Gregory has claimed support for his theories from several forms of visual illusion, for example the Ames room and the Muller - Lyer illusion. Gregory posits a misapplied size - constancy theory to explain the Muller - Lyer illusion in which two equal length vertical lines are perceived as being of different lengths because of other contextual clues. He argues that the two lines are seen as 3 -D objects and can be thought of as the inside and outside corners of a room. The inside corner is known by experience to be further away from the observer than an outside corner, hence is considered longer and hence the illusion. Whilst this explanation seems ingenious, the constructivists arguments seem questionable when we consider how rarely our perception is in error, when the level of inferencing proposed here would suggest that far more perceptive errors would be made. Further, the Muller - Lyer illusion is not universally seen as a 3 - D construct as Gregory suggests and no explanation is offered for this anomaly.

This is a preview of the whole essay

Whilst the constructive theories may then have some limited role in our perception it appears that their view is too simplistic and unable to cope with all the observed phenomena. We turn then, to an idea which contrasts with the ideas of the constructivists, the direct perception theory. Direct theorists argue in favour of 'bottom - up' processing, where little world knowledge is required in order to make sense of what is being seen. They suggest that general constraints are used when analysing the visual input, inferencing only playing a minor role when perceptual ambiguities occur. One of the most influential and complete theories of vision from both a psychological and physiological viewpoint comes from an advocate of the 'bottom - up' processing model, David Marr. In formulating his computational theory of vision, Marr (1976) made use of such constraints as described above, especially drawing from the organisational rules suggested by the Gestaltists. Their fundamental principle was the law of Pragnanz, summarised by Koffka (1935) thus: "psychological organisation will always be as 'good' as possible." In this sense, 'good' refers to the simplest or most uniform of available alternatives, hence their usage of such laws as 'the law of proximity' and 'the law of similarity', laws which group stimuli together according to the most simple relationship which can explain them.

Marr has suggested several levels of representation of the visual input, offering physiological and psychological evidence supporting his theories, which we will now consider in turn.

Perhaps the starting point for Marr's theory lies in his explanation of vision at several levels. Firstly, we have the computational level in which he examines not simply what the visual system does, rather what exactly vision is for, i.e. what the hosts requirements are. Secondly we have the algorithmic level - the link between the requirements of the system and the third level - the hardware, in this case the brain. Marr proposed that each level creates a representation which provide increasingly detailed descriptions of the visual input. In fact, Marr splits the first level into two representations, the raw primal sketch and the full primal sketch.

At this first level, Marr claims that the visual perceiver forms a representation of light intensity values which can then be used to pick out features - edges, bars, blobs etc. which he calls 'tokens' which make up the raw primal sketch, and descriptors - length, orientation etc. which serve to give some indication of relationships between these tokens. It is at this point that Marr introduces evidence from the field of neurophysiology to back up his claims. Cortical cells have been discovered that respond to changes in light intensity across them (an edge can be thought of as a change in light intensity) and hence these could be used as a basis for edge detection. Marr and Hildreth (1980) then implemented an algorithm on computer which could detect edges in an image. Using the zero crossings of the second derivative of the light intensity function, they were able to pick out the edges to a degree of resolution decided by a blurring function which they incorporated in the system and they then proposed that this is exactly what is done in mammalian visual systems in building up the raw primal sketch. This rather jumbled collection of blobs and edges can now be refined to obtain the full primal sketch, and it is here that Marr suggests the usage of the hard - wired Gestaltist knowledge of organisation as discussed earlier to group this series of tokens and descriptors according to their similar properties. In his attempts to write a computer program to move from the raw primal sketch to the full primal sketch Marr found that these organisational rules gave two particularly useful techniques, 'the principle of explicit naming' and 'the principal of least commitment'. The first of these describes the naming of a small set of grouped elements which can then be used again to name other sets of groups, all of which can then form a larger grouping , whilst the latter refers to the resolution of ambiguities only when conclusive proof of the correct resolution is found. This rule is useful in avoiding early mistakes which can lead to more errors at a later level of processing.

To progress to the next level in Marr's theory, he proposes the 2 1/2 D sketch. This level takes the detail of tokens and descriptors from the full primal sketch and adds an impression of depth to the representation through such cues as binocular disparity, textural clues, occlusion and structure analysis through motion. Marr & Poggio (1976) have formulated several rules to ensure correct alignment of the input from the two eyes when assessing disparity - the so-called 'correspondence problem'. To accurately gauge disparity it is important that the images correspond, thus the rules that they outline ensure that correct objects are matched. However, as Marr points out that this level of representation is not truly a 3 - D one, and what is actually being given is an idea of the orientation of surfaces, as he observes that we can perform a comparison of surface orientation of two different parts of the visual field rather better than we can determine their relative depths, since these cues are best obtained via occlusion - if one object occludes another we know it is in front of it.

The final level in Marr's theory is the 3 - D model. This level is a necessary advancement on the 2 1/2 D sketch since there are limitations with the latter which must be overcome if the perceiver is to gain a true understanding of the world he is viewing. Whilst the 2 1/2 D sketch does give some information into depth cues it is primarily operating on surface cues. This necessarily means that the observer has no ideas about those surfaces which are hidden from view. Secondly, it is 'observer centred', that is, the perceived world will look vastly different from different angles, and thus render almost impossible any matching of objects in the 2 1/2 D sketch with any stored in memory. To take account of these requirements, Marr & Nishihara, (1978) have identified three criteria desirable for a 3 - D representation: accessibility - the ease of construction of the representation, scope - the extent to which the representation is applicable to all members of a category, and stability/sensitivity - the incorporation of similarities with other members of the category and it's individual differences. Whilst Marr & Nishihara do not go into much depth in describing how the 3 - D model is represented, they do however suggest that the primitive units for constructing objects in the 3 - D model should be cylinders and they discuss the importance of axes of these 'building blocks', describing how a human form could be thought of through various different levels as one cylinder with a vertical axis, at a greater level of detail as a cylinder body with other cylinders for limbs at different axes to the body, etc.

In the field of cognitive psychology one way that is often used to determine whether a certain mental process occurs in the way we think it does is to look at mentally impaired subjects who have suffered brain damage and thus may have lost certain mental functions. If Marr's theories are correct it seems reasonable that we might expect to find patients who do not have the use of one or more of the representations described in his model. Further, it would seem to follow that impairment to the ability to construct the 2 1/2 D sketch would lead to impairment in the ability to form the 3 - D sketch. Benson & Greenberg (1969) describe the case of patient S, who did indeed seem to possess residual perceptive skills enabling him to distinguish between small intensity differences suggesting some ability to make the primal sketch, but did not seem able to perceive objects or copy simple figures. However, Campion & Latto (1985) have reported a very similar case where they suggest these problems are in fact due to small blind areas in the visual field, though the instances of reported cases similar to S do not correlate highly with instances of sensory deficits and it is thus unlikely that this is the explanation in all cases. Similarly Warrington & Taylor (1978) have reported patients who seem to have difficulties with unusual orientations of objects, suggesting impairment of the 3 - D model representation, again seemingly lending support to Marr's theories.

To look now at a very different approach to the problem of visual perception, we consider the works of J.J. Gibson (1950). Unlike the other theories that we have looked at, Gibson rejected the idea of a passive retinal image which is analysed and broken down, rather postulating that the changes in the visual field as a whole are important - what he terms 'optic flow'. This theory argues for direct perception in a different way to Marr in that whilst it too does not require much inferencing from past experience, no detailed processing of the image is necessary since it is the change in features with which it is primarily interested. Rather than suggesting that we extract tokens such as blobs and bars as does Marr, Gibson says that it is the relationships of the elements in the picture as a whole that are what give us the cues that we need to understand our environment, thus linking perception and action together in a way that none of the other theorists have attempted to do. Gibson argues that certain characteristics exist in the visual field that can inform us of depth, size etc. Referring to texture gradients he argues that, for example, observers know that as stimuli get further away their details change - parallel lines seem to converge, equally sized objects seem to get smaller etc. whilst some aspects of the optic flow are invariant - the ratio of intersection by the horizon of objects of the same height is the same regardless of distance from the observer. Gibson says that these cues can be picked up directly and that they can be combined with information about the pole - the point towards which one is moving and from which elements in the optic array radiate to give locomotive information of oneself and ones environment.

Whilst Gibson's early ideas do not seem unreasonable and are certainly phenomena which we can all experience, Gibson's later work (1966,1979) is far more radical, seeking to cope with the problem of applying meaning to our suroundings with the notion of affordance. Affordance, Gibson says, is how a stimulus appears to the perceiver in terms of what meaning it has for them at that particular time. Hence, that which on one occasion may afford to someone the idea of 'reading material', on a different occasion may afford the idea of 'weapon'. Clearly this is a very dubious area as objects may have myriad different associations and meanings which affordance could simply not cope with.

Having now considered the various standpoints in explaining human visual perception let us now examine them with reference to the question of whether it is possible to model perception on a computer. To look at the arguments of exponents of the constructivist school of thought, for example Gregory and Neisser, it seems that such an approach would be incompatible as a basis for a digital system. Whilst it operates on a static, passive visual input which can be sampled and analysed, the mechanisms by which any information is to be recognised is far too abstract, assuming a vast wealth of knowledge gained from experience of the world and relying on an ability to adapt that knowledge to the visual environment presented. Indeed, as a theory it offers little suggestion of any analytical processes involved in extracting information from the visual field nor does it give much useful information as to how the inferencing is done once the information has been obtained.

The approach of David Marr has been seen to have possibilities for modelling on a computer, indeed Marr himself has implemented many of his ideas into algorithms with some success, though these are primarily at a simple level, concerned with how to break down details of an image and pick out certain forms. Where the failures have come have been in trying to get a machine to recognise what any of these forms are and what they actually mean to us. It seems that perhaps more understanding of the processes by which humans facilitate information storage and retrieval is required before any developments can be made in this area.

Finally to turn to the theories of J.J. Gibson, the ideas of optic flow and the relationships between elements in the visual field does seem like a worthwhile contribution to the topic. It is certainly the case that we are usually in motion and that to therefore think of the information which we are processsing in terms of a series of static images is perhaps a flawed one. The ideas of a central point of expansion as a reference to motion and the use of texture gradients to gain direct information about our environment seem to be sound and would not be hard to implement on a computer. Perhaps if these ideas could be incorporated into a system with the feature detection techniques of Marr & Hildreth, and an intelligent database for reference to in deciding what exactly was being seen, a moderately useful facsimile may yet be attained. However, with regards to Gibson's other ideas concerning 'affordances', concepts which seem at best abstract and at worst almost random in nature, it seems to this author ludicrous to try and implement them on a machine, and whilst there seems little support for them either empirically or intuitively it would also seem fairly silly to want to try.

approx 3,500 words

Bibliography

Bruce, V. & Green, P. (1985). Visual Perception: physiology, psychology and ecology.

Lawrence Erlbaum Associates Ltd.

Eysenck, M.W. & Keane, M.T. (1991). Cognitive Psychology. Lawrence Erlbaum Associates Ltd.

Gordon, I.E. Theories of Visual Perception. Wiley.

Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. Freeman.

The digital computer is an inappropriate model of the visual perceiver." Discuss

The digital computer is an inappropriate model of the visual perceiver." Discuss

This is a preview of the whole essay

Document Details

Related Essays

Evaluate the view that recognition is the only goal of visual perception.

Describe and discuss the working model of memory.

Iconic Memory Based on Sperlings Visual Information Processing Model - Lite...

Discuss the role of bottom up and top down processes in visual perception.