I’ve been involved in R&D of virtual reality and augmented reality applied to visual entertainment for almost twenty years. In the last two decades I’ve had the opportunity to see and test many prototypes and brilliant experimental models, getting much inspiration from others’ work across the world.
However, among them all the one I most vividly retain in my memory is the opening keynote of the SIGGRAPH 2009, when an inspired Will Wright – the designer of The Sims game – delighted the audience with the concept of “playing with perception”. He did it in his unique style and through a very long presentation that unfolded much more like a clever series of puzzles for those present rather than a tedious series of slides.
Among the many micro-experiments that jazzed up that wonderful speech, one in particular impressed me and made me think for several weeks about the inner rules of perception, eventually leading me to publish a few articles and having a sort of direct impact on the digital products I was in charge of at FOX International Channels. After calculating the average age of the audience and collecting a small set of socio-demographic data by show of hands, Wright was able to anticipate the results of a subsequent survey about the “cuteness rating” of a particular photo of his cat Argon : 65/100. Any other photo he showed got a lower score while that one received exactly a rating of 65/100.
Why was that picture liked the most? How could he predict the results so precisely without resorting to mentalism tricks? What relationships did exist between demographic data, a short survey on the response to certain visual stimuli (some pictures), and the ability to forecast such a precise index score? What emerged, unbeknown to television and film directors as well as to great photographers, is that composition and framing have a large impact in determining the quality of perception of any content and its significance. The more these factors conform to recognizable stylistic patterns the more quickly and easily they are assimilated at the cognitive level: our mind works by visual comparison of simplified structures of Reality.
In that picture, Argon was resting with his legs closer than his head to the camera’s point of view , thus the head itself appearing smaller. The head, apparently shrunken in size by means of perspective, led to evaluate a chubby cat weighing 5kgs as smaller than it actually was, and thus make it look more “cute.” Indeed, if you think about it, the physical size of an object has a direct relationship with the meaning of its linguistic diminutive and only the context determines whether the diminutive is running as a nickname or as derogatory. A cat lying handsomely like a feline version of Pauline Bonaparte is obviously cute.
The lesson learned from the inner of this little experiment by Wright is that the intake capacity of our visual perception system greatly exceeds that of the cognitive processing system; as a consequence our mind is constantly running in an attempt to analyze any mental scenario in order to assess any possible responses to a given situation, well before we act consciously. In order to operate at an adequate speed (so that we do not hit the wall before realizing that the wall is there) the mind simplifies the reality by modelling it into a simpler form, a sort of “meaning silhouette” of its original complexity. At any given moment, the image we see is being related and compared with the available silhouette “on file” and, should a correspondence be found, an appropriate response among those pre-built (by genetics or experience) will be triggered.
Canonical mental scenarios, even those “genetically” inherited, prevail over the new ones and over those that cannot be properly framed: first we compare the reality with the most common models, then with less common ones; among the formers, scenarios that relate to the preservation of our own safety and security have precedence over those of discovery and innovation. Thanks to this prioritization mechanism, should we find ourselves in front of a tiger waiting for us just around the corner, we would have an adrenaline rush, our arrector pili muscles would have contracted causing goose bumps and we would almost instantly jump back, well before our consciousness would realize that we are face to face with a… stuffed feline!
However, if we happened to find ourselves in front of a painting of a tiger, would we had the same kind of response? Obviously not! So there must be a relationship between the amount of input information we get via the visual system and our “primitive” recognition mechanism operating at the cognitive processing level. If the amount of information is insufficient to trigger the matching threshold of the mental model, the visceral and atavistic reaction is inhibited in some way: our emotional participation to the world that unfolds before our eyes is slower because we have to make multiple comparisons with the whole dataset of our mental models and by the time the task is completed the conscious and analytical perception takes over. This doesn’t mean that the course of emotive response and operational actions we’ll take will be ice cold and aseptic rather that a different kind of emotional involvement is triggered both in scope and strength. One can see a reflection of this in the way newspapers and TV shows represent the reality: in order to shake our consciousness at comparable levels to direct, deep live experiences they need to push much more “potent” and “violent” and “crude” stimuli. To some degree, this is also an explanation about why the language of mass media has become so barbaric.
Apparently, there is a minimum information threshold that triggers a certain mental scenario thus causing the proper responses at a pre-conscious stage. Having more information than the minimum required, would lead to evaluating the scenario in a different way as if we have multiple versions of a scenario layered by “data resolution” and the amount of information determine whether we trigger an instinctive or analytical model. Nonetheless this doesn’t imply that by increasing the amount of information ad libitum we enhance our capability to map the reality onto the relevant responses. First of all, even a small increase in data inputs might make the substantial reality shift towards a different mental model. For example, a black background with a few hundred sparkling dots may be interpreted as a starry night but if the number of dots is more in the thousands (only an order of magnitude greater) the mental model being triggered could be that of old days television fog; with the aggravating circumstance that the threshold is very subjective and we could have similar but unpredictably different responses from person to person by adding tens of dots at the time); secondly, an excessive amount of information could overcome our analytical capacity and be entirely ignored (thats known as “cognitive saturation”), wasting bandwidth and time (a cornfield seen through a frosted glass increases the variance and amount of information available but reduces its recognizability). Thirdly, the representation of an uncommon reality (i.e. a non-ordinary landscape), doesn’t allow for an agile mind mapping of what we see with pre-existing mental models and thus may not need a very large amount of information to be credible or, conversely, a higher amount of information it is not a sufficient condition to make it appear credible. As an example, slide upwards to the very beginning of this article and take a look at the cover image: that is a REAL photo of a REAL place. Many of you read the title implying “virtual reality” and “impossible worlds”, had a quick look at the image and unconsciously decided it was a computer generated or artist generated one. Now, you’re looking at it and need much more time to convince yourself it is a real photo of a real place, no matter how detailed it is there is something that “doesn’t map the right way”.
As I had a chance to experiment by myself in the last two decades, and now can verify every day with real users – thanks to the availability of mass market VR and AR devices – the context I described so far is not quite correct when it comes to applying it to audiovisual entertainment, at least not without diving into appropriate “customizations” of psycho-cognitive nature. There is a “quality” of information coming from the environment around us which reveals itself as a sort of background noise, hardly measurable and identifiable; nonetheless it is this quality which determines the context of the mental framing at the subconscious level, thus is the master trigger in selecting which thresholds shall we apply to interpret the portion of reality we’re currently focused on. Playing with the perception in a virtual and synthetic environment cannot rely on this information noise as we aren’t able to re-create it and feed it to the user; as a consequence we have to leverage “mechanisms of truthfulness” of the reality we’re presenting that are well beyond the mere topics of speed and bandwidth (the amount of information and how fast it changes); those mechanisms, on the contrary, are rather defined and driven by the way we assign a meaning to the audiovisual stimuli and their constituent components; if we put together a cat, a girl and a tree most likely a large portion of the intended audience will be activated at a very low level into building a new scenario out of many inner mental scenarios and led into almost instantaneously recognizing the scene as a representation of “Alice in Wonderland”, regardless of whether or not the image is photorealistic or identical to a frame took from the movie or an illustration of the original book. More than that, we would have an expectation about the cat talking like a human and moving like a biped.
The quest for extreme photo-realism, a phenomenon we are experiencing already in these first months of life of mass market virtual reality, is to the detriment of a proper classification of types of mental models applicable to immersive audiovisual contexts and augmentation of reality; this course of action leds to a huge miss of the narrative and educational that an increased information selection, integration and comprehension capability would bring to both users and business.
What we are currently lacking is a solid analysis of the cognitive composition of the virtual scene and a derivation of “rules”. Aside from videogames, which are treating the VR merely as a different type of display and do not go much beyond the search for “sensation”, we’re missing the opportunity to go beyond “feeling the presence” and seeking a “being present” setting. I am not talking about the goal to make the audiovisual content interactive as a mean to enhance the suspension of disbelief in favor of a functional approach (a feature that would be nice to have but is not essential to a movie or a documentary, at least not without a clear and mature narrative language structure which fits the way the audience gets the content and its meaning). On the contrary, I am referring to the fact that the environment we create in virtual space must “feel” our presence in the scene and adapt itself to conform to this without violating its narrative strategy.
The canonical shooting having the camera placed at the center line of a busy highway while cars pass “above” the camera itself, shouldn’t be possible and shouldn’t be allowed; since in VR the camera is literally the audience, having a car on top of us or passing through us would be a complete disruption of disbelief (unless in the history we are impersonating a ghost and support this “deus ex machina” technique by applying special effects when we’re traversed by a truck). In the very same manner, if we stand in the middle of a sidewalk looking at two characters walking towards us while talking one each other, somehow the visual storyline must conform to the scenario and react to the fact that we are right there just in front of them: as we are there and not just in front of a screen, we do expect that the characters will move around us according to the multiple strategies they have at hand to avoid obstacles in a real situation.
In traditional audiovisual works, choosing the point of view (the position, orientation, optics and kind of shooting) is the tool directors use to convey a precise feeling and induce an emotional reaction in the viewer. Shooting from a lower angle from below the subject and an oblique light generates a sense of physical dominance of the character over the viewer and a related psychological subjection by making the character look more evil and threatening; it is a very effective way of delivering that feeling. In immersive virtual environments though that very same camera rig corresponds to our actual point of view, a point of view as “real” as we feel it; therefore the shot requires that we are physically kneeling or crouching, with our neck bent and strained to look at the evil character from below: the “sensation” which would be sufficient in a movie at the cinema, in this case must match a physical and real feeling that cannot be outside of ourselves simply projected on a TV screen or a cinema screen. If the frame does not coincide with the actual posture of the viewer, there is a great sense of dissociation that crumbles the director’s intention regardless of the amount of information and the photo-realism of the scene. If the scene was real and we were really at the foot of an evil big man then we would have felt menaced by the intrinsic qualities of the character itself, for what he says and how he says it, for what he does and the way he stands above us. If we are “virtually” on the scene the canonical bi-dimensional mechanisms of delivery of the emotional triggers which are the tools of cinema and television do not fit at all: the story and its narrative mechanics must necessarily be able to offer us a psychological participation to the events which is direct and not intermingled.
To stress the point, I invite you to a sofa-experiment. If you have a Gear VR or an Oculus Rift, you can look for a short movie named “Catatonic”. Frankly speaking, this is not a great piece of cinema; the technical quality of the imagery – in terms of pixel resolution – is very low and if you watch it in 2D (here: https://www.youtube.com/watch?v=hTxJArHZeV0) you won’t get any particular excitement except perhaps a vague sense of madness, more a contempt than a discomfort. But if you’ll watch it in a VR setting, this short film will gain a connotation, power and narrative effectiveness very unique and disturbing. You will really feel helpless, catatonic and locked in your own body, surrounded by an altered reality that gets worse as the video frames unfolds until a no return point where you get to feel the pain and madness of your virtual twin as you sink into the darkest places of human mind: his folly will be your folly, throwing you in the large valley of discomfort. Even physical discomfort. An higher resolution would probably provide too many realistic elements that would help you at keeping hold with reality outside of the visor, the comfortable consciousness that you’re maybe on your sofa in your house, this way destroying the profound identification you might have with the catatonic patient.
Any rules we know about visual storytelling must be amended and adapted and for sure new ones will arise, specifically just for this new industry of entertainment. Even the old and archetypical animation principles by Disney’s Ollie Johnston and Frank Thomas dating back the 20s of last century will have to reshape themselves. A low-fidelity representation of a city street, with the buildings consisting of one-color / no texture cubes whizzing beside us very quickly can be as or more efficient and immersion-effective than a photo-realistic reconstruction with a number lower of frames per second and the buildings that run more slowly or with less fluidity; double the size of the cubes in the low-fidelity version, and you will be able to convey the feeling of a slower movement of the vehicle (just by leaving the speed untouched) or make the building seem larger than they are but maintaining the feel of speed (just by slightly bending the buildings backward), depending on whether we want to give priority to the frequency of updates or to the speed of movement. Immersion and suspension of disbelief are unaffected despite the scene being just made of mere cubes. On the contrary, in a photo-realistic environment at a very high resolution the size of the building moves “cognitively” into the background, overwhelmed by a too high amount of details to be perceived analytically and we will have an increased feeling of moving slowly compared to the scale of the scene reproduced: we would find it hard to concentrate on something specific and if we could do it, all that detail would be useless and a waste of bandwidth (and time). To deliver a feeling of speed we would have to sacrifice part of that photorealism by introducing texture-level motion blur, thus reducing the amount of information in favor of a different quality of the same.
Unfortunately this newborn industry is pushing its major strengths in the direction of a quest for technical resources: you will need very powerful PCs, graphics cards at the bleeding edge of consumer electronics, extremely detailed 3D models and digital displays with impressively high resolutions all of them conjuring up to require additional computing power to feed a loop that generates itself. There are multiple, independent reasons to that: the need for actual computing power to sustain the generation of high resolution images at an adequate speed to not induce motion sickness, an enduring reluctance of the market to adopt newer machines thus impacting the market share of new operating system versus legacy ones, a stagnant PC market requiring new opportunities before most of the companies will leave the competition. All of them are valid arguments for this tech race but are not strictly necessary for the quality of VR and AR in the entertainment industry. In particular, have no relevance at all in establishing an analytical debate and research over of the rules of the game.
Rather than concentrating on the problem of the sustainable amount of visual information to support in credible VR and AR scenarios, which is merely a technological problem, we should first address the problem of increasing the quality and quantity of valuable and relevant information on the scene (and consequently how to create and deliver them efficiently to viewers). What role do we need to assign to mental models of content and information? What role for user-created content originating from a different psychological framing? What is the most efficient way to present these information in a contextual way protecting the narrative goals, that’s saying without exceeding the activation thresholds of incongruous mental scenarios? What are the relevant mental scenarios for the entertainment industry to bring entertaining content on the brand new stage of virtual reality and augmented reality? How much user cognitive capacity is absorbed in the mere acquisition of this information and what behaviors are being altered compared to those of other subjects that do not have these information at their disposal? How would conflict the behaviors of these two classes of users and how do we minimize those cognitive conflicts that clash in order to avoid creating new knowledge gaps or, even worse, endanger people?
These are the issues that really deserve to be on the agenda of those professionals like us who produce content and define the interaction and storytelling models for the next decade. As the striking success of Pokémon Go spreading across the world shows us, there is a huge potential for integrating immersive narrative within the physical spaces; at the same time, as accidents and exploitation for criminal purposes of this very same game demonstrate, there is a necessity and responsibility on our side not to limit ourselves to riding the VR and AR opportunities as a mere “new screen” and business whose rules consist in being more creative than our competition, rather we have a responsibility to define the new rules of the game to leverage this creativity within a new holistic audiovisual grammar.
* Cover image by Diego Delso – Sunset from behind Seljalandsfoss falls, Suðurland, Iceland. Seljalandsá’s river falls 60 meters on the former cliffs.