Scale and the Gaze of a Machine

RICHARD BECKWITH
Intel Labs
JOHN W. SHERRY
Intel Labs

[s2If is_user_logged_in()]

DOWNLOAD PDF
[/s2If] [s2If current_user_can(access_s2member_level1)]

[/s2If]

Scale suffuses the work we do and, recently, has us considering an aspect of scale best suited to those with ethnographic training. We've been asked to help with scaling up one of the latest blockbusters in high tech – deep learning. Advances in deep learning have enabled technology to be programmed to not only see who we are by using facial ID systems and hear what we say by using natural language systems; machines are now even programmed to recognize what we do with vision-based activity recognition. However, machines often define the objects of their gaze at the wrong scale. Rather than “look for” people or objects, with deep learning, machines typically look for patterns at the smallest scale possible. In multiple projects, we've found that insights from anthropology are needed to inform both the scale and uses of these systems.

Keywords Deep Learning, Human Scale, Ethnographic Insights

Article citation: 2020 EPIC Proceedings pp 48–60, ISSN 1559-8918, https://www.epicpeople.org/epic

[s2If current_user_is(subscriber)]

video-paywall

[/s2If] [s2If !is_user_logged_in()]

FREE ARTICLE!
Please sign in OR create a free account to access our library—the leading collection of peer-reviewed work on ethnographic practice. To access video, Become an EPIC Member.

[/s2If] [s2If is_user_logged_in()]

PEOPLE THINK AT A HUMAN SCALE

When we talk about “human scale”, we refer to the sizes of objects and spans of time that people tend to think about. We humans don't have to think on the human scale. We can think on the scale of the universe or the atom. However, thinking at the human scale is natural; it is what allows us to collaborate; it allows us to see the reasons in another's acts; it supports our sociality. Although we can argue with an imposition of “rationality” on broad of swaths human thought (e.g., Malinowski 1922/1984), we also must admit that it is typically rather easy to attribute a rationale to what a person has done. We naturally “see” what other people are doing; machines do not.

Why don't machines just see like humans? Humans program the machines after all. The reason is that machines would need to be programed to see at a human scale and, at this point in time, that hasn't been the case. It's quite hard and there are alternatives. Machines have been programmed to a surprising level of accuracy, to be sure, but that's not enough. You can be accurate and yet not correct. The human ability to see what others are doing – this “vision” – is not the same as being able to describe the outward behavior that people have engaged in. The social sciences became convinced of that disconnect following the fall of behaviorism. Now, the social sciences rarely provide an “objective” description of the “behaviors” of others, rather, we offer what might be called a “preferred description.” (Searle, 1983). Someone might describe another's behavior as alternating movement of the legs across a floor, but this would likely not match how the person would describe it themselves. An observer might say that a subject has walked to the north, which may be true, but the walker may not even have known the direction. It's more likely that the person being observed had thought that they were walking to the exit. “Walking to the exit”, then, is the preferred description and these descriptions are easy for humans to generate about each other. It seems fairly obvious that a person watching that walker would say the same thing, and perhaps this is what Malinowski may have had in mind – that he could look at Trobriand Islanders and their culture and imagine why they would travel great distances to bring some long-held possession to be held by another. That attribution is thinking at a human scale (e.g., Dennett on the “intentional stance”, 1978). It's so much easier to collaborate with, to trust, another whom you can at least convince yourself that you can understand. So, it can be a real problem when “thinking machines” don't think like us.

Machines Don't Have to Think in Human Ways

One of the reasons that our technology company hires social science types is to help to design technologies such that they are better partners. It used to be we were asked to help make purely responsive computers that would fit with people's lives. Now, the computer can take initiative (Console, et al. 2013) and fitting in is so much more significant. New technologies promise to be more connected to their environment and better able to understand and interact with people in more natural ways. That promise is where the problems start. It's frequently the case that “high technology” is designed in a decidedly non-human way and we're here to tell the choir that these machines can be harder to collaborate with and harder to trust than people. In many ways, what we are trying to do in our work is to help to create technology that can truly participate at the human scale or to point out when machines are incapable of working with that way people.

We'll detail some examples from the technology literature and briefly describe some cases that we're working on, but before that, we'll lay out a technology domain to which we will restrict our focus, one that is not only salient these days but which also highlights the value of the social sciences for technology development, namely artificial neural networks or, more simply, “neural nets”.

NEURAL NETS

Neural nets are the “iron horse” of the 21st century. OK, maybe “neural net” is just a similarly inapt metaphor. Iron horses weren't remotely horses and neural nets aren't remotely brain-like. Despite not being horses, railroads have been remarkably useful as a means of transport. They deliver goods, simplify travel, and can be quite reliable. Neural nets can be remarkably useful, too. As many people know, neural nets are terrific at finding pictures of cats (Le, et al., 2012). Moreover, neural nets are driving significant innovation in the computing industry. They have enabled improved multimodal sense-making and understanding (Owens and Efros, 2018), automated speech recognition (Chan et al., 2016), and natural language processing (Vaswani et al., 2017) and, then, there's that near magic we see with computer vision; and, it goes well beyond cats (Krizhevsky et al., 2012).

AlphaGo

For famous example, AlphaGo, which debuted in 2016, was built on a neural net that was programmed to play the game Go (Silver et al., 2016). Go is a two-player game where players capture space by putting colored playing pieces on a game board. At its debut, AlphaGo beat a world champion Go player in four out of five games. This was a surprise to nearly everyone, including the AI community, because Go is considered much harder than chess for a computer and computer scientists had worked for years on computer chess before being able to beat a human champion.

While there are lots of different aspects to the game and the program, we want to focus on just one aspect here – the Go board, its moves, and how AlphaGo sees them. First, let's consider how humans see Go. The Go board is a grid of lines that form 19 squares across, 19 squares deep, and has 361 intersections. (361=19×19) These intersections are where a player puts the playing pieces – “stones” – which are black for one player and white for another. Players take turns in placing one of their stones on an empty intersection. The goal of the game is for a player to build continuous walls of their stones around sections of the board such that their walls enclose more space than their opponent's. When a player puts down a stone, it is to either build a wall of their own pieces or block their opponent from building a larger enclosure. Any surrounded stones of an opponent are taken as prisoners. Each completed game takes about 250 turns. It will be relevant in the paragraph after the next to have noted here that, for humans, reading the current paragraph once or twice would allow a person unfamiliar with Go to not only play the game but also create a functional board with playing pieces.

Humans see a Go board as a 19×19 grid on which walls are built with stones. That's not how AlphaGo sees the game. AlphaGo sees the Go board as one long vector with a separate element for each of the 361 intersections. The training data for AlphaGo consist of game length sets of these vectors with each consecutive vector in the set representing each subsequent move in a game. As AlphaGo sees it, each move in the game is represented by a new vector that is different from the previous vector by one element (i.e., the new stone) or more if an opponent has been surrounded and their pieces taken as “prisoners”. The bottom line is that people playing Go see the building of walls around sections of the playing surface; AlphaGo sees patterns in a series of vectors.

Before playing a game against a person, AlphaGo will look at, literally, millions of games to see what patterns emerge in the vectors. Once AlphaGo has seen millions of games that were played, it can figure out how to win. More specifically, AlphaGo can figure out which next step (i.e., which change in one element) is most likely to lead to a win and with each step chooses the move it believes will get it closer to a win. In order to learn to play at the level it played, AlphaGo needed to see millions of games that had been played. Interestingly, in order to play at all, AlphaGo would likely have needed to have access to nearly as many completed games. This requirement of seeing millions of games, it must be noted, is simply not true of humans who can learn the game quickly (as noted in a previous paragraph) and people are unlikely to ever encounter a million games in their lifetime let alone by the time they've played their first opponent.

Feature Engineering

To be perfectly honest, almost none of that is central to the argument we want to make. What we care about most is that the two-dimensional 19×19 grid on the board on which a person sees walls, AlphaGo sees as a simple line with pieces of data about the state of each cell (black, white, or empty) which forms patterns with the state of the board in nearby lines. That AlphaGo sees the state of the board as linear is quite significant since a line can have no walls. AlphaGo simply finds patterns in the sequence of changes between the lines within a game.

One can imagine that engineers didn't have to spend much time figuring out that a vectorized representation of a two-dimensional board was going to be good enough. They still had a single variable for each intersection and only three different states of those 361 intersections. Noticing patterns across elements isn't likely to be outside the ken of an artificial neural network and, frankly, there isn't much else going on in the training data that the machine would need to notice or would be distracted by. The system only needs to know possible next steps and the likelihood that a change in arrangement on the board will lead to a winner. So, the feature engineering for AlphaGo would have been fairly simple. Nevertheless, feature engineering is an important part of any neural network or machine vision system and is nearly always much more complex than what we've seen with Go.

In fact, deciding which features to include in training a neural net can be quite difficult especially in areas like vision or language which so often seem magical. Because of this difficulty, engineers have discovered ways to allow a program to find its own features. This is called “automatic feature engineering”. Despite the fact that automatic feature engineering has some fairly significant issues, in many ways, it is the magic of vision and language neural networks and underlies the ability to find so many cats. Yet, it can lead to a particularly pernicious type of problem – inferences based on spurious correlations.

Spurious correlation errors are one of the more significant side effects of automatic feature engineering. Obviously, spurious correlations are not just a problem for deep learning. People fall prey to spurious correlations, too. Consider for example, the recent conspiracy theory holding that 5G radio towers cause Covid-19. The best evidence that proponents have for this theory are geographic heatmaps showing that, in February and March of 2020, Covid hotspots and the then-current 5G deployments lined up quite well. The correlation between maps looked compelling, and without a more sensible explanation, 5G could seem like a reasonable-enough theory. The reason for maps lining up, according to experts, was that Covid was hitting urban areas hard and urban areas are also where 5G rolled out first. The correlations between Covid and 5G were spurious. What is important to note here is that we can see the sense of people's mistaken explanations – “the maps lined up so well”; there is a transparency to the error.

Often, transparency of errors isn't the case with deep learning. In fact, sometimes the errors generated with deep learning seem inexplicable. Research on attacks against deep learning systems can demonstrate how opaque the reasons for an error can be. For example, researchers have created patterned eye-glass frames that will fool a state-of-the-art facial recognition system created with automatic feature engineering (Sharif, et al. 2016). This system was trained to recognize different celebrities. The automaticity in the facial recognition system had the system look for pixel-level differences between a number of photos that were labeled with different celebrity names. As with the Go board, the system looked at each picture as a long vector. That is, photos were seen as a long line of pixels. In these digitized photos, the pixels are row after row of dots, each of which is one color, not unlike the Go board with its 19 rows of 19 columns and three states per element. Photos are just more complex than a Go board: more rows, more columns, and more states per element. Instead of Go's three states, the colors of a photo can include 100s of options or more. So, an image is, like the Go board, seen as a vector, but a much longer vector with much more varied contents.

The complexity of digitized images means that there is a greater chance of spurious correlations. The photos of celebrities offered spurious correlations aplenty. The researchers in this study found that they could design a set of colorful eyeglass frames, each of which appeared to have a random design, but the design would match a pixel pattern associated with a particular celebrity. The researchers discovered patterns that would fool the vision system into believing that one person was another. For example, despite the fact that the system was excellent at recognizing photos of Reese Witherspoon, a picture of her was mistaken for Brad Pitt when she was pictured wearing the Brad Pitt glasses (or other celebrities when other glasses were used). [We suppose we should mention that to most people, these two celebrities don't look much alike.] Any person wearing the Brad Pitt glasses would look like Brad Pitt as far as the system was concerned. Brad Pitt was identified by the pattern of pixels in the eyeglass frames (there were certainly other “random” patterns of pixels that happened to be associated with Brad Pitt but those on the glasses were sufficient for identifying him.) Despite being state-of-the-art, the facial recognition system fell for a spurious correlation. However, unlike the similarity of the maps of 5G and Covid, the correlation that the system found between name and pixel pattern was not something that a person could ever see. The patterned glasses don't even remotely look like Brad Pitt or any of his features. The errors would make more sense if the researchers had deployed prosthetic chiseled chins to make someone look like Brad Pitt. People simply don't hypothesize identity of others based on random patterns in pixels.

PROBLEMS FOR ETHNOGRAPHERS

So, now we've covered neural nets and feature engineering and the problem with spurious correlations and can now turn to projects we've worked on to highlight some of the issues that ethnographers are best able to deal with.

Communication versus “Natural Language” Networks

One of the projects that we are now working on is a system that will use deep learning to translate between American Sign Language (ASL) and English. The idea is to find patterns in videos of people signing and relate those patterns to simultaneous English translations. The videos we are using sometimes have ASL translated to English and, other times, English translated to ASL. In all cases, these videos include ASL and English that are intended to express the same content. The goal is to have an “end-to-end” system that learns from videos of signing and an associated translated text of the spoken language used as a “label” for the signed content. Tens of thousands of these labeled videos are required for the system to begin to learn to translate.

Given that the system's input streams include raw video, it will not be surprising to hear that the system will be looking at the video as a sequence of vectorized images with the top left corner of the video being the first element in the vector and the bottom right pixel being the last. The features that the system will discover are like those of the celebrity ID system – in that they are patterns of pixels associated with some label.

A knowledgeable signer of ASL would look at the video and see a sequence of meaningful tokens (i.e., morphemes) composed of a set of language specific building blocks (i.e., phonemes) but the deep learning system has automatic feature engineering and is learning without seeing or knowing anything about phonemes or morphemes and, further, is not being programmed to acquire them. By focusing on pixels and patterns of pixels, the system is far simpler to program. By focusing on these language independent features (i.e., pixels), problems with spurious correlations are rife. Systems in the future will be able to learn morphemes and phonemes first and acquire the language with that “knowledge”. This is the only way to avoid the problem of “spuriosity”. But this is only the beginning; the problem with pixels goes further than that.

Ethnographers trained in microanalysis can say more about what a knowledgeable user of ASL sees or how a fluent signer would construct and understand meanings. What microanalytic techniques brought to the study of communication was to show where relevant data had been ignored in trying to assign meaning: The weight of conversation is not carried only by syntactically words; there are non-linguistic gestures, postures, and eye gaze (Birdwhistell 1970, Kendon 1967, Schegloff 1998). There's intonation, pitch excursion, and volume. Conversation even moves forward with what is not said (Watzlawick, 1967). These all fly under the banner of “microanalysis”. What microanalysis brought to the more strictly behavioral concerns of the time was a research program that asked what needed to be considered in the way that people construct and understand meaning when they communicate. This methodology is associated with anthropology as much as communication theory; both areas study meaning and the technologies and techniques with which meaning is shared. There can be no question that a fluid and facile interpreter will need to consider these cues nor that a system meant to interpret must also consider them. A job for the ethnographer working with deep learning is discovering both the right level of analysis and an ontology that makes sense…and then advocating for them.

The Interpretive Stance and Machine Vision Networks

Part of the magic of these deep learning systems is not only that they can work at all but also how well they work once they do (remember all those cat photos). Part of the problem, is that when they make an error, it will not be an error that a person is likely to be able to understand. That is, it won't fail in a human way and a person working with it is unlikely to be able to determine what data it considered and how it was analyzed while making an inference. When the system offers a solution, a user may find it difficult to know that it has failed. Simply put, when the errors are not on a human scale, it is difficult for a person to be able to correct it, to work with it.

How Do You Work with Failure?

Arguably, effective translation is crucial, and errors could be life threatening. However, it is also the case that, in an operational system, an ethnographer will have insured that conversational methods of correction would be in place. Perhaps an example where the system performs as an autonomous tool would help to highlight the potential risk of our misunderstanding how a machine sees. Here's another example from the tech literature.

The boffins have taken deep learning's most common machine vision training set (i.e., ImageNet (Deng, et al., 2009)), played with something quite like the Brad Pitt eyeglasses noted above, and come up with something diabolical (Athalye, et al., 2018). While it doesn't include celebrity photos, ImageNet is a database of one million images of many different classes of objects. This database is used by many deep learning practitioners to build systems that identify new images of the object types included in ImageNet (like turtles and rifles). In this case, the boffins trained up a network so that it had world-class performance in identifying the object categories.

One of the object categories that is relevant for this story is that of “rifle”. Rifle plays the role of Brad Pitt here. What's interesting is that these researchers used the seemingly random pattern of dots/pixels associated with “rifle” and did something akin to what the other researchers did with the Brad Pitt pixels. Instead of eyeglasses, they manipulated a view of a 3D toy turtle with this random dot pattern. Then, they rotated the turtle and placed the dots such that from every angle, the toy turtle looked like a rifle to the network. To the human viewer, the coloring wound up looking a bit like turtle camouflage. So, instead of a person wearing colored frames on a pair of glasses and then looking like Brad Pitt, a toy turtle was misidentified as a rifle. One can imagine negative consequences that could follow from having a child bring such toy into a protected area…very negative consequences and the reason for the error would not be at all obvious to those protecting that area. Because of the way pixel-based systems work, one would hope that a security detail would never rely on one. (Of course, police departments do use deep learning-based machine vision already (Harris, 2019).) Clearly, there's more for the ethnographer to do.