Scale and the Gaze of a Machine

Share Share Share Share Share
[s2If !is_user_logged_in()] [/s2If] [s2If is_user_logged_in()]

Collaborating with a Deep Learning System

People expect that others, whether a person or a system, will see things as they do. We can learn a lot about how we see things by thinking about how we live and work with others. It's really important that we communicate and agree on what things are. If we see something, we expect that an intelligent other will see the same thing and call it by the same name. If someone calls something by a name we know, we expect that thing to be what we would call that name.

Communicating people don't necessarily agree on everything but, at least where collaboration is concerned, we usually mean the same thing with words. Formally speaking, ontologies do not have to be identical, merely sufficiently overlapping and with a method for finding and resolving difference, if need be (Ludwig, 2016). Some remaining differences are fine as long as we understand what they are.

For example, the Kahluli of Papua New Guinea consider the male and female birds of paradise to be different species and this is entirely consistent with their knowledge that the two come together for breeding (Feld, 1982). This isn't likely to be of much consequence when dealing with a Kahluli person. If you want to see a male bird of paradise, you simply ask to see one. It doesn't matter that it isn't considered the same type of bird as the female. In fact, this is kind of what much ethnography has always been about: How should we understand others outside our group? Classic Ethnography is rife with examples of ontologies that aren't shared. Ethnographers can work with that and explain how to understand each other.

In this way, ethnography, like anthropology more generally, assumes a rejection of radical incommensurability. This rejection of incommensurability means that the concepts deployed by one entity (individual or collective) can be understood by another. When someone discusses their family, say, they may have in mind a different set of people from what another might assume but, if the two share sufficient beliefs, they can discuss the boundaries of the concept of “family”. Foucault would have called this an episteme (1971) and Kuhn, a paradigm (1962). What's important is that our categories are fluid and we can work within and, to a great degree, between them. Ethnography assumes a level of commensurability sufficient that someone could explain another in terms that are understood.

Practically Incommensurable and Practically Inscrutable

Unlike the subjects of ethnographic work, systems created using deep learning are practically incommensurable because they are practically inscrutable. That is, in practice, such systems work with very different concepts from the people who work with them and it will take a lot of work to get to a point where differences can be discovered and resolved.

Incommensurability

If you imagine that a system is observing as you would and “describing” those observations in terms that you would use, a deep learning system could easily be seen as an unreliable observer; however, nothing is further from the truth. The system is quite a reliable observer; under the same conditions, it will come to the same conclusions. It is the expectations of the naïve user that is a problem because the borders of the system's concepts are considerably different from our own (consider Brad Pitt or the rifle). Still, how can you rely on someone who tells you things that you know are simply wrong? How can you work with someone you don't understand and with whom you cannot negotiate a shared meaning? It is only by coming to understand the other's constraints. An example from our work might help.

We work with another project that uses machine vision. This one watches factory workers. The goal of this system is to improve safety while, at the same time, facilitating training, automating record keeping, and increasing efficiency. The video cameras constantly observe and record. The way this system works is that it has been trained to recognize the steps in procedures undertaken by skilled technicians on the factory floor. These technicians are taught a particular plan, composed of a set of steps, done in a particular order. The system learns to recognize them. If you think this sounds like it could lead to Taylorism run amok, you won't be the first. Watching people and watching how they are doing what they are doing is important for safety and practical training but could also be seen as something that would provide management with an unwelcome gaze over the worker.

When we keep in mind the difference between a plan and a situated action (Suchman, 1987), we know that as good as a plan may be, a person may need to veer from that plan to account for local conditions. So, when the local situation requires it, an intelligent being will find a way to reach the appropriate end state despite having to change some part of a plan. This is not a situation that a typical deep learning-based system can account for. One thing that a machine vision cannot do is to recognize something new. It does not recognize novel actions for what they are, it simply recognizes that they are not the expected step.

Because of this, one of our roles here was to explain to management why they shouldn't always have access to what the machine “sees”. An example came up in our work. During an observation, we saw an expert “going through the steps” when someone came up to them with a problem. This was standard protocol where someone with a problem should come to someone more senior for assistance. This new problem was solved and the expert returned to his task. This diversion would, of course, have caused the lengthening of the time of that interrupted step, not to mention the overall process. Some members of management wanted to know what was happening every time the system didn't see what was expected but this is the sort of naïve error that would cause disruption in the work being done.

Practically inscrutable

Developers often say that one simply can't understand how a deep learning system works. It is difficult, to be sure, but the workings of the system could be understood. Jose Hanson (Hanson and Burr, 1990) argued years ago that because neural nets are implemented on state machines, we know that they can be understood: one state leads to the next by virtue of an explicit command and there is a set of input data; each can be clearly seen. It just takes a lot of time to analyze, a whole lot of time. It took Google weeks to figure out how AlphaGo came up with one of its moves and explain why it was able to beat the world champion Go player using that move. The important point, though, is that they could explain it. It was possible. It was just ridiculously hard. A non-expert could not be expected to interrogate a system in any kind of reasonable time. Experts can't even do this quickly. So, how do we interact with a machine?

With inexplicable ontologies derived from patterns in pixels, understanding is surrendered to a mostly “well-performing system” built in a way to ease machine processing.

Human Scale: Description and Explication

Another way of looking at the previous examples is as a “failure of description”. The system in the factory setting had an incomplete description of the technician's job. Going and helping another technician is actually a prescribed part of the job; it's merely infrequent. But as far as the system was concerned, prolonged absence from the process it knows is a problem like any other. So, a problem is seen where none actually exists because the system hadn't been trained to recognize this option (or myriad others). All of the possible actions that might be correctly undertaken by a technician are not possible to train the system to recognize because there are countless correct things to do and ways to do them. Instead, what the system can do is to learn a limited set of actions that could be undertaken and watch to see when they are done correctly. There are many valuable services that such a program can provide but 24/7 understanding of everything it sees is not one of those.

We also see this failure of description in the case of ASL. The system had not been trained to recognize the data relative to the categories of human perception, the types of data that make human language possible even when we're not explicitly aware of them. That is, phonemes and morphemes are important ways that humans see language even when we're not aware of them. And we noted that there are other types of data that the system won't see. Research in ethnomethodology, conversation analysis, and embodied interaction have demonstrated that we are signaling each other in many ways, often unconsciously, but those signals are nonetheless important for the interpretation of meaning. This may include such factors as subtle body positioning, direction, timing and coordination of gaze, and a host of other signals that happen too quickly or subtly to be easily described, but which nonetheless affect communication. The problem is, except for such micro-analytic work, those signals are rarely even acknowledged and, to our knowledge, have never been included in a deep learning-based natural language system.

While (at least, heuristics for) each of these communicative categories could be learned by the system, it could only happen by resisting the emerging standard for such deep learning systems. Seeing ASL as a set of vectors of pixels, simply doesn't bode well for bringing this up to a human scale. Pixels are too fine a scale. Humans think of and see things in ways that are difficult to find in sets of pixels.

The challenge that failures of description present for deep learning systems, then, is that these systems will always be hamstrung.

A way out: changing how we design DL systems. Rather than designing a system as though it completely describes a process (e.g., servicing a tool or translating ASL), we should be developing systems that watch for events in the environment and provide further information in ways that recognize a potential insufficiency and are always compatible with the possibility of error. This is how we can provide a reasonable user experience in the face of deep learnings benefits and limitations. Spurious correlations will still happen, or maybe more correctly, “meaningless” event detection will still happen. Consider Geertz's discussion of the meaning of a wink (1973). Sometimes a wink will just be dust in someone's eye.

The implication here is that our DL systems must be bounded and targeted at the kinds of recognition tasks that make their level of activity more commensurate with human understanding and assessment. That means any individual DL system will perform a task that, if it produces a result that is not meaningful or useful to the human, the human user doesn't require hours of analysis to figure out what happened, but rather can disposition the result quickly, and in a way that the system can learn from.

SUMMARY

The intent of this paper was to argue that one of the most significant recent directions in technology – deep learning – has flaws that are best addressed by those trained in ethnographic methods. Who better than ethnographers to advance the cause of human scale?

A generation (or two!) ago, ethnographers were brought into technology development in order to help people make products that fit people so that businesses could “scale up” their offerings and make them relevant for the whole world. However, once they were inside the corporation, so many more problems were revealed to be within the ethnographer's domain.

Atomic units are used to simplify programming. Pixels are used for images and spectrographic-style frequency analyses for speech sounds. It does simplify programming, too. It's just that it is the wrong level of abstraction for dealing with people.

The work we presented here was to say, in part, how we might create machine learning that works well but, beyond that, it's also about developing AI systems that can be more easily understood by people. Much of today's deep learning consists of the type of system that Latour could point to as being particularly rife with blackboxing (1999); because it is practically impossible to know how they work. Successful scaling-up of the technology should not mean that no one will have access to the methods behind the madness.

By getting the scale right for human understanding, we can hope to have more control over the gaze of the machine. This may slow down both processing and even system creation; it could even mean that a given system would not be as broadly applicable. But it would be a better system, working at a more human scale, and would enable more fundamental interaction with the system itself.

Richard Beckwith is a Research Psychologist at Intel Corporation's Intel Labs. He is a psychologist who studies the impact that emerging technologies have on those upon whom they emerge and helps to ensure that technology designs can support people in the way that they should.

John Sherry is the director of the User Experience Innovation Lab in Intel Labs. This organization focuses on the human dimension of machine learning technologies from diverse perspectives, to better imagine and prototype new technological possibilities, and anticipate the alignments necessary for those to become reality.

2020 EPIC Proceedings, ISSN 1559-8918, https://www.epicpeople.org/epic

REFERENCES CITED

Athalye, Anish, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018. “Synthesizing Robust Adversarial Examples”. Accessed [24 Aug 2020]. https://arxiv.org/pdf/1707.07397.pdf.

Birdwhistell, Raymond. 1970. Kinesics and Context: Essays on Body Motion Communication. Philadelphia: University of Pennsylvania Press.

Chan, William, Nadeep Jaitly, Quoc Le and Oriol Vinyals. 2016. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, 2016, pp. 4960-4964, doi: 10.1109/ICASSP.2016.7472621.

Console, Luca, Fabrizio Antonelli, Giulia Biamino, Francesca Carmagnola, Federica Cena, Elisa Chiabrando, Vincenzo Cuciti, Matteo Demichelis, Franco Fassio, Fabrizio Franceschi, Roberto Furnari, Cristina Gena, Marina Geymonat, Piercarlo Grimaldi, Pierluige Grillo, Silvia Likavec, Ilaria Lombardi, Dario Mana, Alessandro Marcengo, Michele Mioli, Mario Mirabelli, Monica Perrero, Claudia Picardi, Federica Protti, Amon Rapp, Rossana Simeoni, Daniele Theseider Dupré, Ilaria Torre, Andrea Toso, Fabio Torta, and Fabiana Vernero. 2013. Interacting with social networks of intelligent things and people in the world of gastronomy. ACM Trans. Interact. Intell. Syst. 3, 1, Article 4 (April 2013), 38 pages. DOI:https://doi.org/10.1145/2448116.2448120

[/s2If]

Pages: 1 2 3

Leave a Reply