OSD09-H03

“How — what — what kind of foods do they have?”

Four independent subroutines went to work analyzing the phrase uttered by the four-year-old: expression context, voice recognition, tone analysis, body language. Tone analysis needed to be the fastest, and luckily it was also the simplest. No quavering or whining detected. Had it been, the other subroutines would have been directed to stop, and control would be given over to an array of prewritten comfort dialogues.

Expression context came next. Eye contact from the child was only occasional. The image analysis package, in concert with the body language and expression routines, determined that the child, a fair-haired boy, was occupied by something below frame. The RFID scan identified it as a toy train, one of twelve toys in the room. The dialogue routine was updated with the name of the object, potentially to be used later if the child remained silent for a specified amount of time (“Hey, is that a toy train you’ve got there?”).

Voice recognition had been dissecting the phrase all this time. Tone analysis supported the conclusion that the child had asked a question.

ɦɑw ? ʍɜt ? ʍɜt käɪ̯dɜ fudz’ du ðe hæv 

“Food” triggered a subarray of typical questions, and once the substrings “kind of” and “they” had been identified and routed through the context and grammar parsers, it was a simple matter to locate the most likely question being asked.

The response set, indexed by question, was accessed and syllabically divided for the vocal synthesis package. Then, poring over a hash table of pre-identified lingual structures of the child’s father, the synthesizer generated an audio file by conflating the two data streams. The file is equalized to include a bassy subaudio component at 180 Hz, creating a comforting, warm “in-room” effect that mimics the tone heard by the child with their head upon the father’s chest.

Meanwhile, a 1280×700 image of the father, taken years ago when he was first deployed, is overlayed onto a digital model (from the neck up only — originally the Department of Defense had planned to include hands so the model could gesture, but this was abandoned early due to overcomplexity). The resulting hybrid passes through a series of basic lingual configurations (augmented with syllable-stress-driven head movements) and converted into a number of keyframes.

These individual frames can be presented directly on the viewing screen, synchronized to the audio file. A series of static-simulating filters create “webcam believability” and reduce Morian “uncanny valley” effects, which children have been shown to be particularly sensitive to. Once it was understood that they want to believe, the goal became to give them less visual fidelity, not more.

“They give us all kinds of foods here to keep us healthy. Lots of things like vegetables, steak, chicken. Even some of your favorites like pizza. You like pizza, huh, buddy?”

The microphone registers no audio response, but expression context identifies upturned corners of the mouth and squinting eyes.

“I miss you, daddy.”

A timer preset with a value of five minutes plus or minus anywhere from zero to thirty seconds reaches zero. A half-dozen randomly-selected dialogue trees are deallocated from memory.

“I miss you too, Josh. I’m coming home real soon, okay? Daddy has to go now. Be a good boy, okay? I love you. I love you.”

Somewhere in the room, a hard drive whirs.

Leave a Reply

Your email address will not be published. Required fields are marked *