The Intelligence Community wants to develop a kind of universal translator that will search documents across a full range of media and make sense of them for English-speaking analysts.
The Intelligence Advanced Research Projects Activity (IARPA) has launched a multi-year effort to create language processing software under a program called Machine Translation for English Retrieval of Information in Any Language, or “MATERIAL.” IARPA, which works for the Office of the Director of National Intelligence, has awarded contracts to four organizations for the program.
MATERIAL would scour sources as varied as social media, newswires, radio, and TV broadcasts in search of foreign-language documents of interest to U.S. intelligence agencies. The cross-language information retrieval systems would, according to IARPA’s plans, be able to summarize the information they come across and quickly be adapted to new areas of interest.
“The collection and analysis of information required to accomplish a specific intelligence task has increasingly become a multilingual venture,” said Carl Rubino, IARPA program manager. “For most languages, there are very few or no automated tools available for cross-lingual data mining and analysis. The MATERIAL Program aims to investigate how current language processing technologies can most efficiently be developed and integrated to respond to specific information needs against multilingual speech and text data.”
The idea of a universal translator has been around at least since Murray Leinster’s 1945 short novel “First Contact,” and is probably most familiar to people through the “Star Trek” television series. On TV, it worked like magic, translating in both directions across a galaxy of alien languages. Researchers working on MATERIAL, by contrast, face some daunting real-world challenges.
Although language processing software has made tremendous strides in recent years, anyone who has trained a speech-to-text app or conversed with Alexa knows that even English-to-English translation can still encounter hiccups. Foreign languages present a different set of problems, including differences in syntax and usage within individual languages, compounded by the sheer number of languages in use.
At one end of the spectrum, Ethnologue currently counts 7,099 known living languages, although many of them are spoken by small groups of people. Google’s Cloud Natural Language API, which gives developers access to sentiment analysis, entity recognition, and syntax analysis, recognizes 80 languages. That’s likely more in line with IARPA’s intentions, but that’s still a lot of languages.
And while most language research is concentrating on verbal translations, IARPA’s project is looking to automate retrieval of information from a variety of formats.
IARPA has awarded contracts to teams led by Johns Hopkins University, Raytheon BBN Technologies, Columbia University, and the University of Southern California Information Sciences Institute for work on the project. Another four organizations–MIT Lincoln Laboratory, the University of Maryland Center for Advanced Study of Language, the National Institute of Standards and Technology, and Tarragon Consulting–will act as the program’s test and evaluation team, assessing the performance of what IARPA expects to be complex end-to-end solutions.
MATERIAL represents the latest effort in IARPA’s wide-ranging efforts toward Anticipatory Intelligence, employing new technologies in a variety of fields, including cybersecurity, signals intelligence, big data analytics, and other fields.