Hearing the word on the street doesn’t do you much good if you don’t understand the words to begin with.
The Intelligence Community (IC) knows that feeling at times when running into language barriers as there are more than 7,000 languages being spoken worldwide. Some of them are only spoken by small groups of people. A translator isn’t always at hand–a fact that also applies to machine translation when scouring websites, social media, broadcasts, and other sources for clues about what’s happening and what may happen.
The IC’s top research arm, the Intelligence Advanced Research Projects Activity (IARPA), wants to be able to get the gist anywhere and anytime. They are inviting researchers from around the world in industry and academia to use machine learning to develop algorithms for cross-lingual information retrieval capable of extracting answers from little-known foreign languages to questions posed in English. IARPA’s Open Cross-language Information Retrieval (CLIR) Challenge is inviting participants to compete for prizes while working on a kind of fast-working translator that can interpret other languages while using a minimal amount of training data, focusing particularly on what it calls “computationally underserved languages.”
English is, of course, the dominant language used on the internet, accounting for about 55 percent of all content, according to the Unbabel Blog. But that leaves other languages making up the other 45 percent of content, and while the rest of the top 10 or so most-used languages are familiar, their share of web content doesn’t match that of English (German is second, at 7.7 percent). The use of lesser-known languages may be tiny by comparison, but the information they hold could be significant to the IC.
IARPA isn’t looking for a comprehensive, automated language database, but a system that can push the envelope of natural language processing and machine learning to glean information from lesser-known languages. In fact, a comprehensive tool for some of the languages can’t be developed because the data necessary to train a natural language processing tool for them isn’t available in enough quantities, IARPA said. OpenCLIR seems to be aiming for a tool that can intuit meaning based on a limited linguistic foundation–an interpretive interpreter, so to speak.
The research arm already has a more inclusive “universal translator” project in the works, the Machine Translation for English Retrieval of Information in Any Language (MATERIAL), a multi-year effort to create an automated tool that can search documents across a range of media and summarize their content for analysts. But MATERIAL is a bigger project, covering more languages and query types, as well as more tasks, such as domain classification and summarization. OpenCLIR is an offshoot of MATERIAL, focusing on a smaller–if no less daunting–set of goals.
The challenge of OpenCLIR is to be able to pull off a translation even if the automated tool is not very familiar (or maybe not familiar at all) with the language, and IARPA seems to want participants to start with the disadvantage. Although the program is inviting domestic and international researchers with backgrounds in natural language processing and data science to take part, it is excluding any performers on the MATERIAL program, as well as other U.S. government agencies, federally funded research and development centers, university research centers, or any other government-allied organizations that have access to privileged or proprietary information. And, significantly, anyone who speaks the languages being evaluated is not allowed to join a participating team or give them any assistance.
Participants in OpenCLIR will start with a modest amount of machine learning and natural language processing materials and work from there. Whether the program develops the kind of tool it’s looking for is an open question at this point, but, like other government-sponsored research challenges, the goal is to move the ball forward in terms of the technology. In this case, IARPA said, the goal is “to advance the research and development of human language technologies for lower resourced and computationally underserved languages.”