The KMI team consisting of Petr Knoth, Drahomira Herrmannova and Zdenek Zdrahal achieves in the NTCIR-10 CrossLink evaluation competition according to the organisers overall best results in the English to Chinese, Japanese and Korean (English to CJK) task and is the top (steadily among the three best and mostly second best) performer in the CJK to English task. Ten international teams took part in the evaluation. This is the second time team KMi participated in this competition.
NTCIR is a major forum (similar to TREC) of evaluation workshops designed to enhance research in Information Access (IA) technologies including information retrieval, question answering, text summarization, extraction, etc. The NTCIR-10 conference will take place as usually in Tokyo, Japan this June. The CrossLink task (Cross-Lingual Link Discovery – CLLD) is a way of automatically finding potential links between documents in different languages. It is not directly related to traditional cross-lingual information retrieval (CLIR) because CLIR can be viewed as a process of creating a virtual link between the provided cross-lingual query and the retrieved documents; but CLLD actively recommends a set of meaningful anchors in the source document and uses them as queries with the contextual information from the text to establish links with documents in other languages. Wikipedia is an online multilingual encyclopaedia that contains a very large number of articles covering most written languages and so it includes extensive hypertext links between documents of same language for easy reading and referencing. However, the pages in different languages are rarely linked except for the cross-lingual link between pages about the same subject. This could pose serious difficulties to users who try to seek information or knowledge from different lingual sources. Therefore, cross-lingual link discovery tries to break the language barrier in knowledge sharing. With CLLD users are able to discover documents in languages which they either are familiar with, or which have a richer set of documents than in their language of choice.