Why Swahili Wordnet?

Why Swahili Wordnet?

Wordnet is a lexicon database of any language and their relationships to each other. The first Wordnet was an English wordnet developed at Princeton University in the United States. It consists of nouns, verbs, adjectives, and adverbs grouped in synsets (a short form of the synonym set), a group of words that have synonyms or share the same meaning. Each synset represents a distinct concept or idea and is associated with a unique identifier and a definition. Synsets are also linked to other synsets through semantic relations such as antonyms (opposite) and homonyms (words having the exact spellings and punctuations but different meanings). Other semantic relations include; hyponyms, hypernyms, holonyms and meronyms.

Since the creation of the English Wordnet, many other Wordnets from different languages have been developed. As per EuroWordNet Tools and Resources Report (Vossen et al 1998), two approaches are usually used to construct Wordnet, the 'merge' and 'expand' models. They are defined as follows;

  1. Expand Models; translate Princeton English Wordnet to the target language using bilingual dictionaries, take over the relation and revise it. This method was used for Spanish Wordnet.

  2. Merge Models; define synsets and relations into the target language and then align the Wordnet with the Princeton Wordnet using equivalence relations. This approach is used for Dutch and Italian Wordnet.

The first African Wordnet was created in South Africa by the University of South Africa and the South African Center for Digital Language Resources, Pretoria. Codename African Wordnet(Afwn), a multilingual Wodnet consisting of several South African native languages such as isiZulu, isiXhosa, Setswana, Sesotho sa Leboa, Tshivenḓa, Sesotho, isiNdebele, Siswati & Xitsonga. This was a good starting point for other African countries to follow suit in the development of Wordnet, but not much work has been done so far; this can be attributed to most of the languages being low resources languages and lacking lexical resources.

If you haven't read my last article on Swahili, I highly recommend checking it out. Swahili, though highly spoken in East Africa and gaining interest from other countries, lacks its own Wordnet and, if it is available, it is not open-sourced. A quick search on Google doesn't provide many results. Bilingual translations, online Swahili dictionaries and less information on synonyms, definitions, usage and antonyms are available to the general public. Caution must be taken when using bilingual translation as most are biased toward the low-resource language. Google, for instance, is prone to mistranslating words from English to Swahili because It gives more prominence to the high-resource language.

Creating or open-sourcing Swahili Wordnet comes with its merits. For instance, it can help reduce linguistic diversity and bias in online dictionaries. As per Linguistic Diversity and Bias in Online Dictionaries by Gábor Bella (University of Trento) 95 Khuyagbaatar Batsuren (National University of Mongolia) Temuulen Khishigsuren (National University of Mongolia) Fausto Giunchiglia (University of Trento), Linguistic bias is the preference towards the semantic space of dominant culture, hence preventing the definition of locally specific words or failing to provide the means to connect them meaningfully to other languages. This is evident in our earlier example about Google Translate.

Research has always been central to the development of any language. Swahili is no exception. Wordnet plays a crucial role in;

  • Machine translation to other local languages

  • Text analysis and classification, especially in sentiment analysis, where there is an explosion of research papers on hybrid sentiment analysis (rule-based + Machine Learning), much of which are in high-resource languages. Presently Swahili Sentiment analysis is done using Machine learning due to the lack of a Swahili rule-based system that gives polarity to words. As Swahili data continues to grow and its adoption increases in different parts of Africa, the system will be necessary if companies need to understand more about their customers.

  • Word Sense disambiguation; is determining the meaning of a word in context when the word has multiple meanings.

As we slowly move into the technological space and adopt online learning in our institutions, solely relying on traditional methods of examining Swahili words in local dictionaries such as Kamusi is becoming time-consuming for both learners and tutors. The minimum duration for a Swahili lesson in Kenya is around 40 minutes. Because dictionary lookup speed varies among students, it might take me 30 seconds to 1 minute to locate a word in the Swahili dictionary, depending on the alphabet of the Swahili word in question. The time taken above can be reduced to a few seconds when using Wordnet compared to the traditional dictionary. Time is also saved when you want to look at other semantic relations surrounding the word, for instance, its synonym, antonyms, and hypernyms which are not included in the traditional Kamusi.

The local traditional dictionaries assume that the reader has previous knowledge of words and that they are educated. The points raised In the Five papers by George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller are applicable; (using the same word used in the paper for elaboration)

  • The dictionary misses essential information about the superordinate term from the definition; for instance, the word mti (mmea wenye shina gumu) is translated to English as a tree. As seen from the example and stated in the paper, the definition consists of its superordinate 'mmea', which misses a lot of information. It has no information about what the tree (mti) has, for instance, roots and leaves.

  • They contain no information about coordinate terms; a coordinates term is a word or noun phrase that shares a hypernym or hypernymic with another. For instance, the word oak is a coordinate term for a tree.

  • A challenge faces a reader interested in the different kinds of mti (tree); let's assume the reader whats to know the different varieties of mti (tree) that are hardwood or softwood; they have no choice but to look for them in other resources .i.e Google search and later go back to the actual dictionary to look for the words. This is only possible only to determined readers.

Swahili Wordnet can be used as a baseline for building other East African languages, especially Bantu languages which share the same similarities in terms of language structure. For instance, "Njoo" in Swahili is equivalent to "Nzoo" in Giriama. A Multilingual African Wordnet will help preserve the African language and culture. Furthermore, this will allow African languages to create their online presence.

This article has just scratched the surface of the importance of Swahili Wordnet; to get a comprehensive understanding of the importance of Wordnet and link it to Swahili, I suggest you read 'Wordnet An Electronic Lexical Database'. Though the Swahili language is still struggling to catch up to the language world in terms of tools, lexicon resources and adequate corpus to aid in building it to be competitive, small strides matter a lot, such as the availability of bilingual translations to other languages.

References

  1. Wordnet An Electronic Lexical Database

  2. EuroWordNet Tools and Resources Report (Vossen et al 1998)

  3. Linguistic Diversity and Bias in Online Dictionaries by Gábor Bella (University of Trento) 95 Khuyagbaatar Batsuren (National University of Mongolia) Temuulen Khishigsuren (National University of Mongolia) Fausto Giunchiglia (University of Trento)

  4. Five papers on Wordnet

  5. EuroEuroWordNet Top Ontology (the site where the cover image was extracted)