Stopwords are the most common words in any language that do not add much meaning to a sentence in Natural Language Processing. Therefore, removing stopwords can improve the ratio of the most important words to the less important ones, as they significantly affect the performance of the whole system. This results in much smaller data which means faster computation time and allows the system to perform efficiently.
Removing stopwords is a crucial step in pre-processing while doing Natural language Processing and must be done with the utmost care; not all tasks in NLP will require the removal of stopwords, and some may require certain stopwords. Unlike most high resources languages such as English, where what kind of stopwords to use under what circumstances are well-documented, Swahili lack such resources.
There are various stopwords for Swahili online and also in the literature. Pyspark has its version of Swahili stopwords, built for WorldBrains cases for stripping words from webpages to make search-indexing faster; the same list is also available on the Kaggle site. A paper published in 2020 by Bernard Masua and Noel Masasi, Enhancing text pre-processing for the Swahili language, developed a list of stopwords for preprocessing steps.
The Venn diagram below shows the different types of stopwords.
The diagram above shows most words classified in the Swahili literature; they outdo the most common words. Solely relying on the Kaggle/spark data to remove stopwords on some specific NLP task may cause the NLP model to generalize poorly. For instance words such as alisema and akasema, the root words can be extracted using either lemmatization or stemming rather than doing away with the entire set of words. Another example is the word sauti in Kaggle/Spark; eliminating it from a sentence may affect the meaning of a sentence or raise further questions.
The stopwords in literature should be adopted as the standard for NLP Tasks for below reasons:
A good collection of Swahili stopwords.
Developed for NLP tasks in mind, unlike the PySpark stopwords.
Most Natural language processing libraries have no Swahili stopwords due to a lack of educational resources to justify the proposed list; below is an issue raised in nltk, referenced from the Jordan Kalebu issue opened in 2021.
Though Swahili is a low-resource language, having standard stopwords available in major NLP libraries is a step towards conquering Swahili.