| Title: | Text Mining for Bahasa Malaysia |
|---|---|
| Description: | It is designed to work with text written in Bahasa Malaysia. We provide functions and data sets that will make working with Bahasa Malaysia text much easier. For word stemming in particular, we will look up the Malay words in a dictionary and then proceed to remove "extra suffix" as explained in Khan, Rehman Ullah, Fitri Suraya Mohamad, Muh Inam UlHaq, Shahren Ahmad Zadi Adruce, Philip Nuli Anding, Sajjad Nawaz Khan, and Abdulrazak Yahya Saleh Al-Hababi (2017) <https://ijrest.net/vol-4-issue-12.html> . This package includes a dictionary of Malay words that may be used to perform word stemming, a dataset of Malay stop words, a dataset of sentiment words and a dataset of normalized words. |
| Authors: | Zahier Nasrudin [aut, cre] (ORCID: <https://orcid.org/0000-0002-7060-776X>) |
| Maintainer: | Zahier Nasrudin <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.3 |
| Built: | 2026-05-13 07:19:49 UTC |
| Source: | https://github.com/zahiernasrudin/malaytextr |
Data of Malay root words
malayrootwordsmalayrootwords
A tibble with 4310 rows and 2 variables:
Col Worddbl Malay Word
Root Worddbl Malay Root Word
Malaysia Politic Tweets Sentiment Dataset (Positive, Negative or Neutral)
malaysia_politic_sentimentmalaysia_politic_sentiment
A tibble with 71 rows and 3 variables:
iddbl Represents a unique identifier assigned to each tweet
textdbl Tweet related to Malaysia politics
Sentimentdbl The sentiment classification assigned to each tweet
Data of Malay stop words
malaystopwordsmalaystopwords
A tibble with 512 rows and 1 variable:
stopwordsdbl Malay stop words
Data of Malay normalized words
normalizednormalized
A tibble with 153 rows and 2 variables:
Col Worddbl Word
Normalized Worddbl Normalized Word
Remove URL links
remove_url(string)remove_url(string)
string |
String to change |
remove_url() is an approach to remove link(s) from a string
Returns a string with URL links removed
x <- c("test https://t.co/fkQC2dXwnc", "another one https://www.google.com/ to try") remove_url(x)x <- c("test https://t.co/fkQC2dXwnc", "another one https://www.google.com/ to try") remove_url(x)
Data of Sentiment Words (Positive or Negative)
sentiment_generalsentiment_general
A tibble with 1428 rows and 2 variables:
Worddbl Sentiment Word
Sentimentdbl Sentiment
Malaytextr function to stem Malay words
stem_malay(word, dictionary, col_feature1, col_dict1, col_dict2, Word)stem_malay(word, dictionary, col_feature1, col_dict1, col_dict2, Word)
word |
A data frame, or a character vector |
dictionary |
A data frame with a column of words to be stemmed and a column of root words |
col_feature1 |
Column that contains words to be stemmed from |
col_dict1 |
Column that will be used to match with |
col_dict2 |
Column that contains the root words from |
Word |
Depreciated. Please use |
An object of class function of length 1.
stem_malay() is an approach to find the Malay words in a dictionary
and then proceed to remove "extra suffix" as explained by Khan et al. (2017), and then "prefix" and lastly, "suffix".
Returns a data frame with the following properties:
Col Word: Renamed input from word
Root Word: An additional column which contains the word(s) after being stemmed.
Khan, Rehman Ullah, Fitri Suraya Mohamad, Muh Inam UlHaq, Shahren Ahmad Zadi Adruce, Philip Nuli Anding, Sajjad Nawaz Khan, and Abdulrazak Yahya Saleh Al-Hababi. 2017. "Malay Language Stemmer."
#Specifying a character vector & #use a dictionary from malaytextr package stem_malay(word = "banyaknya", dictionary = malayrootwords) #A data frame, #Use a dictionary from malaytextr package, #With a dataframe, you will need to specify the column to be stemmed x <- data.frame(text = c("banyaknya","sangat","terkedu", "pengetahuan")) stem_malay(word = x, dictionary = malayrootwords, col_feature1 = "text")#Specifying a character vector & #use a dictionary from malaytextr package stem_malay(word = "banyaknya", dictionary = malayrootwords) #A data frame, #Use a dictionary from malaytextr package, #With a dataframe, you will need to specify the column to be stemmed x <- data.frame(text = c("banyaknya","sangat","terkedu", "pengetahuan")) stem_malay(word = x, dictionary = malayrootwords, col_feature1 = "text")