Title: | Text Mining for Bahasa Malaysia |
---|---|
Description: | It is designed to work with text written in Bahasa Malaysia. We provide functions and data sets that will make working with Bahasa Malaysia text much easier. For word stemming in particular, we will look up the Malay words in a dictionary and then proceed to remove "extra suffix" as explained in Khan, Rehman Ullah, Fitri Suraya Mohamad, Muh Inam UlHaq, Shahren Ahmad Zadi Adruce, Philip Nuli Anding, Sajjad Nawaz Khan, and Abdulrazak Yahya Saleh Al-Hababi (2017) <https://ijrest.net/vol-4-issue-12.html> . This package includes a dictionary of Malay words that may be used to perform word stemming, a dataset of Malay stop words, a dataset of sentiment words and a dataset of normalized words. |
Authors: | Zahier Nasrudin [aut, cre] |
Maintainer: | Zahier Nasrudin <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.3 |
Built: | 2025-02-13 04:58:26 UTC |
Source: | https://github.com/zahiernasrudin/malaytextr |
Data of Malay root words
malayrootwords
malayrootwords
A tibble with 4310 rows and 2 variables:
Col Word
dbl Malay Word
Root Word
dbl Malay Root Word
Malaysia Politic Tweets Sentiment Dataset (Positive, Negative or Neutral)
malaysia_politic_sentiment
malaysia_politic_sentiment
A tibble with 71 rows and 3 variables:
id
dbl Represents a unique identifier assigned to each tweet
text
dbl Tweet related to Malaysia politics
Sentiment
dbl The sentiment classification assigned to each tweet
Data of Malay stop words
malaystopwords
malaystopwords
A tibble with 512 rows and 1 variable:
stopwords
dbl Malay stop words
Data of Malay normalized words
normalized
normalized
A tibble with 153 rows and 2 variables:
Col Word
dbl Word
Normalized Word
dbl Normalized Word
Remove URL links
remove_url(string)
remove_url(string)
string |
String to change |
remove_url()
is an approach to remove link(s) from a string
Returns a string with URL links removed
x <- c("test https://t.co/fkQC2dXwnc", "another one https://www.google.com/ to try") remove_url(x)
x <- c("test https://t.co/fkQC2dXwnc", "another one https://www.google.com/ to try") remove_url(x)
Data of Sentiment Words (Positive or Negative)
sentiment_general
sentiment_general
A tibble with 1428 rows and 2 variables:
Word
dbl Sentiment Word
Sentiment
dbl Sentiment
Malaytextr function to stem Malay words
stem_malay(word, dictionary, col_feature1, col_dict1, col_dict2, Word)
stem_malay(word, dictionary, col_feature1, col_dict1, col_dict2, Word)
word |
A data frame, or a character vector |
dictionary |
A data frame with a column of words to be stemmed and a column of root words |
col_feature1 |
Column that contains words to be stemmed from |
col_dict1 |
Column that will be used to match with |
col_dict2 |
Column that contains the root words from |
Word |
Depreciated. Please use |
An object of class function
of length 1.
stem_malay()
is an approach to find the Malay words in a dictionary
and then proceed to remove "extra suffix" as explained by Khan et al. (2017), and then "prefix" and lastly, "suffix".
Returns a data frame with the following properties:
Col Word
: Renamed input from word
Root Word
: An additional column which contains the word(s) after being stemmed.
Khan, Rehman Ullah, Fitri Suraya Mohamad, Muh Inam UlHaq, Shahren Ahmad Zadi Adruce, Philip Nuli Anding, Sajjad Nawaz Khan, and Abdulrazak Yahya Saleh Al-Hababi. 2017. "Malay Language Stemmer."
#Specifying a character vector & #use a dictionary from malaytextr package stem_malay(word = "banyaknya", dictionary = malayrootwords) #A data frame, #Use a dictionary from malaytextr package, #With a dataframe, you will need to specify the column to be stemmed x <- data.frame(text = c("banyaknya","sangat","terkedu", "pengetahuan")) stem_malay(word = x, dictionary = malayrootwords, col_feature1 = "text")
#Specifying a character vector & #use a dictionary from malaytextr package stem_malay(word = "banyaknya", dictionary = malayrootwords) #A data frame, #Use a dictionary from malaytextr package, #With a dataframe, you will need to specify the column to be stemmed x <- data.frame(text = c("banyaknya","sangat","terkedu", "pengetahuan")) stem_malay(word = x, dictionary = malayrootwords, col_feature1 = "text")