---
title: "malaytextr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{malaytextr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(malaytextr)
```

## Examples

### Malay root words

There is a data frame of Malay root words that can be used as a dictionary:

```{r}

head(malayrootwords)

```


### Stem Malay words

`stem_malay()` will find the root words in a dictionary, in which the `malayrootwords` data frame can be used, then it will remove "extra suffix"", "prefix" and lastly "suffix"

To stem word "banyaknya". It will return a data frame with the word "banyaknya" and the stemmed word "banyak":

```{r}

stem_malay(word = "banyaknya", dictionary = malayrootwords)

```

To stem words in a data frame:

1. Specify the data frame
2. Specify the dictionary
3. Specify the column that needs to be stemmed

```{r}

x <- data.frame(text = c("banyaknya","sangat","terkedu", "pengetahuan"))

stem_malay(word = x, 
          dictionary = malayrootwords, 
          col_feature1 = "text")


```

### Remove URLs

remove_url will remove all urls found in a string

```{r}

x <- c("test https://t.co/fkQC2dXwnc", "another one https://www.google.com/ to try")

remove_url(x)


```

### Malay stop words

There is a data frame of Malay stop words:

```{r}

head(malaystopwords)

```

### Sentiment lexicon

This lexicon includes words that have been labelled as positive or negative. This is useful for tasks like sentiment analysis, which involves determining the overall sentiment expressed in a piece of text. To use the lexicon, process the text and check each word against the lexicon to determine its sentiment. To note, this sentiment lexicon was created based on a general corpus, sourced from news articles

```{r}

head(sentiment_general)

```