Diachronic Corpora as a Tool for Tracing Etymological Information of Indonesian- Malay Lexicon

Indonesian lexicon comprises numerous loanwords which some of them already exist since the 7 century. The large number of loanwords is the reason why many dictionaries of Indonesian etymology available today contain merely the origin of the words. Meanwhile, there are several aspects in a word etymology that can be studied and presented in a Kamal Yusuf, Dewi Puspita 154 dictionary, such as the change in a word form and in its meaning. This article seeks to demonstrate the use of corpora in identifying the etymological information of Malay words from diachronic corpora and to figure out the semantic change of the Malay words undergo from time to time until they turn out to be Indonesian lexicon. More specifically, two selected Malay words were examined: bersiram and peraduan. By exploring data resources from the corpus of Malay Concordance Project and Leipzig Corpora, this study attempts to collect etymological information of Indonesian lexicon originated from Malay by employing a corpus based research. The findings show that the examined words have changed in meaning through generalization and metaphor. However, unlike the word bersiram, the change that the word peraduan happened only occurs in semantic level. This information, ultimately, can be used as informative data for a more comprehensive Indonesian etymology dictionary. Drawing on corpus analysis, this paper addresses the importance use of diachronic corpora in tracing words origin.


INTRODUCTION
For most language users, etymological information is perceived just evidence of which a word originated from, especially when a language absorbs many loanwords; Indonesian take as an example (Russel et al., 2007;Tadmor, 2009). Kridalaksana (2001) pointed out that the content of Indonesian etymology dictionaries which have been compiled and available today is merely an inventory of words origin which needs to be continued with research and interpretation from various aspects. This is in line with the opinion of Durkin (2009) and Liberman (2009) who stated that the study of etymology is related to the history of a word, the history of meaning, formal history, or the history of its spread from one language to another, or from one group to another.
In accordance with that, it is important to point out that at least there are six etymological information that can be applied to trace a word: (1) the year of usage, (2) the initial form (morphology) and the initial sound (phonology), (3) the language of the donor (for loan word), (4) the person who coined the word for the first time, (5) the initial meaning, and (6) the change of meaning. Therefore, an etymological dictionary should not only contain information of the word's origin but also be given more clear-cut description of a word.
Other things from Indonesian etymology dictionaries that are available until today is the scarcity of etymology information of words originated from Malay. Malay is the root of Indonesian (Teeuw, 1967;Andaya, 2001;Hoogervorst, 2015). In the early centuries, the language spoken in some part of the Indonesian archipelago and the Malay Peninsula might be the same. Over time, there are many things and events, socially and politically that affects the regions and causes the language to change and to be different. Information of changes that occur in Malay words--that now become the vocabulary of Indonesia, phonologically, morphologically, semantically, or syntactically--are parts of the etymology information (Mohamed & Yusoff, 2014).
A number of studies have previously been done concerning the etymology and semantic change of words in various languages, to mention some are Wijaya & Yeniterzi, 2011;Yurrivna, 2014;Jatowt & Duh, 2014;Hasan, 2015;and Altakhaineh, 2018. However, to date, there is not ample works that pay their attention to scrutinize how the etymological information can be approached using corpora, especially in the relation of Malay and Indonesian language. Wijaya & Yeniterzi (2011) identified semantic change of words over centuries using computational linguistics method. They used Topics-Over-Time (TOT) and k-means clustering on Google Books Ngram dataset. Through their methods, they show how clustering words that co-occur with an entity of interest in 5-grams can shed some lights to the nature of change that occurs to the entity and identify the period for which the change occurs. Yurrivna (2014) only classified changes in meaning that occur in English medical terms. Classification of changes in meaning in question is specialization, generalization, pejoration or amelioration, also metaphor and metonymy.
Jatowt & Duh (2014) explored digitized historical texts, which were also carried out in our study. The difference is, Jatowt & Duh uses the NLP (Natural Language Processing) method while we used the corpus-based method. Another study was conducted by Hasan (2015) which dealt with semantic change of borrowing words, especially Arabic words in Bengali. This kind of research in Indonesian is plenty. In fact, most of etymological research in Indonesian is about borrowing words. Altakhaineh (2018) examined the semantic change of positive vs. negative adjectives in Modern English. He compared the meaning of those adjectives in dictionary than look up their frequency of use in the corpus. He wanted to see wether the adjective had been negative or positive from the beginning or whether the adjective turns negative or positive because there are things that cause it. The research object of the those researchs are English vocabulary.
Until now there has been no theory that specifically addresses the search for etymological information through diachronic corpora.
Existing theories still separate theory of etymology and theories about corpus linguistics. Of the two theories, the theory referred to in this study is the theory of Collins (2003)  This paper offers a tool that can be used to trace etymological information, especially to trace changes in their meaning. The tool that can provide a large collection of text from past centuries to be examined is diachronic corpora (de Melo, 2014). According to Allan & Robinson (2012), the use of corpus is the state of the art in the study of historical semantics, which is part of etymology study. Malay is fortunate to have Malay Concordance Project (MCP) developed by Australian National University (Proudfoot, 1991;Gallop, 2013). It consists of old classical Malay manuscripts from 14 th to 20 th centuries that can be used to examine the usage of a Malay word during that time (Johary & Rahim, 2014). This present study, therefore, seeks to explore any etymological information of Malay words that become part of Indonesian lexicon which are still used until today by employing the MCP compared with a more recent potential corpus from the 21 st century.
There are thousands of Malay-Indonesian original vocabularies.
It would take a very long time to be able to analyze the entire original vocabulary. For this reason, as a preliminary study, the current research was conducted using data samples. Two samples were chosen to be presented in this paper; they are bersiram and peraduan. The sample selection process is explained in the research method section.
Thus, the aims of the current study have three folds: (1) to identify the etymological information of Malay words bersiram and peraduan from diachronic corpora, (2) to investigate what kind of changes those Malay words undergo from time to time until they turn out to be Indonesian lexicon, and at the end (3) to demonstrate the use of diachronic corpora as a tool in examining etymological Malay-Indonesia lexicon.

RESEARCH METHOD
This research is a corpus-based research. To prove that etymological information can be collected from diachronic corpora, this study employed two corpora that were set in chronological order. We started with the methodological issue by selecting the proper corpora collection available online. We found two major salient collections regarding Malay and Indonesian corpus. The first corpus is MCP, which comprises 5.8 million words (including 140,000 verses) from more than 165 sources of pre-modern Malay written text. The oldest script is from the year 1302 and the most up-to-date is from 1950 (Gallop, 2013;Bakar, 2020 (Richter et al., 2006;Biemann et al., 2007). The two corpora are available online and they demonstrate the context uses of Malay lexicon from the 14 th to 21 st century.
The search results of the words investigated from the two corpora then were analyzed qualitatively. The changes that each word undergoes were examined from the concordance lines and the word's collocations. Regarding the data, we selected two samples from a number of Malay-Indonesian words to be further investigated as a model study in this paper, i.e. bersiram and peraduan. Those words are taken from the list of honored words in Kamus Besar Bahasa Indonesia (KBBI).
Honored words mean words that are used in formal situation and only for selected and respected people. There are 26 words in that list (Table   1). However, not all of them are originated from the Malay. Some of the words listed are originated from Sanskrit and Old Javanese. Most importantly, not all of them experience changes in their meaning. From that not so many Malay words that undergo changes in meaning, we found the word bersiram and peraduan.

RESULTS & DISCUSSION
Given the above description, we primarily present an analysis model of utilizing diachronic corpora to discover the etymological information of Malay-Indonesia lexicon. We selected bersiram and peraduan and traced their use in the sentences deposited from the two corpora as presented below.

The Semantic Change of ''bersiram''
After its independence in 1945, Indonesia has become a republic.
The royal system is no longer used. For that reason, the frequency of use of the word bersiram might also be decreased. However, in a more recent corpus like Indonesian corpus in Leipzig Corpora, we can still find the use of the word bersiram in many different contexts (see graphic in Chicken covered with Monterey Jack-Cheddar cheese sauce.' We can see from the two diachronic corpora that there are changes in the meaning of the word bersiram. The word that originally had only one meaning and used only for certain circle, after the twentieth century its meaning has widened to a figurative meaning, and move from specific to a more general meaning.
As can be seen from Figure 1, the graph is an auto-generated graph based on the frequency of co-occurences. The words darah and cahaya do not appear on the graph because the frequency of their appearance is not as high as other words.
Furthermore, it is not only the semantic aspect of the word bersiram that change over time. Another linguistic aspect that also changes is the syntactic aspect, especially at class of word. Bersiram is an intransitive verb by nature. In Indonesian grammar, prefix ber-forms intransitive verb. As can be seen in the sentence (9): (9) ... maka bagindapun pergilah bersiram ke kolam itu. The phrase ke kolam itu in above sentence (1) is not an object, but it is an adverb of place. An object is not needed after the word bersiram in that sentence.
However, in its figurative meaning, the verb bersiram has become transitive. Below is a concordance line of the verb bersiram in figurative meaning followed by its objects (in upright letters). 'And, as a desserts please order Roti Cane Gula or Roti Cane Susu, watered with condensed milk' Objects in the above sentences are mandatory because without objects the sentences would be incomplete and meaningless.

Diachronic Use of ''peraduan''
Another example that we would like to present for tracing the semantic change and the etymological information utilizing MCP is the word peraduan. This word is a classical-high Malay as well, that is used strictly for the royal family. It has the meaning of 'bed' or 'bedroom'.
Compared to bersiram, the frequency of peradaun's appearance in MCP was found higher. It appeared 357 times in 31 old manuscripts dated from the 1370s to 1950s. The word can be found in the manuscript Syair Siti Zubaidah Perang Cina (32 times) and mostly occurred one time in sixteen manuscripts (see Table 3).

The Semantic Change of ''peraduan''
The first type has the same meaning and usage as those in previous corpus, which is bed or bedroom of the royal family. The word peraduan in the first type, as shown in sentences number (21), (22), and (23), are collocated with raja (king) and kerajaan (royal).
'Meanwhile, the King had slept in the royal bed.' (22)  In the second type of usage, the word peraduan, as found in the Leipzig Corpora, carries the same meaning but it is then used by common people.
'That morning, heavy rain was pouring in Surabaya and its surrounding area, made me lazy to get out of bed.' (26) Orang-orang yang dekat di hati saya, satu persatu mulai beranjak ke peraduan.
'The people I love, one by one began to move to go to bed.' The common word for 'bed' in Indonesian is tempat tidur or ranjang.
However, in sentence (24), (25) and (26) which contexts are not related to the royal family, the word peraduan is used instead of tempat tidur or ranjang. This usage shows that the meaning of peraduan has been generalized. Since there is no longer king or royal family in Indonesia, the word has become functional for everyone.
The third type is the occurrence of the word in figurative meaning. In this type of usage, the word peraduan mainly collocates with  Figure 2), such as in the sentence (27), (28) and (29); and sang surya which also means 'sun' in (30). In those sentences, the sun is depicted as if it goes to bed to rest so the day turns into night, or gets out of the bed and starts to shine.
'The sun goes down to its resting place and the night begins to climb the earth.' (28) Ketika matahari telah kembali ke peraduan, malam pun tiba.
'One of the reasons is to see directly the sun out of its bed in the eastern horizon.' In Indonesian, there is a metaphor that equates the sun as the king of the day and the moon as the night goddess (_matahari=raja siang; bulan=dewi malam_). Without the sun there will be no daylight. In some cultures, there are also tribes who regard the sun as a god or as the giver of life just like a king. Because of this metaphor and belief, some of the vocabulary reserved only for kings is also applied to the sun.
Finally, those different types of usage of peraduan found in Leipzig Corpora show that the word has changed in meaning through generalization and metaphor. However, unlike the word bersiram, the change that the word peraduan experienced only occurs in semantic level. The other linguistic aspects of the word are not affected.

CONCLUSION
Using corpora, this paper identified the etymological information particularly of the exemplary words bersiram and peraduan to determine to what extent these words diachronically changed through time.
Drawing on data obtained from the analysis, the findings showed information as follows.
The presentation of the etymological information in the dictionary can also be made in the narrative form, so the reader could get a clearer picture of the semantic change (Bochkarev et al., 2020). This paper has demonstrated that diachronic corpora can be a useful tool in the investigation of etymological information, especially to find changes in meaning. The corpora that are set chronologically can also tell us the approximate time of change. Although the precise year of change remains unknown, it is able to at least reveal in which era the change happen. The activity of collecting etymological information from diachronic corpora, however, can only be done to the lexicon in written texts. Furthermore, information about the usage of the words in spoken forms, whether or not they are used in the same register with the same meaning, is undisclosed. We found that, it does not lessen the effectiveness of diachronic corpora as a tool in collecting etymological information. Finally, this paper could strategically contribute to the model of development for a more comprehensive Indonesian etymology dictionary.