FindStem:Analysis and Evaluation of A Turkish stemming algorithm
Transkript
FindStem:Analysis and Evaluation of A Turkish stemming algorithm
FindStem:Analysis and Evaluation of A Turkish stemming algorithm Hayri Sever and Yıltan Bitirim Department of Computer Engineering Başkent University Ankara, 06530 Turkey sever@baskent.edu.tr Department of Computer Engineering Eastern Mediterranean University Famagusta, T.R.N.C. (via Mersin 10, Turkey) yiltan.bitirim@emu.edu.tr Abstract. In this paper, we evaluate the effectiveness of a new stemming algorithm, FINDSTEM, for use with Turkish documents and queries, and compare the use of this algorithm with the other two previously defined Turkish stemmers, namely ”A-F” and ”L-M” algorithms. Of them, the FINDSTEM and A-F algorithms employ inflectional and derivational stemmers, whereas the L-M one handles only inflectional rules. Comparison of stemming algorithms was done manually using 5,000 distinct words out of which the FINDSTEM, A-F, and L-M failed on, in respect, 49, 270, and 559 cases. A medium-size collection, which is comprised of 2,468 law records with 280K document words, 15 queries in natural language with average length of 17 search words, and a complete relevancy information for each query, was used for the effectiveness of the stemming algorithm FINDSTEM. We localized SMART retrieval system in terms of a stopping list, introduction of Turkish characters, i.e., the ISO8859-9 (Latin-5) code set, a stemming algorithm (FINDSTEM), and a Turkish translation at message level. Our results based on average precision values at 11-point recall levels shows that indexing document as well as search terms with the use of FINDSTEM for stemming is clearly and consistently more effective than the one where the terms are indexed as they are (that is, no stemming at all). 1 Introduction No matter what retrieval model is used, typically information retrieval (IR) systems are built around three basic objects: documents, terms, and user queries. The aim of information retrieval is to extract relevant documents from a collection of documents in response to queries. Terms are used to represent the contents of documents and queries. Furthermore, document terms are matched with search terms to determine the relevancy of documents to a user query. Given that it is not realistic to assume that authors and users of documents have common vocabulary in expressing their intellectual activities, to enlarge the extent of the overlap between vocabularies of these two agents becomes a sound effort. Hence, the conflation procedure to reduce variants of a word to a single form gets into picture as a natural consequence of the rationale that similar words generally have similar meanings. The most common conflation procedure is the use of a stemming algorithm, which simply removes, in Turkish 3 , inflectional variants from the word endings while keeping derivational affixes untouched. For example, ‘g özlüḡüm‘ (my eyeglasses) and ‘gözlüklüyü‘ (one who wears eyeglasses) both conflates into the stem ‘gözlük‘ (eyeglasses), not into their root form, which is ‘göz‘ (eye). Similarly, all of the words ‘göz‘ (eye), ‘gözde‘ (favorite), ‘gözlem‘ (observation), ‘gözcü‘ (observer), ‘gözlükçü‘ (optician), ‘gözetim‘ (supervision) constitute to some of the stems derived from the the same root ‘göz‘ 4 ; that is, all should be kept as they are since they have different meanings. Stemming, in other words, can be envisioned as a form of language processing [1] that consistently improves system effectiveness [2], though there is conflicting views for English text in the literature [3, 4]. As much as the stemming process might increase the effectiveness of IR systems especially in the morphologically complex languages, it also boosts up the efficiency of IR systems due to the fact that the size of the index term set will be decreased as a result of stemming. In Turkish language, the need for stemming is more dramatic, since there are approximately 23,000 stems and 350-400 roots actively used [5]; however, when the inflection of the words are included, the number is expressed in millions [6], though the number of entries in a typical Turkish dictionary is roughly about 55K. Furthermore, the index of synthesis 5 for Turkish language is found to be 2.86. There are a number of past works on Turkish stemming mostly published locally or unpublished manuscripts [8– 11], with the exception of the work done by Ekmekçio¯glu and Willet [12]. Hence, in this article, we hold a comparative discussion of our stemming approach with the previous ones. The organization of the paper is as follows: In section 2, Turkish stemming algorithms in the literature are discussed. Section 3 presents a stemming algorithm for Turkish, ”FindStem” in detail. Section 4 considers the methods and configuration of the experiments. In section 5, experimental results for the stemming algorithms for Turkish are discussed. Finally, section 6 presents conclusions. 2 Stemming Algorithms for Turkish We explore two stemming Algorithms 6 for Turkish in this section. The first algorithm, developed by Kut et al. [8], and called Longest-Match (L-M), is based on the word search logic over a lexicon/dictionary that covers Turkish word stems and their possible variances (Figure 1). The authors used the L-M for indexing 3 4 5 6 Turkish as a member of the south-western or Oghuz group of the Turkic family of languages is an agglutinative language with word structures formed by productive affixations of derivational and inflectional suffixes to the root words. There are roughly 150 stems or compound words emerging from the root ‘göz‘ (eye). Index of synthesis refers to the amount of affixation in a language, i.e., it shows the average number of morphemes per word in a language [7]. Truncation of words has been considered as a straightforward alternative to stemming for a long time. Hence, it may be worth stating that a truncation length of 5 characters yields the best performance when compared with the those of 4, 6, 7 and 9 characters [13] document terms and constructing a stop list, which has been in use since then and consists of 316 words. 1. Remove suffixes that are added with punctuation marks from the word. 2. Search the word in the dictionary. 3. If a matched root is found, goto step 5. 4. If the word remained as a single letter, goto step 6. Otherwise, remove the last letter from the word and goto step 2. 5. Choose the found root as a stem and goto step 7. 6. Add the searched word into unfounded records. 7. Exit. Fig. 1: The L-M Algorithm The second algorithm was developed by Solak and Can [10] and is referred as AF algorithm. The algorithm works over a dictionary that keeps actively used stems for Turkish in which each record is annotated with 64 tags showing how to generate surface forms. For given a word, it is iteratively looked up in a dictionary from right to left by pruning a letter at each step. If the word matches with any of the root words, then the morphological analysis for that word is done, i.e., application of affixation rules to get the surface forms of the root word, or lexicon form. If any of the surface forms is in correspondence with the word at hand, then it is assumed that the root word 7 is an eligible stem for that word. The process is repeated until the word drops down to a single letter. The steps of the algorithm are shown in Figure 2. The algorithm provides all possible stems for a word as an output. Solak and Can [10] reported that the average word stem for Turkish words to be 1.22 8 . Ekmekcioglu and Willett in their study of effectiveness of stemming for Turkish text retrieval choice to stem only query words in their experiments under the ground that Turkish word roots are generally unaffected when a suffix is added to its right-hand end. Accordingly, there is no need for the recoding procedures that are required in many other languages, and the use of a simple truncation search thus ensures that a stemmed query word is able to retrieve all of the variants in the database that are derived from it. For example, the word enflasyonla (with/by inflation), in one of the queries was stemmed to enflasyon, this resulting in matches with words such as enflasyonu, enflasyonunu, enflasyonun, and enflasyonist, inter alia. 7 8 Note that Solak and Can did not distinguish a root word from a stem. This may be because the root words may be viewed as special cases of stems in the sense that the root is a stem that neither contains any morpheme nor is a compound word. From now on, we share this view as well, unless otherwise is specified. In the experimental work over the text of 533 Turkish news the A-F algorithm has enumerated 111,062 stems out of 90,912 words. 1. Remove suffixes that are added with punctuation marks from the word. 2. Search the word in dictionary. 3. If a matched root found, add the word into root words list. 4. If the word remained as a single letter, the root words list is empty then goto step 6, if root words list has at least one element then goto step 7. 5. Remove the last letter from the word and goto step 2. 6. Add the searched word into unfounded record and exit. 7. Get the root word from the root words list. 8. Apply morphological analysis to the root word. 9. If the result of morphological analysis is positive then add the root word to the stems list. 10. If there is any element(s) in root words list then goto step 7. 11. Choose the all stems in the stems list as a word stem. Fig. 2: The A-F Algorithm The approach described above is problematic in nature because the roots etymologically in Turkish has given rise to many other stems, e.g., the root ‘g öz‘ (eye) is a source of derivation to roughly 150 stems which have totally different meanings indeed. The stem enflasyon , given as an example in the quotation, has foreign origin in the noun form, and the Turkish words having foreign origin is usually kept in noun forms which may be inflected, but not be the source of offspring to other stems. Hence, we strongly believe that stemming only search terms is a serious mistake given that the number of feasible stems per a word is between 1.2-1.5 and affixation length per word is 2.82 on average [7, 10, 11]. But we fully agree with the statement that Turkish grammar 9 makes the stemming algorithms simple. One reason behind their decision on stemming only query terms would be slowness of the tool, two-level morphological analyzer for Turkish [15], they have used. This analyzer, called PC-KIMMO [16], has been designed to generate and/or recognize words using a two-level model of word structure in a word is represented as a correspondence between its lexical level form and its surface level form. The generator component of PC-KIMMO accepts as input a lexicon form, applies the appropriate rules, and returns the corresponding surface form(s). The recognizer component accepts as input a surface form, applies the appropriate rules, consults the lexicon, and returns the corresponding the lexical form with its gloss. This way of stemming is rather slow and can analyze only about two forms per second and generate about 50 forms per second on Sun SparcStations [17]. Once more than one stem are obtained for a given search word, the smallest one is picked up by the stemming algorithm. It is reported that this choice was bounded by 17% error rate over the data set conducted for the experiment. The way of turkishizing foreign words into the language we would say the other way around: selecting the longest stem would be appropriate choice. A simpler version of the two-level morphological analyzer for Turkish for stemming (using the same lexicon) has been adapted by Solak and Can as described above by the A-F algorithm. 9 For detailed information on Turkish grammar, we refer the reader to [14]. 3 The FindStem Algorithm The FindStem algorithm contains a pre-processing step that simply converts all letters of the word into their small cases and singles out the letters after the punctuation mark in the word. It has three components, namely ”Find the Root”, ”Morphological Analysis”, and ”Choose the Stem” that will be explained in the remaining of this section. 3.1 Finding the root words in Turkish The first step in a stemming algorithm is to find all possible roots of an examined word. Then, these roots and production rules will be used to derive the examining word. Stemming algorithms without a lexicon ignore the word meaning and lead to a number of stemming errors [18]. As in all stemming algorithms for Turkish, the lexicon is used as an auxiliary structure for the stemming process. In lexicon, the type information 10 for every root word and possible root changes (when a root word combines with suffix) is coded for use of morphological analysis. During the root and the suffix combination in Turkish, two alteration on a root word structure would be in order: (1) change of the last vowel (e.g. ara-arıyor) or consonant letter (e.g. kitap-kitabı) of the root word and (2) drop of middle vowel letter (e.g. oğul-oğlum) [6]. The selection of possible root words from lexicon is performed by the search algorithm that uses the coded information in the lexicon. Algorithm starts with the first character of the examined word and search the lexicon for this item. Then the next character is appended to the item for which lexicon search begins. This operation continues until the item becomes equal to the examined word or until the system understands that there are no more relevant roots for the examined word in the lexicon. 3.2 Morphological analysis The Turkish language uses the Latin alphabet consisting of 29 letters, of which 8 are vowels and 21 are consonants, and is an agglutinative language, i.e., one in which words contain a basic root, with one or more suffixes being combined with this root in order to extend its meaning or to create other classes of words [19]. In Turkish language there are a number of rules, which are explained in the appendix, to determine the form and order of suffixation Suffixes are divided into two main classes such as derivational and inflectional ones. Of them, the derivational suffixes are used for changing word meanings. To add the derivational suffixes to end of a word is determined by word type (this information is coded into the lexicon for every word). The derivation rules are gathered under two main titles: (a) advirable that is derivation of tense origin words and (b) de-nominal that is derivation of noun origin words. Note that it is possible to derive an advirable word from a noun origin root word or to derive a de-nominal word from a verb origin root word. For example, the advirable word ”baba-y-dı” (he was a father) can derive 10 Root words in the lexicon are divided into two main groups: nouns and verbs. The nouns are further subdivided into four groups, which are adjective, adverb, noun and pronoun. Then, this information is coded by numerical values as the type information. from the noun origin root word ”baba” (father). To make a derivation, all suffixes are grouped and each one is coded to be a standard method corresponding a rule defined in the appendix. A morphological analyzer is usually required if high-quality stemming is to be achieved [12]. To show the importance of the morphological analysis step in our stemming algorithm, let us consider the word ”edebilecek” as an examined word. The longest possible root words, retrieved from lexicon, are ”edebi”, ”edep”, and ”ede”. According to the algorithms [8, 13] that assigns a stem by matching the examined word with longest root words, these root words will be selected as output. But it is not possible to produce the examined word, ”edebilecek”, by using these root words merely; this result can be achieved through the morphological analysis procedure. 3.3 Selection of a stem In Turkish, a surface form can be generated using more than one root. For example, the word küçücükken (once one is very small) may conflate into either of the root k üçük (small) or küçücük (very small). But, if one wonders as to which one truly represents a stem for küçücükken, it should be küçücük. The number of application of suffixes and their types 11 , at derivation, forms basis for selection operation. It is, however, worth to mention that the ambiguity in conflating Turkish words (or the way around, i.e., generation) into single terms becomes another issue for which our hands are tied for stemming. For example, assume that a word has more than one senses. In this case, let us take a look at the word ”başlar” can be either plural of ”baş” (head) or inflection of present tense of the verb ”başlamak” (to start). To find out the actual stem of a word like the former one, a semantical reasoning about the context should be carried out 12 . If neither a possible root word is found in the lexicon for the examined word nor the production rules are successful for deriving that word from any root in the list, the word will be kept as it is and/or saved on a log file for examination without passing next steps. It is highly probable that such a word is either a foreign word or adopted into Turkish, but not yet present in the lexicon. Putting it all together: The FindStem algorithm is shown in Figure 3.3. 4 The Experimental Method For the experiments, we used localized version of SMART System. To localize the SMART system into Turkish, the Turkish characters (ğ, Ğ, ü, Ü, ş, Ş, ı, İ, ö, Ö, ç, Ç) and Turkish stopwords list are introduced to the system and the English stemming algorithm, in the system, is replaced with the FindStem stemming algorithm. 11 12 Suffixation is divided into two main types as ”derivational suffix” and ”inflectional suffix”, which may be also further divided into subtypes. By the context it should not automatically thought of a sentence level analysis, but it may sometimes require to go for paragraph or even for text. 1. Remove suffixes that are added with punctuation marks from the word. 2. Find all possible roots of the word in a lexicon and add them into root words list. 3. If root words list is empty, add the word into unfounded records and exit. 4. Get the root word from root words list. 5. Apply morphological analysis to the root word. 6. After morphological analysis, add the formed derivations into derivations list. 7. If there is any element(s) in root words list then goto step 4. 8. Choose the word stem by a selection between derivations in the derivations list. Fig. 3: The FindStem Algorithm The experiment is divided into two parts. In the first part, A-F algorithm [10] and L-M algorithm [8] are compared with FindStem algorithm and their effectiveness are investigated. In the second part, FindStem algorithm is integrated to the localized SMART system and the performance measurement of the algorithm is done by using the precision and recall parameters which can be defined for a user query as the proportion of retrieved and relevant documents over retrieval output and relevant documents, respectively. 4.1 Evaluation of the Turkish stemming algorithms The A-F algorithm was used in its entirety along with its lexicon as being downloaded through private communication, and the L-M algorithm was re-coded by us with respect to the principles laid out in [8]. The algorithms were tested through a data set consisting of 5,000 words. The truth stems of each word were tagged manually. Note that there could be more than one truth stems for a given word as shown in Figure 4.1. Searching Word Result Stems benzerlikten 1.benzerlik 2.benzer 3.benze(mek) Fig. 4: Stems, not accepted as an error The accuracies of three stemming algorithms over that data set were computed. 4.2 Performance Measurement ”The effectiveness of stemming algorithms has usually been measured in terms of their effect on retrieval performance with test collections” [20]. Our test collection is the set of Turkish based documents which is formed from total 2,468 law entries. The number of rows in these documents ranges between 10 and 20; the number of rows and words in total are 59,941 and 279,904 respectively. This collection was indexed twice by the localized SMART system. One was without stemming and the other was with the FindStem algorithm. As a result, two different index sets were formed in the system. 15 queries 13 have been defined with complete information about relevancy of documents. In correspondence with the document terms, the search (or query) terms were stemmed (or not stemmed). We have run each query separately and used the retrieval outputs to determine if each retrieved document was relevant or not. We used non-interpolated precision values at 8-point recall levels (from to ) for effectiveness of the system with or without stemmed terms. To calculate the recall, the total number of relevant documents in the collection should be known. But this is not possible while the number of documents in the collection is approximately 2,500. Nevertheless, the cut-off point is determined as 100 documents and the assumption is made that all the relevant documents, which are related to a query, would be in the first 100 documents retrieved. The recall is accepted as 1 at the position (in the retrieval output) that the last relevant document displayed and precision values for every query are calculated for various recall parameters (1, 0.9, 0.8,..., 0.3). The formula used to calculate the precision values for various recall parameters is: where ”N” is the number of relevant documents retrieved, ”R” is the recall parameter ) relevant document in the (1, 0.9,..., 0.3) and is the position of Xth ( retrieval output. For example, when the number of relevant documents is 10 out of 100 documents retrieved and the last relevant document is 85th document in the retrieval output, for recall parameter, 1, the precision value will be !"$# &% and for * ' recall parameter, 0.9, the precision value will be (52 is the position ( ! ) # % + ' of 9th ( ) relevant document in the retrieval output), etc. 5 Experimental Results 5.1 Comparison and effectiveness of Turkish stemming algorithms While the FindStem and the L-M algorithms select only one root as the stem for each word, under certain conditions, the A-F algorithm can select more than one root for a word (Figure 5). The L-M algorithm could not find 559 roots as a stem but many of them are semantically related words (such as the word ”öğreti” is found to be a stem for the word ”öğretilecek”, instead of the word ”öğret(mek)”). Because of this, only 138 of them are assumed wrong. The FindStem algorithm has found the stems of 49 words different from manually entered stems. But actually, these roots, found by the algorithm, can be evaluated as the stem of a word (e.g. the manually entered stem word is ”göz” for the word ”gözden” but the algorithm finds the root word ”gözde” as a stem). 13 Queries are accessible online, http://cmpe.emu.edu.tr/bitirim/stemming. Word alanında anlatılmak aşılmıştır birleşmiş belirtilmeyen FindStem alan anlat aş birleş belirt çekleştirdiği aksamaması aralıkta eklemek daha çek aksa ara* ekle daha A-F al* an* aş bir 1.be* 2.belirt çekleştirdiği* aksa aralık ekle 1.da* 2.daha L-M Manually Entered Stem alan anlat aşı* birleş belirti* alan anlat aş birleş belirt çek aksam* aralık eklem* daha çek aksa aralık ekle daha Fig. 5: The found stems for some words by algorithms (”*” means; the word is accepted as incorrect) The A-F algorithm is found completely wrong stems for 59 words. Furthermore, the algorithm found unsuitable roots as a stem for 270 words (such as the words ”göre” and ”görev” selected to be stems of the word ”görevini”). Some samples of found stems for the employed algorithms are shown in Figure 5. 5.2 The Number of Documents Retrieved The number of zero retrievals (i.e., no documents retrieved) or retrievals that contain no relevant documents (i.e., the precision ratio is zero) can be used to evaluate the retrieval performance while stemming is used or not used. The number of relevant documents retrieved for each query is given in Table 1. The first number in the row labelled ”Average” shows the average number of relevant documents retrieved and the second one (in parentheses) shows the total number of documents retrieved. As Table 1 shows, average number of relevant documents retrieved for 15 queries over stemmed and unstemmed indexes are 23.3 and 28.4, respectively. While the total number of relevant documents retrieved over the stemmed index is 426, and 350 over the unstemmed index. Through stemming the total number of relevant documents are increased approximately by 22%. Except in 3 of 15 queries (i.e., query 4, 10 and 14), the usage of stemming has increased the retrieval effectiveness of the system in all queries. This indicates the success of the FindStem stemming algorithm. Number of relevant documents Query Stemmed index Unstemmed index 1 66 55 2 41 32 3 68 51 4 43 43 5 28 17 6 17 12 7 46 38 8 18 16 9 8 7 10 40 40 11 9 8 12 17 11 13 13 10 14 6 6 15 6 4 Average 28.4(426) 23.3(350) Table 1: Number of relevant documents retrieved Average Precision of 15 queries Recall w/ Stemming w/out Stemming 1 0.415 0.283 0.9 0.508 0.316 0.534 0.415 0.8 0.7 0.586 0.490 0.625 0.547 0.6 0.5 0.681 0.589 0.4 0.781 0.644 0.3 0.818 0.664 Average 0.619 0.494 Table 2: Average precision values when stemming is used and not used 5.3 Precision values for various recall parameters The average precision of 15 queries for various recall parameters over stemmed and unstemmed indexes are given in Table 2 14 . The overall average of the averages are also given in Table 2. Table 2 shows that the precision increases while the recall decreases and when word stemming is used, better precision values are obtained. Thus the affect of stemming on the retrieval performance becomes apparent as shown in Figure 6. 0.9 0.8 Precision 0.7 0.6 0.5 0.4 w/ Stemming 0.3 w/out Stemming 0.2 0.1 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Recall Fig. 6: Precision values for various recall parameters The usage of stemming has increased the retrieval effectiveness of the system by approximately 25% in terms of precision. 6 Conclusions In this article, the necessity of a morphological analysis in the Turkish stemming algorithms is described. The main idea of the algorithms, which do not use the morphological analysis, is to select the root 15 from a lexicon as the stem which increases the number of errors in word stemming. Another point is that it is possible to find more than one stem for some words, but it is not possible to decide the real stem without examining a context in which the word appears and if one of these is selected as the stem, there will be a possibility to select the wrong one and cause incorrect stemming. However, assigning all the roots that their morphological analysis result is positive as a stem, sometimes causes to find the origin beside the stem. The elimination on words, which is assigned as the stem, is a necessity to find the real stem. 14 15 The precision values of every query for every recall parameter (1, 0.9,..., 0.3) are accessible online, http://cmpe.emu.edu.tr/bitirim/stemming. The root will be the longest set of characters that has compatibility with the word. A Turkish Word Formation In Turkish, root words and words derivated from roots are based on obvious rules such as major and minor laws of vowel harmony. These rules determine changes that are made both to root words and affixes in the formation of words. A.1 Turkish alphabet In the Turkish word formation, mainly the vowels and unvoiced consonants are subject to change. The vowels and voiced/unvoiced consonants, and characteristics of vowels are shown in Table 3 and Table 4. Unrounded Rounded Low High Low High Back A I O U Front E İ Ö Ü Table 3: Vowels and their characteristics Unvoiced Consonants f,p,ş,ç,h,s,t,k Voiced Consonants b,c,d,g Table 4: Voiced and unvoiced consonants A.2 The possible changes on roots and suffixes in word formation The two important changes observed in word formation are assimilation and dropping. In addition, the character changes such as unvoiced consonants becoming voiced are observed in Turkish. Assimilation During the word formation process in Turkish, if the last character of a root is a vowel and if the first character of a suffix is a vowel also, then consonants such as ”n”, ”s”, ”y”, ”ş” are used as assimilator. Some examples are: bahçe (garden) komşu (neighbor) pencere (window) iki (two) +ı + in +ı + er bahçesi (his/her/its garden) komşunun (neighbor’s) pencereyi (the window) ikişer (two each) Dropping During Turkish word formation, the loss of a letter is possible in both root word or suffix. This loss can be a vowel or a consonant. The drop of middle vowel: The vowel of the second syllable is lost when a suffix beginning with a vowel is added. For example: oğul (son) burun (nose) karın (stomach) +u +u + im oğlu (his/her/its son) burnu (his/her/its nose) karnım (my stomach) The drop of vowel at the end of a root: When some words such as koku (smell), sızı (pain), yumurta (egg) combine with the suffix ”-le” to be derivated verb, the last vowel of these words are drop. For example: koku (smell) + le kokla (smell it) The drop of consonant at the end of a root There are some situations that the last consonant drop when root and suffix combine. For example: küçük (small) yüksek (high) + cik + (e)l küçücük (very small) yükselmek (to rise) Character change When a vowel is added to some nouns of one syllable and most nouns of more than one syllable, ending in ”p”, ”ç”, ”t”, ”k”, the final consonant changes to ”b”, ”c”, ”d”, or ”ğ” respectively. So with the addition of suffix for the third person, ”-i” as shown below: kitap (book) ağaç (tree) armut (pear) ayak (foot) +ı +ı +ı +ı kitabı (his/her/its book) ağacı (his/her/its tree) armudu (his/her/its pear) ayağı (his/her/its foot) But there are nouns, whose final consonants are not subject to this change, when the letter ”n” comes before the letter ”k” at the end of a word, the letter ”k” becomes ”g” instead of ”ğ”. For example: renk (color) +ı rengi (his/her/its color) In foreign origin words, if the last letter of a word is ”g” and combines with the suffix where first letter is vowel, the letter ”g” changes to the letter ”ğ” as in the example: monolog (monologue) +ı monoloğu (his/her/its monologue) The change of ”g” and ”ğ” is not observed for one syllable words or words which the second letter from the last one is ”n”. For example: şezlong (chaise longue) +ı şezlongu (his/her/its chaise longue) References 1. Irene Diaz, Jorge Morato, and Juan Lloréns. An algorithm for term conflation based on tree structures. Journal of The American Society for Information Science and Technology, 53(3):199–208, 2002. 2. R. Krovetz. Viewing morphology as an inference process. Proceeding 16th International Conference Research and Development in Information Retrieval, ACM, pages 191–202, New York, 1993. 3. Donna Horman. How effective is suffixing? JASIS, 42(1):7–15, 1991. 4. Mirko Popovic and Peter Willett. The effectiveness of stemming for natural language access to Slovene textual data. Journal of the American Society for Information Science, 43:384– 390, 1992. 5. A. B. Ercilasun et al. İmla Klavuzu, volume 525. Atatürk Kültür ve Tarih Yüksek Kurumu, Türk Dili Kurumu Yayınları, Ankara, Turkey, 1996. 6. T. Banguoğlu. Türkçenin Grameri. Atatürk Kültür ve Tarih Yüksek Kurumu, Türk Dili Kurumu Yayınları:528, Ankara, Turkey, 1995. 7. Ari Pirkola. Morphological typology of languages for IR. Journal of Documentation, 57(3):330–348, May 2001. 8. A. Kut, A. Alpkoçak, and E. Özkarahan. Bilgi bulma sistemleri için otomatik türkçe dizinleme yöntemi. In Bilişim Bildirileri, Dokuz Eylül University, İzmir, Turkey, 1995. 9. A. Köksal. Tümüyle özdevimli deneysel bir belge dizinleme ve erişim dizgesi. TURDER, TBD 3. Ulusal Bilişim Kurultayı, pages 37–44, 6-8 April 1981. Ankara,Turkey. 10. A. Solak and F. Can. Effects of stemming on Turkish text retrieval. Technical report BUCEIS-94-20, Bilkent University, Ankara, Turkey, 1994. 11. Gökmen Duran and Hayri Sever. Türkçe gövdeleme algoritmalarının analizi. In Ulusal Bilişim Kurultayı Bildiri Kitabı, pages 235–242, İstanbul,Turkey, September 1996. 12. F. Çuna Ekmekçioğlu and Peter Willet. Effectiveness of stemming for Turkish text retrieval. Program, 34(2):195–200, April 2000. 13. A. Köksal. Automatic Morphological Analysis of Turkish. PhD thesis, Hacettepe University, 1975. 14. G. L. Lewis. Teach Yourself Turkish. Sevenoaks, second edition, 1989. 15. Kemal Oflazer. Two-level description of turkish morphology. Literary and Linguistic Computing, 1994. 16. E. L. Antworth. Glossing text with the pc-kimmo morphological parser. Computers and the Humanities, 1993. 17. K. Oflazer and C. Guzey. Spelling correction in agglitunative languages. In Proceedings of 4th ACL Conference on Applied Natural Language Processing, pages 194–195, Stuttgart, Germany, October 1994. 18. D. Hull. Stemming algorithms:A case study for detailed evaluation. Journal of The American Society for Information Science, 47(1):70–84, 1996. 19. Richard Sproat. Morphology and Computation. Cambridge MA: MIT Press, 1992. 20. Chris D. Paice. An evaluation method for stemming algorithms. Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 42–50, 3-6 July 1994.