FindStem:Analysis and Evaluation of A Turkish stemming algorithm

Transkript

FindStem:Analysis and Evaluation of A Turkish
stemming algorithm
Hayri Sever and Yıltan Bitirim
Department of Computer Engineering
Başkent University
Ankara, 06530 Turkey
sever@baskent.edu.tr
Department of Computer Engineering
Eastern Mediterranean University
Famagusta, T.R.N.C. (via Mersin 10, Turkey)
yiltan.bitirim@emu.edu.tr
Abstract. In this paper, we evaluate the effectiveness of a new stemming algorithm, FINDSTEM, for use with Turkish documents and queries, and compare
the use of this algorithm with the other two previously defined Turkish stemmers, namely ”A-F” and ”L-M” algorithms. Of them, the FINDSTEM and A-F
algorithms employ inflectional and derivational stemmers, whereas the L-M one
handles only inflectional rules. Comparison of stemming algorithms was done
manually using 5,000 distinct words out of which the FINDSTEM, A-F, and L-M
failed on, in respect, 49, 270, and 559 cases. A medium-size collection, which is
comprised of 2,468 law records with 280K document words, 15 queries in natural language with average length of 17 search words, and a complete relevancy
information for each query, was used for the effectiveness of the stemming algorithm FINDSTEM. We localized SMART retrieval system in terms of a stopping
list, introduction of Turkish characters, i.e., the ISO8859-9 (Latin-5) code set, a
stemming algorithm (FINDSTEM), and a Turkish translation at message level.
Our results based on average precision values at 11-point recall levels shows that
indexing document as well as search terms with the use of FINDSTEM for stemming is clearly and consistently more effective than the one where the terms are
indexed as they are (that is, no stemming at all).
1 Introduction
No matter what retrieval model is used, typically information retrieval (IR) systems are
built around three basic objects: documents, terms, and user queries. The aim of information retrieval is to extract relevant documents from a collection of documents in
response to queries. Terms are used to represent the contents of documents and queries.
Furthermore, document terms are matched with search terms to determine the relevancy
of documents to a user query. Given that it is not realistic to assume that authors and
users of documents have common vocabulary in expressing their intellectual activities,
to enlarge the extent of the overlap between vocabularies of these two agents becomes
a sound effort. Hence, the conflation procedure to reduce variants of a word to a single
form gets into picture as a natural consequence of the rationale that similar words generally have similar meanings. The most common conflation procedure is the use of a
stemming algorithm, which simply removes, in Turkish 3 , inflectional variants from the
word endings while keeping derivational affixes untouched. For example, ‘g özlüḡüm‘
(my eyeglasses) and ‘gözlüklüyü‘ (one who wears eyeglasses) both conflates into the
stem ‘gözlük‘ (eyeglasses), not into their root form, which is ‘göz‘ (eye). Similarly,
all of the words ‘göz‘ (eye), ‘gözde‘ (favorite), ‘gözlem‘ (observation), ‘gözcü‘ (observer), ‘gözlükçü‘ (optician), ‘gözetim‘ (supervision) constitute to some of the stems
derived from the the same root ‘göz‘ 4 ; that is, all should be kept as they are since they
have different meanings. Stemming, in other words, can be envisioned as a form of language processing [1] that consistently improves system effectiveness [2], though there
is conflicting views for English text in the literature [3, 4].
As much as the stemming process might increase the effectiveness of IR systems
especially in the morphologically complex languages, it also boosts up the efficiency of
IR systems due to the fact that the size of the index term set will be decreased as a result
of stemming. In Turkish language, the need for stemming is more dramatic, since there
are approximately 23,000 stems and 350-400 roots actively used [5]; however, when the
inflection of the words are included, the number is expressed in millions [6], though the
number of entries in a typical Turkish dictionary is roughly about 55K. Furthermore, the
index of synthesis 5 for Turkish language is found to be 2.86. There are a number of past
works on Turkish stemming mostly published locally or unpublished manuscripts [8–
11], with the exception of the work done by Ekmekçio¯glu and Willet [12]. Hence, in this
article, we hold a comparative discussion of our stemming approach with the previous
ones.
The organization of the paper is as follows: In section 2, Turkish stemming algorithms in the literature are discussed. Section 3 presents a stemming algorithm for
Turkish, ”FindStem” in detail. Section 4 considers the methods and configuration of the
experiments. In section 5, experimental results for the stemming algorithms for Turkish
are discussed. Finally, section 6 presents conclusions.
2 Stemming Algorithms for Turkish
We explore two stemming Algorithms 6 for Turkish in this section.
The first algorithm, developed by Kut et al. [8], and called Longest-Match (L-M),
is based on the word search logic over a lexicon/dictionary that covers Turkish word
stems and their possible variances (Figure 1). The authors used the L-M for indexing
3
4
5
6
Turkish as a member of the south-western or Oghuz group of the Turkic family of languages is
an agglutinative language with word structures formed by productive affixations of derivational
and inflectional suffixes to the root words.
There are roughly 150 stems or compound words emerging from the root ‘göz‘ (eye).
Index of synthesis refers to the amount of affixation in a language, i.e., it shows the average
number of morphemes per word in a language [7].
Truncation of words has been considered as a straightforward alternative to stemming for a
long time. Hence, it may be worth stating that a truncation length of 5 characters yields the
best performance when compared with the those of 4, 6, 7 and 9 characters [13]
document terms and constructing a stop list, which has been in use since then and
consists of 316 words.
1. Remove suffixes that are added with punctuation marks from the word.
2. Search the word in the dictionary.
3. If a matched root is found, goto step 5.
4. If the word remained as a single letter, goto step 6. Otherwise, remove the last letter
from the word and goto step 2.
5. Choose the found root as a stem and goto step 7.
6. Add the searched word into unfounded records.
7. Exit.
Fig. 1: The L-M Algorithm
The second algorithm was developed by Solak and Can [10] and is referred as AF algorithm. The algorithm works over a dictionary that keeps actively used stems for
Turkish in which each record is annotated with 64 tags showing how to generate surface
forms. For given a word, it is iteratively looked up in a dictionary from right to left by
pruning a letter at each step. If the word matches with any of the root words, then the
morphological analysis for that word is done, i.e., application of affixation rules to get
the surface forms of the root word, or lexicon form. If any of the surface forms is in
correspondence with the word at hand, then it is assumed that the root word 7 is an
eligible stem for that word. The process is repeated until the word drops down to a
single letter. The steps of the algorithm are shown in Figure 2. The algorithm provides
all possible stems for a word as an output. Solak and Can [10] reported that the average
word stem for Turkish words to be 1.22 8 .
Ekmekcioglu and Willett in their study of effectiveness of stemming for Turkish
text retrieval choice to stem only query words in their experiments under the ground
that
Turkish word roots are generally unaffected when a suffix is added to its
right-hand end. Accordingly, there is no need for the recoding procedures that
are required in many other languages, and the use of a simple truncation search
thus ensures that a stemmed query word is able to retrieve all of the variants
in the database that are derived from it. For example, the word enflasyonla
(with/by inflation), in one of the queries was stemmed to enflasyon, this resulting in matches with words such as enflasyonu, enflasyonunu, enflasyonun, and
enflasyonist, inter alia.
7
8
Note that Solak and Can did not distinguish a root word from a stem. This may be because the
root words may be viewed as special cases of stems in the sense that the root is a stem that
neither contains any morpheme nor is a compound word. From now on, we share this view as
well, unless otherwise is specified.
In the experimental work over the text of 533 Turkish news the A-F algorithm has enumerated
111,062 stems out of 90,912 words.
2. Search the word in dictionary.
3. If a matched root found, add the word into root words list.
4. If the word remained as a single letter, the root words list is empty then goto step 6,
if root words list has at least one element then goto step 7.
5. Remove the last letter from the word and goto step 2.
6. Add the searched word into unfounded record and exit.
7. Get the root word from the root words list.
8. Apply morphological analysis to the root word.
9. If the result of morphological analysis is positive then add the root word to the stems
list.
10. If there is any element(s) in root words list then goto step 7.
11. Choose the all stems in the stems list as a word stem.
Fig. 2: The A-F Algorithm
The approach described above is problematic in nature because the roots etymologically in Turkish has given rise to many other stems, e.g., the root ‘g öz‘ (eye) is a source
of derivation to roughly 150 stems which have totally different meanings indeed. The
stem enflasyon , given as an example in the quotation, has foreign origin in the noun
form, and the Turkish words having foreign origin is usually kept in noun forms which
may be inflected, but not be the source of offspring to other stems. Hence, we strongly
believe that stemming only search terms is a serious mistake given that the number of
feasible stems per a word is between 1.2-1.5 and affixation length per word is 2.82 on
average [7, 10, 11]. But we fully agree with the statement that Turkish grammar 9 makes
the stemming algorithms simple. One reason behind their decision on stemming only
query terms would be slowness of the tool, two-level morphological analyzer for Turkish [15], they have used. This analyzer, called PC-KIMMO [16], has been designed to
generate and/or recognize words using a two-level model of word structure in a word
is represented as a correspondence between its lexical level form and its surface level
form. The generator component of PC-KIMMO accepts as input a lexicon form, applies the appropriate rules, and returns the corresponding surface form(s). The recognizer component accepts as input a surface form, applies the appropriate rules, consults
the lexicon, and returns the corresponding the lexical form with its gloss. This way of
stemming is rather slow and can analyze only about two forms per second and generate
about 50 forms per second on Sun SparcStations [17]. Once more than one stem are
obtained for a given search word, the smallest one is picked up by the stemming algorithm. It is reported that this choice was bounded by 17% error rate over the data set
conducted for the experiment. The way of turkishizing foreign words into the language
we would say the other way around: selecting the longest stem would be appropriate
choice. A simpler version of the two-level morphological analyzer for Turkish for stemming (using the same lexicon) has been adapted by Solak and Can as described above
by the A-F algorithm.
9
For detailed information on Turkish grammar, we refer the reader to [14].
3 The FindStem Algorithm
The FindStem algorithm contains a pre-processing step that simply converts all letters
of the word into their small cases and singles out the letters after the punctuation mark in
the word. It has three components, namely ”Find the Root”, ”Morphological Analysis”,
and ”Choose the Stem” that will be explained in the remaining of this section.
3.1 Finding the root words in Turkish
The first step in a stemming algorithm is to find all possible roots of an examined word.
Then, these roots and production rules will be used to derive the examining word.
Stemming algorithms without a lexicon ignore the word meaning and lead to a number of stemming errors [18]. As in all stemming algorithms for Turkish, the lexicon is
used as an auxiliary structure for the stemming process. In lexicon, the type information 10 for every root word and possible root changes (when a root word combines with
suffix) is coded for use of morphological analysis. During the root and the suffix combination in Turkish, two alteration on a root word structure would be in order: (1) change
of the last vowel (e.g. ara-arıyor) or consonant letter (e.g. kitap-kitabı) of the root word
and (2) drop of middle vowel letter (e.g. oğul-oğlum) [6].
The selection of possible root words from lexicon is performed by the search algorithm that uses the coded information in the lexicon. Algorithm starts with the first
character of the examined word and search the lexicon for this item. Then the next character is appended to the item for which lexicon search begins. This operation continues
until the item becomes equal to the examined word or until the system understands that
there are no more relevant roots for the examined word in the lexicon.
3.2 Morphological analysis
The Turkish language uses the Latin alphabet consisting of 29 letters, of which 8 are
vowels and 21 are consonants, and is an agglutinative language, i.e., one in which words
contain a basic root, with one or more suffixes being combined with this root in order
to extend its meaning or to create other classes of words [19]. In Turkish language there
are a number of rules, which are explained in the appendix, to determine the form and
order of suffixation
Suffixes are divided into two main classes such as derivational and inflectional ones.
Of them, the derivational suffixes are used for changing word meanings. To add the
derivational suffixes to end of a word is determined by word type (this information is
coded into the lexicon for every word). The derivation rules are gathered under two
main titles: (a) advirable that is derivation of tense origin words and (b) de-nominal
that is derivation of noun origin words. Note that it is possible to derive an advirable
word from a noun origin root word or to derive a de-nominal word from a verb origin
root word. For example, the advirable word ”baba-y-dı” (he was a father) can derive
10
Root words in the lexicon are divided into two main groups: nouns and verbs. The nouns are
further subdivided into four groups, which are adjective, adverb, noun and pronoun. Then, this
information is coded by numerical values as the type information.
from the noun origin root word ”baba” (father). To make a derivation, all suffixes are
grouped and each one is coded to be a standard method corresponding a rule defined in
the appendix.
A morphological analyzer is usually required if high-quality stemming is to be
achieved [12]. To show the importance of the morphological analysis step in our stemming algorithm, let us consider the word ”edebilecek” as an examined word. The longest
possible root words, retrieved from lexicon, are ”edebi”, ”edep”, and ”ede”. According to the algorithms [8, 13] that assigns a stem by matching the examined word with
longest root words, these root words will be selected as output. But it is not possible to
produce the examined word, ”edebilecek”, by using these root words merely; this result
can be achieved through the morphological analysis procedure.
3.3 Selection of a stem
In Turkish, a surface form can be generated using more than one root. For example, the
word küçücükken (once one is very small) may conflate into either of the root k üçük
(small) or küçücük (very small). But, if one wonders as to which one truly represents a
stem for küçücükken, it should be küçücük.
The number of application of suffixes and their types 11 , at derivation, forms basis
for selection operation. It is, however, worth to mention that the ambiguity in conflating
Turkish words (or the way around, i.e., generation) into single terms becomes another
issue for which our hands are tied for stemming. For example, assume that a word has
more than one senses. In this case, let us take a look at the word ”başlar” can be either
plural of ”baş” (head) or inflection of present tense of the verb ”başlamak” (to start). To
find out the actual stem of a word like the former one, a semantical reasoning about the
context should be carried out 12 .
If neither a possible root word is found in the lexicon for the examined word nor
the production rules are successful for deriving that word from any root in the list, the
word will be kept as it is and/or saved on a log file for examination without passing next
steps. It is highly probable that such a word is either a foreign word or adopted into
Turkish, but not yet present in the lexicon.
Putting it all together: The FindStem algorithm is shown in Figure 3.3.
4 The Experimental Method
For the experiments, we used localized version of SMART System. To localize the
SMART system into Turkish, the Turkish characters (ğ, Ğ, ü, Ü, ş, Ş, ı, İ, ö, Ö, ç,
Ç) and Turkish stopwords list are introduced to the system and the English stemming
algorithm, in the system, is replaced with the FindStem stemming algorithm.
11
12
Suffixation is divided into two main types as ”derivational suffix” and ”inflectional suffix”,
which may be also further divided into subtypes.
By the context it should not automatically thought of a sentence level analysis, but it may
sometimes require to go for paragraph or even for text.
2. Find all possible roots of the word in a lexicon and add them into root words list.
3. If root words list is empty, add the word into unfounded records and exit.
4. Get the root word from root words list.
5. Apply morphological analysis to the root word.
6. After morphological analysis, add the formed derivations into derivations list.
7. If there is any element(s) in root words list then goto step 4.
8. Choose the word stem by a selection between derivations in the derivations list.
Fig. 3: The FindStem Algorithm
The experiment is divided into two parts. In the first part, A-F algorithm [10] and
L-M algorithm [8] are compared with FindStem algorithm and their effectiveness are investigated. In the second part, FindStem algorithm is integrated to the localized SMART
system and the performance measurement of the algorithm is done by using the precision and recall parameters which can be defined for a user query as the proportion of
retrieved and relevant documents over retrieval output and relevant documents, respectively.
4.1 Evaluation of the Turkish stemming algorithms
The A-F algorithm was used in its entirety along with its lexicon as being downloaded
through private communication, and the L-M algorithm was re-coded by us with respect
to the principles laid out in [8].
The algorithms were tested through a data set consisting of 5,000 words. The truth
stems of each word were tagged manually. Note that there could be more than one truth
stems for a given word as shown in Figure 4.1.
Searching Word Result Stems
benzerlikten
1.benzerlik
2.benzer
3.benze(mek)
Fig. 4: Stems, not accepted as an error
The accuracies of three stemming algorithms over that data set were computed.
4.2 Performance Measurement
”The effectiveness of stemming algorithms has usually been measured in terms of their
effect on retrieval performance with test collections” [20]. Our test collection is the set
of Turkish based documents which is formed from total 2,468 law entries. The number of rows in these documents ranges between 10 and 20; the number of rows and
words in total are 59,941 and 279,904 respectively. This collection was indexed twice
by the localized SMART system. One was without stemming and the other was with the
FindStem algorithm. As a result, two different index sets were formed in the system.
15 queries 13 have been defined with complete information about relevancy of documents. In correspondence with the document terms, the search (or query) terms were
stemmed (or not stemmed). We have run each query separately and used the retrieval
outputs to determine if each retrieved document was relevant or not.
We used non-interpolated precision values at 8-point recall levels (from
to )
for effectiveness of the system with or without stemmed terms. To calculate the recall,
the total number of relevant documents in the collection should be known. But this is
not possible while the number of documents in the collection is approximately 2,500.
Nevertheless, the cut-off point is determined as 100 documents and the assumption is
made that all the relevant documents, which are related to a query, would be in the first
100 documents retrieved. The recall is accepted as 1 at the position (in the retrieval
output) that the last relevant document displayed and precision values for every query
are calculated for various recall parameters (1, 0.9, 0.8,..., 0.3). The formula used to
calculate the precision values for various recall parameters is:
where ”N” is the number of relevant documents retrieved, ”R” is the recall parameter
) relevant document in the
(1, 0.9,..., 0.3) and
is the position of Xth (
retrieval output. For example, when the number of relevant documents is 10 out of 100
documents retrieved and the last relevant document is 85th document in the retrieval
output, for recall parameter, 1, the precision value will be !"$#
&% and for
*
'
recall parameter, 0.9, the precision value will be (52 is the position
(
!
)
#
%
+
'
of 9th ( ) relevant document in the retrieval output), etc.
5 Experimental Results
5.1 Comparison and effectiveness of Turkish stemming algorithms
While the FindStem and the L-M algorithms select only one root as the stem for each
word, under certain conditions, the A-F algorithm can select more than one root for a
word (Figure 5).
The L-M algorithm could not find 559 roots as a stem but many of them are semantically related words (such as the word ”öğreti” is found to be a stem for the word
”öğretilecek”, instead of the word ”öğret(mek)”). Because of this, only 138 of them are
assumed wrong.
The FindStem algorithm has found the stems of 49 words different from manually
entered stems. But actually, these roots, found by the algorithm, can be evaluated as the
stem of a word (e.g. the manually entered stem word is ”göz” for the word ”gözden”
but the algorithm finds the root word ”gözde” as a stem).
13
Queries are accessible online, http://cmpe.emu.edu.tr/bitirim/stemming.
Word
alanında
anlatılmak
aşılmıştır
birleşmiş
belirtilmeyen
FindStem
alan
anlat
aş
birleş
belirt
çekleştirdiği
aksamaması
aralıkta
eklemek
daha
çek
aksa
ara*
ekle
daha
A-F
al*
an*
aş
bir
1.be*
2.belirt
çekleştirdiği*
aksa
aralık
ekle
1.da*
2.daha
L-M
Manually Entered
Stem
alan
anlat
aşı*
birleş
belirti*
alan
anlat
aş
birleş
belirt
çek
aksam*
aralık
eklem*
daha
çek
aksa
aralık
ekle
daha
Fig. 5: The found stems for some words by algorithms (”*” means; the word is accepted
as incorrect)
The A-F algorithm is found completely wrong stems for 59 words. Furthermore,
the algorithm found unsuitable roots as a stem for 270 words (such as the words ”göre”
and ”görev” selected to be stems of the word ”görevini”).
Some samples of found stems for the employed algorithms are shown in Figure 5.
5.2 The Number of Documents Retrieved
The number of zero retrievals (i.e., no documents retrieved) or retrievals that contain no
relevant documents (i.e., the precision ratio is zero) can be used to evaluate the retrieval
performance while stemming is used or not used.
The number of relevant documents retrieved for each query is given in Table 1.
The first number in the row labelled ”Average” shows the average number of relevant
documents retrieved and the second one (in parentheses) shows the total number of
documents retrieved.
As Table 1 shows, average number of relevant documents retrieved for 15 queries
over stemmed and unstemmed indexes are 23.3 and 28.4, respectively. While the total
number of relevant documents retrieved over the stemmed index is 426, and 350 over
the unstemmed index. Through stemming the total number of relevant documents are
increased approximately by 22%.
Except in 3 of 15 queries (i.e., query 4, 10 and 14), the usage of stemming has increased the retrieval effectiveness of the system in all queries. This indicates the success
of the FindStem stemming algorithm.
Number of relevant documents
Query Stemmed index Unstemmed index
1
66
55
2
41
32
3
68
51
4
43
43
5
28
17
6
17
12
7
46
38
8
18
16
9
8
7
10
40
40
11
9
8
12
17
11
13
13
10
14
6
6
15
6
4
Average 28.4(426)
23.3(350)
Table 1: Number of relevant documents retrieved
Average Precision of 15 queries
Recall w/ Stemming w/out Stemming
1
0.415
0.283
0.9
0.508
0.316
0.534
0.415
0.8
0.7
0.586
0.490
0.625
0.547
0.6
0.5
0.681
0.589
0.4
0.781
0.644
0.3
0.818
0.664
Average
0.619
0.494
Table 2: Average precision values when stemming is used and not used
5.3 Precision values for various recall parameters
The average precision of 15 queries for various recall parameters over stemmed and
unstemmed indexes are given in Table 2 14 . The overall average of the averages are also
given in Table 2.
Table 2 shows that the precision increases while the recall decreases and when word
stemming is used, better precision values are obtained. Thus the affect of stemming on
the retrieval performance becomes apparent as shown in Figure 6.
0.9
0.8
Precision
0.7
0.6
0.5
0.4
w/ Stemming
0.3
w/out Stemming
0.2
0.1
0
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Recall
Fig. 6: Precision values for various recall parameters
The usage of stemming has increased the retrieval effectiveness of the system by
approximately 25% in terms of precision.
6 Conclusions
In this article, the necessity of a morphological analysis in the Turkish stemming algorithms is described. The main idea of the algorithms, which do not use the morphological analysis, is to select the root 15 from a lexicon as the stem which increases the
number of errors in word stemming. Another point is that it is possible to find more
than one stem for some words, but it is not possible to decide the real stem without
examining a context in which the word appears and if one of these is selected as the
stem, there will be a possibility to select the wrong one and cause incorrect stemming.
However, assigning all the roots that their morphological analysis result is positive as
a stem, sometimes causes to find the origin beside the stem. The elimination on words,
which is assigned as the stem, is a necessity to find the real stem.
14
15
The precision values of every query for every recall parameter (1, 0.9,..., 0.3) are accessible
online, http://cmpe.emu.edu.tr/bitirim/stemming.
The root will be the longest set of characters that has compatibility with the word.
A
Turkish Word Formation
In Turkish, root words and words derivated from roots are based on obvious rules such
as major and minor laws of vowel harmony. These rules determine changes that are
made both to root words and affixes in the formation of words.
A.1 Turkish alphabet
In the Turkish word formation, mainly the vowels and unvoiced consonants are subject
to change. The vowels and voiced/unvoiced consonants, and characteristics of vowels
are shown in Table 3 and Table 4.
Unrounded Rounded
Low High Low High
Back A
I
O
U
Front E
İ
Ö
Ü
Table 3: Vowels and their characteristics
Unvoiced Consonants f,p,ş,ç,h,s,t,k
Voiced Consonants b,c,d,g
Table 4: Voiced and unvoiced consonants
A.2 The possible changes on roots and suffixes in word formation
The two important changes observed in word formation are assimilation and dropping.
In addition, the character changes such as unvoiced consonants becoming voiced are
observed in Turkish.
Assimilation During the word formation process in Turkish, if the last character of a
root is a vowel and if the first character of a suffix is a vowel also, then consonants such
as ”n”, ”s”, ”y”, ”ş” are used as assimilator. Some examples are:
bahçe (garden)
komşu (neighbor)
pencere (window)
iki (two)
+ı
+ in
+ı
+ er
bahçesi (his/her/its garden)
komşunun (neighbor’s)
pencereyi (the window)
ikişer (two each)
Dropping During Turkish word formation, the loss of a letter is possible in both root
word or suffix. This loss can be a vowel or a consonant.
The drop of middle vowel: The vowel of the second syllable is lost when a suffix beginning with a vowel is added. For example:
oğul (son)
burun (nose)
karın (stomach)
+u
+u
+ im
oğlu (his/her/its son)
burnu (his/her/its nose)
karnım (my stomach)
The drop of vowel at the end of a root: When some words such as koku (smell), sızı
(pain), yumurta (egg) combine with the suffix ”-le” to be derivated verb, the last vowel
of these words are drop. For example:
koku (smell)
+ le
kokla (smell it)
The drop of consonant at the end of a root There are some situations that the last
consonant drop when root and suffix combine. For example:
küçük (small)
yüksek (high)
+ cik
+ (e)l
küçücük (very small)
yükselmek (to rise)
Character change When a vowel is added to some nouns of one syllable and most
nouns of more than one syllable, ending in ”p”, ”ç”, ”t”, ”k”, the final consonant changes
to ”b”, ”c”, ”d”, or ”ğ” respectively. So with the addition of suffix for the third person,
”-i” as shown below:
kitap (book)
ağaç (tree)
armut (pear)
ayak (foot)
+ı
+ı
+ı
+ı
kitabı (his/her/its book)
ağacı (his/her/its tree)
armudu (his/her/its pear)
ayağı (his/her/its foot)
But there are nouns, whose final consonants are not subject to this change, when the
letter ”n” comes before the letter ”k” at the end of a word, the letter ”k” becomes ”g”
instead of ”ğ”. For example:
renk (color)
+ı
rengi (his/her/its color)
In foreign origin words, if the last letter of a word is ”g” and combines with the suffix
where first letter is vowel, the letter ”g” changes to the letter ”ğ” as in the example:
monolog (monologue)
+ı
monoloğu (his/her/its monologue)
The change of ”g” and ”ğ” is not observed for one syllable words or words which
the second letter from the last one is ”n”. For example:
şezlong (chaise longue)
+ı
şezlongu (his/her/its chaise longue)
References
1. Irene Diaz, Jorge Morato, and Juan Lloréns. An algorithm for term conflation based on
tree structures. Journal of The American Society for Information Science and Technology,
53(3):199–208, 2002.
2. R. Krovetz. Viewing morphology as an inference process. Proceeding 16th International
Conference Research and Development in Information Retrieval, ACM, pages 191–202, New
York, 1993.
3. Donna Horman. How effective is suffixing? JASIS, 42(1):7–15, 1991.
4. Mirko Popovic and Peter Willett. The effectiveness of stemming for natural language access
to Slovene textual data. Journal of the American Society for Information Science, 43:384–
390, 1992.
5. A. B. Ercilasun et al. İmla Klavuzu, volume 525. Atatürk Kültür ve Tarih Yüksek Kurumu,
Türk Dili Kurumu Yayınları, Ankara, Turkey, 1996.
6. T. Banguoğlu. Türkçenin Grameri. Atatürk Kültür ve Tarih Yüksek Kurumu, Türk Dili
Kurumu Yayınları:528, Ankara, Turkey, 1995.
7. Ari Pirkola. Morphological typology of languages for IR. Journal of Documentation,
57(3):330–348, May 2001.
8. A. Kut, A. Alpkoçak, and E. Özkarahan. Bilgi bulma sistemleri için otomatik türkçe dizinleme yöntemi. In Bilişim Bildirileri, Dokuz Eylül University, İzmir, Turkey, 1995.
9. A. Köksal. Tümüyle özdevimli deneysel bir belge dizinleme ve erişim dizgesi. TURDER,
TBD 3. Ulusal Bilişim Kurultayı, pages 37–44, 6-8 April 1981. Ankara,Turkey.
10. A. Solak and F. Can. Effects of stemming on Turkish text retrieval. Technical report BUCEIS-94-20, Bilkent University, Ankara, Turkey, 1994.
11. Gökmen Duran and Hayri Sever. Türkçe gövdeleme algoritmalarının analizi. In Ulusal
Bilişim Kurultayı Bildiri Kitabı, pages 235–242, İstanbul,Turkey, September 1996.
12. F. Çuna Ekmekçioğlu and Peter Willet. Effectiveness of stemming for Turkish text retrieval.
Program, 34(2):195–200, April 2000.
13. A. Köksal. Automatic Morphological Analysis of Turkish. PhD thesis, Hacettepe University,
1975.
14. G. L. Lewis. Teach Yourself Turkish. Sevenoaks, second edition, 1989.
15. Kemal Oflazer. Two-level description of turkish morphology. Literary and Linguistic Computing, 1994.
16. E. L. Antworth. Glossing text with the pc-kimmo morphological parser. Computers and the
Humanities, 1993.
17. K. Oflazer and C. Guzey. Spelling correction in agglitunative languages. In Proceedings
of 4th ACL Conference on Applied Natural Language Processing, pages 194–195, Stuttgart,
Germany, October 1994.
18. D. Hull. Stemming algorithms:A case study for detailed evaluation. Journal of The American
Society for Information Science, 47(1):70–84, 1996.
19. Richard Sproat. Morphology and Computation. Cambridge MA: MIT Press, 1992.
20. Chris D. Paice. An evaluation method for stemming algorithms. Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in
Information Retrieval, pages 42–50, 3-6 July 1994.

FindStem:Analysis and Evaluation of A Turkish stemming algorithm

Transkript

Benzer belgeler

The Turkish National Anthem - English lyrics Turkish National

Patient Participation Group Turkish

Embassy English 2016 Promotion for Turkish

UNI 215 2015-2016 Spring Semester Syllabus

the Linguistic Sciences Margie 0` Bryan action of rules is that