A COMPARISON BETWEEN SINGLE LINKAGE AND COMPLETE LINKAGE IN AGGLOMERATIVE HIERARCHICAL CLUSTER ANALYSIS FOR IDENTIFYING TOURISTS SEGMENTS

Cluster Analysis is a multivariate method in statistics. Agglomerative Hierarchical Cluster Analysis is one of approaches in Cluster Analysis. There are two linkage methods in Agglomerative Hierarchical Cluster Analysis which are Single Linkage and Complete Linkage. The purpose of this study is to compare between Single Linkage and Complete Linkage in Agglomerative Hierarchical Cluster Analysis. The comparison of performances between these linkage methods was shown by using Kruskal-Wallis test. The result of the comparison used for segmenting tourists of Kapas Island. The statistical software SPSS has been applied to analyze data of this research. The result from Kruskal-Wallis test shows Complete Linkage is more useful in identifying tourists segments. ABSTRAK: Analisis Gugusan ialah satu kaedah multivariat dalam bidang statistik. Analisis Gugusan Aglomeratif Berhierarki ialah satu daripada pendekatan dalam Analisis Gugusan. Ada terdapat dua kaedah rantaian dalam Analisis Gugusan Aglomeratif Berhierarki iaitu Rantain Tunggal dan Rantaian Lengkap. Tujuan kajian ini ialah untuk mencari perbandingan antara Rantaian Tunggal dengan Rantaian Lengkap dalam Analisis Gugusan Aglomeratif Berhierarki. Perbandingan prestasi antara dua rantaian tersebut dibuat menggunakan Ujian Kruskal-Wallis. Keputusan perbandingan tersebut digunakan untuk meruas pelancong di Pulau Kapas. Perisian statistic SPSS telah digunakan bagi menganalisa data kajian. Keputusan Ujian Kruskal-Wallis menunjukkan Rantaian Lengkap adalah lebih berguna untuk mengenalpasti segmen pelancong.


INTRODUCTION
In statistics area, there are some methods available to gather observations.Some methods have been developed to divide a sample of observations into some smaller groups.One of the methods is Cluster Analysis.This method involves sorting observations into different groups based on their similarity.Cluster Analysis also refers as a collection of statistical methods that identifies groups of sample that show similar characteristics.
There are many approaches in Cluster Analysis.One of the approaches is Agglomerative Hierarchical Cluster Analysis.The first step need to be considered in this approach is computation of similarity among cases or observation.The similarities among cases were considered as distance in Agglomerative Hierarchical Cluster Analysis.Euclidean Distance Measure will apply to compute the distance among cases in this study.The cases that have same similarities will be set in the same clusters or groups.The distance among clusters can be compute using Single Linkage or Complete Linkage methods.Single Linkage is a method that focused on minimum distances or nearest neighbor between clusters meanwhile Complete Linkage concentrates on maximum distance or furthest neighbor between clusters.This research compares the efficiency of Single and Complete Linkage in Agglomerative Hierarchical Cluster Analysis.This comparison based on evaluation of the output for both linkage methods.Kruskal-Wallis is a method that will apply in this research to contrast the performances between Single Linkage and Complete Linkage.Kruskal-Wallis test is a non-parametric test used to make comparison between independent groups of sampled data.The objectives of this research are: a.To compare performances of Single Linkage and Complete Linkage in Agglomerative Hierarchical Cluster Analysis.
b.To assign groups or clusters of tourists those visit Kapas Island, Terengganu.
Cluster Analysis is a multivariate data analysis method that groups similar objects together.Agglomerative Hierarchical Cluster Analysis is a method of Cluster Analysis.The method is initially seeking for the similarities between different points by using Euclidean distance measure.The similarities between different clusters are calculated using Single Linkage and Complete Linkage methods.Therefore, the comparison between these linkage methods by using Kruskal-Wallis test will be performed in determining the clusters of Kapas Island tourists.It is difficult to assign groups of these tourists since they come from various backgrounds.This problem is solved using Agglomerative Hierarchical Cluster Analysis.
Cluster Analysis is widely used family of multivariate techniques for grouping individuals, objects or behaviors into similar clusters [1].The flexibility of cluster analysis to accommodate wide range of applications makes it one of the most useful tools for understanding the natural structures among observations [1].In tourism research, for example, cluster analysis is often used to identify market segments in order to improve the effectiveness of marketing efforts These segments may be based on a variety of variables including demographic characteristics (such as age, income, gender and location) and trip characteristics (such as trip length, purpose, group size and benefits) [1].Reference [2] stated hierarchical cluster analysis is a set of statistical techniques that is particularly useful for separating a set of objects into constituent group or clusters which minimize variation between members of the same groups without making assumptions about the number of groups or the group structure.

Research Site and Instrument
The selection site for this research is Kapas Island.This island is located at Marang, Terengganu.The sample size preferred for Hierarchical Cluster Analysis is not more than 200 samples [3].Reference [4] mentioned large data sets can be problems with Agglomerative Hierarchical Cluster Analysis.An alternative to Agglomerative Hierarchical Cluster Analysis for more than 200 data is given by various forms of nonhierarchical Cluster Analysis [4].The sample of this research was 200 respondents included local and international tourists that visit Kapas Island in July until September 2009.They have been chosen by using snowball sampling technique.It was one of the non-probability sampling techniques.By using this technique, the local and international tourists in this research have been chosen randomly.
A questionnaire was distributed to the sample of this research.Ten separate visitor surveys were carried out at Kapas Island.The mode of survey delivery for this research was self-administered questionnaire.The surveys were based on a 7-page questionnaire.There are three sections in this questionnaire.The sections are Section A, B and C. In Section A, it was included questions about the respondents' demographic profiles.Section B included questions about details of visit.There are 10 questions in this section.The items in this section are frequency of their visit to Kapas Island, the purpose to visit Kapas Island and so on.Section C is contained items of visitor satisfactions of Kapas Island.The respondents need to answer 24 questions about their characteristics of visit in Kapas Island.Likert Scale has been used in Section C.

Agglomerative Hierarchical Cluster Analysis
In this method, clustering of each observations or objects begins in separate clusters.Next, the clusters of the object or observation that are close together are merged to create one large cluster.The general formula for Agglomerative Hierarchical Cluster Analysis as follows [5]: (1) where α r = system parameter corresponds with cluster r α s = system parameter corresponds with cluster s β = system parameter γ = system parameter d k→r = distance between cluster k to cluster r d k→s = distance between cluster k to cluster s d r→s = distance between cluster r to cluster s The value for all parameters as in Table 1 will be used for simplification of (1).Reference [3] has recommended the following constraints of parameter values to simplify (1).
When α r = α s , hence Here, α r = (1-β)/2.Next, a value of β needs to be selected.It is suggested that β =0 since 0<1.If a small value of β has been use such as β = -0.5 or β = 0.5, it becomes Or

Parameter
Complete Linkage Single Linkage

Complete Linkage
There are some steps in getting the model or formula of Complete Linkage by using model of Agglomerative Hierarchical Cluster Analysis.By using all the values of parameter for Complete Linkage as in Table I Subsequently, (3) needs to be substituted into (2).Therefore, (2) reduces to become as follows: On the other hand, if dk→r < dk→s , then By using all the value of parameters for Complete Linkage as in Table I Since dk→r and dk→s is symmetric, the model of Complete Linkage approach can be written a follows:

Single Linkage
There are some steps that need to follow to get the formula or model of Single Linkage.The first step that needs to follow is substitution the value of parameters for Single Linkage approach as in Table 1 into (2). (5) Since the condition of ( 3) is compulsory for the (5), substitutions of (3) into (5) need to be performed.Hence, When the substitution of (4) into ( 5) is done, the following equation will exist.
Since dk→r and dk→s is symmetric, the model of Single Linkage approach can be written a follows: d k→(r,s) = min [(d k→r ), (d k→s )]

Kruskal-Wallis Test
Kruskal-Wallis test is one of statistical tests in nonparametric statistic.Comparative studies frequently involve the simultaneous comparison not just of two but of three or more treatments or conditions [7].Kruskal-Wallis test used to compare between Single Linkage and Complete Linkage in this research.
The first procedure in Kruskal-Wallis test is ranking all the observations in the combined sample.Data values are grouped and need to be ranked.Next, compute the sum of the ranks for each cluster.The formula of sum of the ranks, ∑r i is given as follows: where n i = number of subjects to the ith treatment.r i1 = rank in the 1st treatment group.The Kruskal-Wallis test is applied after the computation the sum of the ranks.The assumptions for this test are all samples are random samples from their respective population and the measurement scale is at least ordinal.The Kruskal-Wallis test statistic is given by: The null hypothesis of Kruskal-Wallis test for a population is they have the same means.This hypothesis can be written in terms of the respective treatment effects as: H 1 : at least two θs differ

RESULTS AND DISCUSSION
Ordinal data gathered from research respondents usually not normal distribution, therefore it needs to be analyzed using nonparametric tests [8].The purpose of normality test is to check whether all the variables that will be applied is not normally distributed since Kruskal-Wallis test is one of the approaches in nonparametric statistic.

Normality Test for Ordinal Data
The assumption is all the variables for ordinal data are qualitative.The hypotheses for this test are as follows: H 0 :The sample comes from a normal distribution H 1 :The sample does not come from a normal distribution Based on Table 2, from the Kolmogorov-Smimov test it can be conclude that since the significant value (p-value) for all variables are 0.000 < 0.05, all the variables are not normally distributed.Reference [8] stated for Kolmogorov-Smirnov and Shapiro-Wilk tests, data are normal distribution if both of them are not significant, which Sig. > 0.05.Here, there is enough evidence at the 5% level of significance that significant values (Sig.) for all variables are 0.000 which is less than 0.05.Hence, it can reject H 0 from the above hypotheses.It can be accepted that the sample does not come from normal distribution.

Determination Number of Clusters
The formula of Rule of Thumb has been used to determine the number of clusters.The formula as follows: where n = number of object Table 3 shows the number of members for each cluster when applied Single Linkage in Agglomerative Hierarchical Cluster Analysis.Cluster 1 had the majority members which 191 members while Cluster 2, 3, 4, 5, 6, 7, 8, 9 and 10 only had one member in their cluster.

Data Analysis of Agglomerative Hierarchical Cluster Analysis Using Complete Linkage
When using Complete Linkage in Agglomerative Hierarchical Cluster Analysis the members of ten clusters are as follows: Table 4 shows the number of members for each cluster when applied Complete Linkage in Agglomerative Hierarchical Cluster Analysis.Cluster 1 had the majority members which 61 members while Cluster 10 only had the minority members which only two members in its cluster.Table 4 shows the number of members for each cluster when applied Complete Linkage in Agglomerative Hierarchical Cluster Analysis.Cluster 1 had the majority members which 61 members while Cluster 10 only had the minority members which only two members in its cluster.

Calculation of Kruskal Wallis test from application of Single Linkage in Agglomerative Hierarchical Cluster Analysis
The responds for each respondent on Likert scale questions in this research have been total up.The data will be analyzed using Kruskal Wallis test.Hypotheses of research are as follows: H0 : Ten clusters of tourists have same satisfaction value about Kapas Island H1 : Ten clusters of tourists have different satisfaction value of Kapas Island.
Step 1: Arrangement of positions and rank for ordinal scale score The positions of respondents have been arranged in ascending order which start f rom 1 until 200.It is because there were 200 respondents.Based on the positions, the ranks of respondents have identified.There were some respondents that have same satisfaction value with other respondents.i.e respondents 16, 139 and 183.So Step 2: Calculation total of ranks for each cluster.
Total of ranks for each cluster is shown as follows: Table 5: Total of rank of rank for ten clusters (SINGLE linkage).

Cluster
Total of Rank According to Table 5, it shows Cluster 1 has the highest total of rank (19714.5)meanwhile Cluster 9 has the lowest total of rank (16.5).
By using the formula as follows, the estimation value of KW can be determined.Step 5: Making decision for Kruskal Wallis test The comparison value of estimation KW and critical value of KW shows that the estimation value KW (12.7923) lower than critical value of KW (16.92).Therefore it is accepted that hypothesis null, H 0 which stated ten clusters of respondents or tourists have same satisfaction value about Kapas Island.

Calculation of Kruskal Wallis test from Application of Complete Linkage in Agglomerative Hierarchical Cluster Analysis
Step 1 and Step 2 for calculation of Kruskal Wallis for the data of clusters that exist when applying Complete Linkage in Agglomerative Hierarchical Cluster Analysis is same as analysis for Single Linkage in Agglomerative Hierarchical Cluster Analysis.According to Table 6, it shows Cluster 1 has the highest total of rank (10037) meanwhile Cluster 10 has the lowest total of rank (5).
Step 3: Calculation of estimation value of Kruskal Wallis, KW By using the formula as follows, the estimation value of KW when applied Complete Linkage in Agglomerative Hierarchical Cluster Analysis can be calculated.
Step 4: Finding the critical value of Kruskal Wallis, KW The degree of freedom is similar as in case of Single linkage.The critical value of KW can be found by referring the

CONCLUSION
This study shows that the application of Complete Linkage approach in Agglomerative Hierarchical Cluster Analysis is more useful compare than Single Linkage approach in segmenting tourists of Kapas Island.It is because the result from the application of Complete Linkage in Agglomerative Hierarchical Cluster Analysis shows the difference of satisfaction value between ten clusters of tourists.If the clusters had same satisfaction value of Kapas Island, it means the clusters had no difference among them and there is no occurred clusters among tourists.

Table 1 :
Value of Parameters.

Table 4 :
Total of members of ten clusters when applying Complete Linkage in Agglomerative Hierarchical Cluster Analysis Table of critical value for Chi Square on df=9 and significant level, p=0.005, the critical value of KW is 16.92.

Table 6 :
Total of rank and mean of rank for ten clusters (Complete linkage) Table of critical value for Chi Square.By referring the table, on df=9 and p=0.05, the critical value of KW is 16.92.Step 5: Making decision for Kruskal Wallis test By referring Step 3 and Step 4 for this case, it shows the estimation value of KW (154.7401) is higher than estimation value of KW (16.92).Therefore the null hypothesis was rejected.It can be concluded that ten clusters of respondents that occurred after applying Complete Linkage in Agglomerative Hierarchical Cluster Analysis have the different satisfaction value of Kapas Island.It shows that Cluster 1 have the highest satisfaction value of Kapas Island compare than other clusters (mean of rank for Cluster 1=164.5410,mean of rank for Cluster 2=79.5114,mean of rank for Cluster 3=46.0 mean of rank for Cluster 4=17.1667,mean of rank for Cluster 5=92.2778,mean of rank for Cluster 6=10.9583,mean of rank for Cluster 7=113.3333,mean of rank for Cluster 8=46.20,mean of rank for Cluster 9=43.4375,mean of rank for Cluster 10=2.5).