Session 1 Descriptive Analysis
Session 1 Descriptive Analysis
STRUCTURE
OBJECTIVES
In the previous chapter, you have studied some useful commands in SPSS. Now, let us move forward
and study five important aspects of descriptive statistics: frequency distribution, measures of central
tendency, measures of variability or dispersion, and distribution analysis.
Data plays a pivotal role in the research process. However, there would be wastage of resources and
efforts employed for collecting and storing data if it is not analyzed efficiently. Data can be analyzed
with the help of appropriate statistical tools. For example, frequency distribution analysis helps us in
knowing the probabilities of occurrences of certain observations of a single categorical variable. Cross
2 Chapter 5
tabulation is used to analyse the presence of different sub categories in more than two categorical
variables. Similarly, the measures of central tendency and measures of dispersion provide a single
representative value for a large quantity of data and the degree of difference among different values in
a dataset, respectively. Distribution analysis is used to study the nature of distribution of the data and
in selecting the right statistical analytical method for the analysis. For example, parametric tests are
used for normally distributed variables. In this chapter, we will cover these statistical tools in detail.
Frequency Percentage
30
25 24
20
20
15
10
6
5
0
Govt Owned Private Ltd Company Single Owner
For metric variables (interval scale and ratio), the frequency polygon and histograms are
more suitable.
Now, let us see how to use frequency distributions in SPSS. Consider the dataset of workers
working in small- and medium-scale enterprises in a city of India, as shown in Table 5.2:
Age Group
Education
Education
Religion
Religion
Gender
Gender
S No.
S No.
Age Group
Age Group
Education
Education
Religion
Religion
Gender
Gender
S No.
S No.
9 1.00 2.00 2.00 1.00 34 1.00 5.00 2.00 1.00
The coding details of different variables in the dataset are shown in Table 5.3:
2 = 26 to 35 years old
Age group 3 = 36 to 45 years old
4 = 46 to 55 years old
5 = 56 and above
1 = Hindu
Religion 2 = Muslim
3 = Other religion
Suppose we want to estimate the frequency distribution of education profiles of the workers.
For this, the required procedure is as follows:
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Figure 5.2: SPSS Command for Descriptive Statistics (1)
Step 2: Next, transfer the variable ‘Education’ to the ‘Variable(s)’ window and click ‘Charts,’
as shown in Figure 5.3:
6 Chapter 5
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Figure 5.3: SPSS Command for Descriptive Statistics (2)
Step 3: Next, select the type of chart (e.g., ‘Bar charts’), as shown in Figure 5.4:
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Figure 5.4: SPSS Command for Descriptive Statistics (3)
Descriptive Statistics 7
Step 4: Finally, click ‘Continue’ and then ‘OK.’ The final SPSS output in the tabular form is
shown in Table 5.4:
Education
As you can see in Table 5.4 and Figure 5.5, the output represents the frequency distribution
in the tabular as well as graphical form. In addition to bar charts, other type of charts (pie
chart and histogram) can also be selected in the available options. The bar chart and pie
chart are used to represents frequencies of different sub categories in the nominal variable.
However the histogram is used to report the nominal variables such as age and income
where different sub categories also come in order.
In SPSS, flexibility in terms of charts is limited; therefore, you can use MS Excel graphs
for drawing good-quality charts of frequency distribution.
Cross tabulation is one of the popular methods of representing joint frequency distribution of the
cases of two or more nominal variables in the dataset. For example, in the given dataset in previous
section, the cross tabulation of the variables “Gender” and “Religion” can be analyzed as given below:
Descriptive Statistics 9
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Figure 5.6: SPSS Output
Step 2: Next, transfer the variable ‘Gender’ to the Row(s) and ‘Religion’ to the Column(s) window and
click ‘OK,’ as shown in Figure 5.7:
10 Chapter 5
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Figure 5.7: SPSS Output
The SPSS output represents the cross tabulation results. It is shown in the results that there are 35
males and 15 females in the dataset. Out of 35 males 11 are hindu, 15 are muslims and rest belogs to
other religions. Similar interpretations can also be done for females.
In SPSS, cross tabulations can be dome in many more layers. You can also use ‘split file’
command in order to design multi-layer cross tabulation.
Arithmetic Mean
The mean of a variable represents its average value. It can be calculated by using the
following formula:
∑𝒏𝒊= 𝟏 𝒇𝒊 𝑿𝒊
̅=
𝑿
∑𝒇
Where, 𝑋̅ represents the mean and fi represents the frequency of an ith observation of the
variable.
Descriptive Statistics 11
One of the problems with arithmetic mean is that it is highly sensitive to the presence of
outliers in the data of the related variable. To avoid this problem, the trimmed mean of the
variable can be estimated. Trimmed mean is the value of the mean of a variable after
removing some extreme observations (e.g., 2.5 percent from both the tails of the
distribution) from the frequency distribution.
Mean is the hypothetical value of a variable. It may or may not exist in the dataset.
Median
Median is known as the ‘positional average’ of a variable. If we arrange the observations of a
variable in an ascending or descending order, the value of the observation that lies in the
middle of the series is known as median. The value of the median divides the observations of
a variable into two equal halves. Half of the observations of the variable are higher than the
median value and the other half observations are lower than the median value. The
extensions of median are quartiles, deciles, and percentiles.
Mode
The mode of a variable is the observation with the highest frequency or highest
concentration of frequencies.
Let us take an example to better understand the concept of mean, median, and mode. The
monthly sales figures (in crores) of an enterprise for 50 consecutive months are given in
Table 5.6:
2 70 15 12 28 32 41 34
3 45 16 8 29 54 42 56
4 90 17 15 30 34 43 97
5 110 18 40 31 45 44 34
6 40 19 54 32 49 45 54
7 90 20 56 33 68 46 70
8 50 21 25 34 65 47 98
9 70 22 43 35 70 48 45
10 65 23 56 36 60 49 85
12 Chapter 5
12 72 25 120 38 40
13 45 26 130 39 110
The required procedure for estimating the various measures of central tendency in SPSS is as
follows:
Step 1: Click ‘Analyze’ → ‘Descriptive Statistics’ → ‘Frequencies’
The same is shown in Figure 5.8:
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Figure 5.8: SPSS Command for Illustration (1)
Step 2: Next, transfer the variable to the ‘Variable(s)’ window and click ‘Statistics,’ as shown
in Figure 5.9:
Descriptive Statistics 13
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Figure 5.9: SPSS Command for Illustration (2)
Step 3: Next, select the options: ‘Mean,’ ‘Median,’ ‘Mode,’ and ‘Quartiles.’ Next, click (➔)
‘Continue’ and then ‘OK,’ as shown in Figure 5.10:
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Figure 5.10: SPSS Command for Illustration (3)
14 Chapter 5
Monthly Sales
N Valid 50
Missing 0
Mean 61.34
Median 55.00
Mode 45*
Percentiles 25 40.00
50 55.00
75 75.25
The result indicates that the average monthly sales is Rs. 61.34 crores, the median is Rs. 55
crores, and the mode is Rs. 45 crores. The quartiles indicate that the three estimated values
in the result (40, 55, and 75.25) divide the data into four equal groups. In addition to this, the
trimmed mean value can be estimated by using the following command:
Step 1: Click ‘Analyze’ → ‘Descriptive Statistics’ → ‘Explore’
Step 2: Next, transfer the variable to the ‘Dependent List’ window and click ‘OK.’
SPSS output of the command is shown in Table 5.8:
Variance 1021.290
Minimum 8
Descriptive Statistics 15
Range 142
Interquartile range 35
The result indicates that the value of the 5 percent trimmed mean is 59.99. It means that if
we remove the 5 percent (2.5 percent from each side) extreme values (outliers) from the
distribution, the value of mean becomes 59.99. As the trimmed mean is less than the normal
mean, it indicates the presence of outliers on the higher side.
Range
Range is the difference of the extreme values (minimum and maximum) of a variable. The
range can be expressed as follows:
Range = Max value – Min value
As the calculation of range depends upon the extreme values of the observation, it can be
highly misinterpreted. For example, in an enterprise, the annual package of a person at a top
position may be in crores, whereas the package of a person at a lower position is few
thousands. In this case, the range cannot give a true picture of variability in the salaries of
employees. We know that only few executives in the enterprise receive a very high package
and very few employees receive the lowest package. The salaries of most employees are
somewhere between these two extremes.
16 Chapter 5
Standard Deviation
Standard deviation can be defined as the average deviation from the mean. It can be
calculated by using the following formula:
∑𝑛𝑖= 0(𝑥𝑖 − 𝑥̅ )2
𝜎 = √
𝑛
Where,
σ represents standard deviation
n is the number of observations
𝑥̅ represents the mean of the variable
Variance
Variance is the square of standard deviation.
For the monthly sales data given in Table 5.5, the procedure to estimate the measures of
variation in SPSS is as follows:
Step 1: Click ‘Analyze’ → ‘Descriptive Statistics’ → ‘Frequencies’
Step 2: Next, transfer the variable in the ‘Variable(s)’ window and then click (➔) ‘Statistics.’
Step 3: Next, select the options: ‘Range,’ ‘Standard Deviation,’ and ‘Variance.’
Step 4: Next, click (➔) ‘Continue’ and then ‘OK.’
SPSS output is shown in Table 5.9:
Missing 0
Mean 61.34
Median 55.00
Mode 45*
Variance 1021.290
Range 142
The above output indicates that the average monthly sales is Rs 61.34 crores and the
average deviation in the mean (standard deviation) is Rs 31.95 crores. The standard deviation
is just the half of the mean. This indicates a high level of variation in the level of monthly
sales of the enterprise. In addition, the variance is just the square of the standard deviation,
and the range is the difference of the highest and lowest values of the observation in the
data.
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Figure 5.11: SPSS Command for Distribution Analysis (1)
18 Chapter 5
Step 2: Next, transfer the variable in the ‘Dependent List’ window, as shown in Figure 5.12:
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Figure 5.12: SPSS Command for Distribution Analysis (2)
Step 3: Next, click ‘Plots’ and select ‘Normality plots with tests.’ Click ‘Continue’ and then
‘OK,’ as shown in Figure 5.13:
(Copyright: IBM Corp. IBM SPSS Statistics for Windows, Version 21.0.)
Descriptive Statistics 19
Median 55.00
Variance 1021.290
Minimum 8
Maximum 150
Range 142
Interquartile range 35
In Table 5.9, the results indicate that the distribution of the data has positive skewness
(0.761) and leptokutic (.240) problem. Now, it needs to be checked whether the distribution
of the variable is normal or not. In order to check the normality of the distribution, the
Kolmogorov–Smirnov and Shapiro–Wilk tests can be applied. The null hypothesis of both the
tests is that ‘the distribution of the data is normal.’ As the p-values of both the tests are less
than 5 percent level of significance, at 95 percent confidence level, the null hypothesis of the
normal distribution of the data cannot be accepted. Thus, it can be concluded that the
distribution of the observations of the variable in not normal. The results of the tests of
normality are shown in Table 5.11:
Kolmogorov–Smirnov* Shapiro–Wilk
1. Approx 68 percent of the data lie within the range of mean ± one SD
2. Approx 95 percent of the data lie within the range of mean ± two SD
3. Approx 99.7 percent of the data lie within the range of mean ± three SD
For example, if the daily sales data of a passenger car manufactured by a company is found to be
normally distributed with mean of 100 and standard deviation of 10. Then according to Chebyshev rule
it can be expected that
1. About 68 percent of the daily sales of the car in general lie between 90 and 110,
2. About 95 percent of the daily sales of the car in general lie between 80 and 120,
3. About 99.7 percent of the daily sales of the car in general lie between 70 and 130.
5.7 Summary
In this chapter, we discussed major data analysis tools, namely, frequency distribution,
measures of central tendency, and measures of variability or dispersion. In addition, we
explained all these tools with the help of SPSS.
5.8 Exercises
Multiple-Choice Questions
Q1. __________ represent(s) the counts of all outcomes of a variable in a sample.
a. Frequency distribution b. Measures of central tendency
c. Measures of dispersion d. Distribution analysis
Descriptive Statistics 21
Long-Answer Questions
Q1. Define frequency distribution. Also, explain its role in data analysis.
Q2. Explain the measures of central tendency. Also, discuss its different methods.
Q3. Elaborate on the measures of dispersion as a tool for data analysis.
Q4. Define the concept of distribution analysis in data analysis.
Q5. Differentiate between the measures of central tendency and measures of
dispersion.
Q6. The daily data of commodity index, agri index and energy index is collected for
the period of 2005 up to 2015. The results of descriptive and distribution analysis
of all the three indexes are given below:
95% Lower
2341.0532 1797.3058 2519.3571
Confidence Bound
Interval for Upper
2383.3271 1835.1247 2571.3256
Mean Bound
5% Trimmed Mean 2348.1923 1790.6482 2507.2315
Median 2290.1400 1709.0500 2472.2200
Variance 208713.262 167041.968 315418.443
Std. Deviation 456.85147 408.70768 561.62126
Minimum 0.00 0.00 0.00
Maximum 3595.27 2990.06 4749.95
Range 3595.27 2990.06 4749.95
Interquartile Range 593.67 643.44 526.16
Skewness .348 .854 1.251
Kurtosis -.049 .343 2.904
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Comodity Index .070 1797 .000 .974 1797 .000
Agri Index .158 1797 .000 .917 1797 .000
Energy Index .131 1797 .000 .902 1797 .000
a. Lilliefors Significance Correction
1. Compare the dispersion to assess the risk associated with each index.
2. What is the null hypothesis of Kolmogorov-Smirnov and Shapiro Wilk test.
3. Compare the measures of shape of the given indexes.
4. Why “Trimmed mean” is better than “mean” of a variable?