Statistics in R - Part-2 : Descriptive Statistics in R

This post is part-2 of the Series Statistics in R. In the first part, we have learned some basics of R. In this post, we will learn Descriptive Statistics concepts with a hands-on activity in R.

Descriptive Statistics ?

Let's start with a simple example. Tony is working in milk-plant and he is responsible for noting the weight of every milk packet of one liter. The packaging plant is not highly accurate therefore some milk packets produced with less milk and some with more milk. He collected this data over a period of 10 days and this data look like as following (weight in ml.)

1022.8537 1046.8098 1025.9010 1011.9844  991.4790  999.9071 1066.9561 1006.6503 1057.1011, 959.8809  951.8257 1039.4336  947.3567  941.6532  998.9110  960.5717  920.5042 1034.3791, 1008.4992  918.3915 1126.4705 1015.3833 1060.7138 1025.4741 1011.1905  958.6402 1001.0104, 1056.7863 1014.9348 1020.6914 1039.5814  977.8077  970.7610  960.8286  995.4237  972.5953, 1031.0542 1040.9421  996.3655 1022.1383  998.0796  888.3711 1008.7041  981.4500  943.6081, 1064.3009  983.0746 1009.9346  946.4760  978.4714  997.2577 1041.7493  981.6758 1038.7610, 934.0396 1015.9674  909.2195  975.8332 1022.4339 1028.9248  986.3328  949.9608 1027.4207, 1068.0071 1062.7421  991.4067  972.2559 1054.5471  938.9627 1022.3407 1047.2894 1028.4642, 1008.2233 1048.7007  987.0517  989.8248 1006.4934  998.8702 1026.4783  893.4586 1015.7011, 1011.0602  993.3329 1055.6904 1008.7458 1010.6308 1000.0748  991.0321 1057.5244 1030.3993, 1040.4531  992.6614 1035.3172  994.2273  999.1923 1005.3602 1007.5996  969.8353 1005.5389, 1087.2964  

He needs to report this data to his seniors. Showing the entire data would be a silly mistake (as the above data is really small, but for real examples, it could be extremely large).

How to present this data to senior authority?

He started working on it and prepared a small summary sheet as following

Data count: 100

The purpose of the above sheet is to describe the data (Why: Because showing entire data would not convey much information and it's really difficult as well to understand it). Here comes, Descriptive Statistics.
As the name implies descriptive statistics used to describe the data. Now the question is how to describe the data?. The above information is just a very basic form of descriptive statistics. Let's delve deeper and explore it further.

  • How data is distributed?
  • What is central data?
  • Data spread
  • Quartiles

1. How data is distributed?

The answer to this question provides information regarding values in the dataset and about their occurrence. Let's start with a simple example. We have some data and we want to see the distribution.

d <- c(20,30,20,30,10,20,20,30,10,40,50,50,60,30,20,10,30,20,20)  
> table(d)
10 20 30 40 50 60  
 3  7  5  1  2  1 

In the above example, we had data of 20 items. From the frequency table, we can see that the distribution of the data (e.g. 20 is occurring more than other values). Telling someone that 20 appeared seven times in the dataset is not enough until you tell the size of the dataset.
To simplify it, the frequency count replaced by the probability (percentage of times a particular data appear in the dataset). Following is an R script to change the frequency count to the probability.

> d <- c(20,30,20,30,10,20,20,30,10,40,50,50,60,30,20,10,30,20,20)
> freq <- table(d)
> for (i in seq(1:length(freq)))
+ {
+ temp <- freq[[i]]/length(d)
+ freq[[i]] <- temp
+ }
> freq
        10         20         30         40         50         60 
0.15789474 0.36842105 0.26315789 0.05263158 0.10526316 0.05263158  

So, now you just need to show the probabilities. It will communicate the frequency of the data value relative to the entire dataset. Now let's plot the distribution by running plot(freq) command.

Tony used the same strategy and plotted the distribution for his data. However, in his dataset, there are too many data values. Therefore, he divided his range of data into bins (e.g. 850----900-----950) and counted the number of values in each bin. Then, he plotted this data. Such plots are known as histogram (figure given below).

Tips: When you have discrete data (Meaning: only a few values appear again and again), you can go for frequency distribution. In case if you have continuous data (Meaning: Too many values in your dataset) then you can go for histogram plotting).

The final plot generated for Tony's data is the following one. If you see the shape as shown in the following plot (bell shape) then your data is following Normal Distribution. We will discuss it in detail in the later part of this series. For the time being, just keep in mind if your data looks like bell shape then it's Normal distributed data.

You can use densityplot() function to plot distribution of your data.

> d <- c(20,30,20,30,10,20,20,30,10,40,50,50,60,30,20,10,30,20,20)
> densityplot(d)

2. Measure of Centrality

After an initial plotting of data distribution, the next question is what is the central data. You might think about average but it is just one of the measures of centrality. There are others as well which we will discuss in this section.

2.1 Mode

This is simply the data with the highest frequency in your data. For instance, in the following dataset, data value 20 has the highest frequency. Hence 20 is the mode. In R, there is no direct function available for computing mode.

d <- c(20,30,20,30,10,20,20,30,10,40,50,50,60,30,20,10,30,20,20)  
> table(d)
10 20 30 40 50 60  
 3  7  5  1  2  1

You can use following script for computing mode for your data.

> a <- c(20,30,35,30,40)
> uniqv <- unique(a)
> uniqv[which.max(tabulate(match(a,uniqv)))]

2.2 Median

Median is the middle data value in your ordered dataset. Let's say we have 20,30,35,30,40 and we need to find the median. First, rearrange your data into increasing order and then take the middle value as the median. In case of an even number of data values, take the average of two middle values.
In R, you can simply use the median() function.

> a <- c(20,30,35,30,40)
> median(a)
[1] 30

2.3 Mean

Mean is the arithmetic average of the data. It is the addition of your data values divided by the size of data. In R, you can use the mean() function to compute the mean of your data.

> a <- c(20,30,35,30,40)
> mean(a)
[1] 31

Missing values: In R, missing values are represented by NA. if your data has some missing values then specify na.rm=T when using mean() and median() functions. An example is given below

> a <- c(20,30,35,NA,40)
> mean(a,na.rm=T)
[1] 31.25

3. Variability of Data

Let's take an example. A person needs to cross a river. The height of the person is 173 cm. The only information he has about the river is its average depth (170 cm). Now, if he decides to cross the river on the basis of average only, it will be life-threatening for him (let's assume he doesn't know swimming). Here comes the role of variability.

3.1 Standard Deviation

This measure offers the spread of the data around the mean of the data. It is computed using the following formula for the data $(X_1,X_2,X_3,....,X_N)$

$$ SD = \sqrt \frac{\sum_{i=0}^{N}({\mu - X_i})}{N} $$ here, $\mu$ is mean of the data.

If you have a sample of data then you need to use the following formula $$ SD = \sqrt \frac{\sum_{i=0}^{N}({\mu - X_i})}{N-1} $$

In R, you can simply use function sd() to compute the standard deviation.

Example: We have two datasets A and B. Both data have the same average but their data spread are different.

R code for generating above diagram

> library(lattice)
> A <- seq(30,50,.25)
> A <- c(A,20,60)
> B <- seq(22,58,.25)
> B <- c(B,20,60)
> stripplot(A)
> stripplot(B)

Variance: It is square of Standard Deviation. It is also used for showing data variability. There is function var() in R for variance computation.

4. Quartiles

So if you remember the median from the previous section, it divides the data into two halves. Now each of these halves if you divide further then finally you have four parts (quarters). Let's see the following example. In the following data 30 is the median which divides data into two halves. If we further divide these halves then the median of those halves are known as Quartile-1 and Quartile-3. Median is Quartile-2.

You can use summary() function in R to show quartiles of your data.

> a <- c(10,14,17,20,25,30,34,37,40,42,44)
> summary(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.00   18.50   30.00   28.45   38.50   44.00 

One important measure is Inter-Quartile Range (IR) which is also used for the understanding spread of data around mean. $$IR = Q_3 - Q_1$$

Box plot is a very useful plot to show quartile related information. It looks like the one given below. It will basically give you information e.g. minimum, maximum, median, quartiles, inter-quartile range.

To conclude, this post covered the basics of descriptive statistics particularly frequency/probability distribution plotting, centrality measure, data spread, and quartiles. In the next post, we will learn central limit theory and hypothesis test using t-test.

Pankaj Chejara

Research Scholar, YouTuber, Blogger, Traveller

Tallinn, Estonia

Subscribe to LearnAITech

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!