This post is **part-2** of the Series **Statistics in R**. In the first part, we have learned some basics of R. In this post, we will learn **Descriptive Statistics** concepts with a hands-on activity in R.

#### Descriptive Statistics ?

Let's start with a simple example. Tony is working in milk-plant and he is responsible for noting the weight of every milk packet of one liter. The **packaging plant is not highly accurate** therefore some milk packets produced with less milk and some with more milk. He collected this data over a period of 10 days and this data look like as following (weight in ml.)

```
1022.8537 1046.8098 1025.9010 1011.9844 991.4790 999.9071 1066.9561 1006.6503 1057.1011, 959.8809 951.8257 1039.4336 947.3567 941.6532 998.9110 960.5717 920.5042 1034.3791, 1008.4992 918.3915 1126.4705 1015.3833 1060.7138 1025.4741 1011.1905 958.6402 1001.0104, 1056.7863 1014.9348 1020.6914 1039.5814 977.8077 970.7610 960.8286 995.4237 972.5953, 1031.0542 1040.9421 996.3655 1022.1383 998.0796 888.3711 1008.7041 981.4500 943.6081, 1064.3009 983.0746 1009.9346 946.4760 978.4714 997.2577 1041.7493 981.6758 1038.7610, 934.0396 1015.9674 909.2195 975.8332 1022.4339 1028.9248 986.3328 949.9608 1027.4207, 1068.0071 1062.7421 991.4067 972.2559 1054.5471 938.9627 1022.3407 1047.2894 1028.4642, 1008.2233 1048.7007 987.0517 989.8248 1006.4934 998.8702 1026.4783 893.4586 1015.7011, 1011.0602 993.3329 1055.6904 1008.7458 1010.6308 1000.0748 991.0321 1057.5244 1030.3993, 1040.4531 992.6614 1035.3172 994.2273 999.1923 1005.3602 1007.5996 969.8353 1005.5389, 1087.2964
```

He needs to report this data to his seniors. Showing the entire data would be a silly mistake (as the above data is really small, but for real examples, it could be extremely large).

How to present this data to senior authority?

He started working on it and prepared a small summary sheet as following

`Data count: 100`

`Minimum:888.3711`

`Maximum:1126.471`

The purpose of the above sheet is to describe the data (Why: Because showing entire data would not convey much information and it's really difficult as well to understand it). Here comes, **Descriptive Statistics**.

As the name implies descriptive statistics used to describe the data. Now the question is how to describe the data?. The above information is just a very basic form of descriptive statistics. Let's delve deeper and explore it further.

**How data is distributed**?**What is central data**?**Data spread****Quartiles**

#### 1. How data is distributed?

The answer to this question provides information regarding values in the dataset and about their occurrence. Let's start with a simple example. We have some data and we want to see the distribution.

```
d <- c(20,30,20,30,10,20,20,30,10,40,50,50,60,30,20,10,30,20,20)
> table(d)
d
10 20 30 40 50 60
3 7 5 1 2 1
```

In the above example, we had data of 20 items. From the frequency table, we can see that the distribution of the data (e.g. 20 is occurring more than other values). Telling someone that `20`

appeared seven times in the dataset is not enough until you tell the size of the dataset.

To simplify it, the frequency count replaced by the probability (percentage of times a particular data appear in the dataset). Following is an R script to change the frequency count to the probability.

```
> d <- c(20,30,20,30,10,20,20,30,10,40,50,50,60,30,20,10,30,20,20)
> freq <- table(d)
> for (i in seq(1:length(freq)))
+ {
+ temp <- freq[[i]]/length(d)
+ freq[[i]] <- temp
+ }
> freq
d
10 20 30 40 50 60
0.15789474 0.36842105 0.26315789 0.05263158 0.10526316 0.05263158
>
```

So, now you just need to show the probabilities. It will communicate the frequency of the data value relative to the entire dataset. Now let's plot the distribution by running `plot(freq)`

command.

Tony used the same strategy and plotted the distribution for his data. However, in his dataset, there are **too many data values**. Therefore, he **divided his range of data into bins** (e.g. 850----900-----950) and counted the number of values in each bin. Then, he plotted this data. Such plots are known as `histogram`

(figure given below).

Tips: When you have discrete data (Meaning: only a few values appear again and again), you can go for frequency distribution. In case if you have continuous data (Meaning: Too many values in your dataset) then you can go for histogram plotting).

The final plot generated for Tony's data is the following one. If you see the shape as shown in the following plot (bell shape) then your data is following `Normal Distribution`

. We will discuss it in detail in the later part of this series. For the time being, just keep in mind if your data looks like bell shape then it's Normal distributed data.

You can use

`densityplot()`

function to plot distribution of your data.

```
> d <- c(20,30,20,30,10,20,20,30,10,40,50,50,60,30,20,10,30,20,20)
> densityplot(d)
```

#### 2. Measure of Centrality

After an initial plotting of data distribution, the next question is what is the central data. You might think about average but it is just one of the measures of centrality. There are others as well which we will discuss in this section.

###### 2.1 Mode

This is simply the **data with the highest frequency** in your data. For instance, in the following dataset, data value 20 has the highest frequency. **Hence 20 is the mode**. In R, there is no direct function available for computing mode.

```
d <- c(20,30,20,30,10,20,20,30,10,40,50,50,60,30,20,10,30,20,20)
> table(d)
d
10 20 30 40 50 60
3 7 5 1 2 1
```

You can use following script for computing mode for your data.

```
> a <- c(20,30,35,30,40)
> uniqv <- unique(a)
> uniqv[which.max(tabulate(match(a,uniqv)))]
```

###### 2.2 Median

Median is the **middle data value** in your **ordered dataset**. Let's say we have `20,30,35,30,40`

and we need to find the median. First, rearrange your data into increasing order and then take the middle value as the median. In case of **an even number** of data values, take the **average of two middle values**.

In R, you can simply use the `median()`

function.

```
> a <- c(20,30,35,30,40)
> median(a)
[1] 30
```

##### 2.3 Mean

Mean is the arithmetic average of the data. It is the addition of your data values divided by the size of data. In R, you can use the `mean()`

function to compute the mean of your data.

```
> a <- c(20,30,35,30,40)
> mean(a)
[1] 31
```

Missing values:In R, missing values are represented by`NA`

. if your data has some missing values then specify`na.rm=T`

when using`mean()`

and`median()`

functions. An example is given below

```
> a <- c(20,30,35,NA,40)
> mean(a,na.rm=T)
[1] 31.25
```

### 3. Variability of Data

Let's take an example. A person needs to cross a river. The height of the person is 173 cm. The only information he has about the river is its average depth (170 cm). Now, if he decides to cross the river on the basis of average only, it will be life-threatening for him (let's assume he doesn't know swimming). Here comes the role of variability.

#### 3.1 Standard Deviation

This measure offers the spread of the data around the mean of the data. It is computed using the following formula for the data $(X_1,X_2,X_3,....,X_N)$

$$ SD = \sqrt \frac{\sum_{i=0}^{N}({\mu - X_i})}{N} $$ here, $\mu$ is mean of the data.

If you have a

sample of datathen you need to use the following formula $$ SD = \sqrt \frac{\sum_{i=0}^{N}({\mu - X_i})}{N-1} $$

In R, you can simply use function `sd()`

to compute the standard deviation.

**Example:** We have two datasets A and B. Both data have the same average but their data spread are different.

R code for generating above diagram

```
> library(lattice)
> A <- seq(30,50,.25)
> A <- c(A,20,60)
> B <- seq(22,58,.25)
> B <- c(B,20,60)
> stripplot(A)
> stripplot(B)
```

Variance: It is square of Standard Deviation. It is also used for showing data variability. There is function`var()`

in R for variance computation.

#### 4. Quartiles

So if you remember the `median`

from the previous section, it divides the data into two halves. Now each of these halves if you divide further then finally you have four parts (quarters). Let's see the following example. In the following data `30`

is the median which divides data into two halves. If we further divide these halves then the median of those halves are known as `Quartile-1`

and `Quartile-3`

. Median is `Quartile-2`

.

You can use `summary()`

function in R to show quartiles of your data.

```
> a <- c(10,14,17,20,25,30,34,37,40,42,44)
> summary(a)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.00 18.50 30.00 28.45 38.50 44.00
```

One important measure is

Inter-Quartile Range(IR) which is also used for the understanding spread of data around mean. $$IR = Q_3 - Q_1$$

Box plot is a very useful plot to show quartile related information. It looks like the one given below. It will basically give you information e.g. minimum, maximum, median, quartiles, inter-quartile range.

To conclude, this post covered the basics of descriptive statistics particularly frequency/probability distribution plotting, centrality measure, data spread, and quartiles. In the next post, we will learn central limit theory and hypothesis test using t-test.