# Statistics in R - Part-1 : Basics of R

This post is the first part of the series on **Statistics in R**. This series mainly targeted for the people who are completely new to the programming and want to learn how to apply Statistics methods in R.

In this part, we will cover **basics of R**, particularly

1. Data types in R

2. Data structures in R

3. Basic Operations in R

## Installation

First of all, you need to install R on your system. You can download R Studio and install it. The link to download R Studio is here.

### 1. Data types in R

Just think about your dataset (or a dataset containing records of students). **What kind of data it has?**. It could be numbers (e.g. age, marks) or characters (e.g. name, address) or true/false values, etc, Now to store this different type of data, R uses the concept of data types. R basically has five different basic data types

- Numeric
- Character
- Logical
- Complex
- Raw

Now, we will further explore these data types one by one.

#### 1.1 Numeric

Let's begin with `Numeric`

. This data type used to store numbers (e.g. integers and decimals). Let's say we want to store 10 in a variable `a`

(What is variable? It is the name of a location in the memory). For this task, we will use `<-`

assignment operator in R. In the left-hand side of `<-`

, we will specify the name of a variable and on the right-hand side, we will specify the value. The statement is given in the following code.

```
> a<-10
> typeof(a)
[1] double
```

The above code will create a variable and store 10 in it. When you execute the second statement `typeof(a)`

, it will show the data type of `a`

which is "double". This data type is for storing decimal and `by default R stores numeric data as Double`

. That means your integer data also stored as `Double`

data type until you explicitly specify `L`

at the end of the integer number.

For example, instead of `a<-10`

we will write `a<-10L`

.

```
> a<-10L
> typeof(a)
[1] integer
```

By default all numbers are treated as

`double`

by R.

#### 1.2 Character

Datasets usually contain information e.g. address, name, etc. This kind of data treated by R as the `character`

data type. To store such data, we need to enclose the **data within single or double quotes** (e.g. "Ram" or 'Ram').

```
> name<-"Ram"
> typeof(name)
[1] character
```

#### 1.3 Logical

For this datatype, think about a condition e.g. 10 < 15 (is it true or not? the answer is true). To store such kind of information R uses logical data type. There are two values of logical data type- `TRUE`

and `FALSE`

.

When you run the following instructions, it will show `TRUE`

.

```
> 10 < 15
[1] TRUE
```

```
> a <- TRUE
> typeof(a)
[1] logical
```

#### 1.4 Complex

This datatype allows us to store complex numbers. You rarely encounter this type of data in your dataset. Still it is good to know about it. Complex number basically contain two parts, real and imaginary. It is denoted as `a+bi`

where a represents real part and b represents imaginary part.

```
> c <- 10 + 15i
> typeof(c)
[1] complex
```

#### 1.5 Raw

We will not discuss it in depth. Following is a simple example of this data type

```
> r <- raw(10)
> typeof(r)
[1] Raw
```

### 2. Data Structure

In the above section, we have discussed different basic data types used in R. This section will explore **how data can be stored in R**. Let's understand it with an example. Let's say we want to store the salary of 10 employees. One option could be creating ten variables and store salary in those variables.

```
employee1 <- 2000
employee2 <- 2500
...
...
employee10 <- 3000
```

The above strategy is not a feasible option for real-life dataset. To deal with this issue, we need a way which allows us to store our data in an easier way. Additionally, it also allows us an easier way to access and update that data. Here comes the role of data structure.

R offers the following data structure to store our data. We will cover each one of them with the help of an example.

- Vector
- List
- Matrix
- Data frame
- nd-Array
- Factors

#### 2.1 Vector

Vector structure used to store multiple data of **same data type**. Usually, an attribute in a dataset contains data of the same type. In that sense, you can think each attribute stored as a vector.

Let's begin with our example of the employee's salary. In the following subsection, we will create a vector and access it.

**Create a vector**

To create a vector just pass the values in the `c()`

function. You can check the number of elements stored in the vector using `length()`

function.

```
> salary <- c(100000,200000,150000)
> length(salary)
[1] 3
> typeof(salary)
[1] "double"
```

**Generate a vector**

You can also generate a vector of sequence of numbers e.g. 1 to 100. For that, you can simply use `:`

operator or `seq()`

function. With `:`

operator, we need to specify start and end of the sequence. While with `seq()`

function, we also need to specify the step size (difference between two consecutive numbers).

```
> a <- 1:100
> length(a)
[1] 100
> b <- seq(1,100,1)
> length(b)
[1] 100
> c <- seq(1,100,2)
> length(c)
[1] 50
```

**Access items from vector**

You can access any item from the vector using integer index (Index: it is just the location of a number e.g. in the following example item 10 stored as the first item, therefore, it's location is 1). The first item of the vector is considered as index 1 and so on. You can access an item from the vector using `[]`

. In the following example, we are accessing the second, first, and the fifth element from the vector.

```
> a <- c(10,20,30,40,50)
> a[2]
[1] 20
> a[1]
[1] 10
> a[5]
[1] 50
```

`Important Note`

: **When you try to store numeric and character data into a vector, it would allow you to do that. However, R will convert that data into a single data type from stored elements which can store both of them.**

```
> a <- c(10,20,"hello","English",TRUE)
> type of(a)
[1] "character"
```

In the above code, we have created a vector with three different types of data (numeric, character, logical). However, when we checked the type of vector, it showed "character" which means all data is stored as `character`

type.

#### 2.2 List

You can consider `List`

as a generic form of `Vector`

. It means you can **store different types of data in List**.

Let's create a list with three different types of data.

```
> a <- list(10,20,"hello","English",TRUE)
> type of(a)
[1] "list"
```

**Accessing the data**

It is a bit tricky. Here, the use of `[]`

to access data from the list will return a list itself. Let's understand it with an example.

```
> a <- list(c(10,20),"hello",5000)
> b <- a[1]
> type(b)
[1] "list"
```

You can think of the above list as the container of three items, first of length 2, second with length 1 and third also with length 1. Now, when we use `[]`

, it simply returns another list. **To access a particular item use [[]]**.

```
> a[[1]]
[1] 10 20
> b <- a[[1]]
> type(b)
[1] "double"
```

#### 2.3 Factors

Let's assume your dataset has an attribute language and it has three different types of values in your entire dataset (English, Hindi, Estonian). This is basically a nominal variable or categorical variable. Another example could be education levels (which is an ordinal variable) e.g. primary, secondary, bachelor, master, doctoral, etc. Factors come handy to store such data. Let's see how to store categorical or ordinal data in R.

**Categorical data**

Let's say we have an attribute `language`

which stores mother language of five students (English, Estonian, Estonian, Hindi, English). First we create a vector of the data and then use `factor`

function to convert vector into factor type.

```
> language <- c("English","Estonian","Estonian","Hindi","English")
> typeof(language)
[1] "character"
> # Converting above data into categorical
> lan2 <- factor(language)
```

**Ordinal data**

When our categories have an order then it is considered as ordinal data. Example: consider student's contribution in a project coded as `low`

, `medium`

, and `high`

. These categories have inherent order among them. Such kind of data are good to store as ordinal type.

```
contribution <- c("low","low","high","medium","medium")
lan2 <- ordered(contribution )
```

#### 2.4 Matrix

Matrix is two-dimensional array (think about data of similar type in tabular format). For example the one given below (matrix with three rows and three columns).

1 19 20 2 5 78 23 6 6

**Create a matrix**

Let's create the same matrix as shown above. You can create a matrix using `matrix`

function where we need to specify number of rows and columns in `nrows`

and `ncols`

, respectively. You can check the dimension (number of rows and columns) of a matrix using `dim`

function.

```
> m <- matrix(c(1,19,20,2,5,78,23,6,6),nrow=3,ncol=3)
> m
[,1] [,2] [,3]
[1,] 1 2 23
[2,] 19 5 6
[3,] 20 78 6
> dim(m)
[1] 3 3
```

**Accessing elements from Matrix**

You can access elements from the matrix by specifying the row number and column number. Let's say we want to access the element from the first row and second column. We will write `m[1,3]`

to access it. In case if we just want to access the first row then we will skip the column number. Similarly, if we want to access only the third column then we will skip the row number.

```
> m[1,3]
[1] 23
> m[1,]
[1] 1 2 23
> m[,3]
[1] 23 6 6
```

#### 2.5 Data frame

It is a very common data structure to store datasets. It basically stores data into a list of equal length vectors. `what does that mean?`

: Just think about a dataset with three attributes (student-name, age, marks) which contains a record of five students. Each of the attributes has five values and is of different data types (e.g. name as character type, age as a numeric type, and marks as numeric type).

```
> names<- c("John","Ram","Pradeep","Linda","Shashi")
> age<- c(19,20,18,21,10)
> marks <-c(87,78,80,90,88)
data <- dataframe(names,age,marks)
> data <- data.frame(names,age,marks)
> data
names age marks
1 John 19 87
2 Ram 20 78
3 Pradeep 18 80
4 Linda 21 90
5 Shashi 10 88
```

**Changing the name of columns**

You can change the name of your attributes as following

```
names(data)<-c("Student-Name",'Student-Age','Total-marks')
> data
Student-Name Student-Age Total-marks
1 john 19 87
2 ram 20 78
3 Pradeep 18 80
4 Linda 21 90
5 Shashi 10 88
```

**Accessing data**

You can access data in the same way as we did for the `Matrix`

data structure using `[row-index,column-index]`

. Let's access the marks of student `linda`

. If we look at the above data, the row number for student `linda`

is 4 and the column number is 3.

```
> data[4,3]
[1] 90
```

**Accessing column by name**

You can access the columns directly by specifying the column names.

```
> data['Student-Name']
Student-Name
1 john
2 ram
3 Pradeep
4 Linda
5 Shashi
```

**Accessing multiple columns**

If you want to access multiple columns from your data then **you need to specify their names as a vector**.

```
> data[c('Student-Name','Total-marks')]
Student-Name Total-marks
1 john 87
2 ram 78
3 Pradeep 80
4 Linda 90
5 Shashi 88
```

### 3. Basic Mathematical Operations

In this section, we will discuss some of the basic mathematical operations in R. In the following table, we have some of the basic mathematical operations would be needed in the next part of the series.

Operation | Syntax |
---|---|

Addition | `10+25` |

Subtraction | `10-25` |

Multiplication | `10*25` |

Division | `10/5` |

Exponential (e.g. `e^2`) | `exp(2)` |

Log (base 10) (e.g. `log_10(10)`) | `log(10, base=10)` |

Log (base e) (e.g. `log_e(10)`) | `log(10,base=exp(1))` |

Square root (e.g. `\sqrt(25)`) | `sqrt(25)` |

Exponent (e.g. `3^3)`) | `3^3` |

#### 3.1 Operations on Vector

We can also perform mathematical operations on numeric vectors. In such cases, operation is performed on each element of the vector.

```
> a <- c(10,20,30,40,50)
> a + 2
[1] 12 22 32 42 52
> sqrt(a)
[1] 3.162278 4.472136 5.477226 6.324555 7.071068
> log(a)
[1] 2.302585 2.995732 3.401197 3.688879 3.912023
```

**Some useful functions**

Here, we will see some of the useful functions e.g. sorting, finding minimum and maximum, and summing all elements of vector.

```
> a <- c(20,10,60,30,40,50)
> min(a)
[1] 10
> max(a)
[1] 60
> sum(a)
[1] 210
> sort(a)
[1] 10 20 30 40 50 60
```