Statistics in R - Part-1 : Basics of R

This post is the first part of the series on Statistics in R. This series mainly targeted for the people who are completely new to the programming and want to learn how to apply Statistics methods in R.

In this part, we will cover basics of R, particularly
1. Data types in R
2. Data structures in R
3. Basic Operations in R


First of all, you need to install R on your system. You can download R Studio and install it. The link to download R Studio is here.

1. Data types in R

Just think about your dataset (or a dataset containing records of students). What kind of data it has?. It could be numbers (e.g. age, marks) or characters (e.g. name, address) or true/false values, etc, Now to store this different type of data, R uses the concept of data types. R basically has five different basic data types

  • Numeric
  • Character
  • Logical
  • Complex
  • Raw

Now, we will further explore these data types one by one.

1.1 Numeric

Let's begin with Numeric. This data type used to store numbers (e.g. integers and decimals). Let's say we want to store 10 in a variable a (What is variable? It is the name of a location in the memory). For this task, we will use <- assignment operator in R. In the left-hand side of <-, we will specify the name of a variable and on the right-hand side, we will specify the value. The statement is given in the following code.

> a<-10
> typeof(a) 
[1] double

The above code will create a variable and store 10 in it. When you execute the second statement typeof(a), it will show the data type of a which is "double". This data type is for storing decimal and by default R stores numeric data as Double. That means your integer data also stored as Double data type until you explicitly specify L at the end of the integer number.
For example, instead of a<-10 we will write a<-10L.

> a<-10L
> typeof(a)
[1] integer

By default all numbers are treated as double by R.

1.2 Character

Datasets usually contain information e.g. address, name, etc. This kind of data treated by R as the character data type. To store such data, we need to enclose the data within single or double quotes (e.g. "Ram" or 'Ram').

> name<-"Ram"
> typeof(name)
[1] character

1.3 Logical

For this datatype, think about a condition e.g. 10 < 15 (is it true or not? the answer is true). To store such kind of information R uses logical data type. There are two values of logical data type- TRUE and FALSE.
When you run the following instructions, it will show TRUE.

> 10 < 15 
[1] TRUE
> a <- TRUE
> typeof(a)
[1] logical

1.4 Complex

This datatype allows us to store complex numbers. You rarely encounter this type of data in your dataset. Still it is good to know about it. Complex number basically contain two parts, real and imaginary. It is denoted as a+bi where a represents real part and b represents imaginary part.

> c <- 10 + 15i
> typeof(c)
[1] complex

1.5 Raw

We will not discuss it in depth. Following is a simple example of this data type

> r <- raw(10)
> typeof(r)
[1] Raw

2. Data Structure

In the above section, we have discussed different basic data types used in R. This section will explore how data can be stored in R. Let's understand it with an example. Let's say we want to store the salary of 10 employees. One option could be creating ten variables and store salary in those variables.

employee1 <- 2000  
employee2 <- 2500  
employee10 <- 3000  

The above strategy is not a feasible option for real-life dataset. To deal with this issue, we need a way which allows us to store our data in an easier way. Additionally, it also allows us an easier way to access and update that data. Here comes the role of data structure.

R offers the following data structure to store our data. We will cover each one of them with the help of an example.

  • Vector
  • List
  • Matrix
  • Data frame
  • nd-Array
  • Factors

2.1 Vector

Vector structure used to store multiple data of same data type. Usually, an attribute in a dataset contains data of the same type. In that sense, you can think each attribute stored as a vector.

Let's begin with our example of the employee's salary. In the following subsection, we will create a vector and access it.

Create a vector
To create a vector just pass the values in the c() function. You can check the number of elements stored in the vector using length() function.

> salary <- c(100000,200000,150000)
> length(salary)
[1] 3
> typeof(salary)
[1] "double"

Generate a vector
You can also generate a vector of sequence of numbers e.g. 1 to 100. For that, you can simply use : operator or seq() function. With : operator, we need to specify start and end of the sequence. While with seq() function, we also need to specify the step size (difference between two consecutive numbers).

> a <- 1:100
> length(a)
[1] 100
> b <- seq(1,100,1)
> length(b)
[1] 100
> c <- seq(1,100,2)
> length(c)
[1] 50

Access items from vector
You can access any item from the vector using integer index (Index: it is just the location of a number e.g. in the following example item 10 stored as the first item, therefore, it's location is 1). The first item of the vector is considered as index 1 and so on. You can access an item from the vector using []. In the following example, we are accessing the second, first, and the fifth element from the vector.

> a <- c(10,20,30,40,50)
> a[2]
[1] 20
> a[1]
[1] 10
> a[5]
[1] 50

Important Note: When you try to store numeric and character data into a vector, it would allow you to do that. However, R will convert that data into a single data type from stored elements which can store both of them.

> a <- c(10,20,"hello","English",TRUE)
> type of(a)
[1] "character"

In the above code, we have created a vector with three different types of data (numeric, character, logical). However, when we checked the type of vector, it showed "character" which means all data is stored as character type.

2.2 List

You can consider List as a generic form of Vector. It means you can store different types of data in List.

Let's create a list with three different types of data.

> a <- list(10,20,"hello","English",TRUE)
> type of(a)
[1] "list"

Accessing the data
It is a bit tricky. Here, the use of [] to access data from the list will return a list itself. Let's understand it with an example.

> a <- list(c(10,20),"hello",5000)
> b <- a[1]
> type(b)
[1] "list"

You can think of the above list as the container of three items, first of length 2, second with length 1 and third also with length 1. Now, when we use [], it simply returns another list. To access a particular item use [[]].

> a[[1]]
[1] 10 20
> b <- a[[1]]
> type(b)
[1] "double"

2.3 Factors

Let's assume your dataset has an attribute language and it has three different types of values in your entire dataset (English, Hindi, Estonian). This is basically a nominal variable or categorical variable. Another example could be education levels (which is an ordinal variable) e.g. primary, secondary, bachelor, master, doctoral, etc. Factors come handy to store such data. Let's see how to store categorical or ordinal data in R.

Categorical data
Let's say we have an attribute language which stores mother language of five students (English, Estonian, Estonian, Hindi, English). First we create a vector of the data and then use factor function to convert vector into factor type.

> language <- c("English","Estonian","Estonian","Hindi","English")
> typeof(language)
[1] "character"
> # Converting above data into categorical
> lan2 <- factor(language)

Ordinal data
When our categories have an order then it is considered as ordinal data. Example: consider student's contribution in a project coded as low, medium, and high. These categories have inherent order among them. Such kind of data are good to store as ordinal type.

contribution <- c("low","low","high","medium","medium")  
lan2 <- ordered(contribution )  

2.4 Matrix

Matrix is two-dimensional array (think about data of similar type in tabular format). For example the one given below (matrix with three rows and three columns).

1  19 20
2  5  78
23 6  6

Create a matrix
Let's create the same matrix as shown above. You can create a matrix using matrix function where we need to specify number of rows and columns in nrows and ncols, respectively. You can check the dimension (number of rows and columns) of a matrix using dim function.

> m <- matrix(c(1,19,20,2,5,78,23,6,6),nrow=3,ncol=3)
> m
     [,1] [,2] [,3]
[1,]    1    2   23
[2,]   19    5    6
[3,]   20   78    6
> dim(m)
[1] 3 3

Accessing elements from Matrix
You can access elements from the matrix by specifying the row number and column number. Let's say we want to access the element from the first row and second column. We will write m[1,3] to access it. In case if we just want to access the first row then we will skip the column number. Similarly, if we want to access only the third column then we will skip the row number.

> m[1,3]
[1] 23
> m[1,]
[1]  1  2 23
> m[,3]
[1] 23  6  6

2.5 Data frame

It is a very common data structure to store datasets. It basically stores data into a list of equal length vectors. what does that mean?: Just think about a dataset with three attributes (student-name, age, marks) which contains a record of five students. Each of the attributes has five values and is of different data types (e.g. name as character type, age as a numeric type, and marks as numeric type).

> names<- c("John","Ram","Pradeep","Linda","Shashi")
> age<- c(19,20,18,21,10)
> marks <-c(87,78,80,90,88)

data <- dataframe(names,age,marks)  
> data <- data.frame(names,age,marks)
> data
    names age marks
1    John  19    87  
2     Ram  20    78  
3 Pradeep  18    80  
4   Linda  21    90  
5  Shashi  10    88  

Changing the name of columns
You can change the name of your attributes as following

> data
  Student-Name Student-Age Total-marks
1         john          19          87  
2          ram          20          78  
3      Pradeep          18          80  
4        Linda          21          90  
5       Shashi          10          88

Accessing data
You can access data in the same way as we did for the Matrix data structure using [row-index,column-index]. Let's access the marks of student linda. If we look at the above data, the row number for student linda is 4 and the column number is 3.

> data[4,3]
[1] 90

Accessing column by name
You can access the columns directly by specifying the column names.

> data['Student-Name']
1         john  
2          ram  
3      Pradeep  
4        Linda  
5       Shashi  

Accessing multiple columns
If you want to access multiple columns from your data then you need to specify their names as a vector.

> data[c('Student-Name','Total-marks')]
  Student-Name Total-marks
1         john          87  
2          ram          78  
3      Pradeep          80  
4        Linda          90  
5       Shashi          88  

3. Basic Mathematical Operations

In this section, we will discuss some of the basic mathematical operations in R. In the following table, we have some of the basic mathematical operations would be needed in the next part of the series.

Operation Syntax
Addition 10+25
Subtraction 10-25
Multiplication 10*25
Division 10/5
Exponential (e.g. `e^2`) exp(2)
Log (base 10) (e.g. `log_10(10)`) log(10, base=10)
Log (base e) (e.g. `log_e(10)`) log(10,base=exp(1))
Square root (e.g. `\sqrt(25)`) sqrt(25)
Exponent (e.g. `3^3)`) 3^3

3.1 Operations on Vector

We can also perform mathematical operations on numeric vectors. In such cases, operation is performed on each element of the vector.

> a <- c(10,20,30,40,50)
> a + 2
[1] 12 22 32 42 52
> sqrt(a)
[1] 3.162278 4.472136 5.477226 6.324555 7.071068
> log(a)
[1] 2.302585 2.995732 3.401197 3.688879 3.912023

Some useful functions
Here, we will see some of the useful functions e.g. sorting, finding minimum and maximum, and summing all elements of vector.

> a <- c(20,10,60,30,40,50)
> min(a)
[1] 10
> max(a)
[1] 60
> sum(a)
[1] 210
> sort(a)
[1] 10 20 30 40 50 60

Pankaj Chejara

Research Scholar, YouTuber, Blogger, Traveller

Tallinn, Estonia

Subscribe to LearnAITech

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!