This post is the first part of the series on Statistics in R. This series mainly targeted for the people who are completely new to the programming and want to learn how to apply Statistics methods in R.
In this part, we will cover basics of R, particularly
1. Data types in R
2. Data structures in R
3. Basic Operations in R
First of all, you need to install R on your system. You can download R Studio and install it. The link to download R Studio is here.
1. Data types in R
Just think about your dataset (or a dataset containing records of students). What kind of data it has?. It could be numbers (e.g. age, marks) or characters (e.g. name, address) or true/false values, etc, Now to store this different type of data, R uses the concept of data types. R basically has five different basic data types
Now, we will further explore these data types one by one.
Let's begin with
Numeric. This data type used to store numbers (e.g. integers and decimals). Let's say we want to store 10 in a variable
a (What is variable? It is the name of a location in the memory). For this task, we will use
<- assignment operator in R. In the left-hand side of
<-, we will specify the name of a variable and on the right-hand side, we will specify the value. The statement is given in the following code.
> a<-10 > typeof(a)  double
The above code will create a variable and store 10 in it. When you execute the second statement
typeof(a), it will show the data type of
a which is "double". This data type is for storing decimal and
by default R stores numeric data as Double. That means your integer data also stored as
Double data type until you explicitly specify
L at the end of the integer number.
For example, instead of
a<-10 we will write
> a<-10L > typeof(a)  integer
By default all numbers are treated as
Datasets usually contain information e.g. address, name, etc. This kind of data treated by R as the
character data type. To store such data, we need to enclose the data within single or double quotes (e.g. "Ram" or 'Ram').
> name<-"Ram" > typeof(name)  character
For this datatype, think about a condition e.g. 10 < 15 (is it true or not? the answer is true). To store such kind of information R uses logical data type. There are two values of logical data type-
When you run the following instructions, it will show
> 10 < 15  TRUE
> a <- TRUE > typeof(a)  logical
This datatype allows us to store complex numbers. You rarely encounter this type of data in your dataset. Still it is good to know about it. Complex number basically contain two parts, real and imaginary. It is denoted as
a+bi where a represents real part and b represents imaginary part.
> c <- 10 + 15i > typeof(c)  complex
We will not discuss it in depth. Following is a simple example of this data type
> r <- raw(10) > typeof(r)  Raw
2. Data Structure
In the above section, we have discussed different basic data types used in R. This section will explore how data can be stored in R. Let's understand it with an example. Let's say we want to store the salary of 10 employees. One option could be creating ten variables and store salary in those variables.
employee1 <- 2000 employee2 <- 2500 ... ... employee10 <- 3000
The above strategy is not a feasible option for real-life dataset. To deal with this issue, we need a way which allows us to store our data in an easier way. Additionally, it also allows us an easier way to access and update that data. Here comes the role of data structure.
R offers the following data structure to store our data. We will cover each one of them with the help of an example.
- Data frame
Vector structure used to store multiple data of same data type. Usually, an attribute in a dataset contains data of the same type. In that sense, you can think each attribute stored as a vector.
Let's begin with our example of the employee's salary. In the following subsection, we will create a vector and access it.
Create a vector
To create a vector just pass the values in the
c() function. You can check the number of elements stored in the vector using
> salary <- c(100000,200000,150000) > length(salary)  3 > typeof(salary)  "double"
Generate a vector
You can also generate a vector of sequence of numbers e.g. 1 to 100. For that, you can simply use
: operator or
seq() function. With
: operator, we need to specify start and end of the sequence. While with
seq() function, we also need to specify the step size (difference between two consecutive numbers).
> a <- 1:100 > length(a)  100 > b <- seq(1,100,1) > length(b)  100 > c <- seq(1,100,2) > length(c)  50
Access items from vector
You can access any item from the vector using integer index (Index: it is just the location of a number e.g. in the following example item 10 stored as the first item, therefore, it's location is 1). The first item of the vector is considered as index 1 and so on. You can access an item from the vector using
. In the following example, we are accessing the second, first, and the fifth element from the vector.
> a <- c(10,20,30,40,50) > a  20 > a  10 > a  50
Important Note: When you try to store numeric and character data into a vector, it would allow you to do that. However, R will convert that data into a single data type from stored elements which can store both of them.
> a <- c(10,20,"hello","English",TRUE) > type of(a)  "character"
In the above code, we have created a vector with three different types of data (numeric, character, logical). However, when we checked the type of vector, it showed "character" which means all data is stored as
You can consider
List as a generic form of
Vector. It means you can store different types of data in List.
Let's create a list with three different types of data.
> a <- list(10,20,"hello","English",TRUE) > type of(a)  "list"
Accessing the data
It is a bit tricky. Here, the use of
 to access data from the list will return a list itself. Let's understand it with an example.
> a <- list(c(10,20),"hello",5000) > b <- a > type(b)  "list"
You can think of the above list as the container of three items, first of length 2, second with length 1 and third also with length 1. Now, when we use
, it simply returns another list. To access a particular item use
> a[]  10 20 > b <- a[] > type(b)  "double"
Let's assume your dataset has an attribute language and it has three different types of values in your entire dataset (English, Hindi, Estonian). This is basically a nominal variable or categorical variable. Another example could be education levels (which is an ordinal variable) e.g. primary, secondary, bachelor, master, doctoral, etc. Factors come handy to store such data. Let's see how to store categorical or ordinal data in R.
Let's say we have an attribute
language which stores mother language of five students (English, Estonian, Estonian, Hindi, English). First we create a vector of the data and then use
factor function to convert vector into factor type.
> language <- c("English","Estonian","Estonian","Hindi","English") > typeof(language)  "character" > # Converting above data into categorical > lan2 <- factor(language)
When our categories have an order then it is considered as ordinal data. Example: consider student's contribution in a project coded as
high. These categories have inherent order among them. Such kind of data are good to store as ordinal type.
contribution <- c("low","low","high","medium","medium") lan2 <- ordered(contribution )
Matrix is two-dimensional array (think about data of similar type in tabular format). For example the one given below (matrix with three rows and three columns).
1 19 20 2 5 78 23 6 6
Create a matrix
Let's create the same matrix as shown above. You can create a matrix using
matrix function where we need to specify number of rows and columns in
ncols, respectively. You can check the dimension (number of rows and columns) of a matrix using
> m <- matrix(c(1,19,20,2,5,78,23,6,6),nrow=3,ncol=3) > m [,1] [,2] [,3] [1,] 1 2 23 [2,] 19 5 6 [3,] 20 78 6 > dim(m)  3 3
Accessing elements from Matrix
You can access elements from the matrix by specifying the row number and column number. Let's say we want to access the element from the first row and second column. We will write
m[1,3] to access it. In case if we just want to access the first row then we will skip the column number. Similarly, if we want to access only the third column then we will skip the row number.
> m[1,3]  23 > m[1,]  1 2 23 > m[,3]  23 6 6
2.5 Data frame
It is a very common data structure to store datasets. It basically stores data into a list of equal length vectors.
what does that mean?: Just think about a dataset with three attributes (student-name, age, marks) which contains a record of five students. Each of the attributes has five values and is of different data types (e.g. name as character type, age as a numeric type, and marks as numeric type).
> names<- c("John","Ram","Pradeep","Linda","Shashi") > age<- c(19,20,18,21,10) > marks <-c(87,78,80,90,88) data <- dataframe(names,age,marks) > data <- data.frame(names,age,marks) > data names age marks 1 John 19 87 2 Ram 20 78 3 Pradeep 18 80 4 Linda 21 90 5 Shashi 10 88
Changing the name of columns
You can change the name of your attributes as following
names(data)<-c("Student-Name",'Student-Age','Total-marks') > data Student-Name Student-Age Total-marks 1 john 19 87 2 ram 20 78 3 Pradeep 18 80 4 Linda 21 90 5 Shashi 10 88
You can access data in the same way as we did for the
Matrix data structure using
[row-index,column-index]. Let's access the marks of student
linda. If we look at the above data, the row number for student
linda is 4 and the column number is 3.
> data[4,3]  90
Accessing column by name
You can access the columns directly by specifying the column names.
> data['Student-Name'] Student-Name 1 john 2 ram 3 Pradeep 4 Linda 5 Shashi
Accessing multiple columns
If you want to access multiple columns from your data then you need to specify their names as a vector.
> data[c('Student-Name','Total-marks')] Student-Name Total-marks 1 john 87 2 ram 78 3 Pradeep 80 4 Linda 90 5 Shashi 88
3. Basic Mathematical Operations
In this section, we will discuss some of the basic mathematical operations in R. In the following table, we have some of the basic mathematical operations would be needed in the next part of the series.
|Exponential (e.g. `e^2`)||
|Log (base 10) (e.g. `log_10(10)`)||
|Log (base e) (e.g. `log_e(10)`)||
|Square root (e.g. `\sqrt(25)`)||
|Exponent (e.g. `3^3)`)||
3.1 Operations on Vector
We can also perform mathematical operations on numeric vectors. In such cases, operation is performed on each element of the vector.
> a <- c(10,20,30,40,50) > a + 2  12 22 32 42 52 > sqrt(a)  3.162278 4.472136 5.477226 6.324555 7.071068 > log(a)  2.302585 2.995732 3.401197 3.688879 3.912023
Some useful functions
Here, we will see some of the useful functions e.g. sorting, finding minimum and maximum, and summing all elements of vector.
> a <- c(20,10,60,30,40,50) > min(a)  10 > max(a)  60 > sum(a)  210 > sort(a)  10 20 30 40 50 60