Introduction to Pandas Python API

Pandas is a Python API that makes it easier to access and preprocess the data which actually saves a lot of time for developers. This post aims at offering an introduction to Pandas API for beginners to start using it.

Installation

You can choose one of the two options in order to use pandas API on your system. The first option is to install Anaconda Distribution which comes with all essential python packages. While the second option is to install python on your system and then install pandas package using the following command

Pankaj-Pro:~ panche$ pip install pandas

Introduction

Now we are all set to explore the API basics, we will cover the following topics which I think will be sufficient to start programming using pandas API.

  • Pandas DataFrame and Series objects
  • Reading data CSV file and accessing basic information
  • Querying data using loc() and iloc() function
  • Handling missing data
  • Adding, deleting columns or rows

1. Pandas DataFrame and Series objects

Pandas’ Series is a one-dimensional array representation of values. You can understand it as an attribute in a dataset. For instance, consider a dataset with three attributes sid (student-id), name, and marks. Now, each of these attributes in pandas is represented as Series object.

Pandas Series

Let’s write a program to create these three Series objects sid, name and marks.

import pandas as pd

# Creating Series using list
id = pd.Series([101,102,103,104,105])

name = pd.Series(['pradeep','manoj','kiran','pushpendra','sambit']

marks = pd.Series([20,30,40,32,28])

The first line import pandas as pd imports the pandas package in your program. Next three lines creates three Series objects with given list.

Pandas’ DataFrame object is a two-dimensional data structure where each attribute is a Series obejct. You can create DataFrame using a dictionary of key:value pair where key represent attribute name and value represent the values of that attribute. Let’s create a dataframe using Series object we created in above program.

df = pd.DataFrame({'sid':id,'name':name,'marks':marks})

2. Reading data CSV file and accessing basic information

Pandas have a function read__csv() to read CSV format data file. This function comes with multiple useful options which we will learn in this section. The data file used in this tutorial can be downloaded from the link. The name of the downloaded data file is iris_csv.csv.

1.1 Open and read a CSV data file

import pandas as pd
df=pd.read_csv('iris_csv.csv')
df.head()

The first line import pandas as pd imports the pandas package in your program and second line df=pd.read_csv('iris_csv.csv') open and read the csv file (here you can specify name of your data file). Third line df.head() shows first five records (you can specify the number of records) from your data file.

1.2 Assign/Rename column names

In case, if your data file does not have column names or you want to assign different column name then you can use names option of read__csv() function.
Example:

import pandas as pd
df = pd.read_csv('iris_csv.csv',names=['sep-len','sep-wid','pet-len','pet-wid','class'])

1.3 Reading data file with different seperator

Sometimes the data files have columns seperated by other characters (e.g. spaces, colon). In such cases, in order to read the csv file we need to specify the sep option in the read_csv() function.

// reading file having data seperated by :
df = pd.read_csv('data_file',sep=':')

1.4 Skipping rows while reading data

In case, if your data file does not have data records from the first line (let’s say it contains some summary or description and data records begins from line 4), you can skip those lines by specifying skip rows option.

df = pd.read_csv('data_file',skiprows=3)

1.5 Accessing sizes of data

You can check the size of your data set (e.g. number of rows, number of columns) using shape property of the DataFrame.

import pandas as pd
df = pd.read_csv('iris_csv.csv')
print(df.shape)
# output
# (150, 5)

Here, 150 is the number of rows and 5 is number of columns.

1.6 Checking data types of columns

To check the data types of columns in your data file, you can use dtypes property.

import pandas as pd
df = pd.read_csv('iris_csv.csv')
df.dtypes

Output:

sep-len    float64
sep-wid    float64
pet-len    float64
pet-wid    float64
class       object
dtype: object

As the data processing modules requires your data to be in numeric data types (e.g. int, float) therefore it is best practice to check the data types before processing it.

1.7 Basic stats of data

If you want to learn about your data in more depth, you can use describe() function which will provide information about count, minimum, maximum, mean, standard deviation, quartiles for each column. An example is given below

import pandas as pd
df = pd.read_csv('iris_csv.csv',skiprows=1,names=['sep-len','sep-wid','pet-len','pet-wid','class'])
df.describe()

2. Querying data using loc() and iloc() function

Pandas offers two functions (there is one more ix() which is actually deprecated) for accessing data from pandas dataframe- .loc() and .iloc(). In these functions, you specify the labels or positions of rows and columns to access them. However, if you do not specify columns selector then by default all columns are accessed.

: operator used for slicing purpose. It works differently in case of label and position. When applied with labels (start:end), it include end element in the result. However, in case of positions (start:end), it does not include end in the result.

Let’s understand the difference between labels and positions. In the following program, we are creating a simple dataframe with two columns sid and name.

import pandas as pd
df = pd.DataFrame({'sid':[101,102,103,104,105],'name':['pradeep','manoj','kiran','pushpendra','sambit']})
df.head()

The labels for this data frame are the index and column names. As shown in the figure below, the row labels are (0,1,2,3,4) and column labels are (sid, name).
Position for rows and columns begins with 0 that means the first row has position 0, second row has position 1 and so on.

In the above example, the rows position and labels are same. To make the difference clear, let’s try to change the index of our dataframe and then see it.
You can change the index of dataframe using set_index() function. In the following example, we are setting the first column sid as the index of our dataset. This function create a copy of dataframe, apply changes it it and then return the updated copy. In order to make changes to dataframe inplace=True parameter needs to be passed.

df.set_index('sid',inplace=True)
df.head()

As you can see in the figure below, rows index are (101,102,103,104,105) whereas rows position are the same as previous.

labels-position

2.1 Some Examples

The figure given below shows some basic syntax for accessing data with label and position-based selections.

To access a particular row, you need to specify its label or position in the row-selector (for example, you have to specify label 0 to access first row). In case, if you want to access multiple rows, you need to specify their corresponding labels or positions in a list or you can use : operator for slicing (for example, row selector for accessing first three rows can be [0,1,2] or 0:3).

# import pandas package
import pandas as pd

# create the dataframe
df = pd.DataFrame({'sid':[101,102,103,104,105],'name':['pradeep','manoj','kiran','pushpendra','sambit']})

# set sid as index
df.set_index('sid',inplace=True)

# Access first row
df.loc[101]  # lable-based

df.iloc[0]  #  Position-based


# Access first three rows
df.loc[101:103]  # label-based

df.iloc[0:3]  # position-based

2.2 Condition-based data access

Function loc() and iloc() both support condition-based data access using Boolean array. Let’s say we want to access the first row. We need to specify a boolean array for rows-selection containing a boolean value for each row. If we want to show a particular/set-of row, we can do that by specifying a boolean array with True values on corresponding location of those rows. Same for column selection.

import pandas as pd
df = pd.DataFrame({'sid':[101,102,103,104,105],'name':['pradeep','manoj','kiran','pushpendra','sambit']})


df.loc[[True,False,False,False,False],[True,True]]




#Output
#      sid     name
#0     101     pradeep

In the above example, we spcified a boolean array for rows selection and one for columns selection. In the first boolean array True is specified at first index (which corresponds to first row) and second array contains all True(which corresponds to both columns). In results, it access values from first row and both columns.

3. Handling missing data

Pandas offers a great support to handle missing values. If your dataset has some values missing, pandas automatically marks them as NaN values. To demonstrate its example, I have prepared a file with three columns- eid, name, salary.

In this file, I intentionally kept the salary field for the third record empty. Now let’s read this file using pandas.

df=pd.read_csv('emp.csv')
df.head()

Output:

We can handle missing values in two ways deleting them or replacing missing data with some other values.

3.1 Deleting missing values

We can use dropna() function to delete missing records. In this function you need to specify axis=0 if we want to delete the row/rows having NaN and for deleting column/columns having NaN specify axis=1

df=pd.read_csv('emp.csv')

#delete row
df1 = df.dropna(axis=0)
df1.head()

#delete column
df1 = df.dropna(axis=1)
df1.head()

If you want to change your original dataframe then specify inplace=True in the dropna() function.

3.2 Filling missing values

Function fillna is useful in filling missing values.

df=pd.read_csv('emp.csv')

#fill with 0
df1 = df.fillna(0)
df1.head()

#fill using forward fill method
df2 = df.fillna(method='ffill')
df2.head()

# fill using backward fill method
df3 = df.fillna(method='bfill')
df3.head()

# fill using mean value of column
df4 = df.fillna(df['salary'].mean())
df4.head()

ffill replaces NaN with previous value in the same column. While, bfill replace NaN with its next value in the same column (order of values top to bottom).

4. Adding, deleting columns and rows

I am using here our previous dataframe with columns sid, and name.

4.1 Addiding row or column

We can simply add a row using append() or loc()/iloc() function. You can use key: value pair in append function, where key is the attribute name and value represents the value you want to add. Pandas automatically put NaN if some attributes values are not provided. Let’s add a record with sid as 106 and name as ‘gaurav’.

import pandas as pd
df = pd.DataFrame({'sid':[101,102,103,104,105],'name':['pradeep','manoj','kiran','pushpendra','sambit']})

# using the append function
df = df.append(append({'sid':106,'name':'gaurav'},ignore_index=True)
print(df)

# Adding a column
df['marks'] = [20,30,40,32,28,50]

The same record can also be added using the statement df.loc[5]=[106,'gaurav'].

4.2 Deleting rows or column

In order to delete some columns or rows, the drop function can be used. In this function, you need to specify the label of row or column you want to delete. In case, it is column then you also need to pass a parameter axis=1. Let’s see the example

import pandas as pd
df = pd.DataFrame({'sid':[101,102,103,104,105],'name':['pradeep','manoj','kiran','pushpendra','sambit']})

# delete sid column
df.drop('sid',axis=1)
 

admin