crash course on pandas library python for everyday data science

pandas is a fast, powerful, flexible and easy to use open-source data analysis and manipulation tool, built on top of the python programming language. This post acts as a cheat-sheet for using the pandas library in python for everyday data science. It contains some important functions and submodules of pandas library which are used day to day in data science and machine learning.

Installing pandas

pandas can be installed easily by using pip. First, create a virtual environment and activate it. Then run the command below to install pandas.

1
pip install pandas

Importing the library

Create a python project or open an existing project and import pandas using

1
import pandas

we can also use an alias pd to use pandas into the project by importing pandas as pd

1
import pandas as pd

Reading and Visualizing CSV files

data = pd.read_csv(“path_to_csv_file”)

1
data = pd.read_csv("corona.csv")

this will read and load CSV file from memory and creates a pandas DataFrame object.

data.head(number_of_columns)

this will help to visualize how loaded data looks. It is stored in the DataFrame object which is pandas way to structure CSV data in the form of tables. ex:

1
data.head(5)  #shows first 5 rows of dataframe
Name of State / UT Total Confirmed cases * Cured/Discharged/Migrated Death
0 Andhra Pradesh 19 1 0
1 Andaman and Nicobar Islands 9 0 0
2 Bihar 11 0 1
3 Chandigarh 8 0 0
4 Chhattisgarh 7 0 0

Similarly, this will help to visualize the last part of data in DataFrame

data.tail(numbers_of_columns)

1
data.tail(5)
Name of State / UT Total Confirmed cases * Cured/Discharged/Migrated Death
22 Tamil Nadu 50 4 1
23 Telengana 69 1 1
24 Uttarakhand 7 2 0
25 Uttar Pradesh 75 11 0
26 West Bengal 19 0 1

Getting shape or dimensions and size of loaded data

shape property gives out the shape or dimension of DataFrame as a tuple of the form (number_of_rows, number_of_columns)

1
data.shape
(27, 4)

size gives out the number of elements or number of unique cells in DataFrame.

1
data.size
108

len on the DataFrame object gives out the number of rows in DataFrame.

1
len(data)
27

Retrieving data from DataFrame

data[‘column_name’]

data[[‘column_1’,’column_2’,’column_3’]]

1
 data['Death']
0     0
1     0
2     1
3     0
4     0
5     2
6     0
7     5
8     0
9     1
10    2
11    3
12    1
13    0
14    2
15    8
16    0
17    0
18    0
19    0
20    1
21    0
22    1
23    1
24    0
25    0
26    1
Name: Death, dtype: int64
1
data[['Name of State / UT','Death']]
Name of State / UT Death
0 Andhra Pradesh 0
1 Andaman and Nicobar Islands 0
2 Bihar 1
3 Chandigarh 0
4 Chhattisgarh 0
5 Delhi 2
6 Goa 0
7 Gujarat 5
8 Haryana 0
9 Himachal Pradesh 1
10 Jammu and Kashmir 2
11 Karnataka 3
12 Kerala 1
13 Ladakh 0
14 Madhya Pradesh 2
15 Maharashtra 8
16 Manipur 0
17 Mizoram 0
18 Odisha 0
19 Puducherry 0
20 Punjab 1
21 Rajasthan 0
22 Tamil Nadu 1
23 Telengana 1
24 Uttarakhand 0
25 Uttar Pradesh 0
26 West Bengal 1

sort and return DataFrame column specified

1
data.sort_values(['Total Confirmed cases *']) #sort and return DataFrame column
Name of State / UT Total Confirmed cases * Cured/Discharged/Migrated Death
19 Puducherry 1 0 0
17 Mizoram 1 0 0
16 Manipur 1 0 0
18 Odisha 3 0 0
9 Himachal Pradesh 3 0 1
6 Goa 5 0 0
4 Chhattisgarh 7 0 0
24 Uttarakhand 7 2 0
3 Chandigarh 8 0 0
1 Andaman and Nicobar Islands 9 0 0
2 Bihar 11 0 1
13 Ladakh 13 3 0
0 Andhra Pradesh 19 1 0
26 West Bengal 19 0 1
10 Jammu and Kashmir 31 1 2
14 Madhya Pradesh 33 0 2
8 Haryana 33 17 0
20 Punjab 38 1 1
22 Tamil Nadu 50 4 1
5 Delhi 53 6 2
21 Rajasthan 57 3 0
7 Gujarat 58 1 5
23 Telengana 69 1 1
25 Uttar Pradesh 75 11 0
11 Karnataka 80 5 3
15 Maharashtra 193 25 8
12 Kerala 194 19 1

Using loc and iloc

pandas DataFrame provides powerful properties loc and iloc to retrieve data from DataFrame. This is the most widely used technique if we have lots of features to retrieve.

loc selects things by using the label

loc has following valid inputs -

  1. A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along with the index.).

  2. A list or array of labels [‘a’, ‘b’, ‘c’].

  3. A slice object with labels’ a ‘ : ‘ f ‘(Note that contrary to usual python slices, both the start and the stop are included.)

data.loc[row_needed,column_needed] #loc with input type 1

data.loc[[row_needed(multiple)],column_needed] #loc with input type 2

data.loc[row_range_needed,column_range_needed] #loc with input type 3

Example:

1
data.loc[:,'Total Confirmed cases *'] #selects all rows from column Total Confirmed cases
0      19
1       9
2      11
3       8
4       7
5      53
6       5
7      58
8      33
9       3
10     31
11     80
12    194
13     13
14     33
15    193
16      1
17      1
18      3
19      1
20     38
21     57
22     50
23     69
24      7
25     75
26     19
Name: Total Confirmed cases *, dtype: int64
1
data.loc[:,['Total Confirmed cases *','Death']]
Total Confirmed cases * Death
0 19 0
1 9 0
2 11 1
3 8 0
4 7 0
5 53 2
6 5 0
7 58 5
8 33 0
9 3 1
10 31 2
11 80 3
12 194 1
13 13 0
14 33 2
15 193 8
16 1 0
17 1 0
18 3 0
19 1 0
20 38 1
21 57 0
22 50 1
23 69 1
24 7 0
25 75 0
26 19 1
1
data.loc[:,'Total Confirmed cases *':'Death'] #( including death column [refer input type 3])
Total Confirmed cases * Cured/Discharged/Migrated Death
0 19 1 0
1 9 0 0
2 11 0 1
3 8 0 0
4 7 0 0
5 53 6 2
6 5 0 0
7 58 1 5
8 33 17 0
9 3 0 1
10 31 1 2
11 80 5 3
12 194 19 1
13 13 3 0
14 33 0 2
15 193 25 8
16 1 0 0
17 1 0 0
18 3 0 0
19 1 0 0
20 38 1 1
21 57 3 0
22 50 4 1
23 69 1 1
24 7 2 0
25 75 11 0
26 19 0 1

note: using loc slices contrary to usual python slices, both the start and the stop is included, when present in the index ie.. in ‘ a ‘ : ‘ c ‘ both ‘ a ‘ and ‘ c ‘ row are inclusive.

iloc selects rows and columns using integer positions

iloc has the following valid inputs -

1: An integer e.g. 5.

2: A list or array of integers [ 4, 3, 0 ].

3: A slice object with ints 1 : 7.

4: A boolean array.

data.iloc[row_index_needed,column_index_needed] #iloc with type 1

data.iloc[[row_index_needed(multiple)],column_index_needed] #iloc with type 2

data.iloc[row_index_range_needed,column_index_range_needed] #iloc with type 3

Example:

1
data.iloc[:,1]   #selects all rows of column with index 1
0      19
1       9
2      11
3       8
4       7
5      53
6       5
7      58
8      33
9       3
10     31
11     80
12    194
13     13
14     33
15    193
16      1
17      1
18      3
19      1
20     38
21     57
22     50
23     69
24      7
25     75
26     19
Name: Total Confirmed cases *, dtype: int64
1
data.iloc[:,[1,3]]   #selects every row of column with index 1 and 4
Total Confirmed cases * Death
0 19 0
1 9 0
2 11 1
3 8 0
4 7 0
5 53 2
6 5 0
7 58 5
8 33 0
9 3 1
10 31 2
11 80 3
12 194 1
13 13 0
14 33 2
15 193 8
16 1 0
17 1 0
18 3 0
19 1 0
20 38 1
21 57 0
22 50 1
23 69 1
24 7 0
25 75 0
26 19 1
1
data.iloc[:,1:3]  #selects every row of column with index 1 and 2 only
Total Confirmed cases * Cured/Discharged/Migrated
0 19 1
1 9 0
2 11 0
3 8 0
4 7 0
5 53 6
6 5 0
7 58 1
8 33 17
9 3 0
10 31 1
11 80 5
12 194 19
13 13 3
14 33 0
15 193 25
16 1 0
17 1 0
18 3 0
19 1 0
20 38 1
21 57 3
22 50 4
23 69 1
24 7 2
25 75 11
26 19 0
1
data.iloc[0:3,:]  #selects rows 0 to 2 and every column
Name of State / UT Total Confirmed cases * Cured/Discharged/Migrated Death
0 Andhra Pradesh 19 1 0
1 Andaman and Nicobar Islands 9 0 0
2 Bihar 11 0 1

both iloc and loc also take callables as valid input. Callables are functions that return row and column needed to select. This is useful when we have to filter some information before selecting it.

data.loc[callable_function]

data.iloc[callable_function]

Example

1
data.loc[lambda data:data["Name of State / UT"] == 'Uttarakhand'] #selects all rows and columns with country India
Name of State / UT Total Confirmed cases * Cured/Discharged/Migrated Death
24 Uttarakhand 7 2 0
1
data.iloc[lambda x:x.index % 2 == 0] #selects all rows and columns with even index
Name of State / UT Total Confirmed cases * Cured/Discharged/Migrated Death
0 Andhra Pradesh 19 1 0
2 Bihar 11 0 1
4 Chhattisgarh 7 0 0
6 Goa 5 0 0
8 Haryana 33 17 0
10 Jammu and Kashmir 31 1 2
12 Kerala 194 19 1
14 Madhya Pradesh 33 0 2
16 Manipur 1 0 0
18 Odisha 3 0 0
20 Punjab 38 1 1
22 Tamil Nadu 50 4 1
24 Uttarakhand 7 2 0
26 West Bengal 19 0 1

Plotting a DataFrame

We can also plot a DataFrame in the graph which uses Matplotlib library to plot graph. This can be used to plot simple graphs without needing to write plot code.

It plot graphs of graph type specified

Following types of graphs are valid in pandas:

  1. bar or barh for bar plots

  2. hist for histogram

  3. box for boxplot

  4. kde or density for density plots

  5. area for area plots

  6. scatter for scatter plots

  7. hexbin for hexagonal bin plots

  8. pie for pie plots

syntax: data.plot(kind=’graph_type’)

1
data.plot(x ='Name of State / UT', y='Total Confirmed cases *',kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x2272464b148>

png

Concatenating DataFrame with another

We can concatenate DataFrame using pd.concat method

syntax: pd.concat([dataframe1,dataframe2,….],axis= 0 or 1)

pass list of DataFrame to which are needed to concatenate and axis from where to concatenate

1
2
3
4
5
6
7
8
9
10
11
data1 = {'Enrollment No.': [1,2,3,4,5,6,7,8,9,10],
        'Score': [1500,1520,1525,1523,1515,1540,1545,1560,1555,1565]
       }
data2 = {'Class': [0,1,2,1,1,2,0,0,1,2],
        'Place': [1,3,1,2,5,1,5,4,2,1]
       }
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df=pd.concat([df1,df2],axis=1)
df.head(5)
Enrollment No. Score Class Place
0 1 1500 0 1
1 2 1520 1 3
2 3 1525 2 1
3 4 1523 1 2
4 5 1515 1 5

Checking null values in column

pd.isna function is used to check for missing values like NaN or blank etc.

syntax: pd.isna(data[column_name])

1
pd.isna(data['Death'])
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
Name: Death, dtype: bool

So thats what you needed to get started for everyday data science work. There are lot of other features but these can be learned as you progress your journey. All the Best !!!

IPython Notebook Link

References

pandas docs