crash course on pandas library python for everyday data science

pandas is a fast, powerful, flexible and easy to use open-source data analysis and manipulation tool, built on top of the python programming language. This post acts as a cheat-sheet for using the pandas library in python for everyday data science. It contains some important functions and submodules of pandas library which are used day to day in data science and machine learning.

Installing pandas

pandas can be installed easily by using pip. First, create a virtual environment and activate it. Then run the command below to install pandas.

1
pip install pandas

Importing the library

Create a python project or open an existing project and import pandas using

1
import pandas

we can also use an alias pd to use pandas into the project by importing pandas as pd

1
import pandas as pd

Reading and Visualizing CSV files

data = pd.read_csv(“path_to_csv_file”)

1
data = pd.read_csv("corona.csv")

this will read and load CSV file from memory and creates a pandas DataFrame object.

data.head(number_of_columns)

this will help to visualize how loaded data looks. It is stored in the DataFrame object which is pandas way to structure CSV data in the form of tables. ex:

1
data.head(5)  #shows first 5 rows of dataframe

	Name of State / UT	Total Confirmed cases *	Cured/Discharged/Migrated	Death
0	Andhra Pradesh	19	1	0
1	Andaman and Nicobar Islands	9	0	0
2	Bihar	11	0	1
3	Chandigarh	8	0	0
4	Chhattisgarh	7	0	0

Similarly, this will help to visualize the last part of data in DataFrame

data.tail(numbers_of_columns)

1
data.tail(5)

	Name of State / UT	Total Confirmed cases *	Cured/Discharged/Migrated	Death
22	Tamil Nadu	50	4	1
23	Telengana	69	1	1
24	Uttarakhand	7	2	0
25	Uttar Pradesh	75	11	0
26	West Bengal	19	0	1

Getting shape or dimensions and size of loaded data

shape property gives out the shape or dimension of DataFrame as a tuple of the form (number_of_rows, number_of_columns)

1
data.shape

(27, 4)

size gives out the number of elements or number of unique cells in DataFrame.

1
data.size

len on the DataFrame object gives out the number of rows in DataFrame.

1
len(data)

Retrieving data from DataFrame

data[‘column_name’]

data[[‘column_1’,’column_2’,’column_3’]]

1
 data['Death']

   0
   0
   1
   0
   0
   2
   0
   5
   0
   1
  2
  3
  1
  0
  2
  8
  0
  0
  0
  0
  1
  0
  1
  1
  0
  0
  1
Name: Death, dtype: int64

1
data[['Name of State / UT','Death']]

	Name of State / UT	Death
0	Andhra Pradesh	0
1	Andaman and Nicobar Islands	0
2	Bihar	1
3	Chandigarh	0
4	Chhattisgarh	0
5	Delhi	2
6	Goa	0
7	Gujarat	5
8	Haryana	0
9	Himachal Pradesh	1
10	Jammu and Kashmir	2
11	Karnataka	3
12	Kerala	1
13	Ladakh	0
14	Madhya Pradesh	2
15	Maharashtra	8
16	Manipur	0
17	Mizoram	0
18	Odisha	0
19	Puducherry	0
20	Punjab	1
21	Rajasthan	0
22	Tamil Nadu	1
23	Telengana	1
24	Uttarakhand	0
25	Uttar Pradesh	0
26	West Bengal	1

sort and return DataFrame column specified

1
data.sort_values(['Total Confirmed cases *']) #sort and return DataFrame column

	Name of State / UT	Total Confirmed cases *	Cured/Discharged/Migrated	Death
19	Puducherry	1	0	0
17	Mizoram	1	0	0
16	Manipur	1	0	0
18	Odisha	3	0	0
9	Himachal Pradesh	3	0	1
6	Goa	5	0	0
4	Chhattisgarh	7	0	0
24	Uttarakhand	7	2	0
3	Chandigarh	8	0	0
1	Andaman and Nicobar Islands	9	0	0
2	Bihar	11	0	1
13	Ladakh	13	3	0
0	Andhra Pradesh	19	1	0
26	West Bengal	19	0	1
10	Jammu and Kashmir	31	1	2
14	Madhya Pradesh	33	0	2
8	Haryana	33	17	0
20	Punjab	38	1	1
22	Tamil Nadu	50	4	1
5	Delhi	53	6	2
21	Rajasthan	57	3	0
7	Gujarat	58	1	5
23	Telengana	69	1	1
25	Uttar Pradesh	75	11	0
11	Karnataka	80	5	3
15	Maharashtra	193	25	8
12	Kerala	194	19	1

Using loc and iloc

pandas DataFrame provides powerful properties loc and iloc to retrieve data from DataFrame. This is the most widely used technique if we have lots of features to retrieve.

loc selects things by using the label

loc has following valid inputs -

A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along with the index.).
A list or array of labels [‘a’, ‘b’, ‘c’].
A slice object with labels’ a ‘ : ‘ f ‘(Note that contrary to usual python slices, both the start and the stop are included.)

data.loc[row_needed,column_needed] #loc with input type 1

data.loc[[row_needed(multiple)],column_needed] #loc with input type 2

data.loc[row_range_needed,column_range_needed] #loc with input type 3

Example:

1
data.loc[:,'Total Confirmed cases *'] #selects all rows from column Total Confirmed cases

    19
     9
    11
     8
     7
    53
     5
    58
    33
     3
   31
   80
  194
   13
   33
  193
    1
    1
    3
    1
   38
   57
   50
   69
    7
   75
   19
Name: Total Confirmed cases *, dtype: int64

1
data.loc[:,['Total Confirmed cases *','Death']]

	Total Confirmed cases *	Death
0	19	0
1	9	0
2	11	1
3	8	0
4	7	0
5	53	2
6	5	0
7	58	5
8	33	0
9	3	1
10	31	2
11	80	3
12	194	1
13	13	0
14	33	2
15	193	8
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	0
22	50	1
23	69	1
24	7	0
25	75	0
26	19	1

1
data.loc[:,'Total Confirmed cases *':'Death'] #( including death column [refer input type 3])

	Total Confirmed cases *	Cured/Discharged/Migrated	Death
0	19	1	0
1	9	0	0
2	11	0	1
3	8	0	0
4	7	0	0
5	53	6	2
6	5	0	0
7	58	1	5
8	33	17	0
9	3	0	1
10	31	1	2
11	80	5	3
12	194	19	1
13	13	3	0
14	33	0	2
15	193	25	8
16	1	0	0
17	1	0	0
18	3	0	0
19	1	0	0
20	38	1	1
21	57	3	0
22	50	4	1
23	69	1	1
24	7	2	0
25	75	11	0
26	19	0	1

note: using loc slices contrary to usual python slices, both the start and the stop is included, when present in the index ie.. in ‘ a ‘ : ‘ c ‘ both ‘ a ‘ and ‘ c ‘ row are inclusive.

iloc selects rows and columns using integer positions

iloc has the following valid inputs -

1: An integer e.g. 5.

2: A list or array of integers [ 4, 3, 0 ].

3: A slice object with ints 1 : 7.

4: A boolean array.

data.iloc[row_index_needed,column_index_needed] #iloc with type 1

data.iloc[[row_index_needed(multiple)],column_index_needed] #iloc with type 2

data.iloc[row_index_range_needed,column_index_range_needed] #iloc with type 3

Example:

1
data.iloc[:,1]   #selects all rows of column with index 1

    19
     9
    11
     8
     7
    53
     5
    58
    33
     3
   31
   80
  194
   13
   33
  193
    1
    1
    3
    1
   38
   57
   50
   69
    7
   75
   19
Name: Total Confirmed cases *, dtype: int64

1
data.iloc[:,[1,3]]   #selects every row of column with index 1 and 4

	Total Confirmed cases *	Death
0	19	0
1	9	0
2	11	1
3	8	0
4	7	0
5	53	2
6	5	0
7	58	5
8	33	0
9	3	1
10	31	2
11	80	3
12	194	1
13	13	0
14	33	2
15	193	8
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	0
22	50	1
23	69	1
24	7	0
25	75	0
26	19	1

1
data.iloc[:,1:3]  #selects every row of column with index 1 and 2 only

	Total Confirmed cases *	Cured/Discharged/Migrated
0	19	1
1	9	0
2	11	0
3	8	0
4	7	0
5	53	6
6	5	0
7	58	1
8	33	17
9	3	0
10	31	1
11	80	5
12	194	19
13	13	3
14	33	0
15	193	25
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	3
22	50	4
23	69	1
24	7	2
25	75	11
26	19	0

1
data.iloc[0:3,:]  #selects rows 0 to 2 and every column

	Name of State / UT	Total Confirmed cases *	Cured/Discharged/Migrated	Death
0	Andhra Pradesh	19	1	0
1	Andaman and Nicobar Islands	9	0	0
2	Bihar	11	0	1

both iloc and loc also take callables as valid input. Callables are functions that return row and column needed to select. This is useful when we have to filter some information before selecting it.

data.loc[callable_function]

data.iloc[callable_function]

Example

1
data.loc[lambda data:data["Name of State / UT"] == 'Uttarakhand'] #selects all rows and columns with country India

	Name of State / UT	Total Confirmed cases *	Cured/Discharged/Migrated	Death
24	Uttarakhand	7	2	0

1
data.iloc[lambda x:x.index % 2 == 0] #selects all rows and columns with even index

	Name of State / UT	Total Confirmed cases *	Cured/Discharged/Migrated	Death
0	Andhra Pradesh	19	1	0
2	Bihar	11	0	1
4	Chhattisgarh	7	0	0
6	Goa	5	0	0
8	Haryana	33	17	0
10	Jammu and Kashmir	31	1	2
12	Kerala	194	19	1
14	Madhya Pradesh	33	0	2
16	Manipur	1	0	0
18	Odisha	3	0	0
20	Punjab	38	1	1
22	Tamil Nadu	50	4	1
24	Uttarakhand	7	2	0
26	West Bengal	19	0	1

Plotting a DataFrame

We can also plot a DataFrame in the graph which uses Matplotlib library to plot graph. This can be used to plot simple graphs without needing to write plot code.

It plot graphs of graph type specified

Following types of graphs are valid in pandas:

bar or barh for bar plots
hist for histogram
box for boxplot
kde or density for density plots
area for area plots
scatter for scatter plots
hexbin for hexagonal bin plots
pie for pie plots

syntax: data.plot(kind=’graph_type’)

1
data.plot(x ='Name of State / UT', y='Total Confirmed cases *',kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x2272464b148>

png

Concatenating DataFrame with another

We can concatenate DataFrame using pd.concat method

syntax: pd.concat([dataframe1,dataframe2,….],axis= 0 or 1)

pass list of DataFrame to which are needed to concatenate and axis from where to concatenate

1
2
3
4
5
6
7
8
9
10
11
data1 = {'Enrollment No.': [1,2,3,4,5,6,7,8,9,10],
        'Score': [1500,1520,1525,1523,1515,1540,1545,1560,1555,1565]
       }
data2 = {'Class': [0,1,2,1,1,2,0,0,1,2],
        'Place': [1,3,1,2,5,1,5,4,2,1]
       }
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df=pd.concat([df1,df2],axis=1)
df.head(5)

	Enrollment No.	Score	Class	Place
0	1	1500	0	1
1	2	1520	1	3
2	3	1525	2	1
3	4	1523	1	2
4	5	1515	1	5

Checking null values in column

pd.isna function is used to check for missing values like NaN or blank etc.

syntax: pd.isna(data[column_name])

1
pd.isna(data['Death'])

   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
Name: Death, dtype: bool

So thats what you needed to get started for everyday data science work. There are lot of other features but these can be learned as you progress your journey. All the Best !!!

IPython Notebook Link

References

pandas docs

beginner
python

	Total Confirmed cases *	Death
0	19	0
1	9	0
2	11	1
3	8	0
4	7	0
5	53	2
6	5	0
7	58	5
8	33	0
9	3	1
10	31	2
11	80	3
12	194	1
13	13	0
14	33	2
15	193	8
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	0
22	50	1
23	69	1
24	7	0
25	75	0
26	19	1

	Total Confirmed cases *	Cured/Discharged/Migrated	Death
0	19	1	0
1	9	0	0
2	11	0	1
3	8	0	0
4	7	0	0
5	53	6	2
6	5	0	0
7	58	1	5
8	33	17	0
9	3	0	1
10	31	1	2
11	80	5	3
12	194	19	1
13	13	3	0
14	33	0	2
15	193	25	8
16	1	0	0
17	1	0	0
18	3	0	0
19	1	0	0
20	38	1	1
21	57	3	0
22	50	4	1
23	69	1	1
24	7	2	0
25	75	11	0
26	19	0	1

	Total Confirmed cases *	Death
0	19	0
1	9	0
2	11	1
3	8	0
4	7	0
5	53	2
6	5	0
7	58	5
8	33	0
9	3	1
10	31	2
11	80	3
12	194	1
13	13	0
14	33	2
15	193	8
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	0
22	50	1
23	69	1
24	7	0
25	75	0
26	19	1

	Total Confirmed cases *	Cured/Discharged/Migrated
0	19	1
1	9	0
2	11	0
3	8	0
4	7	0
5	53	6
6	5	0
7	58	1
8	33	17
9	3	0
10	31	1
11	80	5
12	194	19
13	13	3
14	33	0
15	193	25
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	3
22	50	4
23	69	1
24	7	2
25	75	11
26	19	0

	Total Confirmed cases *	Death
0	19	0
1	9	0
2	11	1
3	8	0
4	7	0
5	53	2
6	5	0
7	58	5
8	33	0
9	3	1
10	31	2
11	80	3
12	194	1
13	13	0
14	33	2
15	193	8
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	0
22	50	1
23	69	1
24	7	0
25	75	0
26	19	1

	Total Confirmed cases *	Cured/Discharged/Migrated	Death
0	19	1	0
1	9	0	0
2	11	0	1
3	8	0	0
4	7	0	0
5	53	6	2
6	5	0	0
7	58	1	5
8	33	17	0
9	3	0	1
10	31	1	2
11	80	5	3
12	194	19	1
13	13	3	0
14	33	0	2
15	193	25	8
16	1	0	0
17	1	0	0
18	3	0	0
19	1	0	0
20	38	1	1
21	57	3	0
22	50	4	1
23	69	1	1
24	7	2	0
25	75	11	0
26	19	0	1

	Total Confirmed cases *	Death
0	19	0
1	9	0
2	11	1
3	8	0
4	7	0
5	53	2
6	5	0
7	58	5
8	33	0
9	3	1
10	31	2
11	80	3
12	194	1
13	13	0
14	33	2
15	193	8
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	0
22	50	1
23	69	1
24	7	0
25	75	0
26	19	1

	Total Confirmed cases *	Cured/Discharged/Migrated
0	19	1
1	9	0
2	11	0
3	8	0
4	7	0
5	53	6
6	5	0
7	58	1
8	33	17
9	3	0
10	31	1
11	80	5
12	194	19
13	13	3
14	33	0
15	193	25
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	3
22	50	4
23	69	1
24	7	2
25	75	11
26	19	0

	Total Confirmed cases *	Death
0	19	0
1	9	0
2	11	1
3	8	0
4	7	0
5	53	2
6	5	0
7	58	5
8	33	0
9	3	1
10	31	2
11	80	3
12	194	1
13	13	0
14	33	2
15	193	8
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	0
22	50	1
23	69	1
24	7	0
25	75	0
26	19	1

	Total Confirmed cases *	Cured/Discharged/Migrated	Death
0	19	1	0
1	9	0	0
2	11	0	1
3	8	0	0
4	7	0	0
5	53	6	2
6	5	0	0
7	58	1	5
8	33	17	0
9	3	0	1
10	31	1	2
11	80	5	3
12	194	19	1
13	13	3	0
14	33	0	2
15	193	25	8
16	1	0	0
17	1	0	0
18	3	0	0
19	1	0	0
20	38	1	1
21	57	3	0
22	50	4	1
23	69	1	1
24	7	2	0
25	75	11	0
26	19	0	1

	Total Confirmed cases *	Death
0	19	0
1	9	0
2	11	1
3	8	0
4	7	0
5	53	2
6	5	0
7	58	5
8	33	0
9	3	1
10	31	2
11	80	3
12	194	1
13	13	0
14	33	2
15	193	8
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	0
22	50	1
23	69	1
24	7	0
25	75	0
26	19	1

	Total Confirmed cases *	Cured/Discharged/Migrated
0	19	1
1	9	0
2	11	0
3	8	0
4	7	0
5	53	6
6	5	0
7	58	1
8	33	17
9	3	0
10	31	1
11	80	5
12	194	19
13	13	3
14	33	0
15	193	25
16	1	0
17	1	0
18	3	0
19	1	0
20	38	1
21	57	3
22	50	4
23	69	1
24	7	2
25	75	11
26	19	0