pandas is a fast, powerful, flexible and easy to use open-source data analysis and manipulation tool, built on top of the python programming language. This post acts as a cheat-sheet for using the pandas library in python for everyday data science. It contains some important functions and submodules of pandas library which are used day to day in data science and machine learning.
pandas can be installed easily by using pip. First, create a virtual environment and activate it. Then run the command below to install pandas.
1
pip install pandas
Create a python project or open an existing project and import pandas using
1
import pandas
we can also use an alias pd to use pandas into the project by importing pandas as pd
1
import pandas as pd
data = pd.read_csv(“path_to_csv_file”)
1
data = pd.read_csv("corona.csv")
this will read and load CSV file from memory and creates a pandas DataFrame object.
data.head(number_of_columns)
this will help to visualize how loaded data looks. It is stored in the DataFrame object which is pandas way to structure CSV data in the form of tables. ex:
1
data.head(5) #shows first 5 rows of dataframe
Name of State / UT | Total Confirmed cases * | Cured/Discharged/Migrated | Death | |
---|---|---|---|---|
0 | Andhra Pradesh | 19 | 1 | 0 |
1 | Andaman and Nicobar Islands | 9 | 0 | 0 |
2 | Bihar | 11 | 0 | 1 |
3 | Chandigarh | 8 | 0 | 0 |
4 | Chhattisgarh | 7 | 0 | 0 |
Similarly, this will help to visualize the last part of data in DataFrame
data.tail(numbers_of_columns)
1
data.tail(5)
Name of State / UT | Total Confirmed cases * | Cured/Discharged/Migrated | Death | |
---|---|---|---|---|
22 | Tamil Nadu | 50 | 4 | 1 |
23 | Telengana | 69 | 1 | 1 |
24 | Uttarakhand | 7 | 2 | 0 |
25 | Uttar Pradesh | 75 | 11 | 0 |
26 | West Bengal | 19 | 0 | 1 |
shape property gives out the shape or dimension of DataFrame as a tuple of the form (number_of_rows, number_of_columns)
1
data.shape
(27, 4)
size gives out the number of elements or number of unique cells in DataFrame.
1
data.size
108
len on the DataFrame object gives out the number of rows in DataFrame.
1
len(data)
27
data[‘column_name’]
data[[‘column_1’,’column_2’,’column_3’]]
1
data['Death']
0 0
1 0
2 1
3 0
4 0
5 2
6 0
7 5
8 0
9 1
10 2
11 3
12 1
13 0
14 2
15 8
16 0
17 0
18 0
19 0
20 1
21 0
22 1
23 1
24 0
25 0
26 1
Name: Death, dtype: int64
1
data[['Name of State / UT','Death']]
Name of State / UT | Death | |
---|---|---|
0 | Andhra Pradesh | 0 |
1 | Andaman and Nicobar Islands | 0 |
2 | Bihar | 1 |
3 | Chandigarh | 0 |
4 | Chhattisgarh | 0 |
5 | Delhi | 2 |
6 | Goa | 0 |
7 | Gujarat | 5 |
8 | Haryana | 0 |
9 | Himachal Pradesh | 1 |
10 | Jammu and Kashmir | 2 |
11 | Karnataka | 3 |
12 | Kerala | 1 |
13 | Ladakh | 0 |
14 | Madhya Pradesh | 2 |
15 | Maharashtra | 8 |
16 | Manipur | 0 |
17 | Mizoram | 0 |
18 | Odisha | 0 |
19 | Puducherry | 0 |
20 | Punjab | 1 |
21 | Rajasthan | 0 |
22 | Tamil Nadu | 1 |
23 | Telengana | 1 |
24 | Uttarakhand | 0 |
25 | Uttar Pradesh | 0 |
26 | West Bengal | 1 |
sort and return DataFrame column specified
1
data.sort_values(['Total Confirmed cases *']) #sort and return DataFrame column
Name of State / UT | Total Confirmed cases * | Cured/Discharged/Migrated | Death | |
---|---|---|---|---|
19 | Puducherry | 1 | 0 | 0 |
17 | Mizoram | 1 | 0 | 0 |
16 | Manipur | 1 | 0 | 0 |
18 | Odisha | 3 | 0 | 0 |
9 | Himachal Pradesh | 3 | 0 | 1 |
6 | Goa | 5 | 0 | 0 |
4 | Chhattisgarh | 7 | 0 | 0 |
24 | Uttarakhand | 7 | 2 | 0 |
3 | Chandigarh | 8 | 0 | 0 |
1 | Andaman and Nicobar Islands | 9 | 0 | 0 |
2 | Bihar | 11 | 0 | 1 |
13 | Ladakh | 13 | 3 | 0 |
0 | Andhra Pradesh | 19 | 1 | 0 |
26 | West Bengal | 19 | 0 | 1 |
10 | Jammu and Kashmir | 31 | 1 | 2 |
14 | Madhya Pradesh | 33 | 0 | 2 |
8 | Haryana | 33 | 17 | 0 |
20 | Punjab | 38 | 1 | 1 |
22 | Tamil Nadu | 50 | 4 | 1 |
5 | Delhi | 53 | 6 | 2 |
21 | Rajasthan | 57 | 3 | 0 |
7 | Gujarat | 58 | 1 | 5 |
23 | Telengana | 69 | 1 | 1 |
25 | Uttar Pradesh | 75 | 11 | 0 |
11 | Karnataka | 80 | 5 | 3 |
15 | Maharashtra | 193 | 25 | 8 |
12 | Kerala | 194 | 19 | 1 |
pandas DataFrame provides powerful properties loc and iloc to retrieve data from DataFrame. This is the most widely used technique if we have lots of features to retrieve.
loc selects things by using the label
loc has following valid inputs -
A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along with the index.).
A list or array of labels [‘a’, ‘b’, ‘c’].
A slice object with labels’ a ‘ : ‘ f ‘(Note that contrary to usual python slices, both the start and the stop are included.)
data.loc[row_needed,column_needed] #loc with input type 1
data.loc[[row_needed(multiple)],column_needed] #loc with input type 2
data.loc[row_range_needed,column_range_needed] #loc with input type 3
Example:
1
data.loc[:,'Total Confirmed cases *'] #selects all rows from column Total Confirmed cases
0 19
1 9
2 11
3 8
4 7
5 53
6 5
7 58
8 33
9 3
10 31
11 80
12 194
13 13
14 33
15 193
16 1
17 1
18 3
19 1
20 38
21 57
22 50
23 69
24 7
25 75
26 19
Name: Total Confirmed cases *, dtype: int64
1
data.loc[:,['Total Confirmed cases *','Death']]
Total Confirmed cases * | Death | |
---|---|---|
0 | 19 | 0 |
1 | 9 | 0 |
2 | 11 | 1 |
3 | 8 | 0 |
4 | 7 | 0 |
5 | 53 | 2 |
6 | 5 | 0 |
7 | 58 | 5 |
8 | 33 | 0 |
9 | 3 | 1 |
10 | 31 | 2 |
11 | 80 | 3 |
12 | 194 | 1 |
13 | 13 | 0 |
14 | 33 | 2 |
15 | 193 | 8 |
16 | 1 | 0 |
17 | 1 | 0 |
18 | 3 | 0 |
19 | 1 | 0 |
20 | 38 | 1 |
21 | 57 | 0 |
22 | 50 | 1 |
23 | 69 | 1 |
24 | 7 | 0 |
25 | 75 | 0 |
26 | 19 | 1 |
1
data.loc[:,'Total Confirmed cases *':'Death'] #( including death column [refer input type 3])
Total Confirmed cases * | Cured/Discharged/Migrated | Death | |
---|---|---|---|
0 | 19 | 1 | 0 |
1 | 9 | 0 | 0 |
2 | 11 | 0 | 1 |
3 | 8 | 0 | 0 |
4 | 7 | 0 | 0 |
5 | 53 | 6 | 2 |
6 | 5 | 0 | 0 |
7 | 58 | 1 | 5 |
8 | 33 | 17 | 0 |
9 | 3 | 0 | 1 |
10 | 31 | 1 | 2 |
11 | 80 | 5 | 3 |
12 | 194 | 19 | 1 |
13 | 13 | 3 | 0 |
14 | 33 | 0 | 2 |
15 | 193 | 25 | 8 |
16 | 1 | 0 | 0 |
17 | 1 | 0 | 0 |
18 | 3 | 0 | 0 |
19 | 1 | 0 | 0 |
20 | 38 | 1 | 1 |
21 | 57 | 3 | 0 |
22 | 50 | 4 | 1 |
23 | 69 | 1 | 1 |
24 | 7 | 2 | 0 |
25 | 75 | 11 | 0 |
26 | 19 | 0 | 1 |
note: using loc slices contrary to usual python slices, both the start and the stop is included, when present in the index ie.. in ‘ a ‘ : ‘ c ‘ both ‘ a ‘ and ‘ c ‘ row are inclusive.
iloc selects rows and columns using integer positions
iloc has the following valid inputs -
1: An integer e.g. 5.
2: A list or array of integers [ 4, 3, 0 ].
3: A slice object with ints 1 : 7.
4: A boolean array.
data.iloc[row_index_needed,column_index_needed] #iloc with type 1
data.iloc[[row_index_needed(multiple)],column_index_needed] #iloc with type 2
data.iloc[row_index_range_needed,column_index_range_needed] #iloc with type 3
Example:
1
data.iloc[:,1] #selects all rows of column with index 1
0 19
1 9
2 11
3 8
4 7
5 53
6 5
7 58
8 33
9 3
10 31
11 80
12 194
13 13
14 33
15 193
16 1
17 1
18 3
19 1
20 38
21 57
22 50
23 69
24 7
25 75
26 19
Name: Total Confirmed cases *, dtype: int64
1
data.iloc[:,[1,3]] #selects every row of column with index 1 and 4
Total Confirmed cases * | Death | |
---|---|---|
0 | 19 | 0 |
1 | 9 | 0 |
2 | 11 | 1 |
3 | 8 | 0 |
4 | 7 | 0 |
5 | 53 | 2 |
6 | 5 | 0 |
7 | 58 | 5 |
8 | 33 | 0 |
9 | 3 | 1 |
10 | 31 | 2 |
11 | 80 | 3 |
12 | 194 | 1 |
13 | 13 | 0 |
14 | 33 | 2 |
15 | 193 | 8 |
16 | 1 | 0 |
17 | 1 | 0 |
18 | 3 | 0 |
19 | 1 | 0 |
20 | 38 | 1 |
21 | 57 | 0 |
22 | 50 | 1 |
23 | 69 | 1 |
24 | 7 | 0 |
25 | 75 | 0 |
26 | 19 | 1 |
1
data.iloc[:,1:3] #selects every row of column with index 1 and 2 only
Total Confirmed cases * | Cured/Discharged/Migrated | |
---|---|---|
0 | 19 | 1 |
1 | 9 | 0 |
2 | 11 | 0 |
3 | 8 | 0 |
4 | 7 | 0 |
5 | 53 | 6 |
6 | 5 | 0 |
7 | 58 | 1 |
8 | 33 | 17 |
9 | 3 | 0 |
10 | 31 | 1 |
11 | 80 | 5 |
12 | 194 | 19 |
13 | 13 | 3 |
14 | 33 | 0 |
15 | 193 | 25 |
16 | 1 | 0 |
17 | 1 | 0 |
18 | 3 | 0 |
19 | 1 | 0 |
20 | 38 | 1 |
21 | 57 | 3 |
22 | 50 | 4 |
23 | 69 | 1 |
24 | 7 | 2 |
25 | 75 | 11 |
26 | 19 | 0 |
1
data.iloc[0:3,:] #selects rows 0 to 2 and every column
Name of State / UT | Total Confirmed cases * | Cured/Discharged/Migrated | Death | |
---|---|---|---|---|
0 | Andhra Pradesh | 19 | 1 | 0 |
1 | Andaman and Nicobar Islands | 9 | 0 | 0 |
2 | Bihar | 11 | 0 | 1 |
both iloc and loc also take callables as valid input. Callables are functions that return row and column needed to select. This is useful when we have to filter some information before selecting it.
data.loc[callable_function]
data.iloc[callable_function]
Example
1
data.loc[lambda data:data["Name of State / UT"] == 'Uttarakhand'] #selects all rows and columns with country India
Name of State / UT | Total Confirmed cases * | Cured/Discharged/Migrated | Death | |
---|---|---|---|---|
24 | Uttarakhand | 7 | 2 | 0 |
1
data.iloc[lambda x:x.index % 2 == 0] #selects all rows and columns with even index
Name of State / UT | Total Confirmed cases * | Cured/Discharged/Migrated | Death | |
---|---|---|---|---|
0 | Andhra Pradesh | 19 | 1 | 0 |
2 | Bihar | 11 | 0 | 1 |
4 | Chhattisgarh | 7 | 0 | 0 |
6 | Goa | 5 | 0 | 0 |
8 | Haryana | 33 | 17 | 0 |
10 | Jammu and Kashmir | 31 | 1 | 2 |
12 | Kerala | 194 | 19 | 1 |
14 | Madhya Pradesh | 33 | 0 | 2 |
16 | Manipur | 1 | 0 | 0 |
18 | Odisha | 3 | 0 | 0 |
20 | Punjab | 38 | 1 | 1 |
22 | Tamil Nadu | 50 | 4 | 1 |
24 | Uttarakhand | 7 | 2 | 0 |
26 | West Bengal | 19 | 0 | 1 |
We can also plot a DataFrame in the graph which uses Matplotlib library to plot graph. This can be used to plot simple graphs without needing to write plot code.
It plot graphs of graph type specified
Following types of graphs are valid in pandas:
bar or barh for bar plots
hist for histogram
box for boxplot
kde or density for density plots
area for area plots
scatter for scatter plots
hexbin for hexagonal bin plots
pie for pie plots
syntax: data.plot(kind=’graph_type’)
1
data.plot(x ='Name of State / UT', y='Total Confirmed cases *',kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x2272464b148>
We can concatenate DataFrame using pd.concat method
syntax: pd.concat([dataframe1,dataframe2,….],axis= 0 or 1)
pass list of DataFrame to which are needed to concatenate and axis from where to concatenate
1
2
3
4
5
6
7
8
9
10
11
data1 = {'Enrollment No.': [1,2,3,4,5,6,7,8,9,10],
'Score': [1500,1520,1525,1523,1515,1540,1545,1560,1555,1565]
}
data2 = {'Class': [0,1,2,1,1,2,0,0,1,2],
'Place': [1,3,1,2,5,1,5,4,2,1]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df=pd.concat([df1,df2],axis=1)
df.head(5)
Enrollment No. | Score | Class | Place | |
---|---|---|---|---|
0 | 1 | 1500 | 0 | 1 |
1 | 2 | 1520 | 1 | 3 |
2 | 3 | 1525 | 2 | 1 |
3 | 4 | 1523 | 1 | 2 |
4 | 5 | 1515 | 1 | 5 |
pd.isna function is used to check for missing values like NaN or blank etc.
syntax: pd.isna(data[column_name])
1
pd.isna(data['Death'])
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
Name: Death, dtype: bool
So thats what you needed to get started for everyday data science work. There are lot of other features but these can be learned as you progress your journey. All the Best !!!