PyTorch RCNN Tutorial: Dive into Object Detection with RCNN
Object detection is a computer vision task where given an image we have to find objects positions and classify detected objects into different categories.
We will be using Pascal VOC dataset 2007 version. It provides standardised image data sets for object class recognition. More information about dataset can be accesed at this url. We will be using torchvision.datasets.VOCDetection to load the dataset.
Loading dataset
1
2
3
4
5
6
7
8
9
# loading detection data
voc_dataset_train=torchvision.datasets.VOCDetection(root="content/voc",image_set="train",download=True,year="2007")voc_dataset_val=torchvision.datasets.VOCDetection(root="content/voc",image_set="val",download=True,year="2007")
Downloading http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar to content/voc/VOCtrainval_06-Nov-2007.tar
100%|██████████| 460032000/460032000 [00:14<00:00, 31429061.40it/s]
Extracting content/voc/VOCtrainval_06-Nov-2007.tar to content/voc
Using downloaded and verified file: content/voc/VOCtrainval_06-Nov-2007.tar
Extracting content/voc/VOCtrainval_06-Nov-2007.tar to content/voc
Each sample in dataset is a tuple with two elements (index 0: image, index 1: annotations)
We will be only using object annotation information which is list of dictionary that contain information about object class and its bounding box.
Image shapes in pascal dataset are not uniform it contains images of different sizes.
Next we will find all unique object classes present in the dataset.
1
2
3
4
5
6
7
8
9
all_objs=[]fordsinvoc_dataset_train:obj_annots=ds[1]["annotation"]["object"]forobjinobj_annots:all_objs.append(obj["name"])unique_class_labels=set(all_objs)print("Number of unique objects in dataset: ",len(unique_class_labels))print("Unique labels in dataset: \n",unique_class_labels)
Number of unique objects in dataset: 20
Unique labels in dataset:
{'diningtable', 'pottedplant', 'train', 'cat', 'cow', 'tvmonitor', 'bottle', 'bicycle', 'motorbike', 'aeroplane', 'bus', 'car', 'sofa', 'chair', 'sheep', 'bird', 'boat', 'person', 'horse', 'dog'}
Next we will create two dictionaries one will map class labels into integer and another maps integer into class labels.
# img: image as np array
# boxes: [[xmin, y_min, x_max, y_max]]
# labels: labels present in bounding boxes
# scores: array of probabilities that given object is present in bounding boxes.
# class_map: dictionary that maps index to class names
defdraw_boxes(img,boxes,scores,labels,class_map=None):nums=len(boxes)foriinrange(nums):x1y1=tuple((np.array(boxes[i][0:2])).astype(np.int32))x2y2=tuple((np.array(boxes[i][2:4])).astype(np.int32))img=cv2.rectangle(img,x1y1,x2y2,(255,0,0),2)label=int(labels[i])ifclass_mapisnotNone:label_txt=class_map[label]else:label_txt=str(label)img=cv2.putText(img,"{} {:.4f}".format(label_txt,scores[i]),x1y1,cv2.FONT_HERSHEY_COMPLEX_SMALL,1,(0,0,255),2,)returnimg
we will plot one image along with true bounding boxes using draw_boxes function.
defcalculate_iou_score(box_1,box_2):'''
box_1 = single of ground truth bounding boxes
box_2 = single of predicted bounded boxes
'''box_1_x1=box_1[0]box_1_y1=box_1[1]box_1_x2=box_1[2]box_1_y2=box_1[3]box_2_x1=box_2[0]box_2_y1=box_2[1]box_2_x2=box_2[2]box_2_y2=box_2[3]x1=np.maximum(box_1_x1,box_2_x1)y1=np.maximum(box_1_y1,box_2_y1)x2=np.minimum(box_1_x2,box_2_x2)y2=np.minimum(box_1_y2,box_2_y2)area_of_intersection=max(0,x2-x1+1)*max(0,y2-y1+1)area_box_1=(box_1_x2-box_1_x1+1)*(box_1_y2-box_1_y1+1)area_box_2=(box_2_x2-box_2_x1+1)*(box_2_y2-box_2_y1+1)area_of_union=area_box_1+area_box_2-area_of_intersectionreturnarea_of_intersection/float(area_of_union)
RCNN uses selective search algorithm for generating region proposals by merging similar pixels into regions. The regions got from this step were warped, resized and preprocessed then passed into a CNN which produces feature vectors these feature vectors are then used for classification and bounding box regression which result bounding boxes of objects and their classes. RCNN yields a significant performance boost on VOC07 dataset, with a large improvement of mean Average Precision (mAP) from 33.7% in DPM-v5 to 58.5%.This algorithm was fast with respect to sliding window approach and then passing each window to CNN but it was still quite slow to be used in realtime object detection.
Steps to Train RCNN
We apply selective search algorithm to find box proposals
We take those boxes in the dataset whose iou_score(proposed_box, true_box) > threshold.
We also save some boxes with no objects and label them 0 (background). These will be useful while training classifier.
We then train a CNN based model in the prepared dataset.
Steps for inference with RCNN
We first apply selective search algorithm to image
Pass all proposed bounding boxes to trained CNN model we get using above step.
Postprocessing outputs from model (this include selecting best boxes and applying non max supression we will look these later)
Creating dataset for RCNN takes some time, we have already created processed version of dataset for RCNN training and provided it through kaggle datasets. The size of preprocessed dataset is approx 5GB.
To download dataset from kaggle we first need to download kaggle api file and then upload it to colab. Saving to google drive and then loading from google drive is recommended.
Setting up kaggle API
Go to your kaggle account, Scroll to API section.
Click on Create New Token - It will download kaggle.json file on your machine.
Next we need to upload the kaggle.json to google drive or directly to colab and use it to authenticate in kaggle servers.
all_images=[]all_labels=[]count=0iflen(os.listdir(processed_data_save_path_train))<80000:forimage,annotintqdm(voc_dataset_train):image=np.array(image)boxes_annots=annot["annotation"]["object"]ss=cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()ss.setBaseImage(image)ss.switchToSelectiveSearchFast()rects=ss.process()[:max_selections]rects=np.array([[x,y,x+w,y+h]forx,y,w,hinrects])images,classes=process_data_for_rcnn(image,rects,label_2_idx,boxes_annots,max_iou_threshold,max_boxes)count+=1all_images+=imagesall_labels+=classes# saving processed data to pickle file
foridx,(image,label)inenumerate(zip(all_images,all_labels)):withopen(os.path.join(processed_data_save_path_train,f"img_{idx}.pkl"),"wb")aspkl:pickle.dump({"image":image,"label":label},pkl)else:print("Data Already Prepared.")
all_images=[]all_labels=[]count=0iflen(os.listdir(processed_data_save_path_val))<80000:forimage,annotintqdm(voc_dataset_val):image=np.array(image)boxes_annots=annot["annotation"]["object"]ss=cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()ss.setBaseImage(image)ss.switchToSelectiveSearchFast()rects=ss.process()[:max_selections]rects=np.array([[x,y,x+w,y+h]forx,y,w,hinrects])images,classes=process_data_for_rcnn(image,rects,label_2_idx,boxes_annots,max_iou_threshold,max_boxes)count+=1all_images+=imagesall_labels+=classes# saving processed data to pickle file
foridx,(image,label)inenumerate(zip(all_images,all_labels)):withopen(os.path.join(processed_data_save_path_val,f"img_{idx}.pkl"),"wb")aspkl:pickle.dump({"image":image,"label":label},pkl)else:print("Data Already Prepared.")
Data Already Prepared.
Creating PyTorch dataset
We use torch.utils.data.Dataset to create dataset. This class takes processed data folder generated using above methods.
To visualize batches of data we have defined imshow function that takes batch of images and labels as inputs and plot them in a grid.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
defimshow(inp,labels,num_rows=16,num_cols=4):"""Display image for Tensor."""fig,axes=plt.subplots(nrows=num_rows,ncols=num_cols,figsize=(15,30))axes=axes.ravel()mean=torch.tensor([0.485,0.456,0.406]).reshape(1,-1,1,1)std=torch.tensor([0.229,0.224,0.225]).reshape(1,-1,1,1)inp=std*inp+meaninp=inp.permute((0,2,3,1))inp=inp.type(torch.uint8)foridx,axinenumerate(axes):ax.imshow(inp[idx])ax.set_title(labels[idx])ax.grid(False)ax.set_axis_off()plt.show()
print("Train Dataset one sample images shape: ",train_dataset[0][0].shape)print("Train Dataset one sample labels shape: ",train_dataset[0][1].shape)print("Train Dataset one sample images dtype: ",train_dataset[0][0].dtype)print("Train Dataset one sample labels dtype: ",train_dataset[0][1].dtype)print("Train Dataset number of samples: ",len(train_dataset))
Train Dataset one sample images shape: torch.Size([3, 224, 224])
Train Dataset one sample labels shape: torch.Size([])
Train Dataset one sample images dtype: torch.float32
Train Dataset one sample labels dtype: torch.int64
Train Dataset number of samples: 80217
1
2
3
4
5
print("Val Dataset one sample images shape: ",val_dataset[0][0].shape)print("Val Dataset one sample labels shape: ",val_dataset[0][1].shape)print("Val Dataset one sample images dtype: ",val_dataset[0][0].dtype)print("Val Dataset one sample labels dtype: ",val_dataset[0][1].dtype)print("Val Dataset number of samples: ",len(val_dataset))
Val Dataset one sample images shape: torch.Size([3, 224, 224])
Val Dataset one sample labels shape: torch.Size([])
Val Dataset one sample images dtype: torch.float32
Val Dataset one sample labels dtype: torch.int64
Val Dataset number of samples: 80139
Next we have defined torch dataloader for training the model using torch.utils.data.DataLoader
We defined a function build_model that takes any resnet architecture defined in torchvision library and build a model based on number of classes in dataset.
1
2
3
4
5
6
7
8
defbuild_model(backbone,num_classes):num_ftrs=backbone.fc.in_features# num_classes = number of class categories and +1 for background class
backbone.fc=nn.Sequential(nn.Dropout(0.2),nn.Linear(num_ftrs,512),nn.Dropout(0.2),nn.Linear(512,num_classes+1))returnbackbone
We are loading resent50 with pretrained checkpoints which was trained on imagenet. We are also freezing the whole resnet architecture so we will only train the classifier part.
We have defined class_weights this gives more weightage to classes in dataset and give lower weightage to background class. Then we define cross entropy loss and adam optimizer for training.
1
2
3
4
class_weights=[1.0]+[2.0]*len(unique_class_labels)# 1 for bg and 2 for other classes
class_weights=torch.tensor(class_weights).to(device)criterion=nn.CrossEntropyLoss(weight=class_weights)optimizer=optim.Adam(model.parameters(),lr=1e-4)
torch.cuda.empty_cache()num_epochs=100best_val_loss=1000epoch_train_losses=[]epoch_val_losses=[]train_accuracy=[]val_accuracy=[]count=0foridxinrange(num_epochs):train_losses=[]total_train=0correct_train=0model.train()forimages,labelsintqdm(train_loader):optimizer.zero_grad()images=images.to(device)labels=labels.to(device)pred=model(images)loss=criterion(pred,labels)predicted=torch.argmax(pred,1)total_train+=labels.shape[0]correct_train+=(predicted==labels).sum().item()loss.backward()optimizer.step()train_losses.append(loss.item())accuracy_train=(100*correct_train)/total_traintrain_accuracy.append(accuracy_train)epoch_train_loss=np.mean(train_losses)epoch_train_losses.append(epoch_train_loss)val_losses=[]total_val=0correct_val=0model.eval()withtorch.no_grad():forimages,labelsintqdm(val_loader):images=images.to(device)labels=labels.to(device)pred=model(images)loss=criterion(pred,labels)val_losses.append(loss.item())predicted=torch.argmax(pred,1)total_val+=labels.shape[0]correct_val+=(predicted==labels).sum().item()accuracy_val=(100*correct_val)/total_valval_accuracy.append(accuracy_val)epoch_val_loss=np.mean(val_losses)epoch_val_losses.append(epoch_val_loss)print('\nEpoch: {}/{}, Train Loss: {:.8f}, Train Accuracy: {:.8f}, Val Loss: {:.8f}, Val Accuracy: {:.8f}'.format(idx+1,num_epochs,epoch_train_loss,accuracy_train,epoch_val_loss,accuracy_val))ifepoch_val_loss<best_val_loss:best_val_loss=epoch_val_lossprint("Saving the model state dictionary for Epoch: {} with Validation loss: {:.8f}".format(idx+1,epoch_val_loss))torch.save(model.state_dict(),"rcnn_model.pt")count=0else:count+=1ifcount==5:break
100%|██████████| 2507/2507 [15:12<00:00, 2.75it/s]
100%|██████████| 1253/1253 [15:33<00:00, 1.34it/s]
Epoch: 1/100, Train Loss: 0.64668946, Train Accuracy: 86.98405575, Val Loss: 0.45590416, Val Accuracy: 89.13887121
Saving the model state dictionary for Epoch: 1 with Validation loss: 0.45590416
100%|██████████| 2507/2507 [08:06<00:00, 5.15it/s]
100%|██████████| 1253/1253 [07:13<00:00, 2.89it/s]
Epoch: 2/100, Train Loss: 0.41645256, Train Accuracy: 89.96596731, Val Loss: 0.48327551, Val Accuracy: 89.27363706
100%|██████████| 2507/2507 [06:44<00:00, 6.19it/s]
100%|██████████| 1253/1253 [06:47<00:00, 3.08it/s]
Epoch: 3/100, Train Loss: 0.36805147, Train Accuracy: 90.77252951, Val Loss: 0.45782487, Val Accuracy: 89.73907835
100%|██████████| 2507/2507 [06:28<00:00, 6.45it/s]
100%|██████████| 1253/1253 [06:47<00:00, 3.08it/s]
Epoch: 4/100, Train Loss: 0.34133269, Train Accuracy: 91.15274817, Val Loss: 0.45103499, Val Accuracy: 89.39218109
Saving the model state dictionary for Epoch: 4 with Validation loss: 0.45103499
100%|██████████| 2507/2507 [06:32<00:00, 6.39it/s]
100%|██████████| 1253/1253 [06:56<00:00, 3.01it/s]
Epoch: 5/100, Train Loss: 0.32376182, Train Accuracy: 91.46315619, Val Loss: 0.60181320, Val Accuracy: 89.54067308
100%|██████████| 2507/2507 [06:42<00:00, 6.23it/s]
100%|██████████| 1253/1253 [07:29<00:00, 2.78it/s]
Epoch: 6/100, Train Loss: 0.31392360, Train Accuracy: 91.59779099, Val Loss: 0.48469237, Val Accuracy: 89.81020477
100%|██████████| 2507/2507 [07:20<00:00, 5.69it/s]
100%|██████████| 1253/1253 [07:38<00:00, 2.73it/s]
Epoch: 7/100, Train Loss: 0.29900573, Train Accuracy: 91.90570577, Val Loss: 0.46181227, Val Accuracy: 90.14836721
100%|██████████| 2507/2507 [07:35<00:00, 5.51it/s]
100%|██████████| 1253/1253 [07:41<00:00, 2.71it/s]
Epoch: 8/100, Train Loss: 0.29431356, Train Accuracy: 92.06277971, Val Loss: 0.45976700, Val Accuracy: 89.70538689
100%|██████████| 2507/2507 [07:14<00:00, 5.77it/s]
100%|██████████| 1253/1253 [07:27<00:00, 2.80it/s]
Epoch: 9/100, Train Loss: 0.28652644, Train Accuracy: 92.23481307, Val Loss: 0.47067098, Val Accuracy: 90.02982318
Inference
Next, we have defined some helper functions for inference. process_inputs - process input image and return box proposals and image as pytorch tensors.
Non Maximum Suppression is a computer vision method that selects a single bounding boxes from many overlapping bounding boxes in object detection. The criteria is usually discarding entities that are below a given probability bound.
model.eval()withtorch.no_grad():output=model(prep_val_images.to(device))# postprocess output from model
scores=torch.softmax(output,dim=1).cpu().numpy()boxes,boxes_scores,boxes_labels=process_outputs(scores,prep_val_boxes,threshold=0.5,iou_threshold=0.5)
defpredict(image,only_boxed_image=False,label_map=None,max_boxes=100,threshold=0.5,iou_threshold=0.5):# preprocess input image
prep_val_images,prep_val_boxes=process_inputs(image,max_selections=max_boxes)model.eval()withtorch.no_grad():output=model(prep_val_images.to(device))# postprocess output from model
scores=torch.softmax(output,dim=1).cpu().numpy()boxes,boxes_scores,boxes_labels=process_outputs(scores,prep_val_boxes,threshold=threshold,iou_threshold=iou_threshold)ifonly_boxed_image:box_image=draw_boxes(image,boxes,boxes_scores,boxes_labels,label_map)returnbox_imagereturnboxes,boxes_scores,boxes_labels
Make process_data_for_rcnn function to use multiprocessing using Joblib module. Joblib module in Python is used to execute tasks parallelly using Pipelines rather than executing them sequentially one after another.