首页 - 技术栈

大连网站制作敦煌网网站评价

作者: 五速梦信息网
时间: 2026年03月21日 11:29

当前位置：首页 > news >正文

大连网站制作,敦煌网网站评价,wordpress 多站点用户,建设项目环评申报网站接上篇【多模态融合】TransFusion学习笔记(1)。从TransFusion-L到TransFusion ok,终于可以给出论文中那个完整的框架图了#xff0c;我第一眼看到这个图有几个疑问: Q#xff1a;Image Guidance这条虚线引出的Query Initialization是什么意思? Q#xff1a;图像分支中的…接上篇【多模态融合】TransFusion学习笔记(1)。从TransFusion-L到TransFusion ok,终于可以给出论文中那个完整的框架图了我第一眼看到这个图有几个疑问: QImage Guidance这条虚线引出的Query Initialization是什么意思? Q图像分支中的Image Features as K,V是将整张图像的特征图都作为K,V么 Q有了第2阶段之后Initial Prediction还需要么 Q如果第一阶段的Q来自纯lidar bev feature map用它来聚合Image Features靠普么毕竟是两种模态的特征 Q第2阶段的Transformer Decoder Layer with SMCA,这个SMCA是什么意思? Q如果仅仅是纯Lidar分支产生的object query去聚合image featuers产生最终的预测肯定是不够的你可能得到一个修正之后更准的边界框或者分类但是lidar漏掉的框是没办法恢复的所以应该还有补漏的环节带着诸的疑问结合论文及代码继续分析仍然假定batch为2数据集为nuScenes。说到nuScenes需要大该了解以下他lidar和camera配置。他在车顶端配备了一个32线Lidar然后按321队形配置了6个Camera。所以代码中推理的时候每一个batch同时包含了6张图像。 #源文件mmdet3d/models/dense_heads/transfusion_head.py def forward(self, feats, img_feats, img_metas):Forward pass.Args:feats (list[torch.Tensor]): Multi-level features, e.g.,features produced by FPN.Returns:tuple(list[dict]): Output results. first index by level, second index by layerif img_feats is None:img_feats [None]res multi_apply(self.forward_single, feats, img_feats, [img_metas])assert len(res) 1, only support one level features.return res 现在再来看Tranfusion检测头推理入口forward函数的时候img_feats和img_metas就包含了满满的图像及其特征信息了其中img_feats的shape为(12,256,112,200)12为batch(2)*6(cameras的数量)它将batch和n_views整合在了一起明白这一点很重要。 def forward_single(self, inputs, img_inputs, img_metas):Forward function for CenterPoint.Args:inputs (torch.Tensor): Input feature map with the shape of[B, 512, 128(H), 128(W)]. (consistent with L748)Returns:list[dict]: Output results for tasks.batch_size inputs.shape[0]lidar_feat self.shared_conv(inputs) ##[2, 128, 128, 128]lidar_feat_flatten lidar_feat.view(batch_size, lidar_feat.shape[1], -1) #[BS, C, H*W]bev_pos self.bev_pos.repeat(batch_size, 1, 1).to(lidar_feat.device)if self.fuse_img:img_feat self.shared_conv_img(img_inputs) # [BS * n_views, C, H, W]img_h, img_w, num_channel img_inputs.shape[-2], img_inputs.shape[-1], img_feat.shape[1]# [B, C, H, n_views, W]raw_img_feat img_feat.view(batch_size, self.num_views, num_channel, img_h, img_w).permute(0, 2, 3, 1, 4) # [B, C, H, n_views*W]img_feat raw_img_feat.reshape(batch_size, num_channel, img_h, img_w * self.num_views) # (B,C,n_view*W)img_feat_collapsed img_feat.max(2).values# positional encoding for image guided query initializationif self.img_feat_collapsed_pos is None:img_feat_collapsed_pos self.img_feat_collapsed_pos self.create_2D_grid(1, img_feat_collapsed.shape[-1]).to(img_feat.device)else:img_feat_collapsed_pos self.img_feat_collapsed_posbev_feat lidar_feat_flattenfor idx_view in range(self.num_views):bev_feat self.decoder2 idx_view 从if self.fuse_img条件判断进入的这段代码逻辑用于生成融合了的LiDAR-Camera BEV feature map F-lc。图展示如何操作一个batch中的6张Image feature map形成高度压缩后的K,V。图展示Lidar features和6张Height Collapsed Image features融合的过程使用Dense的Lidar BEV features作为Q使用高度压缩后的Image Features作为K,V。为什么要对Image Features进行高度压缩作者在论文中也做了解释。关于如何融合lidar bev features和image features得到一个更具表达能力的bev feature map在若干其它论文中都有涉及较为著名的比如以下图所示的BEVFusion。 BEVFusion这种特征融合的方式很直观但是他需要将multi-view的图像特征通过LSS或其它方式编码到BEV空间然后使用一个Dynamic Fusion Module得到融合后的特征。这种融合简单粗暴也是Hard-Association的。考虑一个问题如果使用BEVFusion这种多模态融合的bev feature map替换TransFusion-L中纯Lidar产生的bev featuremap会有什么效果呢bevfusion的作者就做了这个实验。从最后一列的nuScenes Validation上的结果来看mAP和NDS分别提了3.%和1.1%。怎么说呢有用但好像又觉得没赚到啥。毕竟费了大力气把不同视角下的image特征提取出来再编码到BEV空间融合完成后mAP相比纯Lidar只是涨了3个点基本上还是Lidar在支撑着。 ################################## image guided query initialization#################################if self.initialize_by_heatmap:##[2, 10, 128, 128])dense_heatmap self.heatmap_head(lidar_feat)dense_heatmap_img Noneif self.fuse_img:dense_heatmap_img self.heatmap_head_img(bev_feat.view(lidar_feat.shape)) # [BS, num_classes, H, W]heatmap (dense_heatmap.detach().sigmoid() dense_heatmap_img.detach().sigmoid()) / 2else:heatmap dense_heatmap.detach().sigmoid()padding self.nms_kernel_size // 2local_max torch.zeros_like(heatmap)local_max_inner F.max_pool2d(heatmap, kernel_sizeself.nms_kernel_size, stride1, padding0)local_max[:, :, padding:(-padding), padding:(-padding)] local_max_inner## for Pedestrian Traffic_cone in nuScenesif self.test_cfg[dataset] nuScenes:local_max[:, 8, ] F.max_pool2d(heatmap[:, 8], kernel_size1, stride1, padding0)local_max[:, 9, ] F.max_pool2d(heatmap[:, 9], kernel_size1, stride1, padding0)elif self.test_cfg[dataset] Waymo: # for Pedestrian Cyclist in Waymolocal_max[:, 1, ] F.max_pool2d(heatmap[:, 1], kernel_size1, stride1, padding0)local_max[:, 2, ] F.max_pool2d(heatmap[:, 2], kernel_size1, stride1, padding0)##非max-heat的地方就被set为0了heatmap heatmap * (heatmap local_max)##torch.Size([2, 10, 16384]) heatmap heatmap.view(batch_size, heatmap.shape[1], -1)# top #num_proposals among all classestop_proposals heatmap.view(batch_size, -1).argsort(dim-1, descendingTrue)[…, :self.num_proposals]top_proposals_class top_proposals // heatmap.shape[-1]##index有什么用??top_proposals_index top_proposals % heatmap.shape[-1]query_feat lidar_feat_flatten.gather(indextop_proposals_index[:, None, :].expand(-1, lidar_feat_flatten.shape[1], -1), dim-1)self.query_labels top_proposals_classone_hot F.one_hot(top_proposals_class, num_classesself.num_classes).permute(0, 2, 1)query_cat_encoding self.class_encoding(one_hot.float())query_feat query_cat_encodingquery_pos bev_pos.gather(indextop_proposals_index[:, None, :].permute(0, 2, 1).expand(-1, -1, bev_pos.shape[-1]), dim1)else:query_feat self.query_feat.repeat(batch_size, 1, 1) # [BS, C, num_proposals]base_xyz self.query_pos.repeat(batch_size, 1, 1).to(lidar_feat.device) # [BS, num_proposals, 2] 回到TransFusion上面在没有融合Image Features之前heatmap需要从纯lidar feature map出。现在有了融合后的feature map自然heatmap又多了一条出路。这就是代码中既有一个dense_heatmap又多出来了一个dense_heatmap_img他们最终通过以下代码进行了融合。 heatmap (dense_heatmap.detach().sigmoid() dense_heatmap_img.detach().sigmoid()) / 2 不看代码我还以为就只是利用了从dense_heatmap_img出的heatmap作者这里还是做了一下结合结合方式也比较简单各自simgoid之后相加取平均。 ret_dicts []for i in range(self.num_decoderlayers):prefix last if (i self.num_decoderlayers - 1) else f{i}head# Transformer Decoder Layer# :param query: B C Pq :param query_pos: B Pq 3/6query_feat self.decoderi# Predictionres_layer self.prediction_headsi ##FFNres_layer[center] res_layer[center] query_pos.permute(0, 2, 1)first_res_layer res_layerif not self.fuse_img:ret_dicts.append(res_layer)# for next level positional embeddingquery_pos res_layer[center].detach().clone().permute(0, 2, 1) 这段代码和单模态的TransFusion-L比query_feat还是从纯lidar bev feature map取的lidar_feat_flatten也还是原来那个展开了的lidar bev featuremap。但是此时的query_feat所在的热点位置因为是从融合的bev featuremap出的所以就有了Image Guidance的一说。 ################################## transformer decoder layer (img feature as K,V)#################################if self.fuse_img:# positional encoding for image fusionimg_feat raw_img_feat.permute(0, 3, 1, 2, 4) # [BS, n_views, C, H, W]img_feat_flatten img_feat.view(batch_size, self.num_views, num_channel, -1) # [BS, n_views, C, H*W]if self.img_feat_pos is None:(h, w) img_inputs.shape[-2], img_inputs.shape[-1]img_feat_pos self.img_feat_pos self.create_2D_grid(h, w).to(img_feat_flatten.device)else:img_feat_pos self.img_feat_posprev_query_feat query_feat.detach().clone()query_feat torch.zeros_like(query_feat) # create new container for img query featurequery_pos_realmetric query_pos.permute(0, 2, 1) * self.test_cfg[out_size_factor] * self.test_cfg[voxel_size][0] self.test_cfg[pc_range][0]query_pos_3d torch.cat([query_pos_realmetric, res_layer[height]], dim1).detach().clone()if vel in res_layer:vel copy.deepcopy(res_layer[vel].detach())else:vel Nonepred_boxes self.bbox_coder.decode(copy.deepcopy(res_layer[heatmap].detach()),copy.deepcopy(res_layer[rot].detach()),copy.deepcopy(res_layer[dim].detach()),copy.deepcopy(res_layer[center].detach()),copy.deepcopy(res_layer[height].detach()),vel,)on_the_image_mask torch.ones([batch_size, self.num_proposals]).to(query_pos_3d.device) * -1for sample_idx in range(batch_size if self.fuse_img else 0):lidar2img_rt query_pos_3d.new_tensor(img_metas[sample_idx][lidar2img])img_scale_factor (query_pos_3d.new_tensor(img_metas[sample_idx][scale_factor][:2]if scale_factor in img_metas[sample_idx].keys() else [1.0, 1.0]))img_flip img_metas[sample_idx][flip] if flip in img_metas[sample_idx].keys() else Falseimg_crop_offset (query_pos_3d.new_tensor(img_metas[sample_idx][img_crop_offset])if img_crop_offset in img_metas[sample_idx].keys() else 0)img_shape img_metas[sample_idx][img_shape][:2]img_pad_shape img_metas[sample_idx][input_shape][:2]boxes LiDARInstance3DBoxes(pred_boxes[sample_idx][bboxes][:, :7], box_dim7)query_pos_3d_with_corners torch.cat([query_pos_3d[sample_idx], boxes.corners.permute(2, 0, 1).view(3, -1)], dim-1) # [3, num_proposals] [3, num_proposals*8]# transform point clouds back to original coordinate system by reverting the data augmentationif batch_size 1: # skip during inference to save timepoints query_pos_3d_with_corners.Telse:points apply_3d_transformation(query_pos_3d_with_corners.T, LIDAR, img_metas[sample_idx], reverseTrue).detach()num_points points.shape[0]for view_idx in range(self.num_views):pts_4d torch.cat([points, points.new_ones(size(num_points, 1))], dim-1)pts_2d pts_4d lidar2img_rt[view_idx].t()##相机内参前面那个1/zpts_2d[:, 2] torch.clamp(pts_2d[:, 2], min1e-5)pts_2d[:, 0] / pts_2d[:, 2]pts_2d[:, 1] / pts_2d[:, 2]# img transformation: scale - crop - flip# the image is resized by img_scale_factorimg_coors pts_2d[:, 0:2] * img_scale_factor # Nx2img_coors - img_crop_offset# grid sample, the valid grid range should be in [-1,1]coor_x, coor_y torch.split(img_coors, 1, dim1) # each is Nx1if img_flip:# by default we take it as horizontal flip# use img_shape before padding for fliporig_h, orig_w img_shapecoor_x orig_w - coor_x##e.g. 200个proposal总共有200 200*8 1800个坐标点coor_x, coor_corner_x coor_x[0:self.num_proposals, :], coor_x[self.num_proposals:, :]coor_y, coor_corner_y coor_y[0:self.num_proposals, :], coor_y[self.num_proposals:, :]coor_corner_x coor_corner_x.reshape(self.num_proposals, 8, 1)coor_corner_y coor_corner_y.reshape(self.num_proposals, 8, 1)coor_corner_xy torch.cat([coor_corner_x, coor_corner_y], dim-1)h, w img_pad_shapeon_the_image (coor_x 0) * (coor_x w) * (coor_y 0) * (coor_y h)on_the_image on_the_image.squeeze()# skip the following computation if no object query fall on current imageif on_the_image.sum() 1:continueon_the_image_mask[sample_idx, on_the_image] view_idx# add spatial constraint#out_size_factor_img是什么out的factor?center_ys (coor_y[on_the_image] / self.out_size_factor_img)center_xs (coor_x[on_the_image] / self.out_size_factor_img)centers torch.cat([center_xs, center_ys], dim-1).int() # center on the feature mapcorners (coor_corner_xy[on_the_image].max(1).values - coor_corner_xy[on_the_image].min(1).values) / self.out_size_factor_img#gaosi geradius torch.ceil(corners.norm(dim-1, p2) / 2).int() # radius of the minimum circumscribed circle of the wireframesigma (radius * 2 1) / 6.0The 2D gaussian weight mask M is generated in a similar way as Center-Net,Mij exp(((i-cx)^2(j-cy)^2)/(sigma*radius^2)),where (i,j) is the spatial indices of the weight mask M,(cx,cy) is the 2D center computed by projecting the query prediction onto the image plane distance (centers[:, None, :] - (img_feat_pos - 0.5)).norm(dim-1) ** 2gaussian_mask (-distance / (2 * sigma[:, None] ** 2)).exp()gaussian_mask[gaussian_mask torch.finfo(torch.float32).eps] 0 ##太远的地方权重太小,直接给0attn_mask gaussian_maskquery_feat_view prev_query_feat[sample_idx, :, on_the_image]query_pos_view torch.cat([center_xs, center_ys], dim-1)query_feat_view self.decoderself.num_decoder_layersquery_feat[sample_idx, :, on_the_image] query_feat_view.clone()self.on_the_image_mask (on_the_image_mask ! -1)res_layer self.prediction_headsself.num_decoder_layersres_layer[center] res_layer[center] query_pos.permute(0, 2, 1)for key, value in res_layer.items():pred_dim value.shape[1]res_layer[key][~self.on_the_image_mask.unsqueeze(1).repeat(1, pred_dim, 1)] first_res_layer[key][~self.on_the_image_mask.unsqueeze(1).repeat(1, pred_dim, 1)]ret_dicts.append(res_layer) 上面这段代码是TransFusion的高潮部分只是现在的为K,V取自Image features。之所以说取,自然就是每个object query取聚合所有视图下的Image Features那样效率太低也难以收敛。问题的关键是一个object query和哪些Image Features建立关联。有了第一阶段预测出的Initial Predict Boxes这个问题就好办一些了。关于怎么利用第一阶段的predict boxes以及Gaussian Circule作者在论文中已经写的很清楚了应该算是诸多论文中的常规操作。看到这里其实大该明白了作者所说的soft-association虽然由predict boxes到image features借助了标定关系。但是通过object query聚合对应局部image featues这里利用了TransFormer尤其是利用其中的cross attention做了跟当前object query上下文相关的特征聚合即使传感器之间没有严格对齐也更加鲁棒。