Posts by Collection

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

Human Interaction Learning on 3D Skeleton Point Clouds for Video Violence Recognition

Published in , 2020

Download paper here

Modeling the Uncertainty for Self-supervised 3D Skeleton Action Representation Learning

Published in , 2021

Download paper here

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation

Published in , 2021

Abstract: Data augmentation is vital for deep learning neural networks. By providing massive training samples, it helps to improve the generalization ability of the model. Weakly supervised semantic segmentation (WSSS) is a challenging problem that has been deeply studied in recent years, conventional data augmentation approaches for WSSS usually employ geometrical transformations, random cropping, and color jittering. However, merely increasing the same contextual semantic data does not bring much gain to the networks to distinguish the objects, eg, the correct image-level classification of” aeroplane” may be not only due to the recognition of the object itself but also its co-occurrence context like” sky”, which will cause the model to focus less on the object features. To this end, we present a Context Decoupling Augmentation (CDA) method, to change the inherent context in which the objects appear and thus drive the network to remove the dependence between object instances and contextual information. To validate the effectiveness of the proposed method, extensive experiments on PASCAL VOC 2012 dataset with several alternative network architectures demonstrate that CDA can boost various popular WSSS methods to the new state-of-the-art by a large margin.

Self-supervised 3D Skeleton Action Representation Learning with Motion Consistency and Continuity

Published in , 2021

Abstract: Recently, self-supervised learning (SSL) has been proved very effective and it can help boost the performance in learning representations from unlabeled data in the image domain. Yet, very little is explored about its usefulness in 3D skeleton-based action recognition understanding. Directly applying existing SSL techniques for 3D skeleton learning, however, suffers from trivial solutions and imprecise representations. To tackle these drawbacks, we consider perceiving the consistency and continuity of motion at different playback speeds are two critical issues. To this end, we propose a novel SSL method to learn the 3D skeleton representation in an efficacious way. Specifically, by constructing a positive clip (speed-changed) and a negative clip (motion-broken) of the sampled action sequence, we encourage the positive pairs closer while pushing the negative pairs to force the network to learn the intrinsic dynamic motion consistency information. Moreover, to enhance the learning features, skeleton interpolation is further exploited to model the continuity of human skeleton data. To validate the effectiveness of the proposed method, extensive experiments are conducted on Kinetics, NTU60, NTU120, and PKUMMD datasets with several alternative network architectures. Experimental evaluations demonstrate the superiority of our approach and through which, we can gain significant performance improvement without using extra labeled data.

Self-Supervised Object Localization with Joint Graph Partition

Published in , 2022

Abstract: Object localization aims to generate a tight bounding box for the target object, which is a challenging problem that has been deeply studied in recent years. Since collecting bounding-box labels is time-consuming and laborious, many researchers focus on weakly supervised object localization (WSOL). As the recent appealing self-supervised learning technique shows its powerful function in visual tasks, in this paper, we take the early attempt to explore unsupervised object localization by self-supervision. Specifically, we adopt different geometric transformations to image and utilize their parameters as pseudo labels for self-supervised learning. Then, the class-agnostic activation map (CAAM) is used to highlight the target object potential regions. However, such attention maps merely focus on the most discriminative part of the objects, which will affect the quality of the predicted bounding box. Based on the motivation that the activation maps of different transformations of the same image should be equivariant, we further design a siamese network that encodes the paired images and propose a joint graph cluster partition mechanism in an unsupervised manner to enhance the object co-occurrent regions. To validate the effectiveness of the proposed method, extensive experiments are conducted on CUB-200-2011, Stanford Cars and FGVC-Aircraft datasets. Experimental results show that our method outperforms state-of-the-art methods using the same level of supervision, even outperforms some weakly-supervised methods.

General Object Pose Transformation Network from Unpaired Data

Published in , 2022

Abstract: Adding visible watermark into image is a common copyright protection method of medias. Meanwhile, public research on watermark removal can be utilized as an adversarial technology to help the further development of watermarking. Existing watermark removal methods mainly adopt multi-task learning networks, which locate the watermark and restore the background simultaneously. However, these approaches view the task as an image-to-image reconstruction problem, where they only impose supervision after the final output, making the high-level semantic features shared between different tasks. To this end, inspired by the two-stage coarse-refinement network, we propose a novel contrastive learning mechanism to disentangle the high-level embedding semantic information of the images and watermarks, driving the respective network branch more oriented. Specifically, the proposed mechanism is leveraged for watermark image decomposition, which aims to decouple the clean image and watermark hints in the high-level embedding space. This can guarantee the learning representation of the restored image enjoy more task-specific cues. In addition, we introduce a self-attention-based enhancement module, which promotes the network’s ability to capture semantic information among different regions, leading to further improvement on the contrastive learning mechanism. To validate the effectiveness of our proposed method, extensive experiments are conducted on different challenging benchmarks. Experimental evaluations show that our approach can achieve state-of-the-art performance and yield high-quality images.

EpNet: Power lines foreign object detection with Edge Proposal Network and data composition

Published in , 2022

Download paper here

Dual Progressive Transformations for Weakly Supervised Semantic Segmentation

Published in , 2022

Download paper here

DENet: Disentangled Embedding Network for Visible Watermark Removal

Published in , 2023

Abstract: Adding visible watermark into image is a common copyright protection method of medias. Meanwhile, public research on watermark removal can be utilized as an adversarial technology to help the further development of watermarking. Existing watermark removal methods mainly adopt multi-task learning networks, which locate the watermark and restore the background simultaneously. However, these approaches view the task as an image-to-image reconstruction problem, where they only impose supervision after the final output, making the high-level semantic features shared between different tasks. To this end, inspired by the two-stage coarse-refinement network, we propose a novel contrastive learning mechanism to disentangle the high-level embedding semantic information of the images and watermarks, driving the respective network branch more oriented. Specifically, the proposed mechanism is leveraged for watermark image decomposition, which aims to decouple the clean image and watermark hints in the high-level embedding space. This can guarantee the learning representation of the restored image enjoy more task-specific cues. In addition, we introduce a self-attention-based enhancement module, which promotes the network’s ability to capture semantic information among different regions, leading to further improvement on the contrastive learning mechanism. To validate the effectiveness of our proposed method, extensive experiments are conducted on different challenging benchmarks. Experimental evaluations show that our approach can achieve state-of-the-art performance and yield high-quality images.

Contrastive Generative Network with Recursive-Loop for 3D point cloud generalized zero-shot classification

Published in , 2023

Abstract: Generalized Zero-Shot Learning (GZSL) aims to recognize objects from both seen and unseen categories by transferring semantic knowledge and merely utilizing seen class data for training. Recent feature generation methods in the 2D image domain have made great progress. However, very little is known about its usefulness in 3D point cloud zero-shot learning. This work aims to facilitate research on 3D point cloud generalized zero-shot learning. Different from previous works, we focus on synthesizing the more high-level discriminative point cloud features. To this end, we design a representation enhancement strategy to generate the features. Specifically, we propose a Contrastive Generative Network with Recursive-Loop, termed as CGRL, which can be leveraged to enlarge the inter-class distances and narrow the intra-class gaps. By applying the contrastive representations to the generative model in a recursive-loop form, it can provide the self-guidance for the generator recurrently, which can help yield more discriminative features and train a better classifier. To validate the effectiveness of the proposed method, extensive experiments are conducted on three benchmarks, including ModelNet40, McGill, and ScanObjectNN. Experimental evaluations demonstrate the superiority of our approach and it can outperform the state-of-the-arts by a large margin.

A unified transformer framework for group-based segmentation: Co-segmentation, co-saliency detection and video salient object detection

Published in , 2023

Abstract: Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world. In the computer vision area, many researches focus on co-segmentation (CoS), co-saliency detection (CoSD) and video salient object detection (VSOD) to discover the co-occurrent objects. However, previous approaches design different networks on these similar tasks separately, and they are difficult to apply to each other, which lowers the upper bound of the transferability of deep learning frameworks. Besides, they fail to take full advantage of the cues among inter- and intra-feature within a group of images. In this paper, we introduce a unified framework to tackle these issues, term as UFO (Unified Framework for Co-Object Segmentation). Specifically, we first introduce a transformer block, which views the image feature as a patch token and then captures their long-range dependencies through the self-attention mechanism. This can help the network to excavate the patch structured similarities among the relevant objects. Furthermore, we propose an intra-MLP learning module to produce self-mask to enhance the network to avoid partial activation. Extensive experiments on four CoS benchmarks (PASCAL, iCoseg, Internet and MSRC), three CoSD benchmarks (Cosal2015, CoSOD3k, and CocA) and four VSOD benchmarks (DAVIS16, FBMS, ViSal and SegV2) show that our method outperforms other state-of-the-arts on three different tasks in both accuracy and speed by using the same network architecture , which can reach 140 FPS in real-time.

Occlusion-aware detection and re-id calibrated network for multi-object tracking

Published in , 2023

Download paper here

DI-Net: Decomposed Implicit Garment Transfer Network for Digital Clothed 3D Human

Published in , 2023

Download paper here

Spatial-Semantic Collaborative Cropping for User Generated Content

Published in , 2024

Download paper here

Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

Published in , 2024

Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision. Due to the local receptive fields generated by convolution operations, previous CNN-based methods suffer from partial activation issues, concentrating on the object’s discriminative part instead of the entire entity scope. Benefiting from the capability of the self-attention mechanism to acquire long-range feature dependencies, Vision Transformer has been recently applied to alleviate the local activation drawbacks. However, since the transformer lacks the inductive localization bias that are inherent in CNNs, it may cause a divergent activation problem resulting in an uncertain distinction between foreground and background. In this work, we proposed a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation. Specifically, we first propose a local patch shuffle strategy to construct the image pairs, disrupting local patches while guaranteeing global consistency. The paired images that contain the common object in spatial are then fed into the Siamese network encoder. We further design a semantic-constraint matching module, which aims to mine the co-object part by matching the coarse class activation maps (CAMs) extracted from the pair images, thus implicitly guiding and calibrating the transformer network to alleviate the divergent activation. Extensive experimental results conducted on two challenging benchmarks, including CUB-200-2011 and ILSVRC datasets show that our method can achieve the new state-of-the-art performance and outperform the previous method by a large margin.

Yukun Su

Posts by Collection

portfolio

Portfolio item number 1

Portfolio item number 2

publications

Human Interaction Learning on 3D Skeleton Point Clouds for Video Violence Recognition

Modeling the Uncertainty for Self-supervised 3D Skeleton Action Representation Learning

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation

Self-supervised 3D Skeleton Action Representation Learning with Motion Consistency and Continuity

Self-Supervised Object Localization with Joint Graph Partition

General Object Pose Transformation Network from Unpaired Data

EpNet: Power lines foreign object detection with Edge Proposal Network and data composition

Dual Progressive Transformations for Weakly Supervised Semantic Segmentation

DENet: Disentangled Embedding Network for Visible Watermark Removal

Contrastive Generative Network with Recursive-Loop for 3D point cloud generalized zero-shot classification

A unified transformer framework for group-based segmentation: Co-segmentation, co-saliency detection and video salient object detection

Occlusion-aware detection and re-id calibrated network for multi-object tracking

DI-Net: Decomposed Implicit Garment Transfer Network for Digital Clothed 3D Human

Spatial-Semantic Collaborative Cropping for User Generated Content

Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

talks

Talk 1 on Relevant Topic in Your Field

Tutorial 1 on Relevant Topic in Your Field

Talk 2 on Relevant Topic in Your Field

Conference Proceeding talk 3 on Relevant Topic in Your Field

teaching

Teaching experience 1

Teaching experience 2