检索结果-内蒙古大学图书馆

12th International Conference on Graphics and Image Processing (ICGIP)

作者： Pan, Binbin Wang, Wenzhong Luo, Bin Anhui Univ Hefei Peoples R China

ISBN: (纸本)9781510642782;9781510642775

Due to most of the current algorithms use stacked RGB information for spatial-temporal action detection, the time sequence information is easily lost in the process of convolution and down-sampling, which makes it difficult to model spatial-temporal action and limits the development of action detection. Given the current advanced pose estimation algorithm that has achieved good detection accuracy, we propose an end-to-end network that fuses RGB with skeleton to solve the problem of spatial-temporal action detection. We use RGB to describe the appearance information of object and skeleton to describe the action information. Specifically, in the first part, we generate the initial classification and location proposals based on RGB information by the SSD network. Secondly, we generate frame-level skeleton information by the current advanced pose estimation algorithm, the skeleton helps the SSD network to filter negative samples during training, and then we stack the skeleton after completion and normalization, put it into LSTM network for classification. Finally, we fuse the outputs of the SSD network and LSTM network. We believe that the introduction of skeleton information can effectively address the problem of the insufficient capacity of RGB information for spatial-temporal action modeling. It is worth noting that our skeleton information is based on advanced attitude estimation algorithms rather than annotated. For the datasets, we select the single-person action videos in UCF101 and UCF50. The final experimental results show that our method can significantly improve the action modeling ability of the neural network, and show effective results in action detection.

关键词： spatial-temporal action detection CNNs LSTM skeleton video understanding

来源：评论

学校读者我要写书评

暂无评论

spatial-temporal Context-Aware Online action detection and Prediction

引用

IEEE TRANSactionS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2020年第8期30卷 2650-2662页

作者： Huang, Jingjia Li, Nannan Li, Thomas Liu, Shan Li, Ge Peking Univ Shenzhen Grad Sch Sch Elect & Comp Engn Shenzhen 518055 Peoples R China Peng Cheng Lab Shenzhen 518055 Peoples R China Peking Univ AIIT Hangzhou 310000 Peoples R China Tencent Media Lab Palo Alto CA 94301 USA

spatial-temporal action detection in videos is a challenging problem that has attracted considerable attention in recent years. Most current approaches address action detection as an object detection problem, which utilizes successful object detection frameworks such as Faster R-CNN to operate action detection at every single frame first, and then generates action tubes by linking bounding boxes across the whole video in an offline fashion. However, unlike object detection in static images, temporal context information is vital for action detection in videos. Therefore, we propose an online action detection model that leverages the spatial-temporal context information existing in videos to perform action inference and localization. More specifically, we try to depict the spatial-temporal context pattern of actions via an encoder-decoder model that is based on a convolutional recurrent neural network. The model accepts a video snippet as input and encodes the dynamic information inside the snippet in the forward pass. During the backward pass, the decoder resolves the information for action detection with the current appearance or motion cue at each time stamp. In addition, we devise an incremental action-tube construction algorithm that enables our model to accomplish action prediction ahead of time and performs action detection in an online fashion. To evaluate the performance of our method, we conduct experiments on three popular public datasets UCF-101, UCF-Sports, and J-HMDB-21. The experimental results demonstrate that our method can achieve competitive or superior performance when compared to the state-of-the-art methods. To encourage further research, we release our project on "https://***."

关键词： Videos Electron tubes Proposals Context modeling Object detection Predictive models Computational modeling spatial-temporal action detection encoder-decoder model online action tube generation

来源：评论

学校读者我要写书评

暂无评论

A benchmark dataset and semantics-guided detection network for spatial-temporal human actions in urban driving scenes

引用

PATTERN RECOGNITION 2025年 158卷

作者： Zhong, Fujin Wu, Yini Yu, Hong Wang, Guoyin Lu, Zhantao Chongqing Univ Posts & Telecommun Sch Comp Sci & Technol Chongqing Peoples R China Chongqing Univ Posts & Telecommun Key Lab Big Data Intelligent Comp Chongqing Peoples R China Minist Educ Key Lab Cyberspace Big Data Intelligent Secur Chongqing Peoples R China

In real urban driving scenes, human actions are very complex and have the characteristic of multiple concurrent actions. It has a great significance to detect human actions in urban traffic scenes for auxiliary or autonomous driving systems. In this view, we introduce the TITAN-Human action dataset for the task of multi-person spatial-temporal action detection in urban driving scenes. TITAN-Human action provides the fine-grained action labels and location coordinates for 17,574 persons in the processed frames from the TITAN dataset. Furthermore, we propose a semantics-guided detection network (SGDNet) based on a semantic inference module (SIM) for spatial-temporal human action detection in urban driving scenes. SIM encodes the category labels into sentence vectors at the semantic level with prompting and embedding, utilizes graphs to represent the directed co-occurrence relations between categories, and adopts the graph convolutional network for semantic inference. SGDNet exploits the inference results of SIM to guide the visual branch in better performing human action detection, thereby achieving the integration of visual and linguistic information. We conducted experiments to evaluate SGDNet and several baseline methods on the TITAN-Human action dataset, and reveal the generalizability of SIM in spatial-temporal human action detection. The source code and annotation files will be available at https://***/yyhbswyn/SGDNet.

关键词： spatial-temporal action detection Urban driving scenes Benchmark dataset Semantic inference

来源：评论

学校读者我要写书评

暂无评论

MC-LSTM: Real-Time 3D Human action detection System for Intelligent Healthcare Applications

引用

IEEE TRANSactionS ON BIOMEDICAL CIRCUITS AND SYSTEMS 2021年第2期15卷 259-269页

作者： Yin, Jun Han, Jun Xie, Ruiqi Wang, Chenghao Duan, Xuyang Rong, Yitong Zeng, Xiaoyang Tao, Jun Fudan Univ State Key Lab ASIC & Syst Shanghai 201203 Peoples R China

Due to the movement expressiveness and privacy assurance of human skeleton data, 3D skeleton-based action inference is becoming popular in healthcare applications. These scenarios call for more advanced performance in application-specific algorithms and efficient hardware support. Warnings on health emergencies sensitive to response speed require low latency output and action early detection capabilities. Medical monitoring that works in an always-on edge platform needs the system processor to have extreme energy efficiency. Therefore, in this paper, we propose the MC-LSTM, a functional and versatile 3D skeleton-based action detection system, for the above demands. Our system achieves state-of-the-art accuracy on trimmed and untrimmed cases of general-purpose and medical-specific datasets with early-detection features. Further, the MC-LSTM accelerator supports parallel inference on up to 64 input channels. The implementation on Xilinx ZCU104 reaches a throughput of 18 658 Frames-Per-Second (FPS) and an inference latency of 3.5 ms with the batch size of 64. Accordingly, the power consumption is 3.6 W for the whole FPGA+ARM system, which is 37.8x and 10.4x more energy-efficient than the high-end Titan X GPU and i7-9700 CPU, respectively. Meanwhile, our accelerator also keeps a 4 similar to 5x energy efficiency advantage against the low-power high-performance Firefly-RK3399 board carrying an ARM Cortex-A72+A53 CPU. We further synthesize an 8-bit quantized version on the same hardware, providing a 48.8% increase in energy efficiency under the same throughput.

关键词： 3D human skeleton data deep learning long short-term memory neural network accelerator spatial-temporal action detection

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：