Due to most of the current algorithms use stacked RGB information for spatial-temporal action detection, the time sequence information is easily lost in the process of convolution and down-sampling, which makes it dif...
详细信息
ISBN:
(纸本)9781510642782;9781510642775
Due to most of the current algorithms use stacked RGB information for spatial-temporal action detection, the time sequence information is easily lost in the process of convolution and down-sampling, which makes it difficult to model spatial-temporalaction and limits the development of actiondetection. Given the current advanced pose estimation algorithm that has achieved good detection accuracy, we propose an end-to-end network that fuses RGB with skeleton to solve the problem of spatial-temporal action detection. We use RGB to describe the appearance information of object and skeleton to describe the action information. Specifically, in the first part, we generate the initial classification and location proposals based on RGB information by the SSD network. Secondly, we generate frame-level skeleton information by the current advanced pose estimation algorithm, the skeleton helps the SSD network to filter negative samples during training, and then we stack the skeleton after completion and normalization, put it into LSTM network for classification. Finally, we fuse the outputs of the SSD network and LSTM network. We believe that the introduction of skeleton information can effectively address the problem of the insufficient capacity of RGB information for spatial-temporalaction modeling. It is worth noting that our skeleton information is based on advanced attitude estimation algorithms rather than annotated. For the datasets, we select the single-person action videos in UCF101 and UCF50. The final experimental results show that our method can significantly improve the action modeling ability of the neural network, and show effective results in actiondetection.
spatial-temporal action detection in videos is a challenging problem that has attracted considerable attention in recent years. Most current approaches address actiondetection as an object detection problem, which ut...
详细信息
spatial-temporal action detection in videos is a challenging problem that has attracted considerable attention in recent years. Most current approaches address actiondetection as an object detection problem, which utilizes successful object detection frameworks such as Faster R-CNN to operate actiondetection at every single frame first, and then generates action tubes by linking bounding boxes across the whole video in an offline fashion. However, unlike object detection in static images, temporal context information is vital for actiondetection in videos. Therefore, we propose an online actiondetection model that leverages the spatial-temporal context information existing in videos to perform action inference and localization. More specifically, we try to depict the spatial-temporal context pattern of actions via an encoder-decoder model that is based on a convolutional recurrent neural network. The model accepts a video snippet as input and encodes the dynamic information inside the snippet in the forward pass. During the backward pass, the decoder resolves the information for actiondetection with the current appearance or motion cue at each time stamp. In addition, we devise an incremental action-tube construction algorithm that enables our model to accomplish action prediction ahead of time and performs actiondetection in an online fashion. To evaluate the performance of our method, we conduct experiments on three popular public datasets UCF-101, UCF-Sports, and J-HMDB-21. The experimental results demonstrate that our method can achieve competitive or superior performance when compared to the state-of-the-art methods. To encourage further research, we release our project on "https://***."
In real urban driving scenes, human actions are very complex and have the characteristic of multiple concurrent actions. It has a great significance to detect human actions in urban traffic scenes for auxiliary or aut...
详细信息
In real urban driving scenes, human actions are very complex and have the characteristic of multiple concurrent actions. It has a great significance to detect human actions in urban traffic scenes for auxiliary or autonomous driving systems. In this view, we introduce the TITAN-Human action dataset for the task of multi-person spatial-temporal action detection in urban driving scenes. TITAN-Human action provides the fine-grained action labels and location coordinates for 17,574 persons in the processed frames from the TITAN dataset. Furthermore, we propose a semantics-guided detection network (SGDNet) based on a semantic inference module (SIM) for spatial-temporal human actiondetection in urban driving scenes. SIM encodes the category labels into sentence vectors at the semantic level with prompting and embedding, utilizes graphs to represent the directed co-occurrence relations between categories, and adopts the graph convolutional network for semantic inference. SGDNet exploits the inference results of SIM to guide the visual branch in better performing human actiondetection, thereby achieving the integration of visual and linguistic information. We conducted experiments to evaluate SGDNet and several baseline methods on the TITAN-Human action dataset, and reveal the generalizability of SIM in spatial-temporal human actiondetection. The source code and annotation files will be available at https://***/yyhbswyn/SGDNet.
Due to the movement expressiveness and privacy assurance of human skeleton data, 3D skeleton-based action inference is becoming popular in healthcare applications. These scenarios call for more advanced performance in...
详细信息
Due to the movement expressiveness and privacy assurance of human skeleton data, 3D skeleton-based action inference is becoming popular in healthcare applications. These scenarios call for more advanced performance in application-specific algorithms and efficient hardware support. Warnings on health emergencies sensitive to response speed require low latency output and action early detection capabilities. Medical monitoring that works in an always-on edge platform needs the system processor to have extreme energy efficiency. Therefore, in this paper, we propose the MC-LSTM, a functional and versatile 3D skeleton-based actiondetection system, for the above demands. Our system achieves state-of-the-art accuracy on trimmed and untrimmed cases of general-purpose and medical-specific datasets with early-detection features. Further, the MC-LSTM accelerator supports parallel inference on up to 64 input channels. The implementation on Xilinx ZCU104 reaches a throughput of 18 658 Frames-Per-Second (FPS) and an inference latency of 3.5 ms with the batch size of 64. Accordingly, the power consumption is 3.6 W for the whole FPGA+ARM system, which is 37.8x and 10.4x more energy-efficient than the high-end Titan X GPU and i7-9700 CPU, respectively. Meanwhile, our accelerator also keeps a 4 similar to 5x energy efficiency advantage against the low-power high-performance Firefly-RK3399 board carrying an ARM Cortex-A72+A53 CPU. We further synthesize an 8-bit quantized version on the same hardware, providing a 48.8% increase in energy efficiency under the same throughput.
暂无评论