You are on page 1of 1

Temporally locating and classifying action segments in long untrimmed videos is of

particular interest to many applications like surveillance and robotics. While


traditional approaches follow a two-step pipeline, by generating frame-wise
probabilities and then feeding them to high-level temporal models, recent
approaches use temporal convolutions to directly classify the video frames. In this
paper, we introduce a multi-stage architecture for the temporal action segmentation
task. Each stage features a set of dilated temporal convolutions to generate an
initial prediction that is refined by the next one. This architecture is trained
using a combination of a classification loss and a proposed smoothing loss that
penalizes over-segmentation errors. Extensive evaluation shows the effectiveness of
the proposed model in capturing long-range dependencies and recognizing action
segments. Our model achieves state-of-the-art results on three challenging
datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast
dataset.

You might also like