Home

Personalized Highlight Detection for Automatic GIF Creation


Highlight detection models are typically trained to identify cues that make visual content appealing or interesting for the general public, with the objective of reducing a video to such moments. However, this “interestingness” of a video segment or image is subjective. Thus, such highlight models provide results of limited relevance for the individual user. On the other hand, training one model per user is inefficient and requires large amounts of personal information which is typically not available. To overcome these limitations, we present a global ranking model which can condition on a particular user’s interests. Rather than training one model per user, our model is personalized via its inputs, which allows it to effectively adapt its predictions, given only a few of user-specific examples. To train this model, we create a large-scale dataset of users and the GIFs they created, giving us an accurate indication of their interests.

Our experiments show that using the user history substantially improves the prediction accuracy. On our test set of 850 videos, our model improves the recall over 8% with respect to generic highlight detectors. Our method proves more precise than the user-agnostic baselines even with just one person-specific example.


The awesomness of PHD-GIFs:


Method Overview

Our model predicts the score of a segment as a function of both the segment itself and the user’s previously selected highlights. As such, the model learns to take into account the user history to make accurate personalized predictions.

While there are several ways to do personalization, making the user history an input to the model has the advantage that a single model is sufficient and that the model can use all annotations from all users in training. A single model can predict personalized highlights for all users and new user information can trivially be included.

We propose two models for for our ranking objective, which are combined with late fusion. One takes the segment representation and aggregated history as input (the FNN model), while the second directly uses the distances between the segments and the history (the SVM model).


Experimental Results

When analyzing the results, we find that our method outperforms the baselines by a significant margin. Models using only generic highlight information or only the similarity to previous GIFs perform similar (15.86% for Video2GIF (ours) vs. 15.64% mAP for SVM-D), despite the simplicity of the distance model. Thus, we can conclude that these two kinds of information are both important and that there is a lot of signal contained in a user’s history about his future choice of highlights.