Personalized Highlight Detection for Automatic GIF Creation
Highlight detection models are typically trained to identify cues
that make visual content appealing or interesting for the general
public, with the objective of reducing a video to such moments.
However, this “interestingness” of a video segment or image is
subjective. Thus, such highlight models provide results of limited
relevance for the individual user. On the other hand, training one
model per user is inefficient and requires large amounts of personal
information which is typically not available. To overcome these
limitations, we present a global ranking model which can condition
on a particular user’s interests. Rather than training one model per
user, our model is personalized via its inputs, which allows it to
effectively adapt its predictions, given only a few of user-specific
examples. To train this model, we create a large-scale dataset of
users and the GIFs they created, giving us an accurate indication
of their interests.
Our experiments show that using the user history
substantially improves the prediction accuracy. On our test set of
850 videos, our model improves the recall over 8% with respect to
generic highlight detectors. Our method proves more precise than the
user-agnostic baselines even with just one person-specific example.
Our model predicts the score of a segment
as a function of both the segment itself and the user’s previously
selected highlights. As such, the model learns to take into account
the user history to make accurate personalized predictions.
While there are several ways to do personalization, making the
user history an input to the model has the advantage that a single
model is sufficient and that the model can use all annotations
from all users in training. A single model can predict personalized
highlights for all users and new user information can trivially be included.
We propose two models for for our ranking objective, which are combined with late
fusion. One takes the segment representation and aggregated history
as input (the FNN model), while the second directly uses the distances
between the segments and the history (the SVM model).
Experimental Results
When analyzing the results, we find that our
method outperforms the baselines by a significant
margin.
Models using only generic highlight information or only the similarity
to previous GIFs perform similar (15.86% for Video2GIF
(ours) vs. 15.64% mAP for SVM-D), despite the simplicity of the
distance model. Thus, we can conclude that these two kinds of information
are both important and that there is a lot of signal contained
in a user’s history about his future choice of highlights.