Extracting Topic Data From YouTube Activity
YouTube is an amazing resource for creating and discovering videos, but it is valuable for people building non-video experiences as well. Looking at YouTube usage is a great way of finding out what a user is into, and can help provide information to allow a more tailored out of the box experience. It's really easy to request access to YouTube alongside Google+ Sign-In, and then use the YouTube API to retrieve the user's watch history or their likes.
YouTube offers several scopes for allowing access to different facets of the functionality, but in this case we will be using the readonly scope. This means the user only consents for the application to view their activity on YouTube, and doesn't grant it the ability to upload videos or subscribe them to channels. In an Android setup, we can request the scope with our call to PlusClient.Builder. Because the YouTube API is not part of Google Play services, we will also need to create a GoogleAccountCredential from the Google Java client library to use when making YouTube API calls.
When the user signs in, we need to set the chosen account on the GoogleAccountCredential object. In my test app I immediately kicked off the async task to retrieve the playlist as well, but you may want to do that later or maybe on a server.
We're aiming to retrieve a list of interests, which is done by retrieving the list of videos the user was watched, or has liked, and extracting some topic indicator from them. Liked is a stronger signal, but not every user will have liked any videos, so you may want to implement both. To actually do this there are four stages:
- Retrieve the user's channel details, to the get their playlist IDs.
- Retrieve the videos in the user's Likes or WatchHistory playlist
- Retrieve the video details of each video on the list
- Extract the Freebase topic ID, or the category of each video
The final point is where you actually extract the topic information. Sometimes there is a good match between category and your needs - for example if you are looking for music videos, then filtering by category 10 (music) and saving the video titles is likely to give you a good set of keywords to plug into your own system. However, if you need more specific data, then you can take advantage of the fact that every video on YouTube has one or more Freebase topic MIDs associated with it. You can aggregate these topics, and retrieve details on them from the Freebase API.
Retrieving The Channel
Every user on YouTube has an implicit channel, and retrieving it is the first call we'll need to make. We call the channels API with mine parameter set to true to indicate we want the current user's channels. We also have to specify a part parameter to indicate which part of the response we're interested in - here we use contentDetails. A user may have more than one channel, but we're interested in the default channel, which should be first. From there we can get the playlist ID for the likes or watch history playlists.
Retrieve the playlist video IDs
Next we need to query the playlist. We may well get more videos back that we can retrieve in a single call, so there proper operation here is to loop over the pages of results until we have all of them.
If we just wanted a sampling of videos, we could avoid this looping and just make a single call.
Retrieving videos
While for many uses the play list actually gives us enough information, for retrieving the category or the topic IDs, we need to retrieve the videos themselves. The videos.list call can return all this information, and luckily enough can take a comma separated list of video IDs. In this example, I'm just taking the first 50 video IDs, but we could make several calls and retrieve all the IDs.
In the loop across videos.getItems we are extracting the topic IDs. We're also keeping a count in case we want to filter by only the most popular topics. Sometimes we will see that topicDetails contains two entries - relevantTopicIds can give us better topical matches, though usually more general, and with fewer entries. We could use that in preference, or add it in.
Rather than retrieving the topic details and doing further processing, we could alternatively filter by category here if we wished, as indicated in the comment in the code above. An example of retrieving videos in the music category from my recent likes returns the list in the screenshot below. Also note the YouTube consent line in the dialog, which makes it clear that the access is read only.
Retrieving topic information
We will now have a list of topic to count mappings we can use, a series of pairs like this:
m/07lb3 => 5
m/09xp_ => 3
There are a lot of ways of actually retrieving the topic information. In many cases the identifiers will be enough - if you can map your own data to Freebase topics then it will just require a look up. You can actually get the whole Freebase data in a data dump, though this is quite large! An alternative is to try and find Freebase entities for parts of your database by using the Freebase search API. This allows finding entities based on natural language queries, so could be good for looking up known entities and storing MIDs
If we don't have a pre-existing mapping, we can make a call to the Freebase API to get topic details in JSON format with a simple GET request to a URL "https://www.googleapis.com/freebase/v1/topic" with the topic ID appended/m/09xp_. Working out how to fit this in with your own data can be quite tricky however. There are some guides to the Freebase structure in the developer documentation, where you can start to explore the ways that different topics and types are related, and how it might make sense to map to your own models.
The easiest way is often to look for notable types in the returned topic data. This should likely give you a general enough category to search again.
"/common/topic/notable_types": {
"valuetype": "object",
"values": [
{
"text": "Sport",
"lang": "en",
"id": "/sports/sport"
}
],
"count": 1.0
}
If you just want to play around with the YouTube data, the easiest way is to use the Google API explorer, where you can make all the calls and see the responses easily (the parameters hopefully should be clear enough from the code above).