Tuesday, April 19, 2016

Monitoring YouTube Channel Comments with PHP



If you are a programmer who owns or follows YouTube channels and wishes to monitor and keeps track of the latest posted comments programmatically using PHP and MySQL, you can use the YouTube Data API along with an API key provided by Google.  The Google API key is available at the Google developers console and you must enable YouTube Data API there in order to access the interface.

The YouTube RESTful Data API is fairly well documented and contains code samples for at least Java, PHP and Python.  To expedite development, it also has a "Try It!" snippet where you can test your API call live from the page.  You have two authentication options to access the API - OAuth or API key.  For simplicity, we will be using an API key.

In this little hackathon exercise, I'll provide the complete running PHP and MySQL code to crawl the latest channel and video comments from YouTube.  The crawling repeats in an infinite loop (while observing API rate limiting) so the comments are constantly refreshed to the database.  A web page for viewing comments is added with capabilities such as pagination, sorting and searching all thanks to the wonderful javascript Datatables.  The end result looks like this:

Screenshot of crawled comments on web page.


How to Set it Up?  (Go to Bitbucket)


To get it up and running, make sure you have MySQL and PHP set up first.  After that, do the following steps:

1. Download the PHP source code from Bitbucket.  At the time of writing, the repo contains:

crawler.php  Monitors and crawls YouTube channel and video comments
top.secrets.php  Database access info and Google API key
index.php  Simple web UI using Datatables
server_processing.php  DataTables example server-side processing script
ssp.class.php  Helper functions for building DataTables server-side processing SQL query

2. Add database table for storing YouTube comments.

   CREATE TABLE comments (  
     `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,  
     `crawl_timestamp` INT UNSIGNED NOT NULL,  
     `timestamp` INT UNSIGNED NOT NULL,  
     `uid` VARCHAR(100) NOT NULL,  
     `pid` VARCHAR(100),  
     `username` VARCHAR(255) NOT NULL,  
     `message` TEXT NOT NULL,  
     PRIMARY KEY (`id`)  
   ) ENGINE=InnoDB DEFAULT CHARSET=UTF8;  

A simple explanation of the table structure is warranted.

crawl_timestamp  Unix timestamp when comment was crawled from YouTube
timestamp  Unix timestamp when comment was published
uid  Comment ID assigned by YouTube
gid  Parent comment ID. For top level comments, gid is same as uid
username  Commentator username
message  Comment text

Add unique index on uid column to support "replace into" query.

 CREATE UNIQUE INDEX index_comments_uid ON comments (uid);  

3. Configure top.secrets.php with your database access information and Google API key.

4. Execute "php crawler.php" at command console to begin crawling.  The crawled comments are updated to database.


How Does it Work?


The actual PHP code for crawling is fairly simple and short (see code snippet below).  The crawler runs in an infinite loop.  Each loop iteration crawls all thread comments and replies for the specified YouTube channel and updates them to the database.  Upon completion of an iteration, previously crawled comments marked with an earlier crawl timestamp are deleted from database so that the database always stores the latest comments.  In the script, I hardcoded the minimum number of most recent comments to be retrieved to 100.  While it's possible that the crawler may retrieve more than the given minimum, it will not retrieve less than the minimum unless the total number of comments in the channel is less than 100.

 require('top.secrets.php');  
 date_default_timezone_set('UTC');  
 $maxCount = 100; // min number of latest comments to monitor  
 $secondsInYear = 365 * 24 * 60 * 60;  
 while (true) {  
   $crawledThreads = 0;  
   $crawledReplies = 0;  
   $crawlTimestamp = time(); // comments crawled in this loop iteration are timestamped  
   $nextCommentPageToken = null;  
   do {  
     $nextCommentPageToken = fetchThread($key, $channelId, $nextCommentPageToken, $threadJson);  
     sleep(1); // respect api rate limiting  
     // parse each thread and replies  
     if (array_key_exists('items', $threadJson)) {  
       foreach ($threadJson['items'] as $thread) {  
         // parse thread  
         $threadId = $thread['id'];  
         $threadData = parseComment($threadId, $threadId, $thread['snippet']['topLevelComment']);  
         $threadTimestamp = $threadData['timestamp'];  
         // save thread to db  
         persistDatabase($servername, $username, $password, $dbname, $crawlTimestamp, $threadData);  
         $crawledThreads++;  
         // parse replies  
         $totalReplyCount = $thread['snippet']['totalReplyCount'];  
         if ($totalReplyCount) {  
           $replies = $thread['replies']['comments'];  
           if (count($replies) == $totalReplyCount) {  
             // fetch comment replies from thread json  
             foreach ($replies as $reply) {  
               // save reply to db  
               persistDatabase($servername, $username, $password, $dbname, $crawlTimestamp, parseComment($reply['id'], $threadId, $reply));  
               $crawledReplies++;  
             }  
           }  
           else {  
             // fetch comment replies via api call  
             $nextRepliesPageToken = null;  
             do {  
               $nextRepliesPageToken = fetchReplies($key, $threadId, $nextRepliesPageToken, $repliesJson);  
               sleep(1); // respect api rate limiting  
               if (array_key_exists('items', $repliesJson)) {  
                 foreach ($repliesJson['items'] as $reply) {  
                   // save reply to db  
                   persistDatabase($servername, $username, $password, $dbname, $crawlTimestamp, parseComment($reply['id'], $threadId, $reply));  
                   $crawledReplies++;  
                 }  
               }  
             } while ($nextRepliesPageToken);  
           }  
         }  
         // threads older than 1 year have no comments so stop crawling if min comments quota is met  
         if (time() - $threadTimestamp > $secondsInYear && $crawledThreads + $crawledReplies >= $maxCount) {  
           break 2;  
         }  
       }  
     }  
   } while ($nextCommentPageToken);  
   // delete previously crawled comments from db  
   refreshDatabase($servername, $username, $password, $dbname, $crawlTimestamp);  


What's Next?


To keep the code to barebone minimum, I have omitted error handling, logging, batch SQLs and other things to make the code robust and optimized.  Feel free to do anything you'd would like with the code but I'd love to hear what you are doing with it!