Wednesday, December 28, 2016

Create a Youtube metadata crawler using Java

This blog post will extract metadata information from youtube using Java. It makes use of JSoup library.

Youtube Video Metadata Crawled:
1) Comments
2) Likes/Dislikes
3) Number of user subscribed
4) Video Title
5) Number of views for video
6) Video Description

Language Used:
Java

POM Dependency:
  <dependency>  
       <groupId>org.jsoup</groupId>  
       <artifactId>jsoup</artifactId>  
       <version>1.8.3</version>  
  </dependency>  

Git Repo:
https://github.com/csanuragjain/extra/tree/master/YoutubeMetadataExtractor

Program:

main Method:
      public static void main(String[] args) {  
           String link="";  
           Scanner s=new Scanner(System.in);  
           // TODO Auto-generated method stub  
           try {  
                System.out.println("Enter the youtube video for which metadata need to be extracted");  
                link=s.nextLine();  
                Document doc = Jsoup.connect(link).ignoreContentType(true).timeout(5000).get();  
                YoutubeMetadataCrawler ymc=new YoutubeMetadataCrawler();  
                String title=ymc.getTitle(doc);  
                String desc=ymc.getDesc(doc);  
                String views=ymc.getViews(doc);  
                String subscribed=ymc.getPeopleSubscribed(doc);  
                int liked=ymc.getPeopleLiked(doc);  
                int disliked=ymc.getPeopleDisliked(doc);  
                String vid=ymc.getVideoId(link);  
                List<String> comments=ymc.getCommentsDesc(link,vid);  
                System.out.println(title);  
                System.out.println(desc);  
                System.out.println("Video Views: \n"+views);  
                System.out.println("People subscribed: \n"+subscribed);  
                System.out.println("People who liked the video: \n"+liked);  
                System.out.println("People who disliked the video: \n"+disliked);  
                System.out.println("Top Comments: ");  
                int i=0;  
                for(String comment:comments)  
                {  
                     System.out.println(++i+") "+comment);  
                }  
           } catch (IOException e) {  
                System.out.println("JSoup is unable to connect to the website");  
           }  
           finally  
           {  
                s.close();  
           }  
      }  

How it works:
1) We use the scanner object to obtain the youtube video url from user
2) We use JSOUP to connect to youtube video
3) We call the several modules for obtaining title, comments, title, description, likes etc and print the same to user.

removeUTFCharacters method:
      public static String removeUTFCharacters(String data){  
           Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})");  
           Matcher m = p.matcher(data);  
           StringBuffer buf = new StringBuffer(data.length());  
           while (m.find()) {  
           String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));  
           m.appendReplacement(buf, Matcher.quoteReplacement(ch));  
           }  
           m.appendTail(buf);  
           return new String(buf);  
           }  

How it works:
1) This module removes the UTF characters.
2) It tries to find if any unicode character is present in the string passed.
3) If it finds any unicode character then it will replace it will its corresponding string

getCommentToken method:
      public String getCommentToken(Response res)  
      {  
           String pageSource=res.body();  
           String commentToken=pageSource=pageSource.substring(pageSource.indexOf("COMMENTS_TOKEN': \"")+18);  
           commentToken=commentToken.substring(0,commentToken.indexOf("\""));  
           commentToken=commentToken.replaceAll("%", "%25");  
           return commentToken;  
      }  

How it works:
1) If you open any youtube video, you will find below token in page source which is used by youtube for extracting comments.
 'COMMENTS_TOKEN': "SOME_TOKEN",  
2) We obtain a response object as argument. This contains the response details when user provided youtube video is opened.
3)  In our module, first we obtain the page source of the youtube video page by calling body method over the response object obtained as argument.
4) Now we search for the text COMMENTS_TOKEN': " in the page source and extract the string upto ". This will extract the comment token.
5) Comment token is returned.


getSessionToken method:
 public String getSessionToken(Response res)  
      {  
           String pageSource=res.body();  
           String xsrfToken=pageSource=pageSource.substring(pageSource.indexOf("XSRF_TOKEN': \"")+14);  
           xsrfToken=xsrfToken.substring(0,xsrfToken.indexOf("\""));  
           return xsrfToken;  
      }  

How it works:
1) If you open any youtube video, you will find below token in page source which is used by youtube for csrf protection.
 'XSRF_TOKEN':"SOME_TOKEN",  
2) We obtain a response object as argument. This contains the response details when user provided youtube video is opened.
3)  In our module, first we obtain the page source of the youtube video page by calling body method over the response object obtained as argument.
4) Now we search for the text 'XSRF_TOKEN':" in the page source and extract the string upto ". This will extract the xsrf token.
5) XSRF token is returned.

getVideoId method:
 public String getVideoId(String url)  
      {  
           url=url.substring(url.indexOf("v=")+2);  
           if(url.contains("?"))  
           {  
                url=url.substring(0,url.indexOf("?"));  
           }  
           return url;  
      }  

How it works:
1) Youtube url contains a param named v which contains the video id
2) We find the location of word v= and extract its value and return the same.

getTitle method:
 public String getTitle(Document doc)  
      {  
           return doc.select("#eow-title").text();  
      }  

How it works:
1) Youtube video contains the video title in HTML element with id as eow-title
 <span id="eow-title" class="watch-title" dir="ltr" title="Dangal | Official Trailer | Aamir Khan | In Cinemas Dec 23, 2016">  
   Dangal | Official Trailer | Aamir Khan | In Cinemas Dec 23, 2016  
  </span>  
2) Document object contains the document detail of the youtube video.
3) We use the document object to find the text from the HTML tag containing id as eow-title
4) We return the title

getViews method:
 public String getViews(Document doc)  
      {  
           return doc.select(".watch-view-count").text();  
      }  

How it works:
1) Youtube video contains the video views in HTML element with class as watch-view-count
 <div class="watch-view-count">40,095,041 views</div>  
2) Document object contains the document detail of the youtube video.
3) We use the document object to find the text from the HTML tag containing class as watch-view-count
4) We return the views

getDesc method:
 public String getDesc(Document doc)  
      {  
           return doc.select("#watch-description-clip").text();  
      }  

How it works:
1) Youtube video contains the video description in HTML element with id as watch-description-clip
 <div id="watch-description-clip"><div id="watch-uploader-info"><strong class="watch-time-text">Some description  
2) Document object contains the document detail of the youtube video.
3) We use the document object to find the text from the HTML tag containing id as watch-description-clip
4) We return the video description

getPeopleSubscribed method:
 public int getPeopleSubscribed(Document doc)  
      {  
           return doc.select(".yt-subscriber-count").text().replace(",", "");  
      }  

How it works:
1) Youtube video contains the user subscribed in HTML element with class as yt-subscriber-count
 <span class="yt-subscription-button-subscriber-count-branded-horizontal yt-subscriber-count" title="1,057,873" aria-label="1,057,873" tabindex="0">1,057,873</span>  
2) Document object contains the document detail of the youtube video.
3) We use the document object to find the text from the HTML tag containing class as yt-subscriber-count
4) We return the video subscription count

getPeopleLiked method:
 public int getPeopleLiked(Document doc)  
      {  
           return Integer.parseInt(doc.select("button.like-button-renderer-like-button-unclicked span").text().replace(",", ""));  
      }  

How it works:
1) Youtube video contains the people who liked video in HTML element with class as like-button-renderer-like-button-unclicked
 <button class="yt-uix-button yt-uix-button-size-default yt-uix-button-opacity yt-uix-button-has-icon no-icon-markup like-button-renderer-like-button like-button-renderer-like-button-unclicked yt-uix-clickcard-target  yt-uix-tooltip" type="button" onclick=";return false;" title="I like this" aria-label="like this video along with 327,229 other people" data-force-position="true" data-orientation="vertical" data-position="bottomright" data-tooltip-text="I like this" aria-labelledby="yt-uix-tooltip620-arialabel"><span class="yt-uix-button-content">327,229</span></button>  
2) Document object contains the document detail of the youtube video.
3) We use the document object to find the text from the HTML tag containing class as like-button-renderer-like-button-unclicked
4) We return the video like count

getPeopleDisliked method:
 public int getPeopleDisliked(Document doc)  
      {  
           return Integer.parseInt(doc.select("button.like-button-renderer-dislike-button-unclicked span").text().replace(",", ""));  
      }  

How it works:
1) Youtube video contains the people who disliked video in HTML element with class as like-button-renderer-dislike-button-unclicked
 <button class="yt-uix-button yt-uix-button-size-default yt-uix-button-opacity yt-uix-button-has-icon no-icon-markup like-button-renderer-dislike-button like-button-renderer-dislike-button-unclicked yt-uix-clickcard-target  yt-uix-tooltip" type="button" onclick=";return false;" title="I dislike this" aria-label="dislike this video along with 28,697 other people" data-force-position="true" data-orientation="vertical" data-position="bottomright" data-tooltip-text="I dislike this" aria-labelledby="yt-uix-tooltip621-arialabel"><span class="yt-uix-button-content">28,697</span></button>  
2) Document object contains the document detail of the youtube video.
3) We use the document object to find the text from the HTML tag containing class as like-button-renderer-dislike-button-unclicked
4) We return the video dislike count

Output:
 Enter the youtube video for which metadata need to be extracted  
 https://www.youtube.com/watch?v=x_7YlGv9u1g  
 Dangal | Official Trailer | Aamir Khan | In Cinemas Dec 23, 2016  
 Published on Oct 19, 2016 Dangal is an extraordinary true story based on the life of Mahavir Singh and his two daughters, Geeta and Babita Phogat. The film traces the inspirational journey of a father who trains his daughters to become world class wrestlers. Release Date: 23rd December 2016 Starring: Aamir Khan, Sakshi Tanwar, Fatima Sana Shaikh, Sanya Malhotra, Zaira Wasim, Suhani Bhatnagar Directed By: Nitesh Tiwari Written By: Nitesh Tiwari, Shreyas Jain, Piyush Gupta, Nikhil Mehrotra Produced By: Aamir Khan, Kiran Rao & Siddharth Roy Kapur Music: Pritam Lyrics: Amitabh Bhattacharya Director of Photography: Setu Production Designer: Laxmi Keluskar & Sandeep Meher Editor: Ballu Saluja Casting Director: Mukesh Chhabra Costume Designer: Maxima Basu Wrestling Choreography and Coach: Kripashankar Patel Bishnoi Action Director: Sham Kaushal Sound Designer: Shajith Koyeri SUBSCRIBE UTV Motion Pictures: http://www.youtube.com/utvmotionpictures Keep up with UTV Motion Pictures on: TWITTER: https://twitter.com/utvfilms FACEBOOK: https://www.facebook.com/utvmotionpic... INSTAGRAM: http://instagram.com/utvfilms GOOGLE+: https://plus.google.com/+UTVMotionPic... PINTEREST: http://pinterest.com/utvfilms/ Category Entertainment License Standard YouTube License  
 Video Views:   
 42,718,537 views  
 People subscribed:   
 1072189  
 People who liked the video:   
 333530  
 People who disliked the video:   
 29338  
 Top Comments:   
 1) Where are the haters who wanted to boycott this movie. The movie has already crossed 100 crores in 3 days.  
 2) No Words To Express The AWESOMENESS Of This Movie...😘😍😎😯  
 3) AWESOME MOVIE BEST MOVIE OF 2016 YALL WATCH IT  

Full Program:
 package com.cooltrickshome;  
 import java.io.IOException;  
 import java.util.ArrayList;  
 import java.util.List;  
 import java.util.Scanner;  
 import java.util.regex.Matcher;  
 import java.util.regex.Pattern;  
 import org.jsoup.Connection.Response;  
 import org.jsoup.Jsoup;  
 import org.jsoup.nodes.Document;  
 import org.jsoup.nodes.Element;  
 import org.jsoup.select.Elements;  
 public class YoutubeMetadataCrawler {  
      /**  
       * @param args  
       */  
      public static void main(String[] args) {  
           String link="";  
           Scanner s=new Scanner(System.in);  
           // TODO Auto-generated method stub  
           try {  
                System.out.println("Enter the youtube video for which metadata need to be extracted");  
                link=s.nextLine();  
                Document doc = Jsoup.connect(link).ignoreContentType(true).timeout(5000).get();  
                YoutubeMetadataCrawler ymc=new YoutubeMetadataCrawler();  
                String title=ymc.getTitle(doc);  
                String desc=ymc.getDesc(doc);  
                String views=ymc.getViews(doc);  
                String subscribed=ymc.getPeopleSubscribed(doc);  
                int liked=ymc.getPeopleLiked(doc);  
                int disliked=ymc.getPeopleDisliked(doc);  
                String vid=ymc.getVideoId(link);  
                List<String> comments=ymc.getCommentsDesc(link,vid);  
                System.out.println(title);  
                System.out.println(desc);  
                System.out.println("Video Views: \n"+views);  
                System.out.println("People subscribed: \n"+subscribed);  
                System.out.println("People who liked the video: \n"+liked);  
                System.out.println("People who disliked the video: \n"+disliked);  
                System.out.println("Top Comments: ");  
                int i=0;  
                for(String comment:comments)  
                {  
                     System.out.println(++i+") "+comment);  
                }  
           } catch (IOException e) {  
                System.out.println("JSoup is unable to connect to the website");  
           }  
           finally  
           {  
                s.close();  
           }  
      }  
      public static String removeUTFCharacters(String data){  
           Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})");  
           Matcher m = p.matcher(data);  
           StringBuffer buf = new StringBuffer(data.length());  
           while (m.find()) {  
           String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));  
           m.appendReplacement(buf, Matcher.quoteReplacement(ch));  
           }  
           m.appendTail(buf);  
           return new String(buf);  
           }  
      public String getTitle(Document doc)  
      {  
           return doc.select("#eow-title").text();  
      }  
      public String getViews(Document doc)  
      {  
           return doc.select(".watch-view-count").text();  
      }  
      public String getDesc(Document doc)  
      {  
           return doc.select("#watch-description-clip").text();  
      }  
      public List<String> getCommentsDesc(String link, String vid)  
      {  
           List<String> comments=new ArrayList<>();  
           try {  
                Response response = Jsoup.connect(link).ignoreContentType(true).timeout(5000).execute();  
                Document doc = Jsoup.connect("https://www.youtube.com/watch_fragments_ajax?v="+vid+"&tr=time&distiller=1&ctoken="+getCommentToken(response)+"&frags=comments&spf=load")  
                     .ignoreContentType(true)  
                     .cookies(response.cookies())  
                     .header("X-Client-Data", "")  
                     .timeout(5000)  
                     .data("session_token",getSessionToken(response))  
                     .post();  
                String commentSource=doc.body().text();  
                while(commentSource.indexOf("comment-renderer-text-content")>-1)  
                {  
                     int pos=commentSource.indexOf("comment-renderer-text-content")+37;  
                     commentSource=commentSource.substring(pos);  
                     //System.out.println(commentSource);  
                     String comment=commentSource.substring(0,commentSource.indexOf("div")-8);  
                     comments.add(removeUTFCharacters(comment));  
                }  
           Elements e= doc.select(".comment-renderer-text-content");  
           for(Element e1:e)  
           {  
                comments.add(e1.text());  
           }  
           } catch (IOException e2) {  
                System.out.println("Unable to retrieve comments "+e2.getMessage());  
                e2.printStackTrace();  
           }  
           return comments;  
      }  
      public String getCommentToken(Response res)  
      {  
           String pageSource=res.body();  
           String commentToken=pageSource=pageSource.substring(pageSource.indexOf("COMMENTS_TOKEN': \"")+18);  
           commentToken=commentToken.substring(0,commentToken.indexOf("\""));  
           commentToken=commentToken.replaceAll("%", "%25");  
           return commentToken;  
      }  
      public String getSessionToken(Response res)  
      {  
           String pageSource=res.body();  
           String xsrfToken=pageSource=pageSource.substring(pageSource.indexOf("XSRF_TOKEN': \"")+14);  
           xsrfToken=xsrfToken.substring(0,xsrfToken.indexOf("\""));  
           return xsrfToken;  
      }  
      public String getVideoId(String url)  
      {  
           url=url.substring(url.indexOf("v=")+2);  
           if(url.contains("?"))  
           {  
                url=url.substring(0,url.indexOf("?"));  
           }  
           return url;  
      }  
      public int getPeopleSubscribed(Document doc)  
      {  
           return doc.select(".yt-subscriber-count").text().replace(",", "");  
      }  
      public int getPeopleLiked(Document doc)  
      {  
           return Integer.parseInt(doc.select("button.like-button-renderer-like-button-unclicked span").text().replace(",", ""));  
      }  
      public int getPeopleDisliked(Document doc)  
      {  
           return Integer.parseInt(doc.select("button.like-button-renderer-dislike-button-unclicked span").text().replace(",", ""));  
      }  
 }  


Hope it helps :)

6 comments:

  1. Hey Bro! Thanks For Knowledge Sharing. Everything is work fine, but comments crawling is not working! May be YouTube has change its Comments_Token. Can You Fix That Pleassss

    ReplyDelete
    Replies
    1. Updated the program, it should work fine now :)

      Delete
  2. Hey what a brilliant post I have come across and believe me I have been searching out for this similar kind of post for past a week and hardly came across this. Thank you very much and will look for more postings from you. YouTube Channel Ideas

    ReplyDelete
  3. I haven’t any word to appreciate this post.....Really i am impressed from this post....the person who create this post it was a great human..thanks for shared this with us. youtube promotion

    ReplyDelete
  4. Great write-up, I am a big believer in commenting on blogs to inform the blog writers know that they’ve added something worthwhile to the world wide web!..
    judi bola

    ReplyDelete