[C#] Regex筆記- 取得網頁Youtube 相關訊息
2012-11-01
之前寫過但是有人問,所以最近又再整理一次..
想說紀錄一下..
基本上就是不去讀API 透過取得網頁然後parse 相關資訊回來
請注意,此文章為教學用,請勿拿去做非法用途,否則法律行為請自行負責
而且基本上這種作法,只要Youtube 官方改網頁規格就失效拉..
所以讀API才是王道..
我要取得資訊如下..
YoutubeURL – 該影片網址
Id –該影片Id
Title –該影片標題
Intro –該影片敘述
ImageLarge –該影片大圖
ImageSmall –該影片小圖
直接來看透過Regex 去Fatch 的Class:
using System;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
namespace FatchYoutueInfo
{
public class FatchU2BUtility
{
public string YoutubeURL { get; private set; }
public string Id { get; private set; }
public string Title { get; private set; }
public string Intro { get; private set; }
public string ImageLarge { get; private set; }
public string ImageSmall { get; private set; }
public FatchU2BUtility(string youtubeURL)
{
// <p id="eow-description" >
var src = GetSourceFromUrl(youtubeURL);
var regexIntro = new Regex(
@"(p id=""eow-description"" >)(?<INTRO>.*?)(</p>)",
RegexOptions.IgnoreCase);
MatchCollection mcIntro = regexIntro.Matches(src);
//<meta name="title" content="
var regexTitle = new Regex(
@"(<meta name=""title"" content="")(?<TITLE>.*?)("">)",
RegexOptions.IgnoreCase);
MatchCollection mcTitle = regexTitle.Matches(src);
var regexId = new Regex(
@"(data-button-menu-id=""some-nonexistent-menu"" data-video-id="")(?<ID>.*?)("")",
RegexOptions.IgnoreCase);
MatchCollection mcId = regexId.Matches(src);
if (mcIntro.Count != 0)
Intro = mcIntro[0].Groups["INTRO"].Value;
else
throw new Exception("Can't find Intro");
if (mcTitle.Count != 0)
Title = mcTitle[0].Groups["TITLE"].Value;
else
throw new Exception("Can't find Title");
if (mcId.Count != 0)
Id = mcId[0].Groups["ID"].Value;
else
throw new Exception("Can't find Id");
ImageSmall = "http://img.youtube.com/vi/" + Id + "/2.jpg";
ImageLarge = "http://img.youtube.com/vi/" + Id + "/0.jpg";
YoutubeURL = "http://www.youtube.com/watch?v=" + Id;
}
/// <summary>
/// 從網路上取得原始碼
/// </summary>
/// <param name="url"></param>
/// <returns></returns>
private string GetSourceFromUrl(string url)
{
var client = new WebClient();
//以防萬一 模擬自己為瀏覽器
client.Headers.Add("User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5");
client.Headers.Add("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
client.Headers.Add("Accept-Encoding: identity");
client.Headers.Add("Accept-Language: zh-TW,en;q=0.8");
client.Headers.Add("Accept-Charset: utf-8;q=0.7,*;q=0.3");
client.Headers.Add("ContentType", "application/x-www-form-urlencoded");
client.Encoding = Encoding.UTF8;
return client.DownloadString(url);
}
}
}
來看使用方法..
try
{
FatchU2BUtility util = new FatchU2BUtility(txtURL.Text);
ltlResult.Text += "Title:" + util.Title + "<br />";
ltlResult.Text += "Intro:" + util.Intro + "<br />";
ltlResult.Text += "URL:" + util.YoutubeURL + "<br />";
ltlResult.Text += "Id:" + util.Id + "<br />";
ltlResult.Text += "Image Small:" + "<img src='"+util.ImageSmall+"' />" + "<br />";
ltlResult.Text += "Image Large:" + "<img src='" + util.ImageLarge + "' />" + "<br />";
}
catch
{
ltlResult.Text = "Sorry,我抓不到";
}
你一定覺得很奇怪,為什麼我Ctor 要給網址 最後我還要再給一次..
因為,常常youtube 網址並不是每一個人都是正規的給 像是這樣 http://www.youtube.com/watch?v=0cay2dnuhcs
所以最後經過Class 輸出後我還是希望取得正規統一的網址..