1. 引言
最近接觸Abot爬蟲也有幾天時間了,閑來無事打算從IMDB網站上爬取一些電影數據玩玩。正好美國隊長3正在熱映,打算爬取漫威近幾年的電影並用vis這個JS庫呈現下漫威宇宙的相關電影。
Abot是一個開源的C#爬蟲,代碼非常輕巧。可以參看這篇文章(利用Abot 抓取博客園新聞數據)入門Abot。
Vis 是一個JS的可視化庫類似於D3。vis 提供了像Network 網絡圖的可視化,TimeLine 可視化等等。這里用到了network,只需要給vis傳入簡單的節點信息,邊的信息就可以自動構建一個網絡圖。
2. 實現
首先從數據開始,得到漫威宇宙所有相關的電影名稱,這個數據網上太多了:
從電影名稱到IMDB的電影頁面其實有個搜索過程,還好電影數目不多,這里偷個懶直接采用IMDB的電影鏈接作為種子Url
public static List<string> ImdbFeedMovies = new List<string>()
{
//Iron man 2008
"http://www.imdb.com/title/tt1233205/",
//hunk 2008
"http://www.imdb.com/title/tt0800080/",
//Iron man 2 2010
"http://www.imdb.com/title/tt1228705/",
//Thor 2011
"http://www.imdb.com/title/tt0800369/",
//Captain America
"http://www.imdb.com/title/tt0458339/",
//Averages
"http://www.imdb.com/title/tt0848228/",
//Iron man 3
"http://www.imdb.com/title/tt1300854/",
//thor 2
"http://www.imdb.com/title/tt1981115/",
//Captain America 2
"http://www.imdb.com/title/tt1843866/",
//Guardians of the Galaxy;
"http://www.imdb.com/title/tt2015381/",
//Ultron
"http://www.imdb.com/title/tt2395427/",
//ant-man
"http://www.imdb.com/title/tt0478970/",
//Civil war
"http://www.imdb.com/title/tt3498820/",
//Doctor Strange
"http://www.imdb.com/title/tt1211837/",
//Guardians of the Galaxy 2;
"http://www.imdb.com/title/tt3896198/",
//Thor 3
"http://www.imdb.com/title/tt3501632/",
// Black Panther
"http://www.imdb.com/title/tt1825683/",
//Avengers: Infinity War - Part I
"http://www.imdb.com/title/tt4154756/"
};
有了種子Url 就可以利用Abot 爬取電影的數據,這里只爬取電影名稱,電影圖片以及演員。
這里定義一些需要用到的數據結構:
public class MarvellItem
{
/// <summary>
/// http://www.imdb.com/title/tt0800369/
/// </summary>
public string ImdbUrl { get; set; }
public string Name { get; set; }
public string Image { get; set; }
}
public class ImdbMovie
{
public string ImdbUrl { get; set; }
public string Name { get; set; }
public string Image { get; set; }
public DateTime Date { get; set; }
public List<MarvellItem> Actors { get; set; }
}
public static readonly Regex MovieRegex = new Regex("http://www.imdb.com/title/tt\\d+", RegexOptions.Compiled);
Abot中爬取頁面后最主要的處理函數就是PageCrawlCompletedAsync ,這里給出爬取每個電影頁面后的complete Callback函數
private ConcurrentDictionary<string, ImdbMovie> movieResult; //爬取到的電影數據
public void Moviecrawler_ProcessPageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)
{
if (MovieRegex.IsMatch(e.CrawledPage.Uri.AbsoluteUri))
{
var csTitle = e.CrawledPage.CsQueryDocument.Select(".title_block > .title_bar_wrapper > .titleBar > .title_wrapper > h1");
string title = HtmlData.HtmlDecode(csTitle.Text().Trim());
var datetime =
e.CrawledPage.CsQueryDocument.Select(
".title_block > .title_bar_wrapper > .titleBar > .title_wrapper > .subtext > a:last > meta");
var year = datetime.Attr("content").Trim();
var csImg = e.CrawledPage.CsQueryDocument.Select(".poster > a > img");
string image = csImg.Attr("src").Trim();
if (!string.IsNullOrEmpty(image))
{
HttpWebRequest webRequest = (HttpWebRequest) WebRequest.Create(image);
webRequest.Credentials = CredentialCache.DefaultCredentials;
var stream = webRequest.GetResponse().GetResponseStream();
if (stream != null)
{
Image bitmap = new Bitmap(stream);
image = e.CrawledPage.Uri.AbsoluteUri.GetHashCode() + ".jpg";
bitmap.Save(image);
}
}
var csTable = e.CrawledPage.CsQueryDocument.Select("#titleCast > table");
var csTrs = csTable.Select("tr", csTable);
List<MarvellItem> actors = new List<MarvellItem>();
foreach (var tr in csTrs)
{
var csTr = new CsQuery.CQ(tr);
var cslink = csTr.Select("td > a", csTr);
if (cslink.Any())
{
string url = NormUrl(cslink.Attr("href").Trim());
string actorTitle = cslink.Select("img", cslink).Attr("title").Trim();
string actorImage = cslink.Select("img", cslink).Attr("src").Trim();
actors.Add(new MarvellItem()
{
Name = actorTitle,
ImdbUrl = url,
Image = actorImage
});
}
}
this.movieResult.TryAdd(e.CrawledPage.Uri.AbsoluteUri, new ImdbMovie()
{
Name = title,
Image = image,
Date = DateTime.Parse(year),
ImdbUrl = e.CrawledPage.Uri.AbsoluteUri,
Actors = actors
});
}
}
該函數的主要功能就是解析電影頁面,得到電影名字 電影圖片 和 演員信息。這里面還有一個小trick ,由於IMDB的限制,需要把爬到的圖片下載下來,否則在生產環境下<img src=””/> 圖片是無法顯示的.
更多這個trick的細節可以參看 關於img 403 forbidden的一些思考
對於所有的電影鏈接,可以采用Task 並行執行:
Task[] movieTasks = new Task[ImdbFeedMovies.Count];
System.Console.WriteLine("Start crawl Movies");
for (var i = 0; i < ImdbFeedMovies.Count; i++)
{
var url = ImdbFeedMovies[i];
movieTasks[i] = new Task(() =>
{
System.Console.WriteLine("Start crawl:" + url);
var crawler = GetManuallyConfiguredWebCrawler();
ConfigMovieCrawl(crawler);
crawler.Crawl(new Uri(url));
System.Console.WriteLine("End crawl:" + url);
});
movieTasks[i].Start();
}
Task.WaitAll(movieTasks);
System.Console.WriteLine("End crawl Movies");
結束后我們得到一堆JSON 數據
把它傳到前端:
@model List<ImdbMovie>
<div class="clearfix" style=" position: relative">
<div id="marvel-graph">
</div>
</div>
@section PostScripts{
<script type="text/javascript">
$(function () {
var nodes = [];
var edges = [];
@for (int i = 0; i < Model.Count; i++)
{
var film = Model[i];
<text>
nodes.push({
id: '@film.ImdbUrl',
title: '@film.Name',
borderWidth: 4,
shapeProperties: {useBorderWithImage: true},
shape: "image",
image: '@(string.IsNullOrEmpty(film.Image) ? "" : (film.Image.StartsWith("http") ? film.Image : Href("../../Images/marvel/"+film.Image)))',
color: { border: '#4db6ac', background: '#009688' }
});
@if (i != Model.Count - 1)
{
<text>
edges.push({
from: '@film.ImdbUrl',
to: '@Model[i+1].ImdbUrl',
arrows: { to: true },
width: 4,
length:360,
color: "red"
});
</text>
}
@foreach (var actor in film.Actors)
{
<text>
nodes.push({
id: '@film.ImdbUrl' + '@actor.ImdbUrl',
title: '@actor.Name',
borderWidth: 4,
shapeProperties: { useBorderWithImage: true },
shape: "circularImage",
image: '@(string.IsNullOrEmpty(actor.Image) ? "" : (actor.Image.StartsWith("http") ? actor.Image : Href("../../Images/marvel/"+actor.Image)))',
});
edges.push({
from: '@film.ImdbUrl',
to: '@film.ImdbUrl' + '@actor.ImdbUrl',
arrows: { to: true }
});
</text>
}
</text>
}
var container = document.getElementById("marvel-graph");
var visNodes = new vis.DataSet(nodes);
var data = {
nodes: visNodes,
edges: edges
};
var options = {
layout: { improvedLayout: false },
nodes: {
borderWidth: 3,
font: {
color: '#000000',
size: 12,
face: 'Segoe UI'
},
color: { background: '#4db6ac', border: '#009688' }
},
edges: {
color: '#c1c1c1',
width: 2,
font: {
color: '#2d2d2d',
size: 12
},
smooth: {
enabled: false,
type: 'continuous'
}
}
};
var network = new vis.Network(container, data, options);
});
</script>
}
vis network 主要就是 new Network(container, data, options); 傳入節點 和 邊即可。
最終的效果如圖:
歡迎訪問我的個人主頁 51zhang.net 網站還在不斷開發中…..




