【大數據 Spark】利用電影觀看記錄數據,進行電影推薦


利用電影觀看記錄數據,進行電影推薦。


准備

1、任務描述:

在推薦領域有一個著名的開放測試集,下載鏈接是:http://grouplens.org/datasets/movielens/,該測試集包含三個文件,分別是ratings.datsers.datmovies.dat,具體介紹可閱讀:README.txt。

請編程實現:通過連接ratings.datmovies.dat兩個文件得到平均得分超過4.0的電影列表,采用的數據集是:ml-1m

2、數據下載

下載(大小約為5.64M)后解壓,會有movies.dat、ratings.dat、ReadMe、users.dat四個文件。

3、部分數據展示

movies.dat 部分數據:

MovieID::Title::Genres

1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
13::Balto (1995)::Animation|Children's
14::Nixon (1995)::Drama
15::Cutthroat Island (1995)::Action|Adventure|Romance
16::Casino (1995)::Drama|Thriller
17::Sense and Sensibility (1995)::Drama|Romance
18::Four Rooms (1995)::Thriller
19::Ace Ventura: When Nature Calls (1995)::Comedy
20::Money Train (1995)::Action

ratings.dat 部分數據:

UserID::MovieID::Rating::Timestamp

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
1::595::5::978824268
1::938::4::978301752
1::2398::4::978302281
1::2918::4::978302124
1::1035::5::978301753
1::2791::4::978302188
1::2687::3::978824268
1::2018::4::978301777
1::3105::5::978301713
1::2797::4::978302039

實操

將我們剛剛下載的數據存放到我們的項目中,項目目錄結構如下,創建我們的主程序movie.scala

1、設置輸入輸出路徑

這里使用數組保存我們的輸入輸出文件,方便后面的修改以及使用

    val files = Array("src/main/java/day_20200425/data/movies.dat",
      "src/main/java/day_20200425/data/ratings.dat",
      "src/main/java/day_20200425/output")

2、配置spark

val conf = new SparkConf().setAppName("SparkJoin").setMaster("local")
val sc = new SparkContext(conf)

3、讀取Rating文件

讀取Ratings.dat文件,根據其內容格式我們將其用::分隔開兩個部分,最后計算出電影評分。

 // Read rating  file
    val textFile = sc.textFile(files(1))

    //提取(movieid, rating)
    val rating = textFile.map(line => {
      val fileds = line.split("::")
      (fileds(1).toInt, fileds(2).toDouble)
    })

    //get (movieid,ave_rating)
    val movieScores = rating
      .groupByKey()
      .map(data => {
        val avg = data._2.sum / data._2.size
                 (data._1, avg)
               })

4、讀取movie文件

Join操作的結果(ID,((ID,Rating),(ID,MovieName)))
RDD的keyBy(func)實際上是為每個RDD元素生成一個增加了key的<key,value>

由於有時候數據的列數很多,不只是按一項作為key來排序,有時候需要對其中兩項進行排序,Spark的RDD提供了keyBy的方法。

val movies = sc.textFile(files(0))
    val movieskey = movies.map(line => {
       val fileds = line.split("::")
        (fileds(0).toInt, fileds(1)) //(MovieID,MovieName)
     }).keyBy(tup => tup._1)

5、保存結果

保存評分大於4的電影

val result = movieScores
     .keyBy(tup => tup._1)
     .join(movieskey)
     .filter(f => f._2._1._2 > 4.0)
     .map(f => (f._1, f._2._1._2, f._2._2._2))
//     .foreach(s =>println(s))

    val file = new File(files(2))
    if(file.exists()){
      deleteDir(file)
    }
    result.saveAsTextFile(files(2))

6、結果

他會自動生成output文件夾,里面有四個文件,_SUCECCESS代表成功的意思,里面沒有任何內容,part-00000就是我們的需要的數據。

部分結果

(1084,4.096209912536443,Bonnie and Clyde (1967))
(3007,4.013559322033898,American Movie (1999))
(2493,4.142857142857143,Harmonists, The (1997))
(3517,4.5,Bells, The (1926))
(1,4.146846413095811,Toy Story (1995))
(1780,4.125,Ayn Rand: A Sense of Life (1997))
(2351,4.207207207207207,Nights of Cabiria (Le Notti di Cabiria) (1957))
(759,4.101694915254237,Maya Lin: A Strong Clear Vision (1994))
(1300,4.1454545454545455,My Life as a Dog (Mitt liv som hund) (1985))
(1947,4.057818659658344,West Side Story (1961))
(2819,4.040752351097178,Three Days of the Condor (1975))
(162,4.063136456211812,Crumb (1994))
(1228,4.1875923190546525,Raging Bull (1980))
(1132,4.259090909090909,Manon of the Spring (Manon des sources) (1986))
(306,4.227544910179641,Three Colors: Red (1994))
(2132,4.074074074074074,Who's Afraid of Virginia Woolf? (1966))
(720,4.426940639269406,Wallace & Gromit: The Best of Aardman Animation (1996))
(2917,4.031746031746032,Body Heat (1981))
(1066,4.1657142857142855,Shall We Dance? (1937))
(2972,4.015384615384615,Red Sorghum (Hong Gao Liang) (1987))

你可能會遇到的問題

問題一:結果輸出目錄已存在

描述

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/D:/Projects/JAVA/Scala/src/main/java/day_20200425/data/output already exist

分析:由於運行,然后輸出文件夾已存在,則需要刪除該目錄
解決:方法一:手動刪除、方法二:加入以下代碼

1、主程序中
    val file = new File(files(2))
    if(file.exists()){
      deleteDir(file)
    }


2、刪除函數
  /**
    * https://www.cnblogs.com/honeybee/p/6831346.html
    * 刪除一個文件夾,及其子目錄
    *
    * @param dir 目錄
    */
  def deleteDir(dir: File): Unit = {
    val files = dir.listFiles()
    files.foreach(f => {
      if (f.isDirectory) {
        deleteDir(f)
      } else {
        f.delete()
        println("delete file " + f.getAbsolutePath)
      }
    })
    dir.delete()
    println("delete dir " + dir.getAbsolutePath)
  }

問題二:缺少hadoop環境變量

描述

ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException

分析
在windows環境下沒有配置hadoop環境的原因。
解決
下載:https://github.com/amihalik/hadoop-common-2.6.0-bin,並且將其bin目錄配置為系統的環境變量(path),然后再代碼中加入以下代碼,例如我的目錄為E:\\Program\\hadoop\\hadoop-common-2.6.0-bin,那么則需要加入:

 System.setProperty("hadoop.home.dir", "E:\\Program\\hadoop\\hadoop-common-2.6.0-bin")


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM