neo4j批量導入neo4j-import



neo4j數據批量導入

目前主要有以下幾種數據插入方式:(轉自:如何將大規模數據導入Neo4j
Cypher CREATE 語句,為每一條數據寫一個CREATE
Cypher LOAD CSV 語句,將數據轉成CSV格式,通過LOAD CSV讀取數據。
官方提供的Java API —— Batch Inserter
大牛編寫的 Batch Import 工具
官方提供的 neo4j-import 工具

這里寫圖片描述

這邊重點來說一下官方最快的neo4j-import,使用的前提條件:

  • graph.db需要清空;
  • neo4j需要停掉;
  • 接受CSV導入,而且格式較為固定;
  • 試用場景:首次導入
  • 節點名字需要唯一

比較適用:

首次導入,無法迭代更新
   
   
  
  
          

來看一下官方案例:Use the Import tool


1 neo4j基本參數

1.1 啟動與關閉:

bin\neo4j start
bin\neo4j stop
bin\neo4j restart
bin\neo4j status
   
   
  
  
          

1.2 neo4j-admin的參數:控制內存

來源:10.5. Memory recommendations
這里寫圖片描述

1.2.1 memrec 是查看參考內存設置

neo4j-admin memrec [--memory=<memory dedicated to Neo4j>] [--database=<name>]
   
   
  
  
          
Option Default Description
–memory The memory capacity of the machine The amount of memory to allocate to Neo4j. Valid units are: k, K, m, M, g, G.
–database graph.db The name of the database. This option will generate numbers for Lucene indexes, and for data volume and native indexes in the database. These can be used as an input into more detailed memory analysis.

參考:

$neo4j-home> bin/neo4j-admin memrec --memory=16g
   
   
  
  
          
  • 1

1.2.2 指定緩存–pagecache

還有--pagecache單條命令指定緩存:

bin/neo4j-admin backup --from=192.168.1.34 --backup-dir=/mnt/backup --name=graph.db-backup --pagecache=4G
   
   
  
  
          
  • 1

指的是,再該條導入數據的指令下,緩存設置。

1.3 neo4j-admin的參數:Dump and load databases - 線下備份

執行該兩步操作,需要關閉數據庫。參考:10.7. Dump and load databases

dump過程:把graph.db轉存到.dump

需要關閉數據庫

$neo4j-home> bin/neo4j-admin dump --database=graph.db --to=/backups/graph.db/2016-10-02.dump
$neo4j-home> ls /backups/graph.db
$neo4j-home> 2016-10-02.dump
   
   
  
  
          

load過程:把.dumpload進來

好像可以不用關閉

$neo4j-home> bin/neo4j stop
Stopping Neo4j.. stopped
$neo4j-home> bin/neo4j-admin load --from=/backups/graph.db/2016-10-02.dump --database=graph.db --force
   
   
  
  
          

如果帶--force,那么load之后,會更新所有的存在着的.db(any existing database gets overwritten.

1.4 neo4j-admin的參數:backup and restore - 在線備份

參考:6.2. Perform a backup

在線備份backup :

$neo4j-home> export HEAP_SIZE=2G
$neo4j-home> mkdir /mnt/backup
$neo4j-home> bin/neo4j-admin backup --from=192.168.1.34 --backup-dir=/mnt/backup --name=graph.db-backup --pagecache=4G
   
   
  
  
          

backup 進臨時文件夾之中。

追加備份:

$neo4j-home> export HEAP_SIZE=2G
$neo4j-home> bin/neo4j-admin backup --from=192.168.1.34 --backup-dir=/mnt/backup --name=graph.db-backup --fallback-to-full=true --check-consistency=true --pagecache=4G
   
   
  
  
          

.


2 簡單demo

movies.csv.

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
   
   
  
  
          

其中,title是屬性,注意此時需要有雙引號;year:int也是屬性,只不過該屬性是數值型的;
:LABEL:ID一樣生成了一個新節點,也就是一套數據可以通過:生成雙節點
actors.csv.

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
   
   
  
  
          

roles.csv.
其中,:LABEL非常有意思,是節點的附屬屬性,其中personId:ID一定是唯一的:LABEL可以不唯一。
而且,載入之后,:LABEL單獨會成為新的節點,而且是去重的。

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
   
   
  
  
          

其中,這個節點的屬性,role沒有標注:,role是屬性,可以加雙引號,也可以不加。最好是指定一下格式,譬如:int為數值型,還有字符型roles:string[]

linux執行:

neo4j_home$ bin/neo4j-admin import --nodes import/movies.csv --nodes import/actors.csv --relationships import/roles.csv
   
   
  
  
          

其中,之前老版本批量導入是:neo4j-import,現在批量導入是:neo4j-admin

window執行:

neo4j-import.bat --into ../data/databases/graph.db --id-type string --nodes:attribute ../import/node_attribute.csv --relationships ../import/product_SecondLeaf.csv --relationships ../import/scene_isDemond.csv
   
   
  
  
          
  • --into,是指定存入名字,在不同的嘗試,可以修改名字。
  • --nodes:attribute,其中,nodes:后面是用來指定節點大類的名稱的
  • --id-type string,,The –id-type string is indicating that all :ID columns contain alphanumeric values (there is an optimization for numeric-only id’s).之前節點ID只能由數字組成,現在允許字符+數字共同定義。

linux最后啟動:

./bin/neo4j start
   
   
  
  
          

window 最后啟動:

neo4j.bat console
   
   
  
  
          

執行時候錯誤信息解析:

1 報錯信息留存在bad.log

\data\databases\graph.db\bad.log
   
   
  
  
          

global id space的報錯為節點未定義,或者節點重復

2 如果節點不唯一,直接報錯:
global id space,同時后續的內容中端上傳,需要刪除data/database /graph.db,重新操作一遍


3 其他導入情況列舉

主要來源於:B.2. Use the Import tool

3.1 不同分隔符導入

如果導入的節點信息為:

:START_ID;role;:END_ID;:TYPE
keanu;'Neo';tt0133093;ACTED_IN keanu;'Neo';tt0234215;ACTED_IN
   
   
  
  
          

那么可以通過--delimiter來進行指定。

neo4j_home$ bin/neo4j-admin import --nodes import/movies2.csv --nodes import/actors2.csv --relationships import/roles2.csv --delimiter ";" --array-delimiter "|" --quote "'"
   
   
  
  
          

3.2 不同數據集定義相同節點

movies5a.csv.

movieId:ID,title,year:int
tt0133093,"The Matrix",1999
   
   
  
  
          

sequels5a.csv.

movieId:ID,title,year:int
tt0234215,"The Matrix Reloaded",2003
tt0242653,"The Matrix Revolutions",2003
   
   
  
  
          

actors5a.csv.

personId:ID,name
keanu,"Keanu Reeves"
laurence,"Laurence Fishburne"
carrieanne,"Carrie-Anne Moss"
   
   
  
  
          

執行語句:

neo4j_home$ bin/neo4j-admin import --nodes:Movie import/movies5a.csv --nodes:Movie:Sequel import/sequels5a.csv --nodes:Actor import/actors5a.csv
   
   
  
  
          

執行的時候,把movies5a.csv定義一個節點名字nodes:Movie
sequels5a.csv定義節點名字有兩個::Movie:Sequel

3.3 定義關系名稱以及關系屬性

roles5b.csv.

:START_ID,role,:END_ID
keanu,"Neo",tt0133093
keanu,"Neo",tt0234215
keanu,"Neo",tt0242653
laurence,"Morpheus",tt0133093
laurence,"Morpheus",tt0234215
laurence,"Morpheus",tt0242653
carrieanne,"Trinity",tt0133093
   
   
  
  
          

執行內容:

neo4j_home$ bin/neo4j-admin import --relationships:ACTED_IN import/roles5b.csv
   
   
  
  
          

其中,:ACTED_IN將關系名稱定義為ACTED_IN;同時定義關系的屬性也有role

3.4 拆分數據集上傳提高效率

節點數據集,標題:movies4-header.csv.

movieId:ID,title,year:int,:LABEL
   
   
  
  
          

節點數據集,內容模塊1:movies4-part1.csv.

tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
   
   
  
  
          

節點數據集,內容模塊2:movies4-part2.csv.

tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
   
   
  
  
          

關系數據集,標題:roles4-header.csv.

:START_ID,role,:END_ID,:TYPE
   
   
  
  
          

關系數據集,內容1:roles4-part1.csv.

keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
   
   
  
  
          

關系數據集,內容2:roles4-part2.csv.

laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
   
   
  
  
          

執行:

neo4j_home$ bin/neo4j-admin import --nodes "import/movies4-header.csv,import/movies4-part1.csv,import/movies4-part2.csv" --relationships "import/roles4-header.csv,import/roles4-part1.csv,import/roles4-part2.csv"
   
   
  
  
          

標題與內容單獨分開,然后由:標題,內容模塊1,內容模塊2,分塊導入。

3.5 兩個節點集擁有相同的字段

這個會比較經常出現,兩個節點集合中,擁有相同字段,如果不設置,就會出現報錯。
movies7.csv.

movieId:ID(Movie-ID),title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel
   
   
  
  
          

其中,(Movie-ID),是將ID進行標記
actors7.csv.

personId:ID(Actor-ID),name,:LABEL
1,"Keanu Reeves",Actor
2,"Laurence Fishburne",Actor
3,"Carrie-Anne Moss",Actor
   
   
  
  
          

roles7.csv.

:START_ID(Actor-ID),role,:END_ID(Movie-ID)
1,"Neo",1
1,"Neo",2
1,"Neo",3
2,"Morpheus",1
2,"Morpheus",2
2,"Morpheus",3
3,"Trinity",1
3,"Trinity",2
3,"Trinity",3
   
   
  
  
          

執行:

neo4j_home$ bin/neo4j-admin import --nodes import/movies7.csv --nodes import/actors7.csv --relationships:ACTED_IN import/roles7.csv
   
   
  
  
          

在關聯表中定義::START_ID(Actor-ID):END_ID(Movie-ID),來指定相應的ID。

3.6 錯誤信息跳過:錯誤的節點

錯誤的關系出現:
roles8a.csv.

:START_ID,role,:END_ID,:TYPE
carrieanne,"Trinity",tt0242653,ACTED_IN emil,"Emil",tt0133093,ACTED_IN
   
   
  
  
          

譬如多出了節點,emil
此時執行:

neo4j_home$ bin/neo4j-admin import --nodes import/movies8a.csv --nodes import/actors8a.csv --relationships import/roles8a.csv --ignore-missing-nodes
   
   
  
  
          

其中的--ignore-missing-nodes就是跳過報錯的節點,其中,錯誤信息會記錄在bad.log之中:

InputRelationship:
   source: roles8a.csv:11
   properties: [role, Emil]
   startNode: emil (global id space)
   endNode: tt0133093 (global id space)
   type: ACTED_IN
 referring to missing node emil
   
   
  
  
          

3.7 錯誤信息跳過:重復節點

actors8b.csv.


personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
laurence,"Laurence Harvey",Actor
   
   
  
  
          

在節點數據集actors8b.csv. 中,由重復的節點:laurence
需要執行:

neo4j_home$ bin/neo4j-admin import --nodes import/actors8b.csv --ignore-duplicate-nodes
   
   
  
  
          

其中,–ignore-duplicate-nodes就是重復節點忽略
會在bad.log之中顯示報錯:

Id 'laurence' is defined more than once in global id space, at least at actors8b.csv:3 and actors8b.csv:5
   
   
  
  
          


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM