Titan DB的一些問題

本文轉載自查看原文 2016-11-28 10:49 2416

使用熟悉一點的系統來測試TitanDB，HBASE+ES，記錄下來一些小tips。

1、首先TitanDB支持的Hadoop只有1.2.1，所以Hbase自然也只能取到0.98，雖然官網上提供了titan-1.0-hadoop2，但是並不好用，向hbase存數據時會報錯，原因是因為hadoop1的configure格式和hadoop2的不同，創建的config hbase和hadoop沒法用，只能退回到上述版本。（ES包是1.5.1，建議使用1.5.2避免奇怪的錯誤）

2、使用gremlin按照官方文檔上的方法進行添加索引(參照官方文檔第8節)

mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.buildIndex('byNameAndAgeComposite', Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'byNameComposite').call()
mgmt.awaitGraphIndexStatus(graph, 'byNameAndAgeComposite').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"), SchemaAction.REINDEX).get()
mgmt.commit()

　在執行完mgmt.commit()之后，第一個事物會關閉，一定要重新開一個management才能使用updateIndex。這和0.5.1版本不同

3、JAVA API在添加索引這里有個問題，titan建索引大致是這樣：

　　（1）創建一個空索引，將其狀態設置為registered（mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()）

　　（2）修改索引狀態，將其狀態設置為install（mgmt.awaitGraphIndexStatus(graph, 'byNameComposite').call()）

　　（3）將表中現有數據reindex（mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()）

　　JAVA API中沒有找到將索引狀態轉化為install的方法，還在摸索。但是使用gremlin創建的索引在java使用查詢時是可以正確使用的。

　　這個titan由於活躍度較低，用eclipse import mvn project的方式導入出了各種各樣的錯誤，最后只能用最原始的辦法：下載源碼包，然后將所有依賴包加進來，雖然還有編譯錯誤，但是至少不影響代碼閱讀了，找到了Titanfactory.open方法，發現如下：

    public static TitanGraph open(ReadConfiguration configuration) {
        return new StandardTitanGraph(new GraphDatabaseConfiguration(configuration));
    }

　　找到StandardTitanGraph的openManagement方法

    public TitanManagement openManagement() {
        return new ManagementSystem(this,backend.getGlobalSystemConfig(),backend.getSystemMgmtLog(), mgmtLogger, schemaCache);
    }

　　好吧，原來是TitanManagement的子類，找到ManagementSystem類，查看其源碼發現：

    public static GraphIndexStatusWatcher awaitGraphIndexStatus(TitanGraph g, String graphIndexName) {
        return new GraphIndexStatusWatcher(g, graphIndexName);
    }

　　原來還是個靜態方法，在gremlin中使用的個mgmt.awaitGraphIndexStatus來更改index狀態，而在api中是調用靜態方法ManagementSystem.awaitGraphIndexStatus(g, indexname)來更改的。。

　　不過感覺有點奇怪，我使用updateindex方法更改其狀態，titan竟然是把這個更改放在了觸發器里而不是直接更改，按理說在不commit的時候是不會涉及到修改底層數據的，為什么要做成觸發而且寫在log類里？

　　現在一套走下來沒什么問題了，Titan+HBASE+ES的代碼如下（scala）：

    val g = TitanFactory.open(conf)
    var mgmt = g.openManagement
    val name = mgmt getPropertyKey "movieId"
    var index = mgmt.buildIndex("movie2WIndex666", classOf[Vertex]).addKey(name).buildCompositeIndex
    mgmt.updateIndex(index, SchemaAction.REGISTER_INDEX)
    mgmt.commit
    val ms = ManagementSystem.awaitGraphIndexStatus(g, "movie2WIndex666").call
    println(ms.getTargetStatus)
    mgmt = g.openManagement
    index = mgmt.getGraphIndex("movie2WIndex666")
    println(index.getIndexStatus(name))
    mgmt.updateIndex(index,  SchemaAction.REINDEX).get
    mgmt.commit
    val res = g.traversal().V().has("movieId",12345).out()
    println(res)
    g.close

4、這個和HBASE連接使用的時get方法，每次get一條數據，所以在沒索引的前提下1秒只能檢索100條左右的數據，測試時18萬條的數據做一遍g.V().has(XX)需要34分鍾左右，建立好索引的話查詢一條只需要200ms左右。

　　建立索引時也是做一遍scan（這里是逐條get），所以百萬級的數據對一個屬性做CompositeIndex需要好幾個小時- -mix索引和更大規模的數據集總感覺有點不對勁。

5、由於上述原因，hbase的連接超時要設置的很長，目前我設置的為180000秒，配置文件如下。

<property>
 <name>hbase.rootdir</name>
 <value>hdfs://cloud12:9000/hbase</value>
</property>
<property>
 <name>hbase.cluster.distributed</name>
 <value>true</value>
</property>
 <property>
 <name>hbase.master</name>
 <value>cloud12:60000</value>
 </property>
 <property>
 <name>hbase.zookeeper.property.dataDir</name>
 <value>/home/Titan/hbase/zookeeperDir</value>
 </property>
 <property>
 <name>hbase.zookeeper.quorum</name>
 <value>192.168.12.148</value>
 </property>
 <property>
 <name>hbase.regionserver.lease.period</name>
 <value>180000000</value>
 </property>
 <property>
 <name>hbase.rpc.timeout</name>
 <value>180000000</value>
 </property>

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 關於pytest的一些問題 GAN的一些問題 DB/DBNet：Real-time Scene Text Detection with Differentiable Binarization 一些問題 Flume Spooldir 源的一些問題論文復現的一些問題關於#define 的宏替換的一些問題 kafka manager遇到的一些問題安裝pydelicious遇到的一些問題 hadoop中遇到的一些問題關於poc腳本的一些問題