hbase SingleColumnValueFilter 列不存在無法過濾

本文轉載自查看原文 2015-03-11 19:03 2275

問題描述

對一張log表按時間過濾

正常數據的話,每行有一個時間戳列timestamp作為操作時間,按這個列值過濾出特定時間段的log信息

但是不知怎么的log表中多了一些垃圾數據(不一定是垃圾數據,只是沒有timestamp這個字段)。

過濾第一天的話會有5800條沒有操作時間(timestamp),

過濾第二天的時候還是有5800條沒有操作時間的,

過濾前兩天的時候還是5800條。

問題分析

問題很明顯了,就是當某一行沒有要過濾的字段時,SingleColumnValueFilter是默認這一行符合過濾條件的。

接下來就要讓SingleColumnValueFilter在判斷的時候把這個策略改改。

查看源碼發現是有方法可以更改這個策略的

代碼展現

在SingleColumnValueFilter的源碼開頭的一段注釋中(加粗加大的位置)說明了方法

/**
 * This filter is used to filter cells based on value. It takes a {@link CompareFilter.CompareOp}
 * operator (equal, greater, not equal, etc), and either a byte [] value or
 * a ByteArrayComparable.
 * <p>
 * If we have a byte [] value then we just do a lexicographic compare. For
 * example, if passed value is 'b' and cell has 'a' and the compare operator
 * is LESS, then we will filter out this cell (return true).  If this is not
 * sufficient (eg you want to deserialize a long and then compare it to a fixed
 * long value), then you can pass in your own comparator instead.
 * <p>
 * You must also specify a family and qualifier.  Only the value of this column
 * will be tested. When using this filter on a {@link Scan} with specified
 * inputs, the column to be tested should also be added as input (otherwise
 * the filter will regard the column as missing).
 * <p>
 * To prevent the entire row from being emitted if the column is not found * on a row, use {@link #setFilterIfMissing}. * Otherwise, if the column is found, the entire row will be emitted only if * the value passes. If the value fails, the row will be filtered out.
 * <p>
 * In order to test values of previous versions (timestamps), set
 * {@link #setLatestVersionOnly} to false. The default is true, meaning that
 * only the latest version's value is tested and all previous versions are ignored.
 * <p>
 * To filter based on the value of all scanned columns, use {@link ValueFilter}.
 */

更改代碼

SingleColumnValueFilter f1 = new SingleColumnValueFilter(Bytes.toBytes(FAMILY), Bytes.toBytes("timestamp"), CompareOp.GREATER_OR_EQUAL, Bytes.toBytes(starttime));
SingleColumnValueFilter f2 = new SingleColumnValueFilter(Bytes.toBytes(FAMILY), Bytes.toBytes("timestamp"), CompareOp.LESS, Bytes.toBytes(endtime));
f1.setFilterIfMissing(true);　　//true 跳過改行;false 通過該行 f2.setFilterIfMissing(true);
filters.add(f1);
filters.add(f2);

反思

一開始打算繼承出一個新類,然后重寫部分方法,不過好像還是這樣更靈活一些

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 指定的域不存在或無法聯系 SqlServer判斷表、列不存在則創建 “此人不存在” netdom join 錯誤：指定的域不存在，或無法聯系。 phpize 命令不存在 [動漫]只有我不存在的街道 sql如果存在就修改不存在就新增 mysql存在就更新,不存在就新增 idea 包存在提示不存在 org.postgresql.util.PSQLException：錯誤：列user0_.id不存在–Hibernate

hbase SingleColumnValueFilter 列不存在 無法過濾