背景:有一批數據源從kafka給過來,接收到后需要處理,然后入庫,我們用一個線程消費下來,一次消費30000條,
按照對象的概念,可以用List<Person>來表示,因為某種原因,需要根據記錄的主鍵personId先在內存做去重(覆蓋)處理
在新特性之前,正常的思路會是:list轉為map,key為personId,put的時候相同的personId后面的覆蓋前面的
java8新特性中,對這種情形有優雅的處理方式,我們分兩種:
(1)不關心覆蓋邏輯,相同personId只留一條
public static List<Person> coverDuplicate(List<Person> sourceList) { if (CollectionUtils.isEmpty(sourceList)) { return new ArrayList<>(); } List<Person> distinctList = sourceList.stream().collect( Collectors.collectingAndThen( Collectors.toCollection( () -> new TreeSet<>(Comparator.comparing(o -> o.getPersonId()))), ArrayList::new) ); return distinctList; }
(2)相同的personId,后面的記錄要求覆蓋前面的
public static List<Person> coverDuplicate1(List<Person> sourceList) { if (CollectionUtils.isEmpty(sourceList)) { return new ArrayList<>(); } List<Person> distinctList = sourceList.stream().collect( Collectors.toMap(Person::getPersonId, Function.identity(), (e1, e2) -> e2) ).values().stream().collect(Collectors.toList()); return distinctList; }
測試用例:
public class Person{ private String personId; private String name; private Integer operateTag; } public static void main(String[] args) {
Person p1 = new Person("1","111",1);
Person p2 = new Person ("1","222",0);
Person p3 = new Person ("3","333",1);
Person p4 = new Person ("4","444",0);
Person p5 = new Person ("4","555",1);
List<Person > sourceList = new ArrayList<>();
sourceList.add(p1); sourceList.add(p2);
sourceList.add(p3);
sourceList.add(p4);
sourceList.add(p5); List<Person> unique = coverDuplicate(sourceList);
unique.forEach(e -> System.out.println(e.getPersonId()+","+e.getName()+","+e.getOperateTag())); }
兩種方式,打印結果如預期