LeetCode 1225. Report Contiguous Dates (MYSQL + hive UDTF版本)

本文轉載自查看原文 2019-10-17 10:18 420 LeetCode

一、原題描述

Table: Failed

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| fail_date    | date    |
+--------------+---------+
Primary key for this table is fail_date.
Failed table contains the days of failed tasks.

Table: Succeeded

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| success_date | date    |
+--------------+---------+
Primary key for this table is success_date.
Succeeded table contains the days of succeeded tasks.

A system is running one task every day. Every task is independent of the previous tasks. The tasks can fail or succeed.

Write an SQL query to generate a report of period_state for each continuous interval of days in the period from 2019-01-01 to 2019-12-31.

period_state is 'failed' if tasks in this interval failed or 'succeeded' if tasks in this interval succeeded. Interval of days are retrieved as start_date and end_date.

Order result by start_date.

The query result format is in the following example:

Failed table:
+-------------------+
| fail_date         |
+-------------------+
| 2018-12-28        |
| 2018-12-29        |
| 2019-01-04        |
| 2019-01-05        |
+-------------------+

Succeeded table:
+-------------------+
| success_date      |
+-------------------+
| 2018-12-30        |
| 2018-12-31        |
| 2019-01-01        |
| 2019-01-02        |
| 2019-01-03        |
| 2019-01-06        |
+-------------------+


Result table:
+--------------+--------------+--------------+
| period_state | start date   | end date     |
+--------------+--------------+--------------+
| succeeded    | 2019-01-01   | 2019-01-03   |
| failed       | 2019-01-04   | 2019-01-05   |
| succeeded    | 2019-01-06   | 2019-01-06   |
+--------------+--------------+--------------+

The report ignored the system state in 2018 as we care about the system in the period 2019-01-01 to 2019-12-31.
From 2019-01-01 to 2019-01-03 all tasks succeeded and the system state was "succeeded".
From 2019-01-04 to 2019-01-05 all tasks failed and system state was "failed".
From 2019-01-06 to 2019-01-06 all tasks succeeded and system state was "succeeded".

二、簡要翻譯

兩張表， Failed 和 Succeeded，用來記錄一個每日定時跑的系統任務的失敗和成功。要求返回一張結果表，按順序展示該任務失敗和成功的連續時間段以及起止時間。

三、SQL分析

0、建表語句

CREATE TABLE IF NOT EXISTS `Failed`( `fail_date` DATE, PRIMARY KEY ( `fail_date` ) )ENGINE=INNODB DEFAULT CHARSET=utf8; CREATE TABLE IF NOT EXISTS `Succeeded`( `success_date` DATE, PRIMARY KEY ( `success_date` ) )ENGINE=INNODB DEFAULT CHARSET=utf8; INSERT INTO Succeeded(success_date) VALUES('2018-12-30'),('2018-12-31'),('2019-01-01'),('2019-01-02'),('2019-01-03'),('2019-01-06'); INSERT INTO Failed(fail_date) VALUES('2018-12-28'),('2018-12-29'),('2019-01-04'),('2019-01-05'),('2019-01-07');

1、將兩張表里面的數據按照時間是否連續進行分組。

 1 #失敗表  2 SELECT 
 3  fail_date,  4       IF(  5         @date = DATE_SUB(fail_date, INTERVAL 1 DAY),  6         @rank := @rank,  7         @rank := @rank + 1
 8       ) AS rank,  9       @date := fail_date 10     FROM
11  Failed, 12       (SELECT 
13         @date := NULL, 14         @rank := 0) r 15     WHERE fail_date > '2018-12-31' 
16       AND fail_date < '2020-01-01' 
17     ORDER BY fail_date； 18 
19 #成功表 20 SELECT 
21  success_date, 22       IF( 23         @date = DATE_SUB(success_date, INTERVAL 1 DAY), 24         @rank := @rank, 25         @rank := @rank + 1
26       ) AS rank, 27       @date := success_date 28     FROM
29  Succeeded, 30       (SELECT 
31         @date := NULL, 32         @rank := 0) r 33     WHERE success_date > '2018-12-31' 
34       AND success_date < '2020-01-01' 
35     ORDER BY success_date

2、將上一步的兩張表分組查詢最大日期，最小日期，然后合並再排序

 1 SELECT 
 2   * 
 3 FROM
 4   (SELECT 
 5     'succeeded' AS period_state,  6     MIN(success_date) AS start_date,  7     MAX(success_date) AS end_date  8   FROM
 9     (SELECT 
10  success_date, 11       IF( 12         @date = DATE_SUB(success_date, INTERVAL 1 DAY), 13         @rank := @rank, 14         @rank := @rank + 1
15       ) AS rank, 16       @date := success_date 17     FROM
18  Succeeded, 19       (SELECT 
20         @date := NULL, 21         @rank := 0) r 22     WHERE success_date > '2018-12-31' 
23       AND success_date < '2020-01-01' 
24     ORDER BY success_date) t1 25   GROUP BY t1.rank 26   UNION
27   ALL 
28   SELECT 
29     'failed' AS period_state, 30     MIN(fail_date) AS start_date, 31     MAX(fail_date) AS end_date 32   FROM
33     (SELECT 
34  fail_date, 35       IF( 36         @date = DATE_SUB(fail_date, INTERVAL 1 DAY), 37         @rank := @rank, 38         @rank := @rank + 1
39       ) AS rank, 40       @date := fail_date 41     FROM
42  Failed, 43       (SELECT 
44         @date := NULL, 45         @rank := 0) r 46     WHERE fail_date > '2018-12-31' 
47       AND fail_date < '2020-01-01' 
48     ORDER BY fail_date) t2 49   GROUP BY t2.rank) temp 
50 ORDER BY start_date ;

四、HIVE版本

0.1 建表語句 略

0.2、xml依賴

<properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>
    <dependencies>

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
        </dependency>

        <dependency>
            <groupId>com.janeluo</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012_u6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.7.2</version>
        </dependency>

        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter-api</artifactId>
            <version>RELEASE</version>
            <scope>compile</scope>
        </dependency>
    </dependencies>
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.apache.hive</groupId>
                <artifactId>hive-exec</artifactId>
                <version>2.1.0</version>
            </dependency>
            <dependency>
                <groupId>junit</groupId>
                <artifactId>junit</artifactId>
                <version>4.12</version>
            </dependency>
            <dependency>
                <groupId>jdk.tools</groupId>
                <artifactId>jdk.tools</artifactId>
                <version>1.8</version>
            </dependency>
        </dependencies>
    </dependencyManagement>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

1、用自定義的UDTF代替三里面的步驟1。

 1 package com.hive.udf;  2 
 3 import java.text.ParseException;  4 import java.text.SimpleDateFormat;  5 import java.util.ArrayList;  6 import java.util.Calendar;  7 import java.util.Date;  8 import java.util.GregorianCalendar;  9 
10 import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; 11 import org.apache.hadoop.hive.ql.exec.UDFArgumentException; 12 import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException; 13 import org.apache.hadoop.hive.ql.metadata.HiveException; 14 import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; 15 import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; 16 import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; 17 import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; 18 import org.junit.jupiter.api.Test; 19 
20 public class TestGenericUDTF extends GenericUDTF { 21     private Date startDate = null; 22     private Date lastDate = null; 23     private Date lastEmitDate = null; 24     private static SimpleDateFormat s = new SimpleDateFormat("yyyy-MM-dd"); 25 
26  @Override 27     public void close() throws HiveException { 28         if (startDate != null && lastDate != null && lastEmitDate != startDate) { 29             String[] result = new String[]{s.format(startDate), s.format(lastDate)}; 30  forward(result); 31  } 32  } 33 
34  @Override 35     public StructObjectInspector initialize(ObjectInspector[] args) 36             throws UDFArgumentException { 37         if (args.length != 1) { 38             throw new UDFArgumentLengthException("TestGenericUDTF takes only one argument"); 39  } 40         if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) { 41             throw new UDFArgumentException("TestGenericUDTF takes string as a parameter"); 42  } 43 
44         ArrayList<String> fieldNames = new ArrayList<String>(); 45         ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(); 46         fieldNames.add("start_date"); 47  fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); 48         fieldNames.add("end_date"); 49  fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); 50 
51         return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); 52  } 53 
54  @Override 55     public void process(Object[] args) throws HiveException { 56         String dateStr = args[0].toString(); 57         Date date = null; 58         try { 59             date = s.parse(dateStr); 60         } catch (ParseException e) { 61  e.printStackTrace(); 62  } 63         if (date == null) { 64             if (startDate != null && lastDate != null && lastEmitDate != startDate) { 65                 String[] result = new String[]{startDate.toString(), lastDate.toString()}; 66  forward(result); 67                 lastEmitDate = startDate; 68  } 69             startDate = null; 70             lastDate = null; 71             return; 72  } 73         if (startDate == null) { 74             startDate = date; 75         } else if (lastDate != null) { 76             GregorianCalendar calander = new GregorianCalendar(); 77  calander.setTime(lastDate); 78             calander.add(Calendar.DATE, 1); 79             Date time = calander.getTime(); 80             if (!time.equals(date)) { 81                 String[] result = new String[]{s.format(startDate), s.format(lastDate)}; 82  forward(result); 83                 lastEmitDate = startDate; 84                 startDate = date; 85  } 86  } 87         lastDate = date; 88  } 89 }

2、打肥jar包上傳服務器，使用該udtf。

add JAR /home/hadoop/hive-dateudtf-jar-with-dependencies.jar; create temporary function dateUDTF as 'com.hive.udf.TestGenericUDTF';

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive UDTF函數 hive學習筆記之十一：UDTF Hive UDTF開發指南【轉】HIVE UDF UDAF UDTF 區別使用 Hive 10、Hive的UDF、UDAF、UDTF [LeetCode] 525. Contiguous Array 相連的數組 Hive自定義UDTF函數 hive中UDTF編寫和使用(轉) UDTF Hive 自定義函數 UDF UDAF UDTF