1.前言
最近面試數據分析崗,兩家公司都問到了這個題目,一個是用SQL查詢每家店鋪銷售額前三的商品,一個是用Python統計每家店鋪銷售額前三的商品;而且在leetcode的數據庫題庫中,“部門工資前三高的所有員工”屬於同樣的類型,在所有題目中出現頻率排名第一,今天先進行SQL解題方法的復盤總結。
2. 題目
sales表包含所有的訂單信息,每個訂單有對應的訂單id orderid,店鋪id shopid,商品id goodid,銷售數量 salenum,銷售單價 price,下單日期 orderdate;
shop表包含店鋪信息,店鋪id shopid,店鋪名稱 shopname;
goods表包含商品信息,商品id goodsid,商品名稱 goodsname;
1)基礎版題目:編寫一個SQL查詢,找出每個店鋪在2020年Q1銷售額(銷售數據*銷售數量)排名前三的商品。例如,根據上述給定的表,查詢結果應返回:
2)進階版題目:編寫一個SQL查詢,找出每個店鋪在距今三個月內銷售額(銷售數據*銷售數量)排名前三的商品,分列展示。例如,根據上述給定的表,查詢結果應返回:
附建表語句和插入數據語句:
# 創建表sales並插入數據 CREATE TABLE `sales` ( `orderid` int NOT NULL AUTO_INCREMENT, `shopid` int NOT NULL, `goodsid` int NOT NULL, `salenum` int NOT NULL, `price` int NOT NULL, `orderdate` date NOT NULL, PRIMARY KEY(`orderid`) ); INSERT INTO `sales` (`shopid`, `goodsid`, `salenum`, `price`, `orderdate`) VALUES (1, 10001, 1, 90, '2020-01-15'), (1, 10002, 1, 50, '2020-02-23'), (2, 10004, 2, 120, '2020-01-18'), (1, 10003, 3, 60, '2020-01-19'), (2, 10002, 1, 50, '2020-02-23'), (1, 10002, 1, 40, '2020-03-01'), (1, 10004, 3, 20, '2020-02-14'), (1, 10003, 1, 10, '2020-03-01'), (2, 10002, 1, 50, '2020-02-02'), (2, 10001, 1, 40, '2020-02-09'); # 創建表shop並插入數據 CREATE TABLE `shop` ( `shopid` int NOT NULL, `shopname` varchar(10) NOT NULL ); INSERT INTO `shop` VALUES (1, 'SexyBaby'), (2, 'AngelCity'); # 創建表goods並插入數據 CREATE TABLE `goods` ( `goodsid` int NOT NULL, `goodsname` varchar(10) NOT NULL ); INSERT INTO `goods` VALUES (10001, 'dress'), (10002, 'shirt'), (10003, 'coat'), (10004, 'blouse');
3.使用窗口函數解題
注:MySQL從8.0版本開始支持窗口函數。
既然要分組統計每個店鋪、每個商品的數據,先回憶一下具有分組統計功能的group by 和 partition by的區別:group by具有匯總的功能,只保留參與分組的字段和聚合函數的結果; 而partition by 能夠保留全部數據,只對其中某些字段做分組統計,常與排序函數連用(注意將聚合函數用在partition后的結果集上時,聚合函數是逐條累積計算值的,具體可參考博客:https://www.cnblogs.com/hello-yz/p/9962356.html)。
基礎版題目解題思路:
1.使用where篩選2020年 Q1的訂單數據;
2.因為一個店鋪中的同一個商品可能會存在多條訂單記錄,所以使用groupby聚合得到每個店鋪中每個商品的銷售額sumprice;
3.通過使用row_number() over (partition by ……),對每個店鋪內的商品銷售額進行降序排序,得到每個店鋪內商品的銷售額排名sumprice_rank;
4.將查詢的結果與shop表和goods表join,得到shopname和goodsname,再在外層使用where sumprice_rank <= 3得到每個店鋪內銷售額排名前三的商品。
SELECT shop.shopname, goods.goodsname, a.sumprice, a.sumprice_rank FROM
(SELECT shopid, goodsid, SUM(salenum * price) AS sumprice, ROW_NUMBER() OVER (PARTITION BY shopid ORDER BY SUM(salenum * price) DESC) AS sumprice_rank FROM sales WHERE orderdate > '2020-01-01' AND orderdate < '2020-03-31' GROUP BY shopid, goodsid) a
LEFT JOIN shop ON a.shopid = shop.shopid LEFT JOIN goods ON a.goodsid = goods.goodsid
WHERE a.sumprice_rank <= 3 ORDER BY shopname, sumprice_rank;
進階版題目解題思路:
1. 在上一版的基礎上,日期篩選條件為近三個月
2. 行轉列操作,注意此處為字符型數據行轉列
(日期篩選近*天/月/年參考博客:https://blog.csdn.net/weixin_33739523/article/details/85820328
行轉列方法參考博客:https://www.cnblogs.com/hiwuchong/p/10080215.html)
SELECT shopname, MAX(CASE WHEN sumprice_rank = 1 THEN t.goodsname ELSE '' END) AS goodsname1, MAX(CASE WHEN sumprice_rank = 2 THEN t.goodsname ELSE '' END) AS goodsname2, MAX(CASE WHEN sumprice_rank = 3 THEN t.goodsname ELSE '' END) AS goodsname3 FROM
(SELECT shop.shopname, goods.goodsname, a.sumprice, a.sumprice_rank FROM (SELECT shopid, goodsid, SUM(salenum * price) AS sumprice, ROW_NUMBER() OVER (PARTITION BY shopid ORDER BY SUM(salenum * price) DESC) AS sumprice_rank FROM sales WHERE DATE_SUB(CURDATE(), INTERVAL 3 MONTH) <= date(orderdate) GROUP BY shopid, goodsid) a LEFT JOIN shop ON a.shopid = shop.shopid LEFT JOIN goods ON a.goodsid = goods.goodsid WHERE a.sumprice_rank <= 3 ORDER BY shopname, sumprice_rank) t GROUP BY shopname;
4.使用基本語法解題
基礎版題目期待結果集的最后一列sumpricerank,如果不使用窗口函數的話,需要賦值變量,這里先不額外展開,重點梳理使用基本語法查詢分組中top值的方法。
基礎版題目解題思路:
1.同上使用窗口函數解題思路中的1和2,先做篩選和聚合得到2020年Q1每個店鋪中每個商品的銷售額sumprice, 在此表基礎上繼續;
2.為找每個店鋪的銷售額前三的商品,用上一步得到的表做自連接,連接條件是
t1.sumprice < t2.sumprice AND t1.shopid = t2.shopid
然后對滿足條件的商品進行計數
COUNT(t2.goodsid) < 3
如果數量小於3,那這個商品即為店鋪內銷售額前三的商品;
3. 將內層查詢的結果與shop表和goods表join,得到shopname和goodsname,再進行外層查詢得到需要的字段。
SELECT shop.shopname, goods.goodsname, t1.sumprice FROM
(SELECT shopid, goodsid, sum(salenum * price) AS sumprice FROM sales WHERE orderdate > '2020-01-01' and orderdate < '2020-03-31'GROUP BY shopid, goodsid) t1 LEFT JOIN shop ON t1.shopid = shop.shopid LEFT JOIN goods ON t1.goodsid = goods.goodsid
WHERE
(SELECT COUNT(t2.goodsid) FROM (SELECT shopid, goodsid, sum(salenum * price) AS sumprice FROM sales WHERE orderdate > '2020-01-01' and orderdate < '2020-03-31'GROUP BY shopid, goodsid) t2 WHERE t1.sumprice < t2.sumprice AND t1.shopid = t2.shopid) < 3
ORDER BY t1.shopid, t1.sumprice DESC;
本人數據分析,機器學習初學者一枚,如果任何疑問,歡迎評論區交流討論,期待與大家共同進步。