Spark官網下載Spark
Spark下載,版本隨意,下載后解壓放入bigdata下(目錄可以更改)
下載Windows下Hadoop所需文件winutils.exe
同學們自己網上找找吧,這里就不上傳了,其實該文件可有可無,報錯也不影響Spark運行,強迫症可以下載,本人就有強迫症~~,文件下載后放入bigdata\hadoop\bin目錄下。
不用創建環境變量,再Java最開始處定義系統變量即可,如下:
System.setProperty("hadoop.home.dir", HADOOP_HOME);
創建Java Maven項目java-spark-sql-excel
建立相關目錄層次如下:
父級目錄(項目所在目錄)
- java-spark-sql-excel
- bigdata
- spark
- hadoop
- bin
- winutils.exe
編碼
初始化SparkSession
static{ System.setProperty("hadoop.home.dir", HADOOP_HOME); spark = SparkSession.builder() .appName("test") .master("local[*]") .config("spark.sql.warehouse.dir",SPARK_HOME) .config("spark.sql.parquet.binaryAsString", "true") .getOrCreate(); }
讀取excel
public static void readExcel(String filePath,String tableName) throws IOException{ DecimalFormat format = new DecimalFormat(); format.applyPattern("#"); //創建文件(可以接收上傳的文件,springmvc使用CommonsMultipartFile,jersey可以使用org.glassfish.jersey.media.multipart.FormDataParam(參照本人文件上傳博客)) File file = new File(filePath); //創建文件流 InputStream inputStream = new FileInputStream(file); //創建流的緩沖區 BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream); //定義Excel workbook引用 Workbook workbook =null; //.xlsx格式的文件使用XSSFWorkbook子類,xls格式的文件使用HSSFWorkbook if(file.getName().contains("xlsx")) workbook = new XSSFWorkbook(bufferedInputStream); if(file.getName().contains("xls")&&!file.getName().contains("xlsx")) workbook = new HSSFWorkbook(bufferedInputStream); System.out.println(file.getName()); //獲取Sheets迭代器 Iterator<Sheet> dataTypeSheets= workbook.sheetIterator(); while(dataTypeSheets.hasNext()){ //每一個sheet都是一個表,為每個sheet ArrayList<String> schemaList = new ArrayList<String>(); // dataList數據集 ArrayList<org.apache.spark.sql.Row> dataList = new ArrayList<org.apache.spark.sql.Row>(); //字段 List<StructField> fields = new ArrayList<>(); //獲取當前sheet Sheet dataTypeSheet = dataTypeSheets.next(); //獲取第一行作為字段 Iterator<Row> iterator = dataTypeSheet.iterator(); //沒有下一個sheet跳過 if(!iterator.hasNext()) continue; //獲取第一行用於建立表結構 Iterator<Cell> firstRowCellIterator = iterator.next().iterator(); while(firstRowCellIterator.hasNext()){ //獲取第一行每一列作為字段 Cell currentCell = firstRowCellIterator.next(); //字符串 if(currentCell.getCellTypeEnum() == CellType.STRING) schemaList.add(currentCell.getStringCellValue().trim()); //數值 if(currentCell.getCellTypeEnum() == CellType.NUMERIC) schemaList.add((currentCell.getNumericCellValue()+"").trim()); } //創建StructField(spark中的字段對象,需要提供字段名,字段類型,第三個參數true表示列可以為空)並填充List<StructField> for (String fieldName : schemaList) { StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true); fields.add(field); } //根據List<StructField>創建spark表結構org.apache.spark.sql.types.StructType StructType schema = DataTypes.createStructType(fields); //字段數len int len = schemaList.size(); //獲取當前sheet數據行數 int rowEnd = dataTypeSheet.getLastRowNum(); //遍歷當前sheet所有行 for (int rowNum = 1; rowNum <= rowEnd; rowNum++) { //一行數據做成一個List ArrayList<String> rowDataList = new ArrayList<String>(); //獲取一行數據 Row r = dataTypeSheet.getRow(rowNum); if(r!=null){ //根據字段數遍歷當前行的單元格 for (int cn = 0; cn < len; cn++) { Cell c = r.getCell(cn, Row.MissingCellPolicy.RETURN_BLANK_AS_NULL); if (c == null) rowDataList.add("0");//空值簡單補零 if (c != null&&c.getCellTypeEnum() == CellType.STRING) rowDataList.add(c.getStringCellValue().trim());//字符串 if (c != null&&c.getCellTypeEnum() == CellType.NUMERIC){ double value = c.getNumericCellValue(); if (p.matcher(value+"").matches()) rowDataList.add(format.format(value));//不保留小數點 if (!p.matcher(value+"").matches()) rowDataList.add(value+"");//保留小數點 } } } //dataList數據集添加一行 dataList.add(RowFactory.create(rowDataList.toArray())); } //根據數據和表結構創建臨時表 spark.createDataFrame(dataList, schema).createOrReplaceTempView(tableName+dataTypeSheet.getSheetName()); } }
在項目目錄下創建測試文件
第一個Sheet:
第二個Sheet:
第三個Sheet:
測試
public static void main(String[] args) throws Exception { //需要查詢的excel路徑 String xlsxPath = "test2.xlsx"; String xlsPath = "test.xls"; //定義表名 String tableName1="test_table1"; String tableName2="test_table2"; //讀取excel表名為tableNameN+Sheet的名稱 readExcel(xlsxPath,tableName2); spark.sql("select * from "+tableName2+"Sheet1").show(); readExcel(xlsPath,tableName1); spark.sql("select * from "+tableName1+"Sheet1").show(); spark.sql("select * from "+tableName1+"Sheet2").show(); spark.sql("select * from "+tableName1+"Sheet3").show(); }
運行結果
相關依賴
<dependencies> <dependency> <groupId>org.spark-project.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>1.2.1.spark2</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.3.1</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.3.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi</artifactId> <version>3.17</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.17</version> </dependency> </dependencies>