1. schema參數,AssertionError: dataType should be DataType
# AssertionError: dataType should be DataType schema = StructType([ # true代表不為空 StructField("col_1", StringType, True), StructField("col_2", StringType, True), StructField("col_3", StringType, True), ] ) #原因:StringType等后面沒有加括號“()” #修改為: schema = StructType([ # true代表不為空 StructField("col_1", StringType(), True), StructField("col_2", StringType(), True), StructField("col_3", StringType(), True), ] )
2. pyspark目前的數據類型有:
NullType、StringType、BinaryType、BooleanType、DateType、TimestampType、DecimalType、DoubleType、FloatType、ByteType、IntegerType、LongType、ShortType、ArrayType、MapType、StructType(StructField)等,要根據情況使用,注意可能的溢出問題。
其中大佬總結的對應python數據類型如下:
| NullType | None |
| StringType | basestring |
| BinaryType | bytearray |
| BooleanType | bool |
| DateType | datetime.date |
| TimestampType | datetime.datetime |
| DecimalType | decimal.Decimal |
| DoubleType | float(double precision floats) |
| FloatType | float(single precision floats) |
| ByteType | int(a signed integer) |
| IntegerType | int(a signed 32-bit integer) |
| LongType | long(a signed 64-bit integer) |
| ShortType | int(a signed 16-bit integer) |
