spark 有哪些數據類型 https://spark.apache.org/docs/latest/sql-reference.html
Spark 數據類型
Data Types
Spark SQL and DataFrames support the following data types:
- Numeric types
ByteType
: Represents 1-byte signed integer numbers. The range of numbers is from-128
to127
.ShortType
: Represents 2-byte signed integer numbers. The range of numbers is from-32768
to32767
.IntegerType
: Represents 4-byte signed integer numbers. The range of numbers is from-2147483648
to2147483647
.LongType
: Represents 8-byte signed integer numbers. The range of numbers is from-9223372036854775808
to9223372036854775807
.FloatType
: Represents 4-byte single-precision floating point numbers.DoubleType
: Represents 8-byte double-precision floating point numbers.DecimalType
: Represents arbitrary-precision signed decimal numbers. Backed internally byjava.math.BigDecimal
. ABigDecimal
consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
- String type
StringType
: Represents character string values.
- Binary type
BinaryType
: Represents byte sequence values.
- Boolean type
BooleanType
: Represents boolean values.
- Datetime type
TimestampType
: Represents values comprising values of fields year, month, day, hour, minute, and second.DateType
: Represents values comprising values of fields year, month, day.
- Complex types
ArrayType(elementType, containsNull)
: Represents values comprising a sequence of elements with the type ofelementType
.containsNull
is used to indicate if elements in aArrayType
value can havenull
values.MapType(keyType, valueType, valueContainsNull)
: Represents values comprising a set of key-value pairs. The data type of keys are described bykeyType
and the data type of values are described byvalueType
. For aMapType
value, keys are not allowed to havenull
values.valueContainsNull
is used to indicate if values of aMapType
value can havenull
values.StructType(fields)
: Represents values with the structure described by a sequence ofStructField
s (fields
).StructField(name, dataType, nullable)
: Represents a field in aStructType
. The name of a field is indicated byname
. The data type of a field is indicated bydataType
.nullable
is used to indicate if values of this fields can havenull
values.
對應的pyspark 數據類型在這里 pyspark.sql.types
一些常見的轉化場景:
1. Converts a date/timestamp/string to a value of string, 轉成的string 的格式用第二個參數指定
df.withColumn('test', F.date_format(col('Last_Update'),"yyyy/MM/dd")).show()
2. 轉成 string后,可以 cast 成你想要的類型,比如下面的 date 型
df = df.withColumn('date', F.date_format(col('Last_Update'),"yyyy-MM-dd").alias('ts').cast("date"))
3. 把 timestamp 秒數(從1970年開始)轉成日期格式 string

4. unix_timestamp 把 日期 String 轉換成 timestamp 秒數,是上面操作的反操作
因為unix_timestamp 不考慮 ms ,如果一定要考慮ms可以用下面的方法
df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)
5. timestamp 秒數轉換成 timestamp type, 可以用 F.to_timestamp
6. 從timestamp 或者 string 日期類型提取 時間,日期等信息
Ref: