Spark SQL DataType class is a base class of all data types in Spark which defined in a package org.apache.spark.sql.types.DataType and they are primarily used while working on DataFrames, In this article, you will learn different Data Types and their utility methods with Scala examples.

1. Spark SQL DataType – base class of all Data Types

All data types from the below table are supported in Spark SQL and DataType class is a base class for all these. For some types like IntegerType , DecimalType , ByteType e.t.c are subclass of NumericType which is a subclass of DataType.

StringType ShortType ArrayType IntegerType MapType LongType StructType FloatType DateType DoubleType TimestampType DecimalType BooleanType ByteType CalendarIntervalType HiveStringType BinaryType ObjectType NumericType NullType

1.1 DataType common methods

All Spark SQL Data Types extends DataType class and should provide implementation to the methods explained in this example.

val arr = ArrayType(IntegerType,false) println("json() : "+arrayType.json) // Represents json string of datatype println("prettyJson() : "+arrayType.prettyJson) // Gets json in pretty format println("simpleString() : "+arrayType.simpleString) // simple string println("sql() : "+arrayType.sql) // SQL format println("typeName() : "+arrayType.typeName) // type name println("catalogString() : "+arrayType.catalogString) // catalog string println("defaultSize() : "+arrayType.defaultSize) // default size

Yields below output.

json() : {"type":"array","elementType":"string","containsNull":true} prettyJson() : { "type" : "array", "elementType" : "string", "containsNull" : true } simpleString() : array sql() : ARRAY typeName() : array catalogString() : array defaultSize() : 20

Besides these, the DataType class has the following static methods.

1.2 DataType.fromJson()

If you have a JSON string and you wanted to convert to a DataType use fromJson() . For example you wanted to convert JSON schema from a string to StructType.

val typeFromJson = DataType.fromJson( """{"type":"array", |"elementType":"string","containsNull":false}""".stripMargin) println(typeFromJson.getClass) val typeFromJson2 = DataType.fromJson("\"string\"") println(typeFromJson2.getClass) //This prints class org.apache.spark.sql.types.ArrayType class org.apache.spark.sql.types.StringType$

1.3 DataType.fromDDL()

Like loading structure from JSON string, we can also create it fromDDL() ,

val ddlSchemaStr = "`fullName` STRUCT,`age` INT,`gender` STRING" val ddlSchema = DataType.fromDDL(ddlSchemaStr) println(ddlSchema.getClass) // This prints class org.apache.spark.sql.types.StructType

1.4 DataType.canWrite()

1.5 DataType.equalsStructurally()

2. Use Spark SQL DataTypes class to get a type object

In order to get or create a specific data type, we should use the objects and factory methods provided by org.apache.spark.sql.types.DataTypes class. for example, use object DataTypes.StringType to get StringType and the factory method DataTypes.createArrayType(StirngType) to get ArrayType of string.

//Below are some examples val strType = DataTypes.StringType val arrayType = DataTypes.createArrayType(StringType) val structType = DataTypes.createStructType( Array(DataTypes.createStructField("fieldName",StringType,true)))

3. StringType

StringType “ org.apache.spark.sql.types.StringType ” is used to represent string values, To create a string type use either DataTypes.StringType or StringType() , both of these returns object of String type.

val strType = DataTypes.StringType println("json : "+strType.json) println("prettyJson : "+strType.prettyJson) println("simpleString : "+strType.simpleString) println("sql : "+strType.sql) println("typeName : "+strType.typeName) println("catalogString : "+strType.catalogString) println("defaultSize : "+strType.defaultSize)

Outputs

json : "string" prettyJson : "string" simpleString : string sql : STRING typeName : string catalogString : string defaultSize : 20

4. ArrayType

Use ArrayType to represent arrays in a DataFrame and use either factory method DataTypes.createArrayType() or ArrayType() constructor to get an array object of a specific type.

On Array type object you can access all methods defined in section 1.1 and additionally, it provides containsNull() , elementType() , productElement() to name a few.

val arr = ArrayType(IntegerType,false) val arrayType = DataTypes.createArrayType(StringType,true) println("containsNull : "+arrayType.containsNull) println("elementType : "+arrayType.elementType) println("productElement : "+arrayType.productElement(0))

Yields below output.

containsNull : true elementType : StringType productElement : StringType

For more example and usage, please refer Using ArrayType on DataFrame

5. MapType

Use MapType to represent maps with key-value pair in a DataFrame and use either factory method DataTypes.createMapType() or MapType() constructor to get a map object of a specific key and value type.

On Map type object you can access all methods defined in section 1.1 and additionally, it provides keyType() , valueType() , valueContainsNull() , productElement() to name a few.

val mapType1 = MapType(StringType,IntegerType) val mapType = DataTypes.createMapType(StringType,IntegerType) println("keyType() : "+mapType.keyType) println("valueType() : "+mapType.valueType) println("valueContainsNull() : "+mapType.valueContainsNull) println("productElement(1) : "+mapType.productElement(1))

Yields below output.

keyType() : StringType valueType() : IntegerType valueContainsNull() : true productElement(1) : IntegerType

For more example and usage, please refer Using MapType on DataFrame

Use DateType “ org.apache.spark.sql.types.DataType ” to represent the date on a DataFrame and use either DataTypes.DateType or DateType() constructor to get a date object.

On Date type object you can access all methods defined in section 1.1

Use TimestampType “ org.apache.spark.sql.types.TimestampType ” to represent the time on a DataFrame and use either DataTypes.TimestampType or TimestampType() constructor to get a time object.

On Timestamp type object you can access all methods defined in section 1.1

8. SructType

Use StructType “ org.apache.spark.sql.types.StructType ” to define the nested structure or schema of a DataFrame, use either DataTypes.createStructType() or StructType() constructor to get a struct object.

StructType object provides lot of functions like toDDL() , fields() , fieldNames() , length() to name few.

//StructType val structType = DataTypes.createStructType( Array(DataTypes.createStructField("fieldName",StringType,true))) val simpleSchema = StructType(Array( StructField("name",StringType,true), StructField("id", IntegerType, true), StructField("gender", StringType, true), StructField("salary", DoubleType, true) )) val anotherSchema = new StructType() .add("name",new StructType() .add("firstname",StringType) .add("lastname",StringType)) .add("id",IntegerType) .add("salary",DoubleType)

For more example and usage, please refer StructType

9. All other remaining Spark SQL Data Types

Similar to the above-described types, for the rest of the datatypes use the appropriate method on DataTypes class or data type constructor to create an object of the desired Data Type, And all common methods described in section 1.1 are available with these types.

Conclusion

In this article, you have learned all different Spark SQL DataTypes, DataType, DataTypes classes and their methods using Scala examples. I would recommend referring to DataType and DataTypes API for more details.

Thanks for reading. If you like it, please do share the article by following the below social links and any comments or suggestions are welcome in the comments sections!

Happy Learning !!