How to Convert JavaRDD<String> of JSON to Dataset<Row> in Spark Java
Suppose we have an instance of SparkSession
in Java.
SparkSession spark = new SparkSession(
JavaSparkContext.toSparkContext(javaSparkContext)
);
We also have an RDD JavaRDD<String>
which we want to convert into a Dataset<Row>
.
JavaRDD<String> jsonStrings = ...;
First, we can convert our RDD to a Dataset<String>
using spark.createDataset()
.
Dataset<String> tempDs = spark.createDataset(
jsonStrings.rdd(),
Encoders.STRING()
);
1. Using spark.read().json()
Then, we can parse each JSON using spark.read.json()
.
In this operation, Spark SQL infers the schema of a JSON dataset and loads it as a Dataset<Row>
.
Dataset<Row> finalDs = spark.read().json(tempDs);
2. Using from_json()
We can also get the schema from the JSON string dataset as a StructType
.
StructType schema = spark
.read()
.json(tempDs.select("value").as(Encoders.STRING()))
.schema();
Dataset<Row> finalDs = stringDs
.withColumn("json", from_json(col("value"), schema))
.select(col("json.*"));