How to Convert JavaRDD<String> of JSON to Dataset<Row> in Spark Java


Suppose we have an instance of SparkSession in Java.

SparkSession spark = new SparkSession(
  JavaSparkContext.toSparkContext(javaSparkContext)
);

We also have an RDD JavaRDD<String> which we want to convert into a Dataset<Row>.

JavaRDD<String> jsonStrings = ...;

First, we can convert our RDD to a Dataset<String> using spark.createDataset().

Dataset<String> tempDs = spark.createDataset(
  jsonStrings.rdd(),
  Encoders.STRING()
);

1. Using spark.read().json()

Then, we can parse each JSON using spark.read.json().

In this operation, Spark SQL infers the schema of a JSON dataset and loads it as a Dataset<Row>.

Dataset<Row> finalDs = spark.read().json(tempDs);

2. Using from_json()

We can also get the schema from the JSON string dataset as a StructType.

StructType schema = spark
  .read()
  .json(tempDs.select("value").as(Encoders.STRING()))
  .schema();
Dataset<Row> finalDs = stringDs
  .withColumn("json", from_json(col("value"), schema))
  .select(col("json.*"));