How to Avoid Throwing Exception in spark.read()


We can read data programmatically in Spark using spark.read().

How can we prevent Spark from throwing an Exception when a file is not found?

Suppose we want to use an instance of SparkSesson called spark to read from S3.

We can wrap our spark.read() command inside a try-catch block to handle the errors manually. Let’s check out some errors we may run into.

Handling FileNotFoundException

If we specify a non-existent bucket in the S3 path, then we’ll hit a FileNotFoundException.

java.io.FileNotFoundException: Bucket fake-bucket does not exist

Handling AnalysisException

If our glob does not match any files, we’ll get an AnalysisException.

org.apache.spark.sql.AnalysisException: Path does not exist: s3a://real-bucket/fake/path/*.json;

Avoid exceptions in spark.read()

In this scenario, let’s return an empty Dataset<Row> when no files match our S3 path.

try {
  Dataset<Row> dataset = spark.read().json(s3Path);
} catch (Exception e) {
  if (e instanceof AnalysisException || e instanceof FileNotFoundException) {
    LOG.error(e.toString());
    return spark.emptyDataFrame();
  }
  throw new RuntimeException(e);
}

In the case of any other exception, we’ll throw a RuntimeException.