How to Avoid Throwing Exception in spark.read()
We can read data programmatically in Spark using spark.read()
.
How can we prevent Spark from throwing an Exception
when a file is not found?
Suppose we want to use an instance of SparkSesson
called spark
to read from S3.
We can wrap our spark.read()
command inside a try-catch
block to handle the errors manually. Let’s check out some errors we may run into.
Handling FileNotFoundException
If we specify a non-existent bucket in the S3 path, then we’ll hit a FileNotFoundException
.
java.io.FileNotFoundException: Bucket fake-bucket does not exist
Handling AnalysisException
If our glob does not match any files, we’ll get an AnalysisException
.
org.apache.spark.sql.AnalysisException: Path does not exist: s3a://real-bucket/fake/path/*.json;
Avoid exceptions in spark.read()
In this scenario, let’s return an empty Dataset<Row>
when no files match our S3 path.
try {
Dataset<Row> dataset = spark.read().json(s3Path);
} catch (Exception e) {
if (e instanceof AnalysisException || e instanceof FileNotFoundException) {
LOG.error(e.toString());
return spark.emptyDataFrame();
}
throw new RuntimeException(e);
}
In the case of any other exception, we’ll throw a RuntimeException
.