How to Read Multiple Files or Directories in Spark (glob)


How can we match multiple files or directories in spark.read()?

We will be showing examples using Java, but glob syntax can be applied to any Spark framework.

Read a single file using spark.read()

Spark allows us to load data programmatically using spark.read() into a Dataset.

Dataset<Row> ds;

We can read a variety of data sources in our Dataset.

ds = spark.read().json("/path/to/file.json");
ds = spark.read().csv("/path/to/file.csv");
ds = spark.read().text("/path/to/file.text");
ds = spark.read().parquet("/path/to/file.parquet");
ds = spark.read().orc("/path/to/file.orc");
ds = spark.read().format("avro").load("/path/to/file.avro");

The spark-avro module is external, so there is no avro() API in DataFrameReader or DataFrameWriter.

With the correct credentials, we can also read from S3, HDFS, and many other file systems.

ds = spark.read().json("s3a://bucket/path/to/file.json");
ds = spark.read().json("hdfs://nn1home:8020/file.json");

For the rest of this article, we’ll use json() for the examples.

Read directories and files using spark.read()

We can read multiple files quite easily by simply specifying a directory in the path.

ds = spark.read().json("/path/to/dir");

We can also specify multiple paths, each as its own argument.

ds = spark.read().json("/path/to/dir1", "/path/to/dir2");

We can filter files using the pathGlobFilter option.

ds = spark.read()
          .option("pathGlobFilter", "*.json")
          .json("/path/to/dir");

Glob patterns to match file and directory names

Glob syntax, or glob patterns, appear similar to regular expressions; however, they are designed to match directory and file names rather than characters. Globbing is specifically for hierarchical file systems.

These are some common characters we can use:

  • *: match 0 or more characters except forward slash / (to match a single file or directory name)
  • **: same as * but matches forward slash / (to match any number of directory levels) - the “globstar”
  • ?: match any single character
  • [ab]: matches all in character class
  • [^ab]: matches all not matched in character class (! works in place of ^)
  • [a-b]: matches all in character range
  • {a,b}: matches exactly one option (alternation)

We can use these glob characters to match specific files or folders.

ds = spark.read().json("/dir/*/subdir");
ds = spark.read().json("/dir/**/subdir");
ds = spark.read().json("/dir/2021/06/{19,20,21}");

Below are some common glob patterns to filter through files. Be sure to also test out glob patterns using DigitalOcean’s Glob Tools.

Syntax Matches Does not match
/x/*/y/ /x/a/y/, /x/b/y/ /x/y/, /x/a/b/c/y/
/x/**/y/ /x/a/y/, /x/b/y/, /x/y/, /x/a/b/c/y/ /x/y/a, /a/x/b/y
d?g dog, dag, dmg dg, Dog
[abc] a, b, c d, ab, A, B
[^abc] d, ab, A, B a, b, c
[!abc] d, ab, A, B a, b, c
[a-c] a, b, c d, ab, A, B
{ab,bc} ab, bc ac, xy, a
{x,} x xyz, xy, a, y
{x} xyz, xy, a, y, x

Read more about globs in the glob man page.