How to Read Multiple Files or Directories in Spark (glob)
How can we match multiple files or directories in spark.read()
?
We will be showing examples using Java, but glob syntax can be applied to any Spark framework.
Read a single file using spark.read()
Spark allows us to load data programmatically using spark.read()
into a Dataset
.
Dataset<Row> ds;
We can read a variety of data sources in our Dataset
.
ds = spark.read().json("/path/to/file.json");
ds = spark.read().csv("/path/to/file.csv");
ds = spark.read().text("/path/to/file.text");
ds = spark.read().parquet("/path/to/file.parquet");
ds = spark.read().orc("/path/to/file.orc");
ds = spark.read().format("avro").load("/path/to/file.avro");
The
spark-avro
module is external, so there is noavro()
API inDataFrameReader
orDataFrameWriter
.
With the correct credentials, we can also read from S3, HDFS, and many other file systems.
ds = spark.read().json("s3a://bucket/path/to/file.json");
ds = spark.read().json("hdfs://nn1home:8020/file.json");
For the rest of this article, we’ll use
json()
for the examples.
Read directories and files using spark.read()
We can read multiple files quite easily by simply specifying a directory in the path.
ds = spark.read().json("/path/to/dir");
We can also specify multiple paths, each as its own argument.
ds = spark.read().json("/path/to/dir1", "/path/to/dir2");
We can filter files using the pathGlobFilter
option.
ds = spark.read()
.option("pathGlobFilter", "*.json")
.json("/path/to/dir");
Glob patterns to match file and directory names
Glob syntax, or glob patterns, appear similar to regular expressions; however, they are designed to match directory and file names rather than characters. Globbing is specifically for hierarchical file systems.
These are some common characters we can use:
*
: match 0 or more characters except forward slash/
(to match a single file or directory name)**
: same as*
but matches forward slash/
(to match any number of directory levels) - the “globstar”?
: match any single character[ab]
: matches all in character class[^ab]
: matches all not matched in character class (!
works in place of^
)[a-b]
: matches all in character range{a,b}
: matches exactly one option (alternation)
We can use these glob characters to match specific files or folders.
ds = spark.read().json("/dir/*/subdir");
ds = spark.read().json("/dir/**/subdir");
ds = spark.read().json("/dir/2021/06/{19,20,21}");
Below are some common glob patterns to filter through files. Be sure to also test out glob patterns using DigitalOcean’s Glob Tools.
Syntax | Matches | Does not match |
---|---|---|
/x/*/y/ |
/x/a/y/ , /x/b/y/ |
/x/y/ , /x/a/b/c/y/ |
/x/**/y/ |
/x/a/y/ , /x/b/y/ , /x/y/ , /x/a/b/c/y/ |
/x/y/a , /a/x/b/y |
d?g |
dog , dag , dmg |
dg , Dog |
[abc] |
a , b , c |
d , ab , A , B |
[^abc] |
d , ab , A , B |
a , b , c |
[!abc] |
d , ab , A , B |
a , b , c |
[a-c] |
a , b , c |
d , ab , A , B |
{ab,bc} |
ab , bc |
ac , xy , a |
{x,} |
x |
xyz , xy , a , y |
{x} |
xyz , xy , a , y , x |
Read more about globs in the glob man page.