When quering ADLS data in SQL Serverless Pool we can specify different wildcard patterns to access the data.
Following example explains different approaches.
Important note: Files or folders that start with an underscore _ or a dot . are usually system files or hidden files and are not included by default when querying data.
ADLS directory folder structure
Following query includes only FileA.csv file.
Following query includes all .csv files (except system and hidden files) directly under test folder but does not include .csv files in sub-folders of test folder.
We could assume that following query would
"include all .csv files (except system and hidden files) under test folder and its sub-folders, recursively"
but in reality it throws an error.
Following query includes all csv files (except system and hidden files) in first-level subfolders of test folder.
Following query includes all files (except system and hidden files), regardless of their file extensions, directly under test folder but does not include any files in sub-folders.
Following query includes all files (except system and hidden files), regardless of their file extensions, directly under test folder but does not include any files in sub-folders.
This includes all files (except system and hidden files), regardless of their file extensions, located under test folder and all of its sub-folders, recursively. It will include every file in every folder and subfolder under the specified path.
You can explicitly specify hidden folder to query data.
You can explicitly specify hidden file to query data.
Use when accessing Azure Data Lake Storage Gen2 with hierarchical namespace. Correct for ADLS Gen2 due to dfs endpoint.
In essence, both https and abfss with the dfs endpoint and HADOOP type give you access to the hierarchical namespace capabilities of Azure Data Lake Storage Gen2. The primary difference is which client library or tool you're using to access the storage. Some tools might specifically require the abfss scheme to enable certain Data Lake-specific functionalities, while others can operate with the standard https scheme.
The TYPE = BLOB_STORAGE is set to interact with the flat namespace of blob containers, which is appropriate for workloads that do not require the hierarchical file system.
Correct for Azure Blob Storage with Hadoop ecosystem tools, using wasb (Windows Azure Storage Blob) protocol.
WASB is a protocol that also provides Hadoop compatibility for Azure Blob Storage but does not support the hierarchical namespace like abfs (Azure Blob File System) does for Azure Data Lake Storage Gen2.
In Azure Blob Storage, the fundamental unit of data is a blob (Binary Large OBject ), and it does not natively support a hierarchical file system. However, for interfacing with Hadoop applications, a logical structure is overlaid on top of the flat structure of blobs to simulate directories and files. This is purely a logical view created by the Hadoop-compatible interfaces like wasb and wasbs (Windows Azure Storage Blob Secure (TLS encrypted access)).
Blob Names and “Virtual Directories”: Although the underlying blob storage is flat, the naming convention of blobs can give the illusion of a directory structure. For example, you might name a blob logs/2020/01/01/log.txt. There is no actual directory, but the slashes (/) in the blob name create a virtual hierarchy that Hadoop applications can interpret as directories and subdirectories, with log.txt being a file within those directories.
This naming convention allows Hadoop and other systems to simulate a file system hierarchy on top of Azure Blob Storage. When accessed via Hadoop interfaces, the wasb driver parses the blob names to present a hierarchical view to the applications, where logs, 2020, 01, and 01 appear as nested directories leading to the log.txt file.
The abfss protocol is used for secure access to the hierarchical namespace feature of ADLS Gen2, enabling file system semantics over encrypted communication. The HADOOP type indicates that this data source is intended for use with Hadoop ecosystem tools, which operate on data in a hierarchical structure similar to a traditional file system.
Source:
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#example-for-create-external-data-source
https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-sql
https://learn.microsoft.com/en-us/azure/synapse-analytics/data-integration/data-integration-data-lake
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#location
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-folders-multiple-csv-files
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver16&tabs=dedicated