Polybase should be pointed to a directory containing a collection of same-structure files. These can be in further sub directories since Polybase will recursively descend and glob all the files under the directory.
First off, know that this will only work with the Azure Data Warehouse. Polybase can access data in blob storage only when using SQL Server 2016, accessing ADLS requires the use of the Data warehouse.
Setting this all up requires the creation of a number of objects:
- Security credentials to access the storage.
- This will require a master key and a scoped credential.
- The external data source.
- For this blog, we will be using an ADLS source.
- The external file format the semi structured data has.
- At this time, Polybase has support for delimited text (csv), RCFile, ORC and Parquet. Sadly, no JSON.
- The external table.
- This will be the logical table SQL queries can be run against.
Connect to the Data Warehouse in Azure
A data warehouse should already be running in Azure. Use any SQL GUI to connect to it. I will be using SQL Server Management Studio (SSMS), but anything that works is fine. (Make certain the Azure firewall allows access to the DW.)
Create an Azure Data Lake Store
An Azure Data Lake Store is the second part needed so there is somewhere to put the data in. Simply use the Azure portal to create it.
I used the files in the Ambulance Data folder from the Azure Data Lake Git Repository.
Uploading the files to ADLS can be done several ways. The easiest way is to use the Data Explorer included in the ADLS management pages on the Azure portal. Just create a new directory and upload all the files in there.
Click the “Data Explorer” link on the ADLS management page in the Azure portal.
Using the file selector icon, select all the vehicle_*.csv files you want to upload and click the “Add selected files” button. The “Drivers.txt” and “DriverShiftTrips.csv” files have a different format and will cause problems.
The online documentation has more information on using the Data Lake Store using the portal, Powershell or any of the available SDKs (.Net, Java, Rest API, Azure CLI, Node.js, Python).
Azure AD web application for Authentication
To access the ADLS from the Azure Data Warehouse, an Azure AD web application for service-to-service authentication is required. (online documentation.) (For those that are familiar to kerberos, think of these AD Applications as Kerberos Service principals.)
Go to the Active Directory management section in the Azure portal. In “App registration”, click the “+ New application registration” link at the top of the page.
In the small form, give the app a unique name and make sure that the Application type is set to “Web App / API”. Since the “Sign-on URL” will not be used for this app, set it to any valid URL expression. Click the “Create” button.
For the second part of the credentials, a new Key must be created. Click on the “Keys” link, complete the “Key description” and “Duration” fields and click the “Save” button at the top of the page. As the warning explains, make sure to copy and save the value that was generated before continuing.
Using the tenant ID, an OAuth 2.0 token endpoint can be created: https://login.windows.net/<tennantID>/oauth2/token (This endpoint string can also be found in the Azure portal in the AD management pages under the “Endpoints” link.)
From the examples above, the three strings required for authentication are:
App ID: be2245a6-cbc4-4509-8937-9d9d5112689b Auth key: f3CFxt7JLVlOveHijUoN8I7G/c1q79bgSRBmyMi4sHg= OAuth: https://login.windows.net/c178d526-3dd2-471a- b728-65ed3be8c6e0/oauth2/token
Keep these credentials handy somewhere, they will be needed in a bit.
Assigning the AD App to the ADLS folder
As is often the case, securing the environment is taking quite a bit of work, but fortunately, we are almost done. The next step is to assign the AD application just created to the directory holding the data. (The ability to assign a specific AD Application to a directory this way, makes securing ADS very granular.)
Go to the ADLS management pages in the Azure portal and go to the Data Explorer and select the folder containing the data files. Click on the “Access” link.
However, even though the adlspolytest credential now has full access to the AmbulanceData subdirectory, it does not have any access to the root of the ADLS store itself. That means that any access to the sub directory will fail unless at least the “execute” permission is given to the adlspolytest credential for the ADLS root.
So the previous steps need to be repeated, setting an “execute”-only ACL on the root location of the adlspolytest store. The permissions should not be handed down to the children of the root directory since full permissions were already applied tot he AmbulanceData sub directory and its descendants.
Once the data is uploaded and the authentication credentials are created, the different components can be created in the Azure Data Warehouse. In SSMS, open a new Query script window to run the scriptlets that follow.
1. Security Credentials
First, a new database master key is needed if there is not one available yet. (More info) It will not be references in the rest of this blog, but a master key is used for all encryption on the databases.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0me!nfo';
Now create a Scoped Database Credential which will be used to access the storage. Here is where the credential strings created earlier are used.
CREATE DATABASE SCOPED CREDENTIAL ADL_User WITH IDENTITY = 'be2245a6-cbc4-4509-8937-9d9d5112689b@https://login.windows.net/c178d526-3dd2-471a-b728-65ed3be8c6e0/oauth2/token', SECRET = 'f3CFxt7JLVlOveHijUoN8I7G/c1q79bgSRBmyMi4sHg=';
The blue part is the Application ID, the green part is the OAuth 2.0 Endpoint and the red part is the security key value.
2. External Data Source
Define a connection to the Data Lake Store using the Scoped Credential created above.
CREATE EXTERNAL DATA SOURCE AzureDataLakeStore WITH ( TYPE = HADOOP, LOCATION = 'adl://adlspolytest.azuredatalakestore.net', CREDENTIAL = ADL_User );
The blue part is the ADL URI. It can be found on the ADLS management pages in the Azure Portal. It is quite easy to figure out though, since it is simply the name given to the ADLS followed by the standard azuredatalakestore.net domain. The green string is of course the name of the scoped credential created in the previous step.
3. External File Format
The external file format definition can be quite complex (online documentation). However, the vehicle datafiles are simple column delimited files so the definition here can remain simple.
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =',', STRING_DELIMITER = '"', USE_TYPE_DEFAULT = TRUE) );
4. External Table
We are getting close now! Here the external table is defined. This table will be visible in the Data Warehouse and queries can be run against it. But the table will not store any data: all data will be read from the raw data in the Data Lake Store.
CREATE EXTERNAL TABLE [dbo].[Ambulance_Data] ( [vehicle_id] int, [entry_id] bigint, [event_date] DateTime, [latitude] float, [longitude] float, [speed] int, [direction] char(5), [trip_id] int ) WITH (LOCATION='/AmbulanceData/', DATA_SOURCE = AzureDataLakeStore, FILE_FORMAT = TextFileFormat );
As can be expected, the full table schema is described in the table declaration and in the second part, the different objects defined earlier are used to tell Polybase how to get to the data for this external table.
This is also the point where any problems will present themselves. If the credetntials are bad, the ACLs are badly placed or the data files have issues, here is where the errors will be shown. This does make it hard to debug any challenges. Fortunately, the java errors that are returned usually give a reasonable idea of the problem.
Query the data
The table is created so feel free to use normal T-SQL to query the data. Do note that SSMS will show the table under the “External Tables” heading.
The Full T-SQL Script
--1: Security Credentials -- Create a master key on the database. -- Required to encrypt the credential secret. CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0me!nfo'; -- Create a database scoped credential for Azure Datalake Store. -- IDENTITY: The AD Application ID with the OAuth 2.0 endpoint. -- SECRET: The AD application authentication key value. CREATE DATABASE SCOPED CREDENTIAL ADL_User WITH IDENTITY = 'be2245a6-cbc4-4509-8937-9d9d5112689b@https://login.windows.net/c178d526-3dd2-471a-b728-65ed3be8c6e0/oauth2/token', SECRET = 'f3CFxt7JLVlOveHijUoN8I7G/c1q79bgSRBmyMi4sHg='; --2: Create an external data source. -- LOCATION: Azure datalake name. -- CREDENTIAL: The database scoped credential created above. CREATE EXTERNAL DATA SOURCE AzureDataLakeStore WITH ( TYPE = HADOOP, LOCATION = 'adl://adlspolytest.azuredatalakestore.net', CREDENTIAL = ADL_User ); --3: Create an external file format. -- FORMAT TYPE: Type of format in Hadoop (DELIMITEDTEXT, RCFILE, ORC, PARQUET). CREATE EXTERNAL FILE FORMAT TextFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =',', STRING_DELIMITER = '"', USE_TYPE_DEFAULT = TRUE) ); --4: Create an external table. -- LOCATION: Folder under the ADLS root folder. -- DATA_SOURCE: Specifies which Data Source Object to use. -- FILE_FORMAT: Specifies which File Format Object to use -- REJECT_TYPE: Specifies how you want to deal with rejected rows. Either Value or percentage of the total -- REJECT_VALUE: Sets the Reject value based on the reject type. -- REJECT_SAMPLE_VALUE: The number of rows to retrieve before the percentage of rejected rows is calculated. CREATE EXTERNAL TABLE [dbo].[Ambulance_Data] ( [vehicle_id] int, [entry_id] bigint, [event_date] DateTime, [latitude] float, [longitude] float, [speed] int, [direction] char(5), [trip_id] int ) WITH (LOCATION='/AmbulanceData/', DATA_SOURCE = AzureDataLakeStore, FILE_FORMAT = TextFileFormat );