Extracting Data from a Data Lake through PowerShell, using a Databricks ODBC Connection

In today’s data-driven world, the ability to extract and analyze data from various sources is crucial. One common scenario is extracting data from a data lake, which serves as a centralized repository for raw and transformed data. In this blog post, we will explore how to use PowerShell to query Databricks tables through an ODBC connection to accomplish this task efficiently.

Get the ODBC connection properties from your Databricks cluster

The first thing you should do is install the Databricks ODBC Driver by opening the download page and downloading the correct driver for your operating system. The installation should be straight-forward.

After that, you need to find the Databricks cluster ODBC properties:

Open Databricks
Go do Compute
Choose the cluster you are using
Navigate to Advanced Options
Chose JDBC/ODBC
Get the Server Hostname, Port and HTTP Path

Now you need an access token to authenticate your ODBC connection:

Open Databricks
On the top right corner, click on your email and select User Settings
Navigate to the Access Tokens tab and generate a new token

Note: You should copy the token generated and keep it somewhere safe, you won’t be able to see it again in Databricks

Now you have the properties needed to assemble your ODBC connection string.

Create a PowerShell script to extract data from the data lake

Create a new PowerShell script and add the following code:

function Query-ODBC {
    param (
        [string] $DBConnectionString,
        [string] $query
    )
    $DBConn = New-Object System.Data.Odbc.OdbcConnection;
    $DBConn.ConnectionString = $DBConnectionString;
    $DBConn.Open();
    $cmd = New-object System.Data.Odbc.OdbcCommand($query,$DBConn)
    $ds = New-Object system.Data.DataSet
    $cmd.CommandTimeout = 600
    (New-Object system.Data.odbc.odbcDataAdapter($cmd)).fill($ds) | out-null
    $DBConn.close()
    return $ds.Tables[0]
}

This will create a generic function to return a query result from ODBC sources. It also uses the Connection string and the query as input parameter.

You can execute it by generating your connection string and invoking the function:

$dbHost = '<Your Databricks Host>'
$dbPort = '<Your Databricks Port>'
$dbHTTPPath = '<Your Databricks HTTP Path>'
$dbToken = '<Your Databricks Access Token>'

$dbConnectionString = "Driver=Simba Spark ODBC Driver;Host=$dbHost;Port=$dbPort;HTTPPath=$dbHTTPPath;ThriftTransport=2;SSL=1;AuthMech=3;UID=token;PWD=$dbToken"

$dbQuery = "select * from your_databricks_table"

$queryResult = Query-ODBC $dbConnectionString $dbQuery

$queryResult | Export-Csv -Path 'result.csv'

By replacing the variables with the Databricks ODBC properties and inserting your query in the $dbQuery variable, you will be set to query databricks and return the data.

I’ve also added one last optional step that can be usefull to save the data in a csv file.

In conclusion, we have demonstrated a straightforward and efficient method for extracting data from a data lake by utilizing PowerShell, Databricks, and ODBC connections. This approach not only streamlines the data extraction process but also opens the door to new possibilities for automating data analysis workflows and integrating with other applications. By leveraging the capabilities of Databricks and the flexibility of PowerShell, you can unlock the full potential of your data lake and make informed decisions based on accurate and up-to-date insights.

Extracting Data from a Data Lake through PowerShell, using a Databricks ODBC Connection

Published by José Calheiros on June 12, 2023June 12, 2023

Get the ODBC connection properties from your Databricks cluster

Create a PowerShell script to extract data from the data lake

Related

0 Comments

Leave a Reply Cancel reply

Business Intelligence

Fill Down / Fill Up in Power Query

Business Intelligence Power BI

Conditional formatting text fields in Power BI using DAX

Business Intelligence Power BI

Color the chart background according to target with Error Bars

Extracting Data from a Data Lake through PowerShell, using a Databricks ODBC Connection

Published by José Calheiros on June 12, 2023June 12, 2023

Get the ODBC connection properties from your Databricks cluster

Create a PowerShell script to extract data from the data lake

Share this:

Related

0 Comments

Leave a Reply Cancel reply

Related Posts

Business Intelligence

Fill Down / Fill Up in Power Query

Business Intelligence Power BI

Conditional formatting text fields in Power BI using DAX

Business Intelligence Power BI

Color the chart background according to target with Error Bars