In today’s data-driven world, the ability to extract and analyze data from various sources is crucial. One common scenario is extracting data from a data lake, which serves as a centralized repository for raw and transformed data. In this blog post, we will explore how to use PowerShell to query Databricks tables through an ODBC connection to accomplish this task efficiently.

Get the ODBC connection properties from your Databricks cluster

The first thing you should do is install the Databricks ODBC Driver by opening the download page and downloading the correct driver for your operating system. The installation should be straight-forward.

After that, you need to find the Databricks cluster ODBC properties:

  • Open Databricks
  • Go do Compute
  • Choose the cluster you are using
  • Navigate to Advanced Options
  • Chose JDBC/ODBC
  • Get the Server Hostname, Port and HTTP Path

Now you need an access token to authenticate your ODBC connection:

  • Open Databricks
  • On the top right corner, click on your email and select User Settings
  • Navigate to the Access Tokens tab and generate a new token

Note: You should copy the token generated and keep it somewhere safe, you won’t be able to see it again in Databricks

Now you have the properties needed to assemble your ODBC connection string.

Create a PowerShell script to extract data from the data lake

Create a new PowerShell script and add the following code:

function Query-ODBC {
    param (
        [string] $DBConnectionString,
        [string] $query
    )
    $DBConn = New-Object System.Data.Odbc.OdbcConnection;
    $DBConn.ConnectionString = $DBConnectionString;
    $DBConn.Open();
    $cmd = New-object System.Data.Odbc.OdbcCommand($query,$DBConn)
    $ds = New-Object system.Data.DataSet
    $cmd.CommandTimeout = 600
    (New-Object system.Data.odbc.odbcDataAdapter($cmd)).fill($ds) | out-null
    $DBConn.close()
    return $ds.Tables[0]
}

This will create a generic function to return a query result from ODBC sources. It also uses the Connection string and the query as input parameter.

You can execute it by generating your connection string and invoking the function:

$dbHost = '<Your Databricks Host>'
$dbPort = '<Your Databricks Port>'
$dbHTTPPath = '<Your Databricks HTTP Path>'
$dbToken = '<Your Databricks Access Token>'

$dbConnectionString = "Driver=Simba Spark ODBC Driver;Host=$dbHost;Port=$dbPort;HTTPPath=$dbHTTPPath;ThriftTransport=2;SSL=1;AuthMech=3;UID=token;PWD=$dbToken"

$dbQuery = "select * from your_databricks_table"

$queryResult = Query-ODBC $dbConnectionString $dbQuery

$queryResult | Export-Csv -Path 'result.csv' 

By replacing the variables with the Databricks ODBC properties and inserting your query in the $dbQuery variable, you will be set to query databricks and return the data.

I’ve also added one last optional step that can be usefull to save the data in a csv file.

In conclusion, we have demonstrated a straightforward and efficient method for extracting data from a data lake by utilizing PowerShell, Databricks, and ODBC connections. This approach not only streamlines the data extraction process but also opens the door to new possibilities for automating data analysis workflows and integrating with other applications. By leveraging the capabilities of Databricks and the flexibility of PowerShell, you can unlock the full potential of your data lake and make informed decisions based on accurate and up-to-date insights.


0 Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.