Banner

Overview

DataLink SQL Interface is an interactive query service that makes it easy to analyze data in DataLink using standard SQL. With DataLink SQL, there’s no need for custom scripts to download and manually assemble your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

DataLink SQL is out-of-the-box integrated with the DataLink catalog, allowing you to discover and interact with all your DataTables using your DataLink API Key.

Powered by Trino runs standard SQL

DataLink SQL uses Trino (formerly Presto SQL) with ANSI SQL support. It is ideal for interactive querying and can also handle complex analysis, including large joins, window functions, and arrays. You can utilize any Trino API compatible 1st or 3rd party clients to connect to DataLink SQL.

Unified Catalog

DataLink SQL exposes a standard interface to query underlying DataLink data stores, and gives you a consistent experience for regular and large partitioned DataTables. Your existing API Keys can be used to query the same DataTables via SQL as you already do with via DataTables API or existing DataLink clients.

Usage

There are numerous trino clients available to connect to DataLink SQL. You can also reference the REST API specifications to build your own client/driver.

Endpoints

When using any Trino client, you can configure it to access DataLink by setting parameters like so

Key Value
HOST data.nasdaq.com or ${tenant}.data.nasdaq.com
PORT 443
USER Your DataLink API Key
CATALOG main
SCHEMA huron

Helpful functions

  1. List all tables: SHOW TABLES
  2. Describe table: DESCRIBE
  3. Describe the plan for a query: EXPLAIN

Query Performance

  1. Trino uses a cursor based approach, documented here, to iteratively produce results for your query from the DataLink backend.

  2. Large partitioned DataTables have their underlying data split into partitions composed of a subset of columns. You can find out which columns are partitions in the "Column Definitions" section of the Data Product page on DataLink. The best query performance is possible when you include these partition columns in your WHERE and ORDER BY clauses. You can also describe the plan for a specific query using EXPLAIN to help you tailor more performant queries.

  3. For large partitioned DataTables, we cache partitions to provide better performance for commonly and repeatedly used blocks of data. This means while a query maybe slow the first time, subsequent requests of a similar nature should complete more quickly.

Writes

New data can be added, and historical data can be updated via by running INSERT, UPDATE, DELETE:

INSERT

From another table with identical schema (implicit inference)

INSERT INTO s3.huron.occ_opc SELECT * FROM main.huron.qdl_cbf WHERE date = '2023-01-01' ORDER BY ticker LIMIT 100000;

From another table with different schema and explicit mapping

INSERT INTO s3.huron.new_table (ticker, date, price) SELECT ticker, DATE(some_timestamp), price FROM main.huron.qdl_cbf WHERE date = '2023-01-01'

New values

INSERT INTO s3.huron.eg_bio VALUES (1, 'NDAQ', DATE '2023-01-01'), (2, 'AAPL', DATE '2023-01-01');

NOTE

  1. When performing insert by selecting from another table, ensure

  2. Inserts should be batched as separate queries as

  3. We recommend you only perform a few large (~10-100MB volume of rows) inserts everyday. Lots of tiny inserts or super massive bulk inserts may result in poor INSERT and/or subsequent SELECT performance.

UPDATE

UPDATE s3.huron.occ_opc SET ticker = 'NDAQ' WHERE date = DATE '2023-01-01';

DELETE

DELETE FROM s3.huron.occ_opc WHERE date = DATE '2023-01-01';

Refer to Trino SQL Syntax for more documentation.

Audit Trail

s3.huron.system_trino table contains an audit trail of all queries that completed execution

Known Issues

  1. All varchar columns are truncated to the first 1024 characters
  2. Live queries and long running queries won't appear in this table until it has reached a completed state.
  3. New data for this table is delivered every 10 minutes, which means queries completed within the last 10 minutes won't be available immediately.
  4. This table can only be queried using SQL; i.e., there is no support for API/v3 or bulk-download/export

Schema

Column Type Description
query_id varchar identifier unique to this query
transaction_id varchar transaction that this query belongs to
prepared_query varchar if this query was a part of a Prepared Statement (for ex: EXECUTE), then this will show the query of the prepared statement. otherwise empty for direct queries
query varchar SQL statement of this query
tables array(varchar) table(s) that were accessed in this query
user varchar identity of the user who invoked this query
user_agent varchar client driver that invoked this query, ex python-requests/2.28.2, Trino JDBC Driver/360, etc.
source varchar client application which invoked this query, ex: trino-python-client, DataGrip 2020.3.2, etc.
failure_info map(varchar, varchar) if this query had failed, this will show the error code and message. ex: {failureType=io.trino.spi.TrinoException, failureMessage=Query was canceled}
create_time timestamp(6) time when the query was submitted to trino
execution_start_time timestamp(6) time when the query was picked up by trino worker(s) to begin processing
end_time timestamp(6) time when query completed processing (both success and failure)
statistics map(varchar, bigint) statistics about how much data was read or written. ex: {outputRows=1, outputBytes=9, writtenBytes=137680, writtenRows=410}

Clients

CLI

  1. Download latest cli from trino.io

  2. Save .trino_config to your $HOME folder.

  3. Update .trino_config with your configurations

  4. Run queries as desired

    $ trino --execute 'select * from ndaq_rtat10 limit 5' $ trino --execute 'select * from zacks_hdm limit 5'

Python

NOTE: Tested with Python 3

  1. Add requirements.txt in a new project folder, or install those PIP libraries into your existing project.

  2. Configure environment variables

  3. Add datalink_sql.py to your project folder

  4. Run datalink_sql.py to invoke sample queries

  5. Additionally, create a new module with your desired queries and reuse helper functions

    import datalink_sql import os connection = datalink_sql.create_connection(os.environ.copy()) sql = "..." df = datalink_sql.run_sql(connection, sql) print(df)

DotNet C#

NOTE: Tested with dotnet cli 6

  1. Add DataLinkSQL.csproj in a new project folder, or install those Nuget packages into your existing project.

  2. Configure environment variables

  3. Add DataLinkSQL.cs to your project folder

  4. Run included Main to invoke sample queries

    $ dotnet run --property DefineConstants=INCLUDE_MAIN
  5. Additionally, create a new file with your desired queries and reuse helper functions

    using NReco.PrestoAdo; using static DataLinkSQL; class MyApp { static void Main() { using (PrestoConnection connection = CreateConnection()) { string sql = "..."; DbDataReader reader = RunSQL(connection, sql); DisplayReader(reader); } } }
    $ dotnet run

Tableau

NOTE: Tested with Tableau Desktop 2022.2

  1. Connect "To a Server" and select "Presto"
  2. Enter configurations as explained in Endpoints, and also set
    1. Authentication: Username
    2. Require SSL: True
  3. Click the magnifying glass next to the "Enter table name" text box
  4. You should now see all your tables. Drag and drop them to build your models.

Notebook on Databricks

  1. Install trino-jdbc-406.jar into your workspace

  2. Configure environment variables in your Cluster's Environment Variables

  3. Import and run SparkSQL.ipynb

DataGrip

Install JDBC driver

Setup DataSource