Usage-Public.md

Overview

DataLink SQL Interface is an interactive query service that makes it easy to analyze data in DataLink using standard SQL. With DataLink SQL, there’s no need for custom scripts to download and manually assemble your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

DataLink SQL is out-of-the-box integrated with the DataLink catalog, allowing you to discover and interact with all your DataTables using your DataLink API Key.

Powered by Trino runs standard SQL

DataLink SQL uses Trino (formerly Presto SQL) with ANSI SQL support. It is ideal for interactive querying and can also handle complex analysis, including large joins, window functions, and arrays. You can utilize any Trino API compatible 1st or 3rd party clients to connect to DataLink SQL.

Unified Catalog

DataLink SQL exposes a standard interface to query underlying DataLink data stores, and gives you a consistent experience for regular and large partitioned DataTables. Your existing API Keys can be used to query the same DataTables via SQL as you already do with via DataTables API or existing DataLink clients.

Usage

There are numerous trino clients available to connect to DataLink SQL. You can also reference the REST API specifications to build your own client/driver.

Endpoints

When using any Trino client, you can configure it to access DataLink by setting parameters like so

Key	Value
HOST	data.nasdaq.com or ${tenant}.data.nasdaq.com
PORT	443
USER	Your DataLink API Key
CATALOG	main
SCHEMA	huron

Helpful functions

List all tables: SHOW TABLES
Describe table: DESCRIBE
Describe the plan for a query: EXPLAIN

Query Performance

Trino uses a cursor based approach, documented here, to iteratively produce results for your query from the DataLink backend.
Large partitioned DataTables have their underlying data split into partitions composed of a subset of columns. You can find out which columns are partitions in the "Column Definitions" section of the Data Product page on DataLink. The best query performance is possible when you include these partition columns in your WHERE and ORDER BY clauses. You can also describe the plan for a specific query using EXPLAIN to help you tailor more performant queries.
For large partitioned DataTables, we cache partitions to provide better performance for commonly and repeatedly used blocks of data. This means while a query maybe slow the first time, subsequent requests of a similar nature should complete more quickly.

Writes

New data can be added, and historical data can be updated via by running INSERT, UPDATE, DELETE:

INSERT

From another table with identical schema (implicit inference)

INSERT INTO s3.huron.occ_opc SELECT * FROM main.huron.qdl_cbf WHERE date = '2023-01-01' ORDER BY ticker LIMIT 100000;

From another table with different schema and explicit mapping

INSERT INTO
  s3.huron.new_table (ticker, date, price)
SELECT
  ticker, DATE(some_timestamp), price
FROM
  main.huron.qdl_cbf
WHERE
  date = '2023-01-01'

New values

INSERT INTO s3.huron.eg_bio VALUES (1, 'NDAQ', DATE '2023-01-01'), (2, 'AAPL', DATE '2023-01-01');

NOTE

When performing insert by selecting from another table, ensure
- schemas for source and target match, otherwise it will result in Insert query has mismatched column types error
- no more than 4 million rows is inserted in one query
Inserts should be batched as separate queries as
- Trino has a limit on the query text length (1,000,000 characters)
- performance degrades with overly large or small batches. The ideal batch size will vary for each table based on its partitioning, column schema and write-frequency.
We recommend you only perform a few large (~10-100MB volume of rows) inserts everyday. Lots of tiny inserts or super massive bulk inserts may result in poor INSERT and/or subsequent SELECT performance.

UPDATE

UPDATE s3.huron.occ_opc SET ticker = 'NDAQ' WHERE date = DATE '2023-01-01';

DELETE

DELETE FROM s3.huron.occ_opc WHERE date = DATE '2023-01-01';

Refer to Trino SQL Syntax for more documentation.

Audit Trail

s3.huron.system_trino table contains an audit trail of all queries that completed execution

Known Issues

All varchar columns are truncated to the first 1024 characters
Live queries and long running queries won't appear in this table until it has reached a completed state.
New data for this table is delivered every 10 minutes, which means queries completed within the last 10 minutes won't be available immediately.
This table can only be queried using SQL; i.e., there is no support for API/v3 or bulk-download/export

Schema

Column	Type	Description
query_id	varchar	identifier unique to this query
transaction_id	varchar	transaction that this query belongs to
prepared_query	varchar	if this query was a part of a Prepared Statement (for ex: `EXECUTE`), then this will show the query of the prepared statement. otherwise empty for direct queries
query	varchar	SQL statement of this query
tables	array(varchar)	table(s) that were accessed in this query
user	varchar	identity of the user who invoked this query
user_agent	varchar	client driver that invoked this query, ex `python-requests/2.28.2`, `Trino JDBC Driver/360`, etc.
source	varchar	client application which invoked this query, ex: `trino-python-client`, `DataGrip 2020.3.2`, etc.
failure_info	map(varchar, varchar)	if this query had failed, this will show the error code and message. ex: `{failureType=io.trino.spi.TrinoException, failureMessage=Query was canceled}`
create_time	timestamp(6)	time when the query was submitted to trino
execution_start_time	timestamp(6)	time when the query was picked up by trino worker(s) to begin processing
end_time	timestamp(6)	time when query completed processing (both success and failure)
statistics	map(varchar, bigint)	statistics about how much data was read or written. ex: `{outputRows=1, outputBytes=9, writtenBytes=137680, writtenRows=410}`

Clients

CLI

Download latest cli from trino.io
Save .trino_config to your $HOME folder.
Update .trino_config with your configurations
- set user as your API Key
- optional - replace the server value for a custom HOST

Run queries as desired

$ trino --execute 'select * from ndaq_rtat10 limit 5'
$ trino --execute 'select * from zacks_hdm limit 5'

Python

NOTE: Tested with Python 3

Add requirements.txt in a new project folder, or install those PIP libraries into your existing project.
Configure environment variables
- NASDAQ_DATA_LINK_API_KEY for your API Key
- Optional - NASDAQ_DATA_LINK_BASE_DOMAIN for a custom HOST
Add datalink_sql.py to your project folder
Run datalink_sql.py to invoke sample queries

Additionally, create a new module with your desired queries and reuse helper functions

import datalink_sql
import os

connection = datalink_sql.create_connection(os.environ.copy())
sql = "..."
df = datalink_sql.run_sql(connection, sql)
print(df)

DotNet C#

NOTE: Tested with dotnet cli 6

Add DataLinkSQL.csproj in a new project folder, or install those Nuget packages into your existing project.
Configure environment variables
- NASDAQ_DATA_LINK_API_KEY for your API Key
- Optional - NASDAQ_DATA_LINK_BASE_DOMAIN for a custom HOST
Add DataLinkSQL.cs to your project folder

Run included Main to invoke sample queries

$ dotnet run --property DefineConstants=INCLUDE_MAIN

Additionally, create a new file with your desired queries and reuse helper functions

using NReco.PrestoAdo;
using static DataLinkSQL;

class MyApp
{
    static void Main()
    {
        using (PrestoConnection connection = CreateConnection())
        {
            string sql = "...";
            DbDataReader reader = RunSQL(connection, sql);
            DisplayReader(reader);
        }
    }
}

$ dotnet run

Tableau

NOTE: Tested with Tableau Desktop 2022.2

Connect "To a Server" and select "Presto"
Enter configurations as explained in Endpoints, and also set
1. Authentication: Username
2. Require SSL: True
Click the magnifying glass next to the "Enter table name" text box
You should now see all your tables. Drag and drop them to build your models.

Notebook on Databricks

Install trino-jdbc-406.jar into your workspace
Configure environment variables in your Cluster's Environment Variables
- NASDAQ_DATA_LINK_API_KEY for your API Key
- Optional - NASDAQ_DATA_LINK_BASE_DOMAIN for a custom HOST
Import and run SparkSQL.ipynb

DataGrip

Install JDBC driver

Download the latest driver from trino.io
Move the driver into a permanent location (where it won't get deleted accidentally); i.e., the DataGrip library folder
- in MacOS, this is ~/Library/Application\ Support/JetBrains/DataGrip${VERSION}/jdbc-drivers/
Open DataGrip
Click: File > New > Driver
Fill in fields
- Name: Trino 406
- General (tab)
  - Driver files: browse and select the downloaded driver
  - Class: io.trino.jdbc.TrinoDriver
- Options (tab)
  - Other (section)
    - Icon: Trino
Click Apply and OK

Setup DataSource

Open DataGrip
Click: File > New > DataSource > Trino 406
Fill in fields as explained in Endpoints
- Name: DataLink
- General (tab)
  - User: your API Key
  - URL: jdbc:trino://${HOST}:${PORT}/${CATALOG}/${SCHEMA}
Click Apply and OK