DataLink SQL Interface is an interactive query service that makes it easy to analyze data in DataLink using standard SQL. With DataLink SQL, there’s no need for custom scripts to download and manually assemble your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.
DataLink SQL is out-of-the-box integrated with the DataLink catalog, allowing you to discover and interact with all your DataTables using your DataLink API Key.
DataLink SQL uses Trino (formerly Presto SQL) with ANSI SQL support. It is ideal for interactive querying and can also handle complex analysis, including large joins, window functions, and arrays. You can utilize any Trino API compatible 1st or 3rd party clients to connect to DataLink SQL.
DataLink SQL exposes a standard interface to query underlying DataLink data stores, and gives you a consistent experience for regular and large partitioned DataTables. Your existing API Keys can be used to query the same DataTables via SQL as you already do with via DataTables API or existing DataLink clients.
There are numerous trino clients available to connect to DataLink SQL. You can also reference the REST API specifications to build your own client/driver.
When using any Trino client, you can configure it to access DataLink by setting parameters like so
Key | Value |
---|---|
HOST | data.nasdaq.com or ${tenant}.data.nasdaq.com |
PORT | 443 |
USER | Your DataLink API Key |
CATALOG | main |
SCHEMA | huron |
Trino uses a cursor based approach, documented here, to iteratively produce results for your query from the DataLink backend.
Large partitioned DataTables have their underlying data split into partitions composed of a subset of columns. You can find out which columns are partitions in the "Column Definitions" section of the Data Product page on DataLink. The best query performance is possible when you include these partition columns in your WHERE
and ORDER BY
clauses. You can also describe the plan for a specific query using EXPLAIN to help you tailor more performant queries.
For large partitioned DataTables, we cache partitions to provide better performance for commonly and repeatedly used blocks of data. This means while a query maybe slow the first time, subsequent requests of a similar nature should complete more quickly.
New data can be added, and historical data can be updated via by running INSERT
, UPDATE
, DELETE
:
From another table with identical schema (implicit inference)
INSERT INTO s3.huron.occ_opc SELECT * FROM main.huron.qdl_cbf WHERE date = '2023-01-01' ORDER BY ticker LIMIT 100000;
From another table with different schema and explicit mapping
INSERT INTO
s3.huron.new_table (ticker, date, price)
SELECT
ticker, DATE(some_timestamp), price
FROM
main.huron.qdl_cbf
WHERE
date = '2023-01-01'
New values
INSERT INTO s3.huron.eg_bio VALUES (1, 'NDAQ', DATE '2023-01-01'), (2, 'AAPL', DATE '2023-01-01');
NOTE
When performing insert by selecting from another table, ensure
Insert query has mismatched column types
errorInserts should be batched as separate queries as
We recommend you only perform a few large (~10-100MB volume of rows) inserts everyday. Lots of tiny inserts or super massive bulk inserts may result in poor INSERT and/or subsequent SELECT performance.
UPDATE s3.huron.occ_opc SET ticker = 'NDAQ' WHERE date = DATE '2023-01-01';
DELETE FROM s3.huron.occ_opc WHERE date = DATE '2023-01-01';
Refer to Trino SQL Syntax for more documentation.
s3.huron.system_trino
table contains an audit trail of all queries that completed execution
Known Issues
varchar
columns are truncated to the first 1024
charactersSchema
Column | Type | Description |
---|---|---|
query_id | varchar | identifier unique to this query |
transaction_id | varchar | transaction that this query belongs to |
prepared_query | varchar | if this query was a part of a Prepared Statement (for ex: EXECUTE ), then this will show the query of the prepared statement. otherwise empty for direct queries |
query | varchar | SQL statement of this query |
tables | array(varchar) | table(s) that were accessed in this query |
user | varchar | identity of the user who invoked this query |
user_agent | varchar | client driver that invoked this query, ex python-requests/2.28.2 , Trino JDBC Driver/360 , etc. |
source | varchar | client application which invoked this query, ex: trino-python-client , DataGrip 2020.3.2 , etc. |
failure_info | map(varchar, varchar) | if this query had failed, this will show the error code and message. ex: {failureType=io.trino.spi.TrinoException, failureMessage=Query was canceled} |
create_time | timestamp(6) | time when the query was submitted to trino |
execution_start_time | timestamp(6) | time when the query was picked up by trino worker(s) to begin processing |
end_time | timestamp(6) | time when query completed processing (both success and failure) |
statistics | map(varchar, bigint) | statistics about how much data was read or written. ex: {outputRows=1, outputBytes=9, writtenBytes=137680, writtenRows=410} |
Download latest cli from trino.io
Save .trino_config to your $HOME
folder.
Update .trino_config
with your configurations
user
as your API Keyserver
value for a custom HOSTRun queries as desired
$ trino --execute 'select * from ndaq_rtat10 limit 5'
$ trino --execute 'select * from zacks_hdm limit 5'
NOTE: Tested with Python 3
Add requirements.txt in a new project folder, or install those PIP libraries into your existing project.
Configure environment variables
NASDAQ_DATA_LINK_API_KEY
for your API KeyNASDAQ_DATA_LINK_BASE_DOMAIN
for a custom HOSTAdd datalink_sql.py to your project folder
Run datalink_sql.py
to invoke sample queries
Additionally, create a new module with your desired queries and reuse helper functions
import datalink_sql
import os
connection = datalink_sql.create_connection(os.environ.copy())
sql = "..."
df = datalink_sql.run_sql(connection, sql)
print(df)
NOTE: Tested with dotnet cli 6
Add DataLinkSQL.csproj in a new project folder, or install those Nuget packages into your existing project.
Configure environment variables
NASDAQ_DATA_LINK_API_KEY
for your API KeyNASDAQ_DATA_LINK_BASE_DOMAIN
for a custom HOSTAdd DataLinkSQL.cs to your project folder
Run included Main to invoke sample queries
$ dotnet run --property DefineConstants=INCLUDE_MAIN
Additionally, create a new file with your desired queries and reuse helper functions
using NReco.PrestoAdo;
using static DataLinkSQL;
class MyApp
{
static void Main()
{
using (PrestoConnection connection = CreateConnection())
{
string sql = "...";
DbDataReader reader = RunSQL(connection, sql);
DisplayReader(reader);
}
}
}
$ dotnet run
NOTE: Tested with Tableau Desktop 2022.2
Install trino-jdbc-406.jar into your workspace
Configure environment variables in your Cluster's Environment Variables
NASDAQ_DATA_LINK_API_KEY
for your API KeyNASDAQ_DATA_LINK_BASE_DOMAIN
for a custom HOSTImport and run SparkSQL.ipynb
~/Library/Application\ Support/JetBrains/DataGrip${VERSION}/jdbc-drivers/
io.trino.jdbc.TrinoDriver
jdbc:trino://${HOST}:${PORT}/${CATALOG}/${SCHEMA}