In D&A Projects, building efficient SQL Queries are critical towards achieving the Extraction/Load Batch cycles complete faster to meet the desired SLAs. Below observations are towards following the approaches to ensure writing SQL queries that meet the Best Practices to facilitate performance improvements.
Tuning Approach
Pre-Requisite Checks :
Before we get into subjecting a SQL Query against Performance Improvements, below steps to be adopted.,
Deep Dive into the current SQL Query.
Complexity of the SQL (# of Tables/Joins/Functions)
Design of the SQL Query (Sub-Query/Correlated Sub-Query/Join/Filter Sequences)
Whether Best Practices followed (Modularized?/Joins contain Functions/Derivations?)
Verify the As-Is Metrics of the SQL
Duration to return 1st record and first 100 records
Extract the Explain Plan Metrics
Cost (Resource Usage)
Cardinality (# of Rows returned per Task Operations)
Access Method (Full Table/ROWID/Index Unique/Full Index/Index Skip Scan)
Join Method (Hash/Nested-Loop/Sort-Merge/Outer Join)
Join Order (Multiple tables join sequence)
Partition
Parallel Processing (Exec on Multiple Nodes)
After ensuring the above Pre-requisites are taken care and possible bottlenecks identified, Tuning practices can be applied on the SQL Query for performance improvements.
Tuning Guidelines :
Basic Guidelines are listed below.,
Query Design Perspective
Extract only the required columns in the code via SELECT (instead of SELECT *)
Use Inner joins well ahead of Outer joins
Filters applied ahead with Inner Joins rather at the end using WHERE clause
Avoid Sub-queries/Correlated Sub-queries as much as possible
Create TEMP tables
to hold Sub-Query logic
to Modularize Complex Logic with related Columns/Derivations
to hold reference list of values (used as Joins instead of IN clause)
to hold Functions/Calculations/Derivations Attributes for later JOIN with Tables
to hold Complex Query Logic and subsequently apply RANK()/ROW_NUMBER()
Create Physical tables (instead of TEMP) if high volume
Drop the TEMP or Physical tables after intermediate processing completes
Complex Query with too many LEFT joins can be broken into parts and then JOINed
Avoid Duplicates as early as possible before subjecting the Derived tables to JOINs
On MPP DBs, do not use DISTRIBUTION for Smaller tables
On MPP DBs, DISTRIBUTION column based joins provide faster results
Functions Perspective
Use EXISTS instead of IN, if presence alone requires to be checked
Instead of MINUS, use LEFT JOIN with IS NULL condition
If DISTINCT causes slowness, try ROW_NUMBER() to select 1 record out of Multiples
Do not use Functions on Joins
DBA Perspective
Collect STATISTICS
Create Indexes (Single/Multiple) (on frequently used Joins/Predicates as required)
Create Partitions (for Optimized Scans)
Space/Computing Perspective
Increase the DB Server storage space
Increase the DB Server Computing Abilities
Multi-Node Processing of Queries
Conclusion
On a high level, below are the inferences.,
Check Explain Plan.
Subject the Query to effective Design.
Focus on DBA, Space, Computing Abilities.
Follow the Best Practices.
Leave A Comment