HANA Predictive Analysis Library (PAL): Regression & Clustering in SQL – Day 7 Expert Hands-On | Part 52 | FreeLearning365

 

HANA Predictive Analysis Library (PAL): Regression & Clustering in SQL – Day 7 Expert Hands-On | Part 52 | FreeLearning365

🔥 DAY07 [SAP HANA Database & Performance] {Part-52}

Welcome to the day where your HANA database becomes a machine learning platform. Over the past six days, you’ve conquered architecture, SQLScript, modeling, tools, tuning, and memory. Now you’ll add predictive analytics to your skillset without leaving the SQL environment. HANA’s Predictive Analysis Library (PAL) is a set of native procedures that implement algorithms like regression, clustering, classification, and time‑series forecasting. They run entirely inside the database engine, leveraging column‑store parallelism and in‑memory speed. No external R or Python servers, no data duplication, no security nightmares. You’ll build real models today – credit risk scoring and customer segmentation – using the same ACDOCA data you already manage.

We’ll demystify PAL’s procedural interface, walk through data preparation with SQL, and provide complete, copy‑ready code for each algorithm. By the end, you’ll be able to present a working predictive model to your business stakeholders, all built in an afternoon.


🧠 What is PAL and Why Should You Care?

The Predictive Analysis Library (PAL) is a set of pre‑built algorithms delivered as SQLScript stored procedures in the _SYS_AFL schema. They are executed using the Application Function Library (AFL) framework, which runs inside the index server with dedicated worker threads. PAL covers:

  • Regression: Linear, multiple, polynomial, logistic.
  • Clustering: K‑Means, DBSCAN, Agglomerate, SOM.
  • Classification: Decision trees, Naive Bayes, SVM.
  • Time Series: ARIMA, exponential smoothing, seasonal decomposition.
  • Association: Apriori, FP‑Growth.
  • Data Prep: Sampling, scaling, imputation, feature selection.

Because PAL is native, model training and scoring can operate directly on column‑store tables, avoiding the latency and security risks of moving data to external systems. For S/4HANA, this means you can embed credit checks, demand forecasts, or maintenance predictions into ABAP applications by simply calling a PAL procedure via EXEC SQL.

💡 Key advantage: PAL models are first‑class database objects. You can schedule training via HANA jobs, version models in Git (Day 4), and control access with analytic privileges. No separate model registry needed.

🏗️ PAL Architecture – How It Actually Works

PAL algorithms are not magic. They are implemented as C++ functions within the AFL framework. To use them, you call a wrapper procedure that passes data tables, parameter tables, and receives result tables. The typical calling pattern is:

  1. Prepare input data tables (e.g., training set).
  2. Prepare a parameter table (key‑value pairs specifying algorithm options).
  3. Call the PAL procedure, passing input and parameter tables, and receive output tables (model, coefficients, predictions).

The wrapper procedures are located in the _SYS_AFL schema. For example, _SYS_AFL.PAL_LINEAR_REGR. You don’t create them; SAP delivers them with HANA. But you must ensure AFL is enabled and the PAL plugin is installed (usually yes for S/4HANA).

🔑 Enabling PAL

Check if PAL is available by querying SYS.AFL_AREAS for the PAL area. If missing, work with your Basis team to install the AFL component. HANA Cloud includes PAL by default.

SELECT * FROM SYS.AFL_AREAS WHERE AREA_NAME = 'PAL';

📊 Regression with PAL – Predicting Continuous Values

📈 Linear Regression (Simple & Multiple)

Linear regression models the relationship between a dependent variable (e.g., sales amount) and one or more independent variables (e.g., marketing spend, season). PAL’s PAL_LINEAR_REGR procedure performs ordinary least squares estimation and returns coefficients, R‑squared, and prediction functionality.

🧪 Hands‑on Lab: Predict Sales from Marketing Spend

Scenario: You have a table Z_SALES_DATA with columns PERIOD, MARKETING_SPEND, SEASON_INDEX, and SALES_AMOUNT. You want to predict next period’s sales.

Step 1 – Prepare training data table:

CREATE COLUMN TABLE Z_TRAIN_SALES ( ID INTEGER, MARKETING_SPEND DOUBLE, SEASON_INDEX DOUBLE, SALES_AMOUNT DOUBLE ); INSERT INTO Z_TRAIN_SALES VALUES (1, 1000, 1.2, 15000); INSERT INTO Z_TRAIN_SALES VALUES (2, 1500, 1.5, 21000); -- ... more rows

Step 2 – Prepare parameter table:

CREATE LOCAL TEMPORARY TABLE #PAL_PARAMS ( PARAM_NAME VARCHAR(100), INT_VALUE INTEGER, DOUBLE_VALUE DOUBLE, STRING_VALUE VARCHAR(100) ); INSERT INTO #PAL_PARAMS VALUES ('THREAD_NUMBER', 2, NULL, NULL); INSERT INTO #PAL_PARAMS VALUES ('ALPHA', NULL, 0.05, NULL); -- confidence level

Step 3 – Call PAL_LINEAR_REGR:

CALL _SYS_AFL.PAL_LINEAR_REGR ( Z_TRAIN_SALES, -- input data '#PAL_PARAMS', -- parameters ?, -- output: coefficients ?, -- output: model pmml ?, -- output: fit summary ?, -- output: residual ? -- output: statistics ) WITH OVERVIEW;

After calling, you’ll get tables with coefficients (intercept, slopes). You can then use the model to predict new values by applying the linear formula in SQL.

📉 Logistic Regression – Classifying Outcomes

Logistic regression predicts binary outcomes (yes/no, default/non‑default). Perfect for credit risk scoring. PAL provides PAL_LOGISTIC_REGR. It returns probabilities and classification thresholds.

🧪 Real‑Life Lab: Credit Risk Scoring on ACDOCA Payment Data

Business goal: Classify customers into high‑risk (likely to default) vs low‑risk based on their payment history, average days overdue, and credit limit utilization.

Data preparation from ACDOCA:

CREATE COLUMN TABLE Z_CREDIT_TRAIN AS ( SELECT KUNNR, AVG(DAYS_BETWEEN(DUE_DATE, AUGDT)) AS AVG_OVERDUE, SUM(WSL) / NULLIF(MAX(CREDIT_LIMIT),0) AS UTILIZATION, CASE WHEN MAX(DELINQUENCY_FLAG) = 1 THEN 1 ELSE 0 END AS DEFAULT_FLAG FROM ACDOCA WHERE KOART = 'D' AND AUGDT IS NOT NULL GROUP BY KUNNR );

Now we have labeled data: DEFAULT_FLAG = 1 for default, 0 for non‑default. We train the logistic model:

CREATE LOCAL TEMPORARY TABLE #LOG_PARAMS ...; INSERT INTO #LOG_PARAMS VALUES ('DEPENDENT_VARIABLE', NULL, NULL, 'DEFAULT_FLAG'); INSERT INTO #LOG_PARAMS VALUES ('CLASS_MAP_BASE', NULL, NULL, '0,1'); CALL _SYS_AFL.PAL_LOGISTIC_REGR ( Z_CREDIT_TRAIN, '#LOG_PARAMS', ?, -- coefficients ?, -- model ?, -- confusion matrix ?, -- statistics ? );

The coefficients tell you the impact of each factor on default probability. You can then score new customers in real time by executing a SQL query that multiplies their attributes by the coefficients and applies the sigmoid function.

-- Scoring using saved coefficients SELECT KUNNR, 1 / (1 + EXP(-(0.5 + 0.03*AVG_OVERDUE + 0.8*UTILIZATION))) AS DEFAULT_PROB FROM Z_CUSTOMER_CURRENT;

This model runs entirely in HANA. No ETL to external Python. For ABAP Fiori apps, wrap the scoring query in a table function and expose it as an OData service.


🧩 Clustering with PAL – Finding Hidden Groups

🔵 K‑Means Clustering

K‑Means partitions data into K groups by minimizing within‑cluster variance. Use it for customer segmentation, product grouping, or anomaly detection. PAL’s PAL_KMEDOIDS or PAL_KMEANS (choose KMeans for large data) handles initialization, iteration, and returns cluster assignments and centroids.

🧪 Lab: Customer Segmentation Based on Purchase Behavior

Data: Aggregate sales from ACDOCA: total revenue, frequency, recency per customer.

CREATE COLUMN TABLE Z_CUST_RFM AS ( SELECT KUNNR, SUM(WSL) AS MONETARY, COUNT(DISTINCT BELNR) AS FREQUENCY, DAYS_BETWEEN(MAX(BUDAT), CURRENT_DATE) AS RECENCY FROM ACDOCA WHERE KOART = 'D' GROUP BY KUNNR );

Scale the data (PAL expects normalized inputs for best results):

-- Use PAL_SCALE procedure CALL _SYS_AFL.PAL_SCALE ( Z_CUST_RFM, '#SCALE_PARAMS', Z_CUST_RFM_SCALED, ? -- scale model );

Run K‑Means with K=4:

CREATE LOCAL TEMPORARY TABLE #KMEANS_PARAMS ...; INSERT INTO #KMEANS_PARAMS VALUES ('GROUP_NUMBER', 4, NULL, NULL); INSERT INTO #KMEANS_PARAMS VALUES ('INIT_TYPE', 1, NULL, NULL); -- random centers CALL _SYS_AFL.PAL_KMEANS ( Z_CUST_RFM_SCALED, '#KMEANS_PARAMS', Z_CLUSTER_RESULT, -- cluster assignment Z_CENTROIDS, -- centers ? -- statistics );

Now Z_CLUSTER_RESULT contains each customer and their assigned cluster. You can profile clusters by averaging attributes per cluster. Marketing can then target each segment differently.

📍 DBSCAN – Outlier Detection Without Predefining K

DBSCAN groups points by density, automatically identifying noise (outliers). PAL’s PAL_DBSCAN is excellent for fraud detection or finding unusual supplier behavior.

-- Example: detect anomalous payment patterns CALL _SYS_AFL.PAL_DBSCAN ( Z_PAYMENT_PATTERNS, '#DBSCAN_PARAMS', -- minPts=5, epsilon=0.4 Z_DBSCAN_RESULT, ? );

Rows assigned to cluster -1 are outliers. Investigate them for potential fraud or errors.


🔄 Data Preparation for PAL – The Secret to Accurate Models

Garbage in, garbage out. PAL includes dedicated procedures for preprocessing:

  • PAL_SCALE – Min‑Max or Z‑score normalization.
  • PAL_ENCODE – One‑hot encoding for categorical variables.
  • PAL_IMPUTE – Missing value treatment (mean, median, constant).
  • PAL_TRAIN_TEST_SPLIT – Random partition into training and test sets.

Combine these in a SQLScript procedure to create a repeatable preprocessing pipeline. Example:

CREATE PROCEDURE PREP_CREDIT_DATA (OUT out_train TABLE(...), OUT out_test TABLE(...)) AS BEGIN -- Step 1: encode categoricals CALL PAL_ENCODE(...); -- Step 2: impute missing CALL PAL_IMPUTE(...); -- Step 3: split 70/30 CALL PAL_TRAIN_TEST_SPLIT(..., 'TRAINING_RATIO', 0.7); -- return train and test tables END;

📈 Evaluating PAL Models – Accuracy, AUC, and Interpretability

PAL outputs evaluation tables directly. For classification, the confusion matrix gives accuracy, precision, recall. For regression, R‑squared and mean squared error are provided. Use these to compare models and choose the best.

-- Example: retrieve confusion matrix from logistic regression output SELECT * FROM :confusion_table;

You can also apply a trained model to the test set and calculate your own metrics using SQL aggregations.

🔍 Real‑Life Success Story: Retail Customer Churn

A fashion retailer used PAL logistic regression on ACDOCA purchase recency and frequency to predict churn. They achieved an AUC of 0.82, and integrated the scoring into a Fiori app that alerts account managers when a high‑value customer’s churn probability crosses 0.7. The entire pipeline (training, scoring) runs inside HANA and refreshes weekly without any external tool.


⚖️ Pros, Cons & Alternatives to PAL

✅ Pros

  • Zero data movement: Data stays in HANA, preserving security and avoiding latency.
  • Column‑store performance: Training on billions of rows is feasible due to parallelized, in‑memory operations.
  • SQL integration: Analysts and ABAP developers can call models with standard SQL.
  • Enterprise support: SAP‑delivered, patchable, and supported – unlike open‑source scripts.

⚠️ Cons

  • Algorithm breadth: PAL covers core methods but not deep learning or cutting‑edge research algorithms. For those, you may need Python/R integration via HANA APL or separate data science platform.
  • Learning curve: The procedural PAL interface (parameter tables, multiple outputs) can be intimidating initially.
  • Model governance: While Git helps, PAL lacks built‑in model tracking (like MLflow). You must build your own registration tables.
  • Limited customization: PAL procedures are closed‑box; you can’t tweak the inner optimization algorithm (but you can choose variants).

🔄 Alternatives

  • SAP HANA Automated Predictive Library (APL): Higher‑level, automatic algorithm selection and pre‑processing. Good for quick wins but less control.
  • Python/R with HANA DataFrame: Bring data into Python via hana-ml package for more advanced algorithms, but you lose the “inside database” advantage.
  • SAP Analytics Cloud Smart Predict: For business users, a no‑code interface built on HANA PAL.
✅ Best Practice: Start with PAL for production scoring that needs low latency and high security. Use Python/R for exploratory analysis where you need many iterations and visualizations. Combine both: prototype in Python, then deploy the chosen model’s coefficients back into HANA for scoring via PAL or manual SQL.

📌 Your Data Science Journey Inside HANA

Today you’ve bridged the gap between database engineering and data science. You can now build regression models to forecast sales, logistic models to score risk, and cluster customers – all within the safe, fast walls of HANA. This capability is a career multiplier: the professional who can deliver predictive insights without complex infrastructure is the one who leads digital transformation.

🔥 Action Item: In your sandbox, enable PAL, then run the K‑Means customer segmentation lab using your own ACDOCA data. Visualize the clusters in a reporting tool of your choice. Next, attempt the credit risk model and evaluate its confusion matrix. You’ll have concrete predictive analytics experience by dinner.

Post a Comment

0 Comments