Skip to content

Table Description

Datasources Usage Guide

Overview

Each project ingested into Druid produces three main datasources representing the process data at different levels:

  • Cases: one row per case (process execution)
  • Tasks (_vertex): one row per task execution
  • Links (_edge): one row per transition between tasks

Datasource naming pattern:

  • ${projectId}
  • ${projectId}_vertex
  • ${projectId}_edge

These datasources can be enriched using Druid lookups such as:

  • ${projectId}_variant_name
  • ${projectId}_excluded_cases
  • ${projectId}_cases_is_completed
  • business_rule_<ruleId>
  • conformance_<modelId>

Lookups behave similarly to joins and can be used using the LOOKUP() function.

Example:

SELECT caseid, LOOKUP(graphkey, '${projectId}_variant_name') AS variant_name
FROM ${projectId}

Cases datasource

The Cases datasource contains one row per case and aggregates information computed from all events belonging to that case.

Datasource name pattern: ${projectId}

Example: b7bd7acb-6a13-44a9-bebb-26f8e7e96c48

This datasource is typically used for:

  • case-level analytics
  • duration and performance metrics
  • filtering cases
  • variant or process analysis
Name Type Description
__time Timestamp start date of the first case event
enddate Long end date of the last case event
duration Long case duration in milliseconds
graphkey UUID variant id the case belongs to (model variant)
processkey UUID process id the case belongs to (process variant)
processsequence UUID process_sequence id belongs to (sequence variant)
caseid String case unique identifier
version String data version.
case_version String <caseid>-<version>, used to filter obsolete case data
active_duration Long total active time in the case
concurrency Double case concurrency rate
rework Long case total rework
tasks_count Long number of tasks in the case (excluding start/end)
distinct_vertices Array[String] distinct task identifiers in the case (including start and end)
distinct_vertices_count Long number of distinct tasks (including start and end)
distinct_edges Array[String] distinct transitions in the case
distinct_edges_count Long number of distinct transitions.
hll_vertices Complex<HLLSketch> HyperLogLog sketch of distinct vertices
hll_edges Complex<HLLSketch> HyperLogLog sketch of distinct edges.
last_tasks_names Array[String] names of last tasks before case completion.
vertices_dates Complex<json> start/end timestamps for each task in the case
igx_metric-slug_lpkr Double aggregated value of the metric at the case level
igx_dimension-slug_lpkr String aggregated value of the dimension at the case level
is_predicted Boolean indicates if row is prediction data
is_predicted_finished Boolean indicates if predicted case is finished

Tasks datasource (_vertex)

The Tasks datasource contains one row per task execution within a case.

Datasource name pattern: ${projectId}_vertex

Example: b7bd7acb-6a13-44a9-bebb-26f8e7e96c48_vertex

This datasource is typically used for:

  • task performance analysis
  • task-level filtering
  • bottleneck detection
  • concurrency analysis
Name Type Description
ingestion_timestamp Long insertion timestamp
__time Timestamp task start date
enddate Long task end date
duration Long task duration in milliseconds
graphkey UUID variant id the case belongs to (model variant)
processkey UUID process id the case belongs to (process variant)
processsequence UUID process_sequence id belongs to (sequence variant)
caseid String identifier of the task’s case
case_version String <caseid>-<version>, used to filter obsolete case data
vertex_id UUID task non-unique identifier based on task name (two identical task names has same vertex_id)
task_id UUID task unique id
vertex_name String task name (null for Start and End vertices)
igx_metric-slug_lpkr Double metric value for the event
igx_dimension-slug_lpkr String dimension value for event
concurrent_vertices Multi UUID concurrent tasks non-unique ids
loop_path Multi UUID for a loop starting and finishing with vertex_id, non-unique ids of links composing the loop
occurrences_in_case Long number of times a the task appears in the case (based on task name)
predicted_occurrences_in_case Long number of times a the task appears in the predicted case (based on task name)
rework Long number of repetitions of the task (without concurrency) in the case
case_igx_metric-slug_lpkr Double aggregated value of the metric for the case
case_igx_dimension-slug_lpkr Multi String aggregated value of the dimension for the case
case_start_date Long case start date
case_end_date Long case end date
case_duration Long case duration
case_rework Long case rework
case_concurrency Double case concurrency
case_active_duration Long case active duration
case_idle_duration Long case idle duration
case_tasks_count Long number of tasks in the case
case_distinct_vertices Array[String] distinct task identifiers in the case (including start and end)
case_distinct_edges Array[String] distinct transitions in the case
case_is_predicted_finished Boolean indicates if predicted case is finished
is_predicted Boolean row belongs to prediction data
is_in_prediction Boolean task is part of predicted case

The Links datasource represents transitions between tasks in the process.

Datasource name pattern: ${projectId}____edge

Example: b7bd7acb-6a13-44a9-bebb-26f8e7e96c48_edge

The edges form a Direct Follower Graph (DFG), meaning only direct transitions between tasks are represented (no BPMN gateways).

This datasource is typically used for: - path analysis - transition frequency analysis - performance analysis between tasks - loop detection

Name Type Description
__time Timestamp edge start date (end of event A)
enddate Long edge end date (start of event B)
duration Long edge duration in milliseconds
graphkey UUID variant id the case belongs to (model variant)
processkey UUID process id the case belongs to (process variant)
processsequence UUID process_sequence id belongs to (sequence variant)
caseid String identifier of the edge’s case
case_version String <caseid>-<version>, used to filter obsolete case data
edge_id UUID edge non-unique identifier based on edge name (two identical edges names have the same edge_id)
edge_name String edge name (A->B)
start_taskid UUID unique identifier of the starting task A
start_vertexid UUID non-unique identifier of the starting task A
start_vertexname String name of starting task A
end_taskid UUID unique identifier of the end task B
end_vertexid UUID non-unique identifier of the end task B
end_vertexname String name of end task B
igx_metric-slug_lpkr__source Double task A metric value
igx_metric-slug_lpkr_destination Double task B metric value
igx_dimension-slug_lpkr_source String task A dimension value
igx_dimension-slug_lpkr_destination String task B dimension value
concurrentEdges Multi UUID non-unique ids of concurrent links
occurrences_in_case Long number of times the link appears in the case (based on the link name)
predicted_occurrences_in_case Long number of times the link appears in the predicted case (based on the link name)
rework Long number of repetitions of the link (without concurrency) in the case
case_igx_metric-slug_lpkr Double aggregated value of the metric at the case level
case_igx_dimension-slug_lpkr Multi String aggregated value of the dimension at the case level
case_start_date Long case start date
case_end_date Long case end date
case_duration Long case duration
case_rework Long case rework
case_concurrency Double case concurrency
case_active_duration Long case active duration
case_idle_duration Long case idle duration
case_tasks_count Long number of tasks in the case
case_distinct_vertices Array[String] distinct task identifiers in the case (including start and end)
case_distinct_edges Array[String] distinct transitions in the case
case_is_predicted_finished Boolean indicates if predicted case is finished
is_predicted Boolean row belongs to prediction data
is_in_prediction Boolean link is part of predicted case

Common Query Patterns

1 Filtering excluded cases

When multiple versions of a case exist during ingestion, some versions must be ignored.

Use the ${projectId}_excluded_cases lookup.

SELECT *
FROM ${projectId}
WHERE LOOKUP(case_version, '${projectId}_excluded_cases') IS NULL

Explanation:

  • excluded cases are stored in the lookup
  • if lookup returns a value → case version is excluded
  • if lookup returns NULL → case version is valid

2 Retrieving the variant name

Variants are stored as graphkey. To get the readable name:

SELECT
  caseid,
  graphkey,
  LOOKUP(graphkey, '${projectId}_variant_name') AS variant_name
FROM ${projectId}

Example use cases:

variant analysis filtering specific variants

WHERE LOOKUP(graphkey, '${projectId}_variant_name') = '3'

3 Filtering completed cases

Cases completion is computed by business logic and stored in a lookup.

SELECT caseid
FROM ${projectId}
WHERE LOOKUP(caseid, '${projectId}_cases_is_completed') = 'TRUE'

Possible values depend on the end case rule (business rule).

4 Using business rules

Business rules results are stored in dynamic lookups.

Lookup name pattern:

business_rule_

Example:

SELECT caseid
FROM ${projectId}
WHERE LOOKUP(caseid, 'business_rule_12345') = 'TRUE'

5 Filtering by conformance score

Conformance scores are stored in lookups.

Example:

SELECT caseid,
LOOKUP(processkey, 'conformance_model1') AS fitness
FROM ${projectId}
WHERE LOOKUP(processkey, 'conformance_model1') < 0.8

This allows filtering low-conformance cases.

6 Task-level analysis

Example: average task duration

SELECT
  vertex_name,
  AVG(duration) AS avg_duration
FROM ${projectId}_vertex
GROUP BY vertex_name

7 Transition analysis

Example: most frequent transitions

SELECT
  start_vertexname,
  end_vertexname,
  COUNT(*) AS transitions
FROM ${projectId}_edge
GROUP BY start_vertexname, end_vertexname
ORDER BY transitions DESC

8 Variant performance

Example: average case duration per variant

SELECT
  LOOKUP(graphkey, '${projectId}_variant_name') AS variant,
  AVG(duration) AS avg_duration
FROM ${projectId}
GROUP BY graphkey

9 Using prediction fields

Prediction-related fields allow comparing predicted and real data.

Example:

SELECT *
FROM ${projectId}
WHERE is_predicted = true

Example: predicted tasks

SELECT *
FROM ${projectId}_vertex
WHERE is_in_prediction = true

Best Practices

Always:

  • filter excluded cases using the lookup
  • avoid join to filter on dimension or metric when case_xx exists
  • prefer the Cases datasource for aggregated metrics
  • use Tasks datasource for activity analysis
  • use Links datasource for path analysis
  • use lookups instead of joins whenever possible

Resources

IGrafx has extended SQL Druid by some specific functions:

Sha256
SELECT SHA256('hello world') AS hash;
Md5
SELECT MD5('hello world') AS hash;
NewID
SELECT NEWID() AS uuid;

Tip: You can use NEWID() and MD5 to get a random sample (about 10%) of data.

SELECT "caseid" from "cases" WHERE BITWISE_AND(md5(CONCAT("caseid", NEWID())), 2147483647) / 2147483647.0 < 0.1;

Be careful that when using NEWID() in a query, it will return the same UUID for each row in the result set. If you would like to get random sample of data using distributed SQL, you cannot use NEWID() directly, but a variable set before instead.

BUSINESS_DURATION
SELECT BUSINESS_DURATION(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), TIME_PARSE('2023-10-01", 'YYYY-MM-dd', '+01:00'), 'FR') AS duration;

You can add optionally two parameters to the function: - LANGUAGE to describe the language to use for the calendar. The default is "en". Example "fr" for French. - HOLIDAY_TYPES to describe the type of holiday to use. Example "public observance". The different values are public, observance, bank, school, optional. The default is "public". - WEEK_HOURS to describe the business hours of the week. The default is "08:30-12:30, 13:30-17:30". You can specify a different range for each day of the week, e.g., '09:00-17:00' for Monday to Friday. Format of weekhour must follow https://github.com/yusuke/businessCalendar4J/blob/main/README.md#specify-holidays-and-business-hours (set business hours) - SPECIFIC_HOLIDAYS to describe custom holidays. The default is related to calendar you specified (3rd parameter). See https://github.com/commenthol/date-holidays/blob/c968f9a8e3cf734a32645efd605b67dac70ea509/docs/specification.md#grammar-for-day-rules for format of specific holidays.

IS_BUSINESS_DAY
SELECT IS_BUSINESS_DAY(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), 'FR') AS businessday;

You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.

IS_HOLIDAY_DAY
SELECT IS_HOLIDAY_DAY(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), 'FR') AS businessday;

You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.

DURATION_BETWEEN
SELECT DURATION_BETWEEN(ARRAY[TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00')], ARRAY[TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00')], 'ALL', 'NEXT') AS duration;

You can add optionally two parameters to the function, CALENDAR, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above if you want DURATION_BETWEEN calculated with business duration.

IS_FOLLOWED_BY

SELECT IS_FOLLOWED_BY(ARRAY[TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00')], ARRAY[TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00')]) AS followed;

LONG_TO_HEX

SELECT LONG_TO_HEX(MD5('hello world')) AS hex;

ARRAY_FLATTEN

SELECT ARRAY_FLATTEN('LONG',ARRAY[1, 2, 3, ARRAY[4, 5], ARRAY[6, 7]]) AS set_array;

ARRAY_FLATTEN_AND_SORT

SELECT ARRAY_FLATTEN_AND_SORT('LONG',ARRAY[1, 2, 3, ARRAY[4, 5], ARRAY[6, 7]]) AS sorted_array;

CONCAT_FLATTEN_AND_SORT

SELECT CONCAT_FLATTEN_AND_SORT(',',ARRAY[1, 2, 3, ARRAY[4, 5], ARRAY[6, 7]]) AS sorted_string;

INTERVAL_UNION

SELECT INTERVAL_UNION(ARRAY[1, 2, 3, 5], ARRAY[4, 6]) AS interval_union;

INTERVAL_UNION_AGG

SELECT INTERVAL_UNION_AGG(field_array) AS interval_union_agg;

DURATION_INTERVAL

SELECT DURATION_INTERVAL(ARRAY[TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00')], 'FR') AS duration;

You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.

GET_HOLIDAYS

SELECT GET_HOLIDAYS(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00'), 'FR') AS holidays;

You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.

GET_CALENDARS

SELECT GET_CALENDARS() AS calendars;

JSON_OBJECT_SIZE

SELECT JSON_OBJECT_SIZE("key1", "value1", "key2", "value2") AS json_size;

IN_BITMAP_BASE64

SELECT IN_BITMAP_BASE64("LONG_COL", 'roaringbitmapBase64_01', 'roaringbitmapBase64_02') AS in_bitmap;

ARRAY_SUCCESSIVE_DIFF

SELECT ARRAY_SUCCESSIVE_DIFF(ARRAY[10, 20, 30, 35, 40]) AS successive_diff;

ENCODE_BASE64_UTF8

SELECT ENCODE_BASE64_UTF8("TOENCODESTRING") AS encoded_str;

ARRAY_SUBSET

SELECT ARRAY_SUBSET(ARRAY[1, 2], ARRAY[1, 2, 3]) AS is_subset;

GRAPH_JACCARD_SIMILARITY

SELECT GRAPH_JACCARD_SIMILARITY(ARRAY['v1', 'v2', 'v3'], ARRAY['v1->v2', 'v2->v3'], ARRAY['v1', 'v2', 'v3', 'v4'], ARRAY['v1->v2', 'v2->v3', 'v3->v4']) AS jaccard_similarity;
END_DATE
SELECT END_DATE(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), 4233033, 'FR') AS duration;

You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.