Table Description

Datasources Usage Guide ¶

Overview ¶

Each project ingested into Druid produces three main datasources representing the process data at different levels:

Cases: one row per case (process execution)
Tasks (_vertex): one row per task execution
Links (_edge): one row per transition between tasks

Datasource naming pattern:

${projectId}
${projectId}_vertex
${projectId}_edge

These datasources can be enriched using Druid lookups such as:

${projectId}_variant_name
${projectId}_excluded_cases
${projectId}_cases_is_completed
business_rule_<ruleId>
conformance_<modelId>

Lookups behave similarly to joins and can be used using the LOOKUP() function.

Example:

SELECT caseid, LOOKUP(graphkey, '${projectId}_variant_name') AS variant_name
FROM ${projectId}

Cases datasource ¶

The Cases datasource contains one row per case and aggregates information computed from all events belonging to that case.

Datasource name pattern: ${projectId}

Example: b7bd7acb-6a13-44a9-bebb-26f8e7e96c48

This datasource is typically used for:

case-level analytics
duration and performance metrics
filtering cases
variant or process analysis

Name	Type	Description
__time	Timestamp	start date of the first case event
enddate	Long	end date of the last case event
duration	Long	case duration in milliseconds
graphkey	UUID	variant id the case belongs to (model variant)
processkey	UUID	process id the case belongs to (process variant)
processsequence	UUID	process_sequence id belongs to (sequence variant)
caseid	String	case unique identifier
version	String	data version.
case_version	String	<caseid>-<version>, used to filter obsolete case data
active_duration	Long	total active time in the case
concurrency	Double	case concurrency rate
rework	Long	case total rework
tasks_count	Long	number of tasks in the case (excluding start/end)
distinct_vertices	Array[String]	distinct task identifiers in the case (including start and end)
distinct_vertices_count	Long	number of distinct tasks (including start and end)
distinct_edges	Array[String]	distinct transitions in the case
distinct_edges_count	Long	number of distinct transitions.
hll_vertices	Complex<HLLSketch>	HyperLogLog sketch of distinct vertices
hll_edges	Complex<HLLSketch>	HyperLogLog sketch of distinct edges.
last_tasks_names	Array[String]	names of last tasks before case completion.
vertices_dates	Complex<json>	start/end timestamps for each task in the case
igx_metric-slug_lpkr	Double	aggregated value of the metric at the case level
igx_dimension-slug_lpkr	String	aggregated value of the dimension at the case level
is_predicted	Boolean	indicates if row is prediction data
is_predicted_finished	Boolean	indicates if predicted case is finished

Tasks datasource (_vertex)¶

The Tasks datasource contains one row per task execution within a case.

Datasource name pattern: ${projectId}_vertex

Example: b7bd7acb-6a13-44a9-bebb-26f8e7e96c48_vertex

This datasource is typically used for:

task performance analysis
task-level filtering
bottleneck detection
concurrency analysis

Name	Type	Description
ingestion_timestamp	Long	insertion timestamp
__time	Timestamp	task start date
enddate	Long	task end date
duration	Long	task duration in milliseconds
graphkey	UUID	variant id the case belongs to (model variant)
processkey	UUID	process id the case belongs to (process variant)
processsequence	UUID	process_sequence id belongs to (sequence variant)
caseid	String	identifier of the task’s case
case_version	String	<caseid>-<version>, used to filter obsolete case data
vertex_id	UUID	task non-unique identifier based on task name (two identical task names has same vertex_id)
task_id	UUID	task unique id
vertex_name	String	task name (null for Start and End vertices)
igx_metric-slug_lpkr	Double	metric value for the event
igx_dimension-slug_lpkr	String	dimension value for event
concurrent_vertices	Multi UUID	concurrent tasks non-unique ids
loop_path	Multi UUID	for a loop starting and finishing with vertex_id, non-unique ids of links composing the loop
occurrences_in_case	Long	number of times a the task appears in the case (based on task name)
predicted_occurrences_in_case	Long	number of times a the task appears in the predicted case (based on task name)
rework	Long	number of repetitions of the task (without concurrency) in the case
case_igx_metric-slug_lpkr	Double	aggregated value of the metric for the case
case_igx_dimension-slug_lpkr	Multi String	aggregated value of the dimension for the case
case_start_date	Long	case start date
case_end_date	Long	case end date
case_duration	Long	case duration
case_rework	Long	case rework
case_concurrency	Double	case concurrency
case_active_duration	Long	case active duration
case_idle_duration	Long	case idle duration
case_tasks_count	Long	number of tasks in the case
case_distinct_vertices	Array[String]	distinct task identifiers in the case (including start and end)
case_distinct_edges	Array[String]	distinct transitions in the case
case_is_predicted_finished	Boolean	indicates if predicted case is finished
is_predicted	Boolean	row belongs to prediction data
is_in_prediction	Boolean	task is part of predicted case

Links datasource (____edge)¶

The Links datasource represents transitions between tasks in the process.

Datasource name pattern: ${projectId}____edge

Example: b7bd7acb-6a13-44a9-bebb-26f8e7e96c48_edge

The edges form a Direct Follower Graph (DFG), meaning only direct transitions between tasks are represented (no BPMN gateways).

This datasource is typically used for: - path analysis - transition frequency analysis - performance analysis between tasks - loop detection

Name	Type	Description
__time	Timestamp	edge start date (end of event A)
enddate	Long	edge end date (start of event B)
duration	Long	edge duration in milliseconds
graphkey	UUID	variant id the case belongs to (model variant)
processkey	UUID	process id the case belongs to (process variant)
processsequence	UUID	process_sequence id belongs to (sequence variant)
caseid	String	identifier of the edge’s case
case_version	String	<caseid>-<version>, used to filter obsolete case data
edge_id	UUID	edge non-unique identifier based on edge name (two identical edges names have the same edge_id)
edge_name	String	edge name (A->B)
start_taskid	UUID	unique identifier of the starting task A
start_vertexid	UUID	non-unique identifier of the starting task A
start_vertexname	String	name of starting task A
end_taskid	UUID	unique identifier of the end task B
end_vertexid	UUID	non-unique identifier of the end task B
end_vertexname	String	name of end task B
igx_metric-slug_lpkr__source	Double	task A metric value
igx_metric-slug_lpkr_destination	Double	task B metric value
igx_dimension-slug_lpkr_source	String	task A dimension value
igx_dimension-slug_lpkr_destination	String	task B dimension value
concurrentEdges	Multi UUID	non-unique ids of concurrent links
occurrences_in_case	Long	number of times the link appears in the case (based on the link name)
predicted_occurrences_in_case	Long	number of times the link appears in the predicted case (based on the link name)
rework	Long	number of repetitions of the link (without concurrency) in the case
case_igx_metric-slug_lpkr	Double	aggregated value of the metric at the case level
case_igx_dimension-slug_lpkr	Multi String	aggregated value of the dimension at the case level
case_start_date	Long	case start date
case_end_date	Long	case end date
case_duration	Long	case duration
case_rework	Long	case rework
case_concurrency	Double	case concurrency
case_active_duration	Long	case active duration
case_idle_duration	Long	case idle duration
case_tasks_count	Long	number of tasks in the case
case_distinct_vertices	Array[String]	distinct task identifiers in the case (including start and end)
case_distinct_edges	Array[String]	distinct transitions in the case
case_is_predicted_finished	Boolean	indicates if predicted case is finished
is_predicted	Boolean	row belongs to prediction data
is_in_prediction	Boolean	link is part of predicted case

Common Query Patterns ¶

1 Filtering excluded cases ¶

When multiple versions of a case exist during ingestion, some versions must be ignored.

Use the ${projectId}_excluded_cases lookup.

SELECT *
FROM ${projectId}
WHERE LOOKUP(case_version, '${projectId}_excluded_cases') IS NULL

Explanation:

excluded cases are stored in the lookup
if lookup returns a value → case version is excluded
if lookup returns NULL → case version is valid

2 Retrieving the variant name ¶

Variants are stored as graphkey. To get the readable name:

SELECT
  caseid,
  graphkey,
  LOOKUP(graphkey, '${projectId}_variant_name') AS variant_name
FROM ${projectId}

Example use cases:

variant analysis filtering specific variants

WHERE LOOKUP(graphkey, '${projectId}_variant_name') = '3'

3 Filtering completed cases ¶

Cases completion is computed by business logic and stored in a lookup.

SELECT caseid
FROM ${projectId}
WHERE LOOKUP(caseid, '${projectId}_cases_is_completed') = 'TRUE'

Possible values depend on the end case rule (business rule).

4 Using business rules ¶

Business rules results are stored in dynamic lookups.

Lookup name pattern:

business_rule_

Example:

SELECT caseid
FROM ${projectId}
WHERE LOOKUP(caseid, 'business_rule_12345') = 'TRUE'

5 Filtering by conformance score ¶

Conformance scores are stored in lookups.

Example:

SELECT caseid,
LOOKUP(processkey, 'conformance_model1') AS fitness
FROM ${projectId}
WHERE LOOKUP(processkey, 'conformance_model1') < 0.8

This allows filtering low-conformance cases.

6 Task-level analysis ¶

Example: average task duration

SELECT
  vertex_name,
  AVG(duration) AS avg_duration
FROM ${projectId}_vertex
GROUP BY vertex_name

7 Transition analysis ¶

Example: most frequent transitions

SELECT
  start_vertexname,
  end_vertexname,
  COUNT(*) AS transitions
FROM ${projectId}_edge
GROUP BY start_vertexname, end_vertexname
ORDER BY transitions DESC

8 Variant performance ¶

Example: average case duration per variant

SELECT
  LOOKUP(graphkey, '${projectId}_variant_name') AS variant,
  AVG(duration) AS avg_duration
FROM ${projectId}
GROUP BY graphkey

9 Using prediction fields ¶

Prediction-related fields allow comparing predicted and real data.

Example:

SELECT *
FROM ${projectId}
WHERE is_predicted = true

Example: predicted tasks

SELECT *
FROM ${projectId}_vertex
WHERE is_in_prediction = true

Best Practices ¶

Always:

filter excluded cases using the lookup
avoid join to filter on dimension or metric when case_xx exists
prefer the Cases datasource for aggregated metrics
use Tasks datasource for activity analysis
use Links datasource for path analysis
use lookups instead of joins whenever possible

Resources ¶

IGrafx has extended SQL Druid by some specific functions:

Sha256 ¶

SELECT SHA256('hello world') AS hash;

Md5 ¶

SELECT MD5('hello world') AS hash;

NewID ¶

SELECT NEWID() AS uuid;

Tip: You can use NEWID() and MD5 to get a random sample (about 10%) of data.

SELECT "caseid" from "cases" WHERE BITWISE_AND(md5(CONCAT("caseid", NEWID())), 2147483647) / 2147483647.0 < 0.1;

Be careful that when using NEWID() in a query, it will return the same UUID for each row in the result set. If you would like to get random sample of data using distributed SQL, you cannot use NEWID() directly, but a variable set before instead.

BUSINESS_DURATION ¶

SELECT BUSINESS_DURATION(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), TIME_PARSE('2023-10-01", 'YYYY-MM-dd', '+01:00'), 'FR') AS duration;

You can add optionally two parameters to the function: - LANGUAGE to describe the language to use for the calendar. The default is "en". Example "fr" for French. - HOLIDAY_TYPES to describe the type of holiday to use. Example "public observance". The different values are public, observance, bank, school, optional. The default is "public". - WEEK_HOURS to describe the business hours of the week. The default is "08:30-12:30, 13:30-17:30". You can specify a different range for each day of the week, e.g., '09:00-17:00' for Monday to Friday. Format of weekhour must follow https://github.com/yusuke/businessCalendar4J/blob/main/README.md#specify-holidays-and-business-hours (set business hours) - SPECIFIC_HOLIDAYS to describe custom holidays. The default is related to calendar you specified (3rd parameter). See https://github.com/commenthol/date-holidays/blob/c968f9a8e3cf734a32645efd605b67dac70ea509/docs/specification.md#grammar-for-day-rules for format of specific holidays.

IS_BUSINESS_DAY ¶

SELECT IS_BUSINESS_DAY(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), 'FR') AS businessday;

You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.

IS_HOLIDAY_DAY ¶

SELECT IS_HOLIDAY_DAY(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), 'FR') AS businessday;

You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.

DURATION_BETWEEN ¶

SELECT DURATION_BETWEEN(ARRAY[TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00')], ARRAY[TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00')], 'ALL', 'NEXT') AS duration;

You can add optionally two parameters to the function, CALENDAR, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above if you want DURATION_BETWEEN calculated with business duration.

IS_FOLLOWED_BY ¶

SELECT IS_FOLLOWED_BY(ARRAY[TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00')], ARRAY[TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00')]) AS followed;

LONG_TO_HEX ¶

SELECT LONG_TO_HEX(MD5('hello world')) AS hex;

ARRAY_FLATTEN ¶

SELECT ARRAY_FLATTEN('LONG',ARRAY[1, 2, 3, ARRAY[4, 5], ARRAY[6, 7]]) AS set_array;

ARRAY_FLATTEN_AND_SORT ¶

SELECT ARRAY_FLATTEN_AND_SORT('LONG',ARRAY[1, 2, 3, ARRAY[4, 5], ARRAY[6, 7]]) AS sorted_array;

CONCAT_FLATTEN_AND_SORT ¶

SELECT CONCAT_FLATTEN_AND_SORT(',',ARRAY[1, 2, 3, ARRAY[4, 5], ARRAY[6, 7]]) AS sorted_string;

INTERVAL_UNION ¶

SELECT INTERVAL_UNION(ARRAY[1, 2, 3, 5], ARRAY[4, 6]) AS interval_union;

INTERVAL_UNION_AGG ¶

SELECT INTERVAL_UNION_AGG(field_array) AS interval_union_agg;

DURATION_INTERVAL ¶

SELECT DURATION_INTERVAL(ARRAY[TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00')], 'FR') AS duration;

You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.

GET_HOLIDAYS ¶

SELECT GET_HOLIDAYS(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00'), 'FR') AS holidays;

You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.

GET_CALENDARS ¶

SELECT GET_CALENDARS() AS calendars;

JSON_OBJECT_SIZE ¶

SELECT JSON_OBJECT_SIZE("key1", "value1", "key2", "value2") AS json_size;

IN_BITMAP_BASE64 ¶

SELECT IN_BITMAP_BASE64("LONG_COL", 'roaringbitmapBase64_01', 'roaringbitmapBase64_02') AS in_bitmap;

ARRAY_SUCCESSIVE_DIFF ¶

SELECT ARRAY_SUCCESSIVE_DIFF(ARRAY[10, 20, 30, 35, 40]) AS successive_diff;

ENCODE_BASE64_UTF8 ¶

SELECT ENCODE_BASE64_UTF8("TOENCODESTRING") AS encoded_str;

ARRAY_SUBSET ¶

SELECT ARRAY_SUBSET(ARRAY[1, 2], ARRAY[1, 2, 3]) AS is_subset;

GRAPH_JACCARD_SIMILARITY ¶

SELECT GRAPH_JACCARD_SIMILARITY(ARRAY['v1', 'v2', 'v3'], ARRAY['v1->v2', 'v2->v3'], ARRAY['v1', 'v2', 'v3', 'v4'], ARRAY['v1->v2', 'v2->v3', 'v3->v4']) AS jaccard_similarity;

END_DATE ¶

SELECT END_DATE(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), 4233033, 'FR') AS duration;

You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.