Table Description
Datasources Usage Guide¶
Overview¶
Each project ingested into Druid produces three main datasources representing the process data at different levels:
- Cases: one row per case (process execution)
- Tasks (
_vertex): one row per task execution - Links (
_edge): one row per transition between tasks
Datasource naming pattern:
- ${projectId}
- ${projectId}_vertex
- ${projectId}_edge
These datasources can be enriched using Druid lookups such as:
${projectId}_variant_name${projectId}_excluded_cases${projectId}_cases_is_completedbusiness_rule_<ruleId>conformance_<modelId>
Lookups behave similarly to joins and can be used using the LOOKUP() function.
Example:
SELECT caseid, LOOKUP(graphkey, '${projectId}_variant_name') AS variant_name
FROM ${projectId}
Cases datasource¶
The Cases datasource contains one row per case and aggregates information computed from all events belonging to that case.
Datasource name pattern: ${projectId}
Example: b7bd7acb-6a13-44a9-bebb-26f8e7e96c48
This datasource is typically used for:
- case-level analytics
- duration and performance metrics
- filtering cases
- variant or process analysis
| Name | Type | Description |
|---|---|---|
| __time | Timestamp | start date of the first case event |
| enddate | Long | end date of the last case event |
| duration | Long | case duration in milliseconds |
| graphkey | UUID | variant id the case belongs to (model variant) |
| processkey | UUID | process id the case belongs to (process variant) |
| processsequence | UUID | process_sequence id belongs to (sequence variant) |
| caseid | String | case unique identifier |
| version | String | data version. |
| case_version | String | <caseid>-<version>, used to filter obsolete case data |
| active_duration | Long | total active time in the case |
| concurrency | Double | case concurrency rate |
| rework | Long | case total rework |
| tasks_count | Long | number of tasks in the case (excluding start/end) |
| distinct_vertices | Array[String] | distinct task identifiers in the case (including start and end) |
| distinct_vertices_count | Long | number of distinct tasks (including start and end) |
| distinct_edges | Array[String] | distinct transitions in the case |
| distinct_edges_count | Long | number of distinct transitions. |
| hll_vertices | Complex<HLLSketch> | HyperLogLog sketch of distinct vertices |
| hll_edges | Complex<HLLSketch> | HyperLogLog sketch of distinct edges. |
| last_tasks_names | Array[String] | names of last tasks before case completion. |
| vertices_dates | Complex<json> | start/end timestamps for each task in the case |
| igx_metric-slug_lpkr | Double | aggregated value of the metric at the case level |
| igx_dimension-slug_lpkr | String | aggregated value of the dimension at the case level |
| is_predicted | Boolean | indicates if row is prediction data |
| is_predicted_finished | Boolean | indicates if predicted case is finished |
Tasks datasource (_vertex)¶
The Tasks datasource contains one row per task execution within a case.
Datasource name pattern: ${projectId}_vertex
Example: b7bd7acb-6a13-44a9-bebb-26f8e7e96c48_vertex
This datasource is typically used for:
- task performance analysis
- task-level filtering
- bottleneck detection
- concurrency analysis
| Name | Type | Description |
|---|---|---|
| ingestion_timestamp | Long | insertion timestamp |
| __time | Timestamp | task start date |
| enddate | Long | task end date |
| duration | Long | task duration in milliseconds |
| graphkey | UUID | variant id the case belongs to (model variant) |
| processkey | UUID | process id the case belongs to (process variant) |
| processsequence | UUID | process_sequence id belongs to (sequence variant) |
| caseid | String | identifier of the task’s case |
| case_version | String | <caseid>-<version>, used to filter obsolete case data |
| vertex_id | UUID | task non-unique identifier based on task name (two identical task names has same vertex_id) |
| task_id | UUID | task unique id |
| vertex_name | String | task name (null for Start and End vertices) |
| igx_metric-slug_lpkr | Double | metric value for the event |
| igx_dimension-slug_lpkr | String | dimension value for event |
| concurrent_vertices | Multi UUID | concurrent tasks non-unique ids |
| loop_path | Multi UUID | for a loop starting and finishing with vertex_id, non-unique ids of links composing the loop |
| occurrences_in_case | Long | number of times a the task appears in the case (based on task name) |
| predicted_occurrences_in_case | Long | number of times a the task appears in the predicted case (based on task name) |
| rework | Long | number of repetitions of the task (without concurrency) in the case |
| case_igx_metric-slug_lpkr | Double | aggregated value of the metric for the case |
| case_igx_dimension-slug_lpkr | Multi String | aggregated value of the dimension for the case |
| case_start_date | Long | case start date |
| case_end_date | Long | case end date |
| case_duration | Long | case duration |
| case_rework | Long | case rework |
| case_concurrency | Double | case concurrency |
| case_active_duration | Long | case active duration |
| case_idle_duration | Long | case idle duration |
| case_tasks_count | Long | number of tasks in the case |
| case_distinct_vertices | Array[String] | distinct task identifiers in the case (including start and end) |
| case_distinct_edges | Array[String] | distinct transitions in the case |
| case_is_predicted_finished | Boolean | indicates if predicted case is finished |
| is_predicted | Boolean | row belongs to prediction data |
| is_in_prediction | Boolean | task is part of predicted case |
Links datasource (____edge)¶
The Links datasource represents transitions between tasks in the process.
Datasource name pattern: ${projectId}____edge
Example: b7bd7acb-6a13-44a9-bebb-26f8e7e96c48_edge
The edges form a Direct Follower Graph (DFG), meaning only direct transitions between tasks are represented (no BPMN gateways).
This datasource is typically used for: - path analysis - transition frequency analysis - performance analysis between tasks - loop detection
| Name | Type | Description |
|---|---|---|
| __time | Timestamp | edge start date (end of event A) |
| enddate | Long | edge end date (start of event B) |
| duration | Long | edge duration in milliseconds |
| graphkey | UUID | variant id the case belongs to (model variant) |
| processkey | UUID | process id the case belongs to (process variant) |
| processsequence | UUID | process_sequence id belongs to (sequence variant) |
| caseid | String | identifier of the edge’s case |
| case_version | String | <caseid>-<version>, used to filter obsolete case data |
| edge_id | UUID | edge non-unique identifier based on edge name (two identical edges names have the same edge_id) |
| edge_name | String | edge name (A->B) |
| start_taskid | UUID | unique identifier of the starting task A |
| start_vertexid | UUID | non-unique identifier of the starting task A |
| start_vertexname | String | name of starting task A |
| end_taskid | UUID | unique identifier of the end task B |
| end_vertexid | UUID | non-unique identifier of the end task B |
| end_vertexname | String | name of end task B |
| igx_metric-slug_lpkr__source | Double | task A metric value |
| igx_metric-slug_lpkr_destination | Double | task B metric value |
| igx_dimension-slug_lpkr_source | String | task A dimension value |
| igx_dimension-slug_lpkr_destination | String | task B dimension value |
| concurrentEdges | Multi UUID | non-unique ids of concurrent links |
| occurrences_in_case | Long | number of times the link appears in the case (based on the link name) |
| predicted_occurrences_in_case | Long | number of times the link appears in the predicted case (based on the link name) |
| rework | Long | number of repetitions of the link (without concurrency) in the case |
| case_igx_metric-slug_lpkr | Double | aggregated value of the metric at the case level |
| case_igx_dimension-slug_lpkr | Multi String | aggregated value of the dimension at the case level |
| case_start_date | Long | case start date |
| case_end_date | Long | case end date |
| case_duration | Long | case duration |
| case_rework | Long | case rework |
| case_concurrency | Double | case concurrency |
| case_active_duration | Long | case active duration |
| case_idle_duration | Long | case idle duration |
| case_tasks_count | Long | number of tasks in the case |
| case_distinct_vertices | Array[String] | distinct task identifiers in the case (including start and end) |
| case_distinct_edges | Array[String] | distinct transitions in the case |
| case_is_predicted_finished | Boolean | indicates if predicted case is finished |
| is_predicted | Boolean | row belongs to prediction data |
| is_in_prediction | Boolean | link is part of predicted case |
Common Query Patterns¶
1 Filtering excluded cases¶
When multiple versions of a case exist during ingestion, some versions must be ignored.
Use the ${projectId}_excluded_cases lookup.
SELECT *
FROM ${projectId}
WHERE LOOKUP(case_version, '${projectId}_excluded_cases') IS NULL
Explanation:
- excluded cases are stored in the lookup
- if lookup returns a value → case version is excluded
- if lookup returns NULL → case version is valid
2 Retrieving the variant name¶
Variants are stored as graphkey. To get the readable name:
SELECT
caseid,
graphkey,
LOOKUP(graphkey, '${projectId}_variant_name') AS variant_name
FROM ${projectId}
Example use cases:
variant analysis filtering specific variants
WHERE LOOKUP(graphkey, '${projectId}_variant_name') = '3'
3 Filtering completed cases¶
Cases completion is computed by business logic and stored in a lookup.
SELECT caseid
FROM ${projectId}
WHERE LOOKUP(caseid, '${projectId}_cases_is_completed') = 'TRUE'
Possible values depend on the end case rule (business rule).
4 Using business rules¶
Business rules results are stored in dynamic lookups.
Lookup name pattern:
business_rule_
Example:
SELECT caseid
FROM ${projectId}
WHERE LOOKUP(caseid, 'business_rule_12345') = 'TRUE'
5 Filtering by conformance score¶
Conformance scores are stored in lookups.
Example:
SELECT caseid,
LOOKUP(processkey, 'conformance_model1') AS fitness
FROM ${projectId}
WHERE LOOKUP(processkey, 'conformance_model1') < 0.8
This allows filtering low-conformance cases.
6 Task-level analysis¶
Example: average task duration
SELECT
vertex_name,
AVG(duration) AS avg_duration
FROM ${projectId}_vertex
GROUP BY vertex_name
7 Transition analysis¶
Example: most frequent transitions
SELECT
start_vertexname,
end_vertexname,
COUNT(*) AS transitions
FROM ${projectId}_edge
GROUP BY start_vertexname, end_vertexname
ORDER BY transitions DESC
8 Variant performance¶
Example: average case duration per variant
SELECT
LOOKUP(graphkey, '${projectId}_variant_name') AS variant,
AVG(duration) AS avg_duration
FROM ${projectId}
GROUP BY graphkey
9 Using prediction fields¶
Prediction-related fields allow comparing predicted and real data.
Example:
SELECT *
FROM ${projectId}
WHERE is_predicted = true
Example: predicted tasks
SELECT *
FROM ${projectId}_vertex
WHERE is_in_prediction = true
Best Practices¶
Always:
- filter excluded cases using the lookup
- avoid join to filter on dimension or metric when case_xx exists
- prefer the Cases datasource for aggregated metrics
- use Tasks datasource for activity analysis
- use Links datasource for path analysis
- use lookups instead of joins whenever possible
Resources¶
IGrafx has extended SQL Druid by some specific functions:
Sha256¶
SELECT SHA256('hello world') AS hash;
Md5¶
SELECT MD5('hello world') AS hash;
NewID¶
SELECT NEWID() AS uuid;
Tip: You can use NEWID() and MD5 to get a random sample (about 10%) of data.
SELECT "caseid" from "cases" WHERE BITWISE_AND(md5(CONCAT("caseid", NEWID())), 2147483647) / 2147483647.0 < 0.1;
Be careful that when using NEWID() in a query, it will return the same UUID for each row in the result set.
If you would like to get random sample of data using distributed SQL, you cannot use NEWID() directly, but a variable set before instead.
BUSINESS_DURATION¶
SELECT BUSINESS_DURATION(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), TIME_PARSE('2023-10-01", 'YYYY-MM-dd', '+01:00'), 'FR') AS duration;
You can add optionally two parameters to the function: - LANGUAGE to describe the language to use for the calendar. The default is "en". Example "fr" for French. - HOLIDAY_TYPES to describe the type of holiday to use. Example "public observance". The different values are public, observance, bank, school, optional. The default is "public". - WEEK_HOURS to describe the business hours of the week. The default is "08:30-12:30, 13:30-17:30". You can specify a different range for each day of the week, e.g., '09:00-17:00' for Monday to Friday. Format of weekhour must follow https://github.com/yusuke/businessCalendar4J/blob/main/README.md#specify-holidays-and-business-hours (set business hours) - SPECIFIC_HOLIDAYS to describe custom holidays. The default is related to calendar you specified (3rd parameter). See https://github.com/commenthol/date-holidays/blob/c968f9a8e3cf734a32645efd605b67dac70ea509/docs/specification.md#grammar-for-day-rules for format of specific holidays.
IS_BUSINESS_DAY¶
SELECT IS_BUSINESS_DAY(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), 'FR') AS businessday;
You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.
IS_HOLIDAY_DAY¶
SELECT IS_HOLIDAY_DAY(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), 'FR') AS businessday;
You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.
DURATION_BETWEEN¶
SELECT DURATION_BETWEEN(ARRAY[TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00')], ARRAY[TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00')], 'ALL', 'NEXT') AS duration;
You can add optionally two parameters to the function, CALENDAR, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above if you want DURATION_BETWEEN calculated with business duration.
IS_FOLLOWED_BY¶
SELECT IS_FOLLOWED_BY(ARRAY[TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00')], ARRAY[TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00')]) AS followed;
LONG_TO_HEX¶
SELECT LONG_TO_HEX(MD5('hello world')) AS hex;
ARRAY_FLATTEN¶
SELECT ARRAY_FLATTEN('LONG',ARRAY[1, 2, 3, ARRAY[4, 5], ARRAY[6, 7]]) AS set_array;
ARRAY_FLATTEN_AND_SORT¶
SELECT ARRAY_FLATTEN_AND_SORT('LONG',ARRAY[1, 2, 3, ARRAY[4, 5], ARRAY[6, 7]]) AS sorted_array;
CONCAT_FLATTEN_AND_SORT¶
SELECT CONCAT_FLATTEN_AND_SORT(',',ARRAY[1, 2, 3, ARRAY[4, 5], ARRAY[6, 7]]) AS sorted_string;
INTERVAL_UNION¶
SELECT INTERVAL_UNION(ARRAY[1, 2, 3, 5], ARRAY[4, 6]) AS interval_union;
INTERVAL_UNION_AGG¶
SELECT INTERVAL_UNION_AGG(field_array) AS interval_union_agg;
DURATION_INTERVAL¶
SELECT DURATION_INTERVAL(ARRAY[TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00')], 'FR') AS duration;
You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.
GET_HOLIDAYS¶
SELECT GET_HOLIDAYS(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), TIME_PARSE('2023-10-01', 'YYYY-MM-dd', '+01:00'), 'FR') AS holidays;
You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.
GET_CALENDARS¶
SELECT GET_CALENDARS() AS calendars;
JSON_OBJECT_SIZE¶
SELECT JSON_OBJECT_SIZE("key1", "value1", "key2", "value2") AS json_size;
IN_BITMAP_BASE64¶
SELECT IN_BITMAP_BASE64("LONG_COL", 'roaringbitmapBase64_01', 'roaringbitmapBase64_02') AS in_bitmap;
ARRAY_SUCCESSIVE_DIFF¶
SELECT ARRAY_SUCCESSIVE_DIFF(ARRAY[10, 20, 30, 35, 40]) AS successive_diff;
ENCODE_BASE64_UTF8¶
SELECT ENCODE_BASE64_UTF8("TOENCODESTRING") AS encoded_str;
ARRAY_SUBSET¶
SELECT ARRAY_SUBSET(ARRAY[1, 2], ARRAY[1, 2, 3]) AS is_subset;
GRAPH_JACCARD_SIMILARITY¶
SELECT GRAPH_JACCARD_SIMILARITY(ARRAY['v1', 'v2', 'v3'], ARRAY['v1->v2', 'v2->v3'], ARRAY['v1', 'v2', 'v3', 'v4'], ARRAY['v1->v2', 'v2->v3', 'v3->v4']) AS jaccard_similarity;
END_DATE¶
SELECT END_DATE(TIME_PARSE('2023-09-21', 'YYYY-MM-dd', '+01:00'), 4233033, 'FR') AS duration;
You can add optionally two parameters to the function, LANGUAGE, HOLIDAY_TYPES, WEEK_HOURS and SPECIFIC_HOLIDAYS, as described above.