Functionality type | Function | Data profiling tools capable of natively executing function | |||||||

K | DC | O | W | PP | AP | TOS | WR | ||

Single Column-Cardinalities
Refers to the uniqueness of data values contained in a particular column (Attribute) of a table (Entity) | No of rows | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

No of nulls | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |

Percentage of nulls | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||

No of distinct values (cardinality) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |

Percentage of distinct values (No of distinct values divided by the no of rows) | ✓ | ✓ | ✓ | ✓ | |||||

Single Column-Value distribution
Presents an ordering of the relative frequency (count and percentage) of the assignment of distinct values | Frequency histograms (equi-width, equi-depth, etc.) | ✓ | ✓ | ||||||

Minimum and maximum values in a numeric column | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||

Constancy (Frequency of most frequent value divided by number of rows) | ✓ | ✓ | ✓ | ||||||

Quartiles (three points that divide the numeric values into four equal groups) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||

Distribution of ﬁrst digit in numeric values (to check Benford’s law) | ✓ | ✓ | ✓ | ||||||

Single Column-Patterns, datatypes and domains
Refers to the discovery of patterns and data types | Basic types (eg, numeric, alphanumeric, date, time) | ✓ | ✓ | ||||||

Database Management Systems-speciﬁc data type (eg, varchar, timestamp) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||

Measurement of Value length (minimum, maximum, average, median) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||

Maximum number of digits in numeric values | ✓ | ✓ | ✓ | ✓ | |||||

Maximum number of decimals in numeric values | ✓ | ✓ | ✓ | ||||||

Histogram of value patterns (Aa9…) | ✓ | ✓ | ✓ | ✓ | |||||

Generic semantic data type (eg, code, date/time, quantity, identiﬁer) | ✓ | ✓ | ✓ | ✓ | |||||

Semantic domain (eg, credit card, ﬁrst name, city) | ✓ | ✓ | ✓ | ✓ | |||||

Dependencies
Determines the dependent relationships within a data set | Unique column combinations (key discovery) | ✓ | |||||||

Relaxed unique column combinations | ✓ | ||||||||

Inclusion dependencies (foreign key discovery) | ✓ | ||||||||

Relaxed inclusion dependencies | ✓ | ||||||||

Functional dependencies | ✓ | ||||||||

Conditional functional dependencies | ✓ | ||||||||

Advanced Multi Column proﬁling
Determines the similarities and differences in syntax and data types between tables (entities) to determine which data might be redundant and which could be mapped together | Correlation analysis | ✓ | ✓ | ✓ | |||||

Association rule mining | ✓ | ||||||||

Cluster analysis | ✓ | ||||||||

Outlier detection | ✓ | ✓ | ✓ | ||||||

Exact duplicate tuple detection | ✓ | ✓ | ✓ | ||||||

Relaxed duplicate tuple detection | ✓ | ✓ | ✓ | ||||||

Total | 19 | 13 | 8 | 5 | 30 | 10 | 15 | 8 |

AP, Aggregate Profiler; DC, DataCleaner; K, Knime; O, orange; PP, Pandas Profiling (Python); TOS, Talend Open Studio for Data Quality; W, WEKA; WR, WhiteRabbit.