Specific data profiling tool functionalities evaluated
Functionality type | Function | Data profiling tools capable of natively executing function | |||||||
K | DC | O | W | PP | AP | TOS | WR | ||
Single Column-Cardinalities
Refers to the uniqueness of data values contained in a particular column (Attribute) of a table (Entity) | No of rows | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
No of nulls | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Percentage of nulls | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
No of distinct values (cardinality) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Percentage of distinct values (No of distinct values divided by the no of rows) | ✓ | ✓ | ✓ | ✓ | |||||
Single Column-Value distribution
Presents an ordering of the relative frequency (count and percentage) of the assignment of distinct values | Frequency histograms (equi-width, equi-depth, etc.) | ✓ | ✓ | ||||||
Minimum and maximum values in a numeric column | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Constancy (Frequency of most frequent value divided by number of rows) | ✓ | ✓ | ✓ | ||||||
Quartiles (three points that divide the numeric values into four equal groups) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Distribution of first digit in numeric values (to check Benford’s law) | ✓ | ✓ | ✓ | ||||||
Single Column-Patterns, datatypes and domains
Refers to the discovery of patterns and data types | Basic types (eg, numeric, alphanumeric, date, time) | ✓ | ✓ | ||||||
Database Management Systems-specific data type (eg, varchar, timestamp) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Measurement of Value length (minimum, maximum, average, median) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Maximum number of digits in numeric values | ✓ | ✓ | ✓ | ✓ | |||||
Maximum number of decimals in numeric values | ✓ | ✓ | ✓ | ||||||
Histogram of value patterns (Aa9…) | ✓ | ✓ | ✓ | ✓ | |||||
Generic semantic data type (eg, code, date/time, quantity, identifier) | ✓ | ✓ | ✓ | ✓ | |||||
Semantic domain (eg, credit card, first name, city) | ✓ | ✓ | ✓ | ✓ | |||||
Dependencies
Determines the dependent relationships within a data set | Unique column combinations (key discovery) | ✓ | |||||||
Relaxed unique column combinations | ✓ | ||||||||
Inclusion dependencies (foreign key discovery) | ✓ | ||||||||
Relaxed inclusion dependencies | ✓ | ||||||||
Functional dependencies | ✓ | ||||||||
Conditional functional dependencies | ✓ | ||||||||
Advanced Multi Column profiling
Determines the similarities and differences in syntax and data types between tables (entities) to determine which data might be redundant and which could be mapped together | Correlation analysis | ✓ | ✓ | ✓ | |||||
Association rule mining | ✓ | ||||||||
Cluster analysis | ✓ | ||||||||
Outlier detection | ✓ | ✓ | ✓ | ||||||
Exact duplicate tuple detection | ✓ | ✓ | ✓ | ||||||
Relaxed duplicate tuple detection | ✓ | ✓ | ✓ | ||||||
Total | 19 | 13 | 8 | 5 | 30 | 10 | 15 | 8 |
AP, Aggregate Profiler; DC, DataCleaner; K, Knime; O, orange; PP, Pandas Profiling (Python); TOS, Talend Open Studio for Data Quality; W, WEKA; WR, WhiteRabbit.