Table 2

Specific data profiling tool functionalities evaluated

Functionality typeFunctionData profiling tools capable of natively executing function
KDCOWPPAPTOSWR
Single Column-Cardinalities
Refers to the uniqueness of data values contained in a particular column (Attribute) of a table (Entity)
No of rows
No of nulls
Percentage of nulls
No of distinct values (cardinality)
Percentage of distinct values (No of distinct values divided by the no of rows)
Single Column-Value distribution
Presents an ordering of the relative frequency (count and percentage) of the assignment of distinct values
Frequency histograms (equi-width, equi-depth, etc.)
Minimum and maximum values in a numeric column
Constancy (Frequency of most frequent value divided by number of rows)
Quartiles (three points that divide the numeric values into four equal groups)
Distribution of first digit in numeric values (to check Benford’s law)
Single Column-Patterns, datatypes and domains
Refers to the discovery of patterns and data types
Basic types (eg, numeric, alphanumeric, date, time)
Database Management Systems-specific data type (eg, varchar, timestamp)
Measurement of Value length (minimum, maximum, average, median)
Maximum number of digits in numeric values
Maximum number of decimals in numeric values
Histogram of value patterns (Aa9…)
Generic semantic data type (eg, code, date/time, quantity, identifier)
Semantic domain (eg, credit card, first name, city)
Dependencies
Determines the dependent relationships within a data set
Unique column combinations (key discovery)
Relaxed unique column combinations
Inclusion dependencies (foreign key discovery)
Relaxed inclusion dependencies
Functional dependencies
Conditional functional dependencies
Advanced Multi Column profiling
Determines the similarities and differences in syntax and data types between tables (entities) to determine which data might be redundant and which could be mapped together
Correlation analysis
Association rule mining
Cluster analysis
Outlier detection
Exact duplicate tuple detection
Relaxed duplicate tuple detection
Total1913853010158
  • AP, Aggregate Profiler; DC, DataCleaner; K, Knime; O, orange; PP, Pandas Profiling (Python); TOS, Talend Open Studio for Data Quality; W, WEKA; WR, WhiteRabbit.