InsightIQ Blog
The Necessity of Data Profiling
Sep 21 2009
Data profiling is the process of analyzing data sets to gather information on data types, statistics, patterns, relationships and other attributes. This process is a critical task often scheduled just prior to the start of a design phase for a new database initiative. Typically, data profiling is seen primarily as an input for data modeling and ETL design to identify data types and entity relationships. The value of thorough data profiling actually extends far beyond these tasks and may help steer the direction of the project and ensure its success.
Below are some of the ways data profiling can add value during a new database initiative:
- Reduce project risk - uncover data questions, issues and hidden (or even unknown) business rules prior to system design.
- Estimate tasks more accurately- reduce delays or impediments in project delivery through better estimation and project planning. By understanding the source data and its complexities, teams can work together to accurately assess the effort required to complete tasks and meet requirements.
- Plan for growth - accurately estimate database storage requirements, system growth potential, and data archive strategies.
- Identify data relationships and hierarchies - the traditional output of the data profile and a critical input for comprehensive data modeling.
- Uncover data issues - discover issues or gaps within source data and address these design problems prior to starting any development tasks. This will significantly reduce rework and keep project milestones on track.
- Define business rules - help uncover business rule complexities and define these rules with the team based on the available source data. The data profile can also identify areas where new business rules need to be defined or old business rules may be retired.
- Anticipate ETL requirements - helps to architect the database solution to handle business requirements and associated data intricacies. Knowing the data well upfront will provide the architect with the information needed to address issues, cleanse or transform the data to conform to the requirements, and to reduce the potential for downstream system issues. Additionally, anticipating the complexities of the data and implementing an optimized solution will improve database load efficiency and decrease system overhead.
- Define data validations -data profiles may help to engage the team in opportunities for data validations to reduce the amount of incomplete, inaccurate or bad data entering the system.
- Exception handling and reporting - similar to the data validations, the system design can be architected to process and allow reporting on source data issues to better identify the origin of the problems.
- Drive system testing requirements - design additional system tests cases to anticipate potential data issues or intricacies inspired by the data profiling sessions.
Data profiling is more than just a quick analysis for data modeling purposes; it should be used as a fundamental building block to the overall database initiative. Think, discuss and share the data profile findings with all members of the project team to address any issues or design considerations as soon as possible. The process may seem arduous at times and less than glamorous, but it is much better to uncover any surprises prior to development to keep the project on track.



From MJ: You're right on target. You have to use control groups to measure effectiveness.…
From Anthony Famularo: Similarly, realtors in my area are just as guilty of untargeted marketing as the airlines. Every day…
From Bill Connolly: I should, for the sake of disclosure purposes, mention that I also work for CSG Systems.