Data quality is always an aspect of business intelligence (BI) projects that seems to be deprioritized. It is easy to look at the beautiful visualizations and drill-through reports that are key selling features of a BI project. However, this article is about the value of cleansing your data so that these tools will work seamlessly with the data model that you establish. Everyone knows the IT saying, “Garbage in. Garbage Out.” That holds entirely true with BI projects. If the incoming data is dirty, it is going to be very difficult to efficiently process the data and make it available for a reporting platform. This isn’t an easy problem to solve either. When working across multiple functional areas, you may also have different sets of users that are entering data into the system in DIFFERENT ways. So, in this instance, you may not have a data quality issue, but a business process issue.
As I have worked through my BI projects, here are 10 steps that I have followed to work with teams to create a data-centric culture and to improve data integrity. I hope that these are of use to you…and please feel free to share any additional best practice in the comments of this blog! We can all learn from one another.
- Step #1: Build data profiling and inspection into the design of your project
Don’t wait until you are about to go-live to start looking at the quality of your data. From the very beginning of your project, you should start to profile the data that you are loading into your BI platform. Depending on your technology stack, there are multiple tools that will aid you in data profiling and inspection. You might consider tools such as Informatica Analyst, or Microsoft SSIS Data Profiler. A quick Google search will provide many alternatives such as Talend. Regardless of the tool, make sure that you incorporate this activity into your BI project as soon as possible. You’ll want to do a fair amount of inspection on each system that you intend to load into your BI platform.
- Step #2: Don’t be afraid to discuss these data quality issues at an executive level (PRIOR TO THE EXECUTION PHASE OF THE PROJECT)
Awareness is always a key factor for your executive team. Executives and executive sponsors need to know about the data quality issues as soon as possible. Why? You will need their support not only to address the data quality issues, but sometimes these issues stem from poor business process and training. Their support will be critical to address either issue.
- Step #3: Assign ownership and establish accountability
Assign ownership for data as soon as possible. This will assist you to not only to resolve the data quality issues, but these key data stewards may be able to help identify additional data quality issues as they may be more familiar with their data than you. In most cases, they have inherited this bad data too, and will likely want to partner with you to fix it. However, you must consider that it will also place a burden on them from a bandwidth perspective. Unless dedicated to your project, they will also have a day job. Keep this in mind during your planning and see if you can augment and support these data cleansing efforts with your team.
- Step #4: Define rules for the data
One of the biggest challenges that I continue to see is when data stewards do not want to cleanse their data, they want the ETL scripts to handle the 1,001 permutations of how the data should be interpreted. While the ETLs can handle some of this logic, the business owners need to ensure that the data is being entered into the transactional system via a single set of business processes and that it is being done consistently and completely. Usually, the transactional systems can have business rules defined and field requirements put in place that can help to enforce these processes. In some cases, the transaction systems are sophisticated enough to handle workflow too. Utilize these features to your advantage and do not over-engineer the ETL processes. Not only will this be time consuming to initially develop, but it will be a management nightmare moving forward.
- Step #5: Modify business process as needed
If you are working cross-functionally, you may run into the need to revise business processes to support consistent data entry into the transactional systems. Recently, I was working on a project across 6 HR departments. The net of their hiring process was the same, but they had 6 different processes and unfortunately, they were utilizing the same transactional system. We had to get their executives together and do some business process alignment work before we could proceed. Once the business process is unified, you then have to consider the historical data. Does it need to be cleansed or transformed? In our case it did. Don’t underestimate this effort!
- Step #6: Prioritize and make trade-offs. Data will rarely be perfect.
Once you have revised business process and defined data cleansing activities, you will need to prioritize them. Rarely are you in a position where data is perfect or resources are unlimited. If you have done your design work correctly, you will have a catalog of the most critical reports and key pieces of data. Focus on these areas first and then expand. Don’t try to boil the ocean. Keep your data cleansing activities as condensed as possible and make an honest effort to try to support the business units as much as possible. In my experience, the BI developers can generally augment the full time staff to get data cleansing and data corrections done more efficiently. However, make sure that the business unit maintains responsibility and accountability. You don’t want the data to become IT’s problem. It is a shared problem and one that you will have to work very hard to maintain moving forward.
- Step #7: Test and make qualitative data updates
As you prioritize and move through your data cleansing checklist, ensure that you have prioritized the efforts that will reap the largest reward. You might be able to prioritize a few smaller wins at first to show the value of the cleansing activities. You should then align your efforts with the primary requirements of your project. You may be able to defer some of the data cleansing to later stages of the project, or handle it in a more gradual way.
- Step #8: Setup alerts and notifications for future discrepancies
After data cleansing has occurred and you feel that you have the data in a good state, your job is not over! Data quality is an ongoing activity. You almost always run into future data quality issues and governance needs to be setup in order to address these. Exception reports should be setup and made available “on-demand” to support data cleansing. Also, one of my favorite tools is data-driven subscriptions, or report bursts. Microsoft uses the “data-driven subscription” terminology. IBM Cognos uses the term “report burst.” Once you have defined the type of data integrity issues that are likely to occur (missing data, incomplete data, inaccurate data, etc.), you can setup data-driven subscriptions, or report bursts, that will prompt the data stewards when these issues occur. Of course, at the end of the day, you still have the issue of accountability. We’ll take a look at that in the next step. Depending on the tool that you using, you may have the capability of sending the user an exception report with the data issue(s) listed. In other systems, you may simply alert the user of a particular data issue and then they must take action. These subscriptions should augment the exception reports that are available “on-demand” in your reporting portal.
- Step #9: Consider a workflow to keep data stewards accountable
So, what now? The user now has an inbox full of exception reports, or a portal inbox full of alerts, and they still haven’t run the manual, on-demand exception report. Data integrity issues are causing reporting problems as the data is starting to slip in its quality. You have a few options here. In previous projects, I have setup a bit of workflow around the data-driven subscriptions. The first port of call is the data steward. They are alerted of an issue with the data and a standard SLA is set to allow them an adequate amount of time to address the issue. After that SLA period expires, the data issue is then escalated to their line manager. This can also be setup as a data-driven subscription. If both steps fail (i.e. both the data steward and the line manager are ignoring the data issue), then it is time to re-engage with your executive committee. Make the data issues visible and help the executives understand the impact of the data being inaccurate. Paint a picture for the executive about why data is important. To further illustrate your point, if you have an executive dashboard that is using this data…it may be worthwhile to point out how the data integrity issue may impact that dashboard. Not many executives want to be in a position where they are making decisions on inaccurate data.
- Step #10: Wash, rinse, and repeat
By the time that you have gotten to this point, it will likely be time to fold in another transactional system into your BI platform. Remember this process and use it again!
- Weed Out Bad Data: The Importance of Data Quality (infotrellis.com)
- No matter how big it gets, data still demands management and quality checks (blogs.techworld.com)
- What is Dirty Data Costing You? (the-decisionfactor.com)