Management via Email? Think Again!

Even with the proliferation of online tools, SaaS software providers, and free Google apps, I constantly see teams that attempt to manage projects via email.  When joining a team or project midstream, below are some of the questions that I often ask:

  • How many requests have been made for certain types of data?  Are you seeing any patterns?
  • How many requests are you receiving per day, per week, per month, or per quarter?
  • On average, how long does it take the team to respond to these requests?  Do you have an SLA and are you meeting it?
  • How long does it take the end-users to test once the change has been loaded in the UAT environment?
  • How many requests have we seen that are really “training issues” (i.e. where the report or data already exists, but the user was unaware)?
  • Are we seeing more custom report-related requests, or requests to modify the underlying data model?

Are you able to answer these questions from email alone?  The answer is maybe – but it will take a long time to sift through your emails and arrive at the appropriate answer.

Why not track this information in a system and manage these types of statistics daily via a dashboard?  Oh – and by the way – if you setup your tracking system in the right way, it can become intelligent.  You’re probably wondering what I mean by that statement.  Well, let’s look at an example:

  1. Request is made for a new custom report (System notifies the team that a new request has been received)
  2. Team member is assigned to “own” the creation of this new report (System notifies team member which is assigned)
  3. Team member works with the end-user to create a functional requirements document (System has the ability to store this document inline with the original request)
  4. Team member creates the report and writes an accompanying technical specification (System has the ability to store the technical document inline with the original request and functional requirements document)
  5. Team member places report in the UAT environment (System notifies the end-user that the report is available in UAT and provides a link to accept or deny)
  6. Report is approved and placed into production environment (System notifies the user community that the report is available)

This is a very basic example, but this is what I mean by having an “intelligent” system.  Your system should be designed not only to capture the right pieces of data for your project, but also the associated workflow.  Many systems have email notification and basic reporting functions built-in as well.  Throughout my career, I have used many systems to manage a process through technology.  Depending on your budget and how quickly you need the environment available, you may prefer different vendors.  Here are a few of my favorites with a bit of my personal analysis:

PM Tool Comparison

PM Tool Comparison

If you are looking for a recommended tool, I will provide the following analysis based upon my own experience.  You should always carefully consider your requirements and procure the right tool for your organization.

Robert’s Quick Recommendations:

  • Intuit Quickbase:  I recommend Quickbase for projects that require a solution to be in place quickly – and one that requires a great deal of flexibility around workflow and notification.  I have used Quickbase very successfully to manage three separate business intelligence projects.  It is a great tool that is flexible and easy to setup.  It also doesn’t require programmers like a sophisticated application on the platform might.  My favorite Quickbase feature:  The ability to control the data model and write your own Excel-like functions.  Built-in report features are nice, too!  Please see my previous post, “Technology isn’t everything, there is PROCESS too!” for additional screen shots of a pre-configured instance of Quickbase.

Function in Intuit Quickbase

Standard Report in Intuit Quickbase

  • Basecamp:  I recommend Basecamp for projects that require a high degree of collaboration.  I’ve used Basecamp successfully to manage iterative web design projects.  It’s a great tool to keep communication centrally located, store documents for the team to share, and to track critical milestones.  My favorite Basecamp feature:  the mobile app!  I like the “latest updates” feature, too.
Basecamp iPhone App

Basecamp iPhone App

Basecamp Latest Updates Feature

Basecamp Latest Updates Feature

  • Microsoft Sharepoint 2013:  I recommend Microsoft Sharepoint 2013 for more mature shops.  I have used Sharepoint successfully to manage the development of software projects.  Combined with the functionality of Microsoft’s Team Foundation Server, Sharepoint can become an invaluable tool.  If you are working with a business intelligence project, Sharepoint also has components such as PerformancePoint and PowerView that may interest your data guys.   Sharepoint requires the right hardware and technical expertise to setup and maintain, but it is feature rich and integrates seamlessly with the Office suite.  Microsoft now offers Sharepoint Online if you’d like to offload the technical headache and put it in the cloud.   My favorite feature of Sharepoint:  As I BI guy, I love PowerView and PowerPivot.  Very cool!


  • Smartsheet:  I recommend Smartsheet as a replacement for Microsoft Project/Project Server in start-up shops.  Smartsheet is an excellent collaborative project management tool.  It’s not fancy, but it is an effective way to track work, collaborate, and ensure that critical projects stay on time and on budget.  My favorite Smartsheet feature:  The ability to setup automatic alerts when items in the project plan are changed.
Smartsheet Alerts

Smartsheet Alerts

A Compilation of My Favorite DW Resources

Recently, I received an email as part of a listserv from a colleague at  HEDW, or Higher Education Data Warehousing Forum, is a network of higher education colleagues dedicated to promoting the sharing of knowledge and best practices regarding knowledge management in colleges and universities, including building data warehouses, developing institutional reporting strategies, and providing decision support.

In the email that I referenced above, my colleague sent a link to an IBM Redbooks publication titled, “Dimensional Modeling: In a Business Intelligence Environment.”  This is a good read for someone that wants the basics of data warehousing.  It also may be a good refresher for others.  Here’s a short description of the book:

In this IBM Redbooks publication we describe and demonstrate dimensional data modeling techniques and technology, specifically focused on business intelligence and data warehousing. It is to help the reader understand how to design, maintain, and use a dimensional model for data warehousing that can provide the data access and performance required for business intelligence.

Business intelligence is comprised of a data warehousing infrastructure, and a query, analysis, and reporting environment. Here we focus on the data warehousing infrastructure. But only a specific element of it, the data model – which we consider the base building block of the data warehouse. Or, more precisely, the topic of data modeling and its impact on the business and business applications. The objective is not to provide a treatise on dimensional modeling techniques, but to focus at a more practical level.

There is technical content for designing and maintaining such an environment, but also business content.

Dimensional Modeling: In a Business Intelligence Environment

Dimensional Modeling: In a Business Intelligence Environment

In reading through a few responses on the listserv, it compelled me to produce a list of some of my favorite BI books.  I’ll publish a part II to this post in the future, but here is an initial list that I would recommend to any BI professional.  It is also worth signing up for the Kimball Group’s Design Tips.  They are tremendously useful.

Related Articles:

Contain DW/BI Scope Creep and Avoid Scope Theft by Bob Becker @ Kimball Group

I just read a great article by Bob Becker at the Kimball Group.  It addresses some of the key issues of scope creep in BI projects.  I find the Kimball Group a great resource for anyone implementing or maintaining a BI system.

Read the entire article titled, “Design Tip #154 Contain DW/BI Scope Creep and Avoid Scope Theft” below:

10 Steps to Data Quality Delight!

Data quality is always an aspect of business intelligence (BI) projects that seems to be deprioritized.  It is easy to look at the beautiful visualizations and drill-through reports that are key selling features of a BI project.  However, this article is about the value of cleansing your data so that these tools will work seamlessly with the data model that you establish.  Everyone knows the IT saying, “Garbage in.  Garbage Out.”  That holds entirely true with BI projects.  If the incoming data is dirty, it is going to be very difficult to efficiently process the data and make it available for a reporting platform.  This isn’t an easy problem to solve either.  When working across multiple functional areas, you may also have different sets of users that are entering data into the system in DIFFERENT ways.  So, in this instance, you may not have a data quality issue, but a business process issue.

As I have worked through my BI projects, here are 10 steps that I have followed to work with teams to create a data-centric culture and to improve data integrity.   I hope that these are of use to you…and please feel free to share any additional best practice in the comments of this blog!  We can all learn from one another.

Data Quality Workflow

Data Quality Workflow

  • Step #1:  Build data profiling and inspection into the design of your project
    Don’t wait until you are about to go-live to start looking at the quality of your data.  From the very beginning of your project, you should start to profile the data that you are loading into your BI platform.  Depending on your technology stack, there are multiple tools that will aid you in data profiling and inspection.  You might consider tools such as Informatica Analyst, or Microsoft SSIS Data Profiler.  A quick Google search will provide many alternatives such as Talend.  Regardless of the tool, make sure that you incorporate this activity into your BI project as soon as possible.  You’ll want to do a fair amount of inspection on each system that you intend to load into your BI platform.

    Informatica Analyst

    Informatica Analyst

    Microsoft SSIS Data Profiler

    Microsoft SSIS Data Profiler

    Talend Data Quality Tool

    Talend Data Quality Tool

  • Step #2:  Don’t be afraid to discuss these data quality issues at an executive level (PRIOR TO THE EXECUTION PHASE OF THE PROJECT)
    Awareness is always a key factor for your executive team.  Executives and executive sponsors need to know about the data quality issues as soon as possible.  Why?  You will need their support not only to address the data quality issues, but sometimes these issues stem from poor business process and training.  Their support will be critical to address either issue.

  • Step #3:  Assign ownership and establish accountability
    Assign ownership for data as soon as possible.  This will assist you to not only to resolve the data quality issues, but these key data stewards may be able to help identify additional data quality issues as they may be more familiar with their data than you.  In most cases, they have inherited this bad data too, and will likely want to partner with you to fix it.  However, you must consider that it will also place a burden on them from a bandwidth perspective.  Unless dedicated to your project, they will also have a day job.  Keep this in mind during your planning and see if you can augment and support these data cleansing efforts with your team.
  • Step #4:  Define rules for the data
    One of the biggest challenges that I continue to see is when data stewards do not want to cleanse their data, they want the ETL scripts to handle the 1,001 permutations of how the data should be interpreted.  While the ETLs can handle some of this logic, the business owners need to ensure that the data is being entered into the transactional system via a single set of business processes and that it is being done consistently and completely.  Usually, the transactional systems can have business rules defined and field requirements put in place that can help to enforce these processes.  In some cases, the transaction systems are sophisticated enough to handle workflow too.  Utilize these features to your advantage and do not over-engineer the ETL processes.  Not only will this be time consuming to initially develop, but it will be a management nightmare moving forward.
  • Step #5:  Modify business process as needed
    If you are working cross-functionally, you may run into the need to revise business processes to support consistent data entry into the transactional systems.  Recently, I was working on a project across 6 HR departments.  The net of their hiring process was the same, but they had 6 different processes and unfortunately, they were utilizing the same transactional system.  We had to get their executives together and do some business process alignment work before we could proceed.  Once the business process is unified, you then have to consider the historical data.  Does it need to be cleansed or transformed?  In our case it did.  Don’t underestimate this effort!
  • Step #6:  Prioritize and make trade-offs.  Data will rarely be perfect.
    Once you have revised business process and defined data cleansing activities, you will need to prioritize them.  Rarely are you in a position where data is perfect or resources are unlimited.  If you have done your design work correctly, you will have a catalog of the most critical reports and key pieces of data.  Focus on these areas first and then expand.  Don’t try to boil the ocean.  Keep your data cleansing activities as condensed as possible and make an honest effort to try to support the business units as much as possible.  In my experience, the BI developers can generally augment the full time staff to get data cleansing and data corrections done more efficiently.  However, make sure that the business unit maintains responsibility and accountability.  You don’t want the data to become IT’s problem.  It is a shared problem and one that you will have to work very hard to maintain moving forward.
  • Step #7:  Test and make qualitative data updates
    As you prioritize and move through your data cleansing checklist, ensure that you have prioritized the efforts that will reap the largest reward.  You might be able to prioritize a few smaller wins at first to show the value of the cleansing activities.  You should then align your efforts with the primary requirements of your project.  You may be able to defer some of the data cleansing to later stages of the project, or handle it in a more gradual way.
  • Step #8:  Setup alerts and notifications for future discrepancies
    After data cleansing has occurred and you feel that you have the data in a good state, your job is not over!  Data quality is an ongoing activity.  You almost always run into future data quality issues and governance needs to be setup in order to address these.  Exception reports should be setup and made available “on-demand” to support data cleansing.  Also, one of my favorite tools is data-driven subscriptions, or report bursts.  Microsoft uses the “data-driven subscription” terminology.  IBM Cognos uses the term “report burst.”  Once you have defined the type of data integrity issues that are likely to occur (missing data, incomplete data, inaccurate data, etc.), you can setup data-driven subscriptions, or report bursts, that will prompt the data stewards when these issues occur.  Of course, at the end of the day, you still have the issue of accountability.  We’ll take a look at that in the next step.  Depending on the tool that you using, you may have the capability of sending the user an exception report with the data issue(s) listed.  In other systems, you may simply alert the user of a particular data issue and then they must take action.  These subscriptions should augment the exception reports that are available “on-demand” in your reporting portal.

    Microsoft SSRS Data-Driven Subscription

    Microsoft SSRS Data-Driven Subscription

    IBM Cognos Report Burst

    IBM Cognos Report Burst

  • Step #9:  Consider a workflow to keep data stewards accountable
    So, what now?  The user now has an inbox full of exception reports, or a portal inbox full of alerts, and they still haven’t run the manual, on-demand exception report.  Data integrity issues are causing reporting problems as the data is starting to slip in its quality.  You have a few options here.  In previous projects, I have setup a bit of workflow around the data-driven subscriptions.  The first port of call is the data steward.  They are alerted of an issue with the data and a standard SLA is set to allow them an adequate amount of time to address the issue.  After that SLA period expires, the data issue is then escalated to their line manager.  This can also be setup as a data-driven subscription.  If both steps fail (i.e. both the data steward and the line manager are ignoring the data issue), then it is time to re-engage with your executive committee.  Make the data issues visible and help the executives understand the impact of the data being inaccurate.  Paint a picture for the executive about why data is important.  To further illustrate your point, if you have an executive dashboard that is using this data…it may be worthwhile to point out how the data integrity issue may impact that dashboard.  Not many executives want to be in a position where they are making decisions on inaccurate data.
  • Step #10:  Wash, rinse, and repeat
    By the time that you have gotten to this point, it will likely be time to fold in another transactional system into your BI platform.  Remember this process and use it again!WashRinseRepeat

Busting 10 Myths about Hadoop by Philip Russom @ TDWI

Recently, I read a very informative white paper which was published by TDWI’s Philip Russom.  The research in this report was sponsored by some of the key BI players below – so it had significant backing.

Integrating Hadoop into Business Intelligence and Data Warehousing:  Research Sponsors

Integrating Hadoop into Business Intelligence and Data Warehousing: TDWI Research Sponsors

I wanted to share the top 10 myths about Hadoop from TDWI’s report.  I found them insightful and you may as well:

Credit:  Integrating Hadoop into Business Intelligence and Data Warehousing by Philip Russom @ TDWI

  • Fact #1:  Hadoop consists of multiple products
    We talk about Hadoop as if it’s one monolithic thing, but it’s actually a family of open source  products and technologies overseen by the Apache Software Foundation (ASF). (Some Hadoop  products are also available via vendor distributions; more on that later.)  The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System  (HDFS), MapReduce, Pig, Hive, HBase, HCatalog, Ambari, Mahout, Flume, and so on. You can  combine these in various ways, but HDFS and MapReduce (perhaps with Pig, Hive, and HBase) constitute a useful technology stack for applications in BI, DW, DI, and analytics. More Hadoop projects are coming that will apply to BI/DW, including Impala, which is a much-needed SQL  engine for low-latency data access to HDFS and Hive data.
  • Fact #2:  Hadoop is open source but available from vendors, too
    Apache Hadoop’s open source software library is available from ASF at For users  desiring a more enterprise-ready package, a few vendors now offer Hadoop distributions that include  additional administrative tools, maintenance, and technical support. A handful of vendors offer their  own non-Hadoop-based implementations of MapReduce.
  • Fact #3: Hadoop is an ecosystem, not a single product
    In addition to products from Apache, the extended Hadoop ecosystem includes a growing list of  vendor products (e.g., database management systems and tools for analytics, reporting, and DI)  that integrate with or expand Hadoop technologies. One minute on your favorite search engine will reveal these.
  • Fact #4: HDFS is a file system, not a database management system (DBMS)
    Hadoop is primarily a distributed file system and therefore lacks capabilities we associate with a  DBMS, such as indexing, random access to data, support for standard SQL, and query optimization.  That’s okay, because HDFS does things DBMSs do not do as well, such as managing and processing  massive volumes of file-based, unstructured data. For minimal DBMS functionality, users can  layer HBase over HDFS and layer a query framework such as Hive or SQL-based Impala over HDFS or HBase.
  • Fact #5: Hive resembles SQL but is not standard SQL
    Many of us are handcuffed to SQL because we know it well and our tools demand it. People who  know SQL can quickly learn to hand code Hive, but that doesn’t solve compatibility issues with SQL-based tools. TDWI believes that over time, Hadoop products will support standard SQL and  SQL-based vendor tools will support Hadoop, so this issue will eventually be moot.
  • Fact #6: Hadoop and MapReduce are related but don’t require each other
    Some variations of MapReduce work with a variety of storage technologies, including HDFS, other file systems, and some relational DBMSs. Some users deploy HDFS with Hive or HBase, but not MapReduce.
  • Fact #7: MapReduce provides control for analytics, not analytics per se
    MapReduce is a general-purpose execution engine that handles the complexities of network communication, parallel programming, and fault tolerance for a wide variety of hand-coded logic  and other applications—not just analytics.
  • Fact #8: Hadoop is about data diversity, not just data volume
    Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true,  and it’s exactly what brings many users to Apache HDFS and related Hadoop products. After all,  many types of big data that require analysis are inherently file based, such as Web logs, XML files,  and personal productivity documents.
  • Fact #9: Hadoop complements a DW; it’s rarely a replacement
    Most organizations have designed their DWs for structured, relational data, which makes it difficult  to wring BI value from unstructured and semistructured data. Hadoop promises to complement
    DWs by handling the multi-structured data types most DWs simply weren’t designed for.  Furthermore, Hadoop can enable certain pieces of a modern DW architecture, such as massive data staging areas, archives for detailed source data, and analytic sandboxes. Some early adoptors offload as many workloads as they can to HDFS and other Hadoop technologies because they are less expensive than the average DW platform. The result is that DW resources are freed for the workloads with which they excel.
  • Fact #10: Hadoop enables many types of analytics, not just Web analytics
    Hadoop gets a lot of press about how Internet companies use it for analyzing Web logs and other  Web data, but other use cases exist. For example, consider the big data coming from sensory devices, such as robotics in manufacturing, RFID in retail, or grid monitoring in utilities. Older analytic applications that need large data samples—such as customer base segmentation, fraud detection, and  risk analysis—can benefit from the additional big data managed by Hadoop. Likewise, Hadoop’s additional data can expand 360-degree views to create a more complete and granular view of customers, financials, partners, and other business entities.

Philip also did a nice job in this white paper in clarifying the status of current HDFS implementations.  It is represented well in the graphic below.

Status of HDFS Implementations

Status of HDFS Implementations

To sort out which Hadoop products are in use today (and will be in the near future), this report’s
survey asked: Which of the following Hadoop and related technologies are in production in your
organization today? Which will go into production within three years? These
questions were answered by a subset of 48 survey respondents who claim they’ve deployed or used
HDFS. Hence, their responses are quite credible, being based on direct, hands-on experience.

HDFS and a few add-ons are the most commonly used Hadoop products today.  HDFS is near the top of
the list (67%) because most Hadoop-based applications demand HDFS as the base
platform. Certain add-on Hadoop tools are regularly layered atop HDFS today:

  • MapReduce (69%) for the distributed processing of hand-coded logic, whether for analytics or for fast data loading and ingestion
  • Hive (60%) for projecting structure onto Hadoop data so it can be queried using a SQL-like language called HiveQL
  • HBase (54%) for simple, record-store database functions against HDFS’s data

If this information has been helpful to you, check out the full report from TDWI below.  How is your organization using Hadoop?

Related Articles:

Project Management in Technical Projects: Isn’t this a line item that I can remove?

Project Management

In numerous engagements, and working both in the position of the client and the consultant, I have been continually asked about the value of project management.  I often hear questions like this:

  • My budget has been reduced.  Do we really need project management services on this project?
  • We have an FTE that will be responsible for this project.  Why do I need to hire a vendor project manager also?
  • Project management is expensive.  Is it necessary?
  • I don’t understand project management.  Seems like just another “fee” to me.  Does the project manager just provide status reports?  Surely, we can pull together our own status reports.

Let’s start to address these questions by reviewing a few of the statistics provided by a PMSolutions report titled, “The State of the PMO 2010“.  This study found the following value benchmarks:

PMO Value Benchmarks

Another study was completed in 2007 by PricewaterhouseCoopers, titled, “Insights and Trends: Current Program and Project Management Practices.”  This study found the following:

  • 80 percent of higher performing projects used a certified project manager
  • 50 percent of project failure was traceable to poor (or no) project management (Bad estimates/deadlines, Scope changes, Poor resource planning).
Certification Versus Project Performance

Certification Versus Project Performance

Reasons for Project Failure

Reasons for Project Failure

Outside of these facts, I have personally found project management to be of extreme value on technical projects.  You commonly hear the value proposition explained as:

  • Better expectation-setting through up-front estimating, planning, and project definition.
  • Faster execution through the reuse of common processes and templates.
  • Fewer project problems encountered when utilizing proactive project management processes.
  • Better organizational decision making through more effective project communication.
  • Higher client satisfaction and less rework by building a higher quality product the first time.

For technical projects, project managers become even more valuable.  They generally grow within an organization into a project management role.  This infuses a lot of great technical knowledge into the project and can help tremendously to ensure that the right architecture is being put in place and the minimal rework is being done.  This equates to cost savings and efficiency.

If you are considering the proposition of hiring a vendor project manager, the vendor project manager will not only have this technical knowledge, but they will also be invaluable when securing vendor resources for the project and managing through their own organization.  Many BI organizations are a combination of M&As that have taken place.  You will need help in navigating their corporate structure and avoiding pitfalls in their technology!

I’ll leave you with what I felt was a good representation of the project management process.  My question to the skeptics is:  why wouldn’t you want this structure and value as an integral part of your project?  I would.

Project Management Cycle

Project Management Cycle

Related Articles:

Benefit of Column-store Indexes

With the release of Microsoft SQL Server 2012, Microsoft has bought into the concept of column-store indexes with its VertiPaq technology.  This is similar to the approach taken by Vertica.  In a very simplistic example, below is a standard data set:

LastName FirstName BusinessUnit JobTitle
Johnson Ben Finance Director
Stevens George HR Recruiter
Evans Bryce Advancement Major Gift Officer

In a common row-store, this might be stored on the disk as follows:

  • Johnson, Ben, Finance, Director
  • Stevens, George, HR, Recruiter
  • Evans, Bryce, Advancement, Major Gift Officer

Column-store indexes store each column’s data together.  In a column-store, you may find the information stored on the disk in this format:

  • Johnson, Stevens, Evans
  • Ben, George, Bryce
  • Finance, HR, Advancement
  • Director, Recruiter, Major Gift Officer

The important thing to notice is that the columns are stored individually.  If your queries are just doing SELECT LastName, FirstName then you don’t need to read the business unit or job title.  Effectively, you read less off the disk.  This is fantastic for data warehouses where query performance is paramount.   In their white paper, Microsoft uses the illustration below to better explain how columnstore indexes work:

Microsoft's Columnstore Index Illustration

Microsoft’s Columnstore Index Illustration

The benefits of using a non-clustered columnstore index are:

  • Only the columns needed to solve a query are fetched from disk (this is often fewer than 15% of the columns in a typical fact table)
  • It is easier to compress the data due to the redundancy of data within a column
  • Buffer hit rates are improved because data is highly compressed, and frequently accessed parts of commonly used columns remain in memory, while infrequently used parts are paged out.

Microsoft recommends that columnstore indexes are used for fact tables in datawarehouses, for large dimensions (say with more than 10 millions of records), and any large tables designated to be used as read-only.  When introducing a columnstore index, tables do become read only.  Some may see this as a disadvantage.  To learn more about Microsoft’s columnstore indexes, read their article titled, “Columstore Indexes Described.”

So let’s take a quick look at how to do this in the 2012 SQL Server Management Studio.  If you need to add a new non-clustered column store index, you can do so as seen below:

Columnstore index in MS SQL Server 2012

Creating a new non-clustered columnstore index in Microsoft SQL Server 2012

To ensure that the non-clustered columnstore index is achieving the right result for you, test it.  Under the Query menu, ensure that you enable the “Display Estimated Execution Plan.”

Displaying the Query Execution Plan in SQL Server 2012

Displaying the Query Execution Plan in SQL Server 2012

Analyzing the Query Execution Plan in SQL Server 2012

Analyzing the Query Execution Plan in SQL Server 2012

You should now be in a position to analyze your old and new queries to determine which is optimal from a performance perspective.  You’ll likely see a great reduction in query time by using the non-clustered columnstore index.

What are your thoughts?  Are you using columnstore indexes in your environment?

Related Articles: