Top 10 Gotchas of BI in a SaaS World

DataInCloud

As we undergo a BI implementation with a cloud-based ERP provider, I thought it would be interesting to share some of our insights.  As you know from previous posts, Georgetown has been undergoing a large ERP implementation on the Workday ERP.  We are now live on both HCM and Financials.

The Business Intelligence (BI) project has been running in parallel to this effort and I wanted to share some initial findings in the event that it may be helpful for others.

Here are our top 10 gotchas:

#10Be cognizant of time zones.  When you are setting up incremental nightly data pulls, ensure that the user account that you are utilizing is in the same time zone as the underlying tenant.  Otherwise, you might spend hours validating data that may be hours out-of-sync (e.g. user account to extract data is in EST and the tenant is in PST).

#9Chop the data into sufficiently sized extracts to prevent timeout issues.  This may be less of an issue on a nightly basis, but is important if you are loading historical data or preparing for a time in which there is a large integration into the ERP.  Large integrations (e.g. from your Student Information System into your Financial System) can cause a nighly extract file to balloon and subsequently fail.

#8Send your team to deep ERP training – immediately!  Be prepared to know the ERP as well as the team administering it.

#7Ensure that your BI team receives the appropriate security that they require in the SaaS ERP.  If this gets into policy-related issues, ensure that you work these through with the respective owners in advance.  You may have to work through operational protocols and procedure that are not expected.

#6Plan a strategy for how you will create BI related content in your SaaS-based ERP.  Typically, this development needs to occur in a lower tenant and migrated to production.  How will you access these environments?  What is the path to migrate content?  Be prepared that some sandbox tenants may refresh on a weekly basis (i.e. BI content may not reside there for longer than a week).  Consider the notion of development in a dev tenant, migration to sandbox (which may refresh weekly), and then to production.  The timing of this process should feed into your planning.

#5Set up BI extract notifications.  We setup a specific email address which routes to our on call team.  These alerts notify our team if any of the extracts failed so that we can quickly extract and reprocess the file as needed.

#4Prepare to either learn an API in and out, or work with custom reports in the SaaS ERP.  Remember, there will be no direct connection to the database, so you’ll need to find alternate ways of efficiently extracting large amounts of data.  In Georgetown’s case, we wrote custom versions of reports from our SaaS-based ERP and then scheduled their output to a XML or CSV format.  These files are then processed by the nightly ETL routine.

#3Create a portal for consolidation and collaboration of reporting deliverables.  We set up a portal which provides instructions, documentation, links to metadata, ability to ask for help, report guides, etc.  Make this easy for your end-users to consume.

#2Test, Test, and Test.  SaaS-based ERP tenants typically go down one night a week for patches and maintenance.  This can be problematic for nightly BI extracts.  Make sure that you are aware of the outage schedules and can plan around them.  ETL routines may need to start a bit later on the day in which maintenance occurs.

#1Create an operational structure for the ongoing needs of the BI function in advance.  Georgetown is embarking upon a ‘BI Center of Excellence’ that will handle this sort of cross-functional structure which will be comprised of subject matter experts which represent different business units.  This will help the business units to continue to drive the development of the platform and allow IT to play a supportive role.

 

Higher Education Data Warehouse Conference (HEDW) @ Cornell University

I just returned from an excellent conference  (HEDW) which was administered by the IT staff at Cornell University. Kudos to the 2013 Conference Chair, Jeff Christen, and the staff of Cornell University for hosting this year! Below is a little bit more information about the conference:

The Higher Education Data Warehousing Forum (HEDW) is a network of higher education colleagues dedicated to promoting the sharing of knowledge and best practices regarding knowledge management in colleges and universities, including building data warehouses, developing institutional reporting strategies, and providing decision support.

The Forum meets once a year and sponsors a series of listservs to promote communication among technical developers and administrators of data access and reporting systems, data custodians, institutional researchers, and consumers of data representing a variety of internal university audiences.

There are more than 2000 active members in HEDW, representing professionals from 700+ institutions found in 38 different countries and 48 different states.

This conference has proven to be helpful to Georgetown University over the last 5 years.  It is a great opportunity to network with peers and share best practice around the latest technology tools.  And, sorry vendors, you are kept at bay.  This is important as the focus of the conference is less on technology sales – and more about relationships and sharing successes.

Cornell University Outside of Statler Conference Center

Cornell University Outside of Statler Conference Center

Personally, this was my first year in attendance.  I gained a lot of industry insight, but it was also helpful to find peer organizations that are using the same technology tools.  We are about to embark upon an Oracle Peoplesoft finance to Workday conversion.  It was helpful to connect with others that are going through similar projects.  And for me specifically, it was helpful to learn how folks are starting to extract data from Workday for business intelligence purposes.

Higher Education Data Warehouse Conference

Higher Education Data Warehouse Conference

2013 HEDW Attendee List

2013 HEDW Attendee List

My key take-aways from the conference were:

  • Business intelligence is happening with MANY tools.  We saw A LOT of technology.  Industry leaders in the higher education space still seem to be Oracle and MicrosoftOracle seemed to be embedded in more universities; however many are starting projects on the Microsoft stack – particularly with the Blackboard Analytics team standardizing on the Microsoft platform.  IBM Cognos still seemed to be the market leader in terms of operational reporting; however Microsoft’s SSRS is gaining momentum.  From an OLAP and dashboard perspective, it seemed like a mixed bag.  Some were using IBM BI Dashboards, while others were using tools such as OBIEE Dashboards, Microsoft Sharepoint’s Dashboard Designer, and an emerging product – Pyramid Analytics. Microsoft’s PowerPivot was also highly demonstrated and users like it!  PowerView was mentioned, but no one seemed to have it up and running…yet.  Tableau was also a very popular choice and highly recommended.  Several people mentioned how responsive both Microsoft and Tableau had been to their needs pre-sale.
  • Business intelligence requires a SIGNIFICANT amount of governance to be successful.  We saw presentation after presentation about the governance structures that should have been setup.  Or, projects that had to be restarted in order be governed in the appropriate way.  This includes changing business processes and ensuring that common data definitions are put in place across university silos.  A stove-piped approach does not work when you are trying to analyze data cross functionally.
  • Standardizing on one tool is difficult.  We spoke to many universities that had multiple tools in play.  This is due to the difficulty of change management and training.  It is worth making the investment for change management in order to standardize on the appropriate tool set.
  • Technology is expensive.  There is no one size fits all.  Depending on the licensing agreements that are in place at your university – there may be a clear technology choice.  Oracle is expensive, but it may already be in use to support critical ERP systems.  We also heard many universities discuss their use of Microsoft due to educational and statewide discounts available.
  • Predictive Analytics are still future state.  We had brief discussions about statistical tools like SAS and IBM’s SPSS; however, these tools were not the focus of many discussions.  It seems that most universities are trying to figure out simple ODS and EDW projects. Predictive analytics and sophisticated statistical tools are in use – but seem to be taking a back seat while IT departments get the more fundamental data models in place.  Most had an extreme amount of interest in these types of predictive analytics, but felt, “we just aren’t there yet.”  GIS data also came up in a small number of presentations, but also has interest.  In fact, one presentation displayed a dashboard with student enrollment by county.  People like to see data overlaid on a map.  I can see more universities taking advantage of this soon.
  • Business intelligence technologists are in high demand and hard to find.  It was apparent throughout the conference that many universities are challenged to find the right technology talent.  Many are in need of employees that possess business intelligence and reporting skills.
  • Hadoop remains on the shelf.  John Rome from Arizona State gave an excellent presentation about Hadoop and its functional use.  He clarified how Hadoop got its name.  The founder, Doug Cutting, named the company after his son’s stuffed yellow elephant!  John also presented a few experiments that ASU has been doing to evaluate the value that Hadoop may be able to bring the university.  In ASU’s experiments, they used Amazon’s EC2 service to quickly spin up supporting servers and configure the services necessary to support Hadoop.  This presentation was entertaining, but was almost the only mention of Hadoop during the entire conference.  It may have more use in research functions, but does not seem widely adopted in key university business intelligence efforts as of yet.  Wonder if this will change by next year?
g with Son's Stuffed Elephant

Doug Cutting with Son’s Stuffed Elephant

Hadoop

A Compilation of My Favorite DW Resources

Recently, I received an email as part of a listserv from a colleague at HEDW.org.  HEDW, or Higher Education Data Warehousing Forum, is a network of higher education colleagues dedicated to promoting the sharing of knowledge and best practices regarding knowledge management in colleges and universities, including building data warehouses, developing institutional reporting strategies, and providing decision support.

In the email that I referenced above, my colleague sent a link to an IBM Redbooks publication titled, “Dimensional Modeling: In a Business Intelligence Environment.”  This is a good read for someone that wants the basics of data warehousing.  It also may be a good refresher for others.  Here’s a short description of the book:

In this IBM Redbooks publication we describe and demonstrate dimensional data modeling techniques and technology, specifically focused on business intelligence and data warehousing. It is to help the reader understand how to design, maintain, and use a dimensional model for data warehousing that can provide the data access and performance required for business intelligence.

Business intelligence is comprised of a data warehousing infrastructure, and a query, analysis, and reporting environment. Here we focus on the data warehousing infrastructure. But only a specific element of it, the data model – which we consider the base building block of the data warehouse. Or, more precisely, the topic of data modeling and its impact on the business and business applications. The objective is not to provide a treatise on dimensional modeling techniques, but to focus at a more practical level.

There is technical content for designing and maintaining such an environment, but also business content.

Dimensional Modeling: In a Business Intelligence Environment

Dimensional Modeling: In a Business Intelligence Environment

In reading through a few responses on the listserv, it compelled me to produce a list of some of my favorite BI books.  I’ll publish a part II to this post in the future, but here is an initial list that I would recommend to any BI professional.  It is also worth signing up for the Kimball Group’s Design Tips.  They are tremendously useful.

Related Articles:

Benefit of Column-store Indexes

With the release of Microsoft SQL Server 2012, Microsoft has bought into the concept of column-store indexes with its VertiPaq technology.  This is similar to the approach taken by Vertica.  In a very simplistic example, below is a standard data set:

LastName FirstName BusinessUnit JobTitle
Johnson Ben Finance Director
Stevens George HR Recruiter
Evans Bryce Advancement Major Gift Officer

In a common row-store, this might be stored on the disk as follows:

  • Johnson, Ben, Finance, Director
  • Stevens, George, HR, Recruiter
  • Evans, Bryce, Advancement, Major Gift Officer

Column-store indexes store each column’s data together.  In a column-store, you may find the information stored on the disk in this format:

  • Johnson, Stevens, Evans
  • Ben, George, Bryce
  • Finance, HR, Advancement
  • Director, Recruiter, Major Gift Officer

The important thing to notice is that the columns are stored individually.  If your queries are just doing SELECT LastName, FirstName then you don’t need to read the business unit or job title.  Effectively, you read less off the disk.  This is fantastic for data warehouses where query performance is paramount.   In their white paper, Microsoft uses the illustration below to better explain how columnstore indexes work:

Microsoft's Columnstore Index Illustration

Microsoft’s Columnstore Index Illustration

The benefits of using a non-clustered columnstore index are:

  • Only the columns needed to solve a query are fetched from disk (this is often fewer than 15% of the columns in a typical fact table)
  • It is easier to compress the data due to the redundancy of data within a column
  • Buffer hit rates are improved because data is highly compressed, and frequently accessed parts of commonly used columns remain in memory, while infrequently used parts are paged out.

Microsoft recommends that columnstore indexes are used for fact tables in datawarehouses, for large dimensions (say with more than 10 millions of records), and any large tables designated to be used as read-only.  When introducing a columnstore index, tables do become read only.  Some may see this as a disadvantage.  To learn more about Microsoft’s columnstore indexes, read their article titled, “Columstore Indexes Described.”

So let’s take a quick look at how to do this in the 2012 SQL Server Management Studio.  If you need to add a new non-clustered column store index, you can do so as seen below:

Columnstore index in MS SQL Server 2012

Creating a new non-clustered columnstore index in Microsoft SQL Server 2012

To ensure that the non-clustered columnstore index is achieving the right result for you, test it.  Under the Query menu, ensure that you enable the “Display Estimated Execution Plan.”

Displaying the Query Execution Plan in SQL Server 2012

Displaying the Query Execution Plan in SQL Server 2012

Analyzing the Query Execution Plan in SQL Server 2012

Analyzing the Query Execution Plan in SQL Server 2012

You should now be in a position to analyze your old and new queries to determine which is optimal from a performance perspective.  You’ll likely see a great reduction in query time by using the non-clustered columnstore index.

What are your thoughts?  Are you using columnstore indexes in your environment?

Related Articles: