Higher Education Data Warehouse Conference (HEDW) @ Cornell University

I just returned from an excellent conference  (HEDW) which was administered by the IT staff at Cornell University. Kudos to the 2013 Conference Chair, Jeff Christen, and the staff of Cornell University for hosting this year! Below is a little bit more information about the conference:

The Higher Education Data Warehousing Forum (HEDW) is a network of higher education colleagues dedicated to promoting the sharing of knowledge and best practices regarding knowledge management in colleges and universities, including building data warehouses, developing institutional reporting strategies, and providing decision support.

The Forum meets once a year and sponsors a series of listservs to promote communication among technical developers and administrators of data access and reporting systems, data custodians, institutional researchers, and consumers of data representing a variety of internal university audiences.

There are more than 2000 active members in HEDW, representing professionals from 700+ institutions found in 38 different countries and 48 different states.

This conference has proven to be helpful to Georgetown University over the last 5 years.  It is a great opportunity to network with peers and share best practice around the latest technology tools.  And, sorry vendors, you are kept at bay.  This is important as the focus of the conference is less on technology sales – and more about relationships and sharing successes.

Cornell University Outside of Statler Conference Center

Cornell University Outside of Statler Conference Center

Personally, this was my first year in attendance.  I gained a lot of industry insight, but it was also helpful to find peer organizations that are using the same technology tools.  We are about to embark upon an Oracle Peoplesoft finance to Workday conversion.  It was helpful to connect with others that are going through similar projects.  And for me specifically, it was helpful to learn how folks are starting to extract data from Workday for business intelligence purposes.

Higher Education Data Warehouse Conference

Higher Education Data Warehouse Conference

2013 HEDW Attendee List

2013 HEDW Attendee List

My key take-aways from the conference were:

  • Business intelligence is happening with MANY tools.  We saw A LOT of technology.  Industry leaders in the higher education space still seem to be Oracle and MicrosoftOracle seemed to be embedded in more universities; however many are starting projects on the Microsoft stack – particularly with the Blackboard Analytics team standardizing on the Microsoft platform.  IBM Cognos still seemed to be the market leader in terms of operational reporting; however Microsoft’s SSRS is gaining momentum.  From an OLAP and dashboard perspective, it seemed like a mixed bag.  Some were using IBM BI Dashboards, while others were using tools such as OBIEE Dashboards, Microsoft Sharepoint’s Dashboard Designer, and an emerging product – Pyramid Analytics. Microsoft’s PowerPivot was also highly demonstrated and users like it!  PowerView was mentioned, but no one seemed to have it up and running…yet.  Tableau was also a very popular choice and highly recommended.  Several people mentioned how responsive both Microsoft and Tableau had been to their needs pre-sale.
  • Business intelligence requires a SIGNIFICANT amount of governance to be successful.  We saw presentation after presentation about the governance structures that should have been setup.  Or, projects that had to be restarted in order be governed in the appropriate way.  This includes changing business processes and ensuring that common data definitions are put in place across university silos.  A stove-piped approach does not work when you are trying to analyze data cross functionally.
  • Standardizing on one tool is difficult.  We spoke to many universities that had multiple tools in play.  This is due to the difficulty of change management and training.  It is worth making the investment for change management in order to standardize on the appropriate tool set.
  • Technology is expensive.  There is no one size fits all.  Depending on the licensing agreements that are in place at your university – there may be a clear technology choice.  Oracle is expensive, but it may already be in use to support critical ERP systems.  We also heard many universities discuss their use of Microsoft due to educational and statewide discounts available.
  • Predictive Analytics are still future state.  We had brief discussions about statistical tools like SAS and IBM’s SPSS; however, these tools were not the focus of many discussions.  It seems that most universities are trying to figure out simple ODS and EDW projects. Predictive analytics and sophisticated statistical tools are in use – but seem to be taking a back seat while IT departments get the more fundamental data models in place.  Most had an extreme amount of interest in these types of predictive analytics, but felt, “we just aren’t there yet.”  GIS data also came up in a small number of presentations, but also has interest.  In fact, one presentation displayed a dashboard with student enrollment by county.  People like to see data overlaid on a map.  I can see more universities taking advantage of this soon.
  • Business intelligence technologists are in high demand and hard to find.  It was apparent throughout the conference that many universities are challenged to find the right technology talent.  Many are in need of employees that possess business intelligence and reporting skills.
  • Hadoop remains on the shelf.  John Rome from Arizona State gave an excellent presentation about Hadoop and its functional use.  He clarified how Hadoop got its name.  The founder, Doug Cutting, named the company after his son’s stuffed yellow elephant!  John also presented a few experiments that ASU has been doing to evaluate the value that Hadoop may be able to bring the university.  In ASU’s experiments, they used Amazon’s EC2 service to quickly spin up supporting servers and configure the services necessary to support Hadoop.  This presentation was entertaining, but was almost the only mention of Hadoop during the entire conference.  It may have more use in research functions, but does not seem widely adopted in key university business intelligence efforts as of yet.  Wonder if this will change by next year?
g with Son's Stuffed Elephant

Doug Cutting with Son’s Stuffed Elephant


Busting 10 Myths about Hadoop by Philip Russom @ TDWI

Recently, I read a very informative white paper which was published by TDWI’s Philip Russom.  The research in this report was sponsored by some of the key BI players below – so it had significant backing.

Integrating Hadoop into Business Intelligence and Data Warehousing:  Research Sponsors

Integrating Hadoop into Business Intelligence and Data Warehousing: TDWI Research Sponsors

I wanted to share the top 10 myths about Hadoop from TDWI’s report.  I found them insightful and you may as well:

Credit:  Integrating Hadoop into Business Intelligence and Data Warehousing by Philip Russom @ TDWI

  • Fact #1:  Hadoop consists of multiple products
    We talk about Hadoop as if it’s one monolithic thing, but it’s actually a family of open source  products and technologies overseen by the Apache Software Foundation (ASF). (Some Hadoop  products are also available via vendor distributions; more on that later.)  The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System  (HDFS), MapReduce, Pig, Hive, HBase, HCatalog, Ambari, Mahout, Flume, and so on. You can  combine these in various ways, but HDFS and MapReduce (perhaps with Pig, Hive, and HBase) constitute a useful technology stack for applications in BI, DW, DI, and analytics. More Hadoop projects are coming that will apply to BI/DW, including Impala, which is a much-needed SQL  engine for low-latency data access to HDFS and Hive data.
  • Fact #2:  Hadoop is open source but available from vendors, too
    Apache Hadoop’s open source software library is available from ASF at http://www.apache.org. For users  desiring a more enterprise-ready package, a few vendors now offer Hadoop distributions that include  additional administrative tools, maintenance, and technical support. A handful of vendors offer their  own non-Hadoop-based implementations of MapReduce.
  • Fact #3: Hadoop is an ecosystem, not a single product
    In addition to products from Apache, the extended Hadoop ecosystem includes a growing list of  vendor products (e.g., database management systems and tools for analytics, reporting, and DI)  that integrate with or expand Hadoop technologies. One minute on your favorite search engine will reveal these.
  • Fact #4: HDFS is a file system, not a database management system (DBMS)
    Hadoop is primarily a distributed file system and therefore lacks capabilities we associate with a  DBMS, such as indexing, random access to data, support for standard SQL, and query optimization.  That’s okay, because HDFS does things DBMSs do not do as well, such as managing and processing  massive volumes of file-based, unstructured data. For minimal DBMS functionality, users can  layer HBase over HDFS and layer a query framework such as Hive or SQL-based Impala over HDFS or HBase.
  • Fact #5: Hive resembles SQL but is not standard SQL
    Many of us are handcuffed to SQL because we know it well and our tools demand it. People who  know SQL can quickly learn to hand code Hive, but that doesn’t solve compatibility issues with SQL-based tools. TDWI believes that over time, Hadoop products will support standard SQL and  SQL-based vendor tools will support Hadoop, so this issue will eventually be moot.
  • Fact #6: Hadoop and MapReduce are related but don’t require each other
    Some variations of MapReduce work with a variety of storage technologies, including HDFS, other file systems, and some relational DBMSs. Some users deploy HDFS with Hive or HBase, but not MapReduce.
  • Fact #7: MapReduce provides control for analytics, not analytics per se
    MapReduce is a general-purpose execution engine that handles the complexities of network communication, parallel programming, and fault tolerance for a wide variety of hand-coded logic  and other applications—not just analytics.
  • Fact #8: Hadoop is about data diversity, not just data volume
    Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true,  and it’s exactly what brings many users to Apache HDFS and related Hadoop products. After all,  many types of big data that require analysis are inherently file based, such as Web logs, XML files,  and personal productivity documents.
  • Fact #9: Hadoop complements a DW; it’s rarely a replacement
    Most organizations have designed their DWs for structured, relational data, which makes it difficult  to wring BI value from unstructured and semistructured data. Hadoop promises to complement
    DWs by handling the multi-structured data types most DWs simply weren’t designed for.  Furthermore, Hadoop can enable certain pieces of a modern DW architecture, such as massive data staging areas, archives for detailed source data, and analytic sandboxes. Some early adoptors offload as many workloads as they can to HDFS and other Hadoop technologies because they are less expensive than the average DW platform. The result is that DW resources are freed for the workloads with which they excel.
  • Fact #10: Hadoop enables many types of analytics, not just Web analytics
    Hadoop gets a lot of press about how Internet companies use it for analyzing Web logs and other  Web data, but other use cases exist. For example, consider the big data coming from sensory devices, such as robotics in manufacturing, RFID in retail, or grid monitoring in utilities. Older analytic applications that need large data samples—such as customer base segmentation, fraud detection, and  risk analysis—can benefit from the additional big data managed by Hadoop. Likewise, Hadoop’s additional data can expand 360-degree views to create a more complete and granular view of customers, financials, partners, and other business entities.

Philip also did a nice job in this white paper in clarifying the status of current HDFS implementations.  It is represented well in the graphic below.

Status of HDFS Implementations

Status of HDFS Implementations

To sort out which Hadoop products are in use today (and will be in the near future), this report’s
survey asked: Which of the following Hadoop and related technologies are in production in your
organization today? Which will go into production within three years? These
questions were answered by a subset of 48 survey respondents who claim they’ve deployed or used
HDFS. Hence, their responses are quite credible, being based on direct, hands-on experience.

HDFS and a few add-ons are the most commonly used Hadoop products today.  HDFS is near the top of
the list (67%) because most Hadoop-based applications demand HDFS as the base
platform. Certain add-on Hadoop tools are regularly layered atop HDFS today:

  • MapReduce (69%) for the distributed processing of hand-coded logic, whether for analytics or for fast data loading and ingestion
  • Hive (60%) for projecting structure onto Hadoop data so it can be queried using a SQL-like language called HiveQL
  • HBase (54%) for simple, record-store database functions against HDFS’s data

If this information has been helpful to you, check out the full report from TDWI below.  How is your organization using Hadoop?

Related Articles: