My.ADVISOR.com Sign-In
ID
Password

Member Center / Sign-Up
   
SUBSCRIPTION STATUS
If you are a subscriber to this publication, sign-in to access locked articles. To subscribe or renew go to www.AdvisorStore.com.
Go to Article

WHITE PAPER

Data Quality: A Problem and an Approach

By Javed Beg and Shadab Hussain, Data Warehouse Consultants, Wipro Technologies


As a company's survival increasingly depends on its ability to extract the right information from data, the quality of the data itself is becoming more important. Data quality isn't restricted to data correctness and consistency. It covers the extent to which a company integrates and uses good data enterprise-wide. Data quality also involves scrutiny at the initial level of data entrance in an organization, and geographically and financially valid data with reference to other databases across the enterprise.
Figure 1: CCDB architecture.
Figure 2: This application provides data quality for this customer registry system using two Trillium utilities.

This article deals with evolving data quality concepts, and assessment and methodologies to achieve enterprise-wide implementation and integration. Also, we'll evaluate the business value and impact of data quality. It's necessary to understand the best-practice-based approach and architecture of a data quality initiative in an organization, as well as its complexity and intricacy, to achieve friendly and useful critical information.

Why data quality is important

With the arrival of the Information Age, a huge amount of data has become readily accessible through large databases, the Internet, and a sophisticated communications network. The use of the Internet in supply chains, data mining, knowledge management, and many other related concepts is developing dramatically.

There has been a considerable increase in independent, distributed database servers that provide direct, online information-retrieval services to users. This has made it easier to manipulate multi-source data to a great extent. The resulting integration of data from multiple sources for effective information sharing gives rise to certain incompatibilities. Most users neglect the importance of data, considering it mere "stuff" present in computers, but data acts as the real "fuel" in the IT engine. Companies have to remove incorrect and meaningless data, as it results in faulty analysis and consequent loss.

Without accurate data, users lose confidence in the database and make improper decisions. Everyone feels the impact of data errors at one time or another, no matter the area in the company to which the user belongs -- management, Internet applications, financial applications, marketing, information services, etc. Mostly, poor data quality results in loss of time, money, and customer confidence, and causes embarrassment.

Some studies have determined that data quality problems cost 10 percent of the total revenue. If data quality is poor, the staff of any organization can spend 25 percent of its time handling customer complaints caused by erratic data -- including checking data that should be correct when it comes to a particular department, fixing incorrect data, finding missing data, and clarifying data that doesn't make sense.

Over the past several years, a number of disasters have resulted from poor data quality, as highlighted in news headlines:

  • "Mail from INS stuns flight school" (USA Today, March 13, 2002). As a result of mismanaged documents, a Florida flight school received notice from the Immigration and Naturalization Service of approval of student visas for two of the September 11th terrorists -- six months to the day after the attack on America.
  • "W Hotel's Room-Rate Mistake Benefits Some New York Guests" (The Wall Street Journal On Line, January 3, 2002). Poor data quality impacted the profits of a hotel chain when the room rate was misquoted on its Web site.
  • "Right Answer, Wrong Score: Test Flaws Take Toll" (The New York Times, May 20, 2001). In recent years, educational testing companies have experienced serious breakdowns in quality control. Testing industry errors in the last three years have affected millions of students who took standardized proficiency tests in at least 20 states. The company that scored tests in Minnesota gave 47,000 students lower scores than they deserved. Nearly 9,000 students in New York City were mistakenly assigned to summer school in 1999 because of an error by another big company.

A few examples specifically pointing to the importance of customers' names and addresses include the following (all in the United States):
  • In an estimate, more than 175,000 IRS and state tax refund checks were marked as "undeliverable" by the postal department in a year.
  • In an audit comparing voter registration lists to data gathered from post office relocation data, it was estimated that 15 to 20 percent of voters on voter registration lists have either moved or are deceased.
  • An acquiring company learned later that a deal closed for a new consumer business with only half of the anticipated customers. The problem? Data duplication.
  • A fiber-optics company lost US$500,000 after a mislabeled shipment caused the wrong cable to be laid at the bottom of a lake.
  • The U.S. government estimates that billions of dollars are lost annually due to poor data quality.

These examples illustrate the severe problems caused by data quality in general, but we'd like to concentrate on the corporate problems relating to customers' names and addresses.

Customer name and address data

Significance of cleanliness
Customer data is important to every organization, but it's especially significant for companies pursuing customer intimacy, product leadership, or price leadership. These companies need to develop a common definition of the customer data to various business units. Removal of errors from this "common view" of customer data is a necessary activity for proper functioning of organizations.

To establish more effective relationships with customers, an organization should develop company-wide customer relationship management (CRM) strategies, representing a combination of business processes, information management tools, and customer data. The foundation of effective customer relationships is ensuring customer data has the highest standard of quality and integrity.

Managing customer name and address data isn't a simple task, as such data is often volatile. The names and addresses of customers can change; the information is also easily accessible for users to edit on a data entry screens.

On the Internet, customer acquisition costs have soared up to US$65 to $250 per customer. More than 50 percent of Internet companies can't respond to their customers and can't relate to them, due to lack of high-quality customer data [Members]. Acquiring a new customer costs six to seven times more than retaining an existing one. Retaining 5 percent more of the best customers of an organization can boost profits by as much as 100 percent.

A Data Warehousing Institute survey shows that only one out of four U.S. companies has implemented data quality initiatives. Therefore, 15 percent of the data in the typical U.S. customer database is inaccurate, as confirmed by national data audits. According to interviews with industry experts and customer and data surveys, data quality problems cost $600 billion/year to U.S. businesses.

To analyze the cost of bad data, William Weil of Innovative Systems carried out an analysis. Assuming that a given customer list for a company is 90 percent accurate, of the 10 percent inaccurate customer details, 5 percent (i.e., 0.5 percent of the entire records) had unusable addresses that could have been rectified. The cost of retaining each customer was estimated as $100 to $1,000. In a huge business organization with around a million customers, 0.5 percent (i.e., 5,000 customers) would be lost if they aren't identified correctly in the organization's database. Finally, losing 5,000 customers would result in a direct cost of $500,000 to $5,000,000 (i.e., 5,000 x {100-1,000}). Thus, bad data is considerably expensive, and data cleaning is an activity companies can't neglect at any cost.

To communicate effectively with its customers, via phone or e-mail, an organization has to maintain an extraordinarily clean customer list. Companies lose credibility by using nonsensical or misspelled addresses and by sending multiple letters to the same person. In banking and healthcare, customer matching is an important issue for companies that want to know about a customer who buys a product repeatedly. A business can't pursue the activity with poor-quality data. Cross-selling -- i.e., identifying the overall needs of a household (comprising identified individuals) and suggesting an effective consolidation or expansion of products -- is also a customer data-based activity. It's also good for an organization to know if it's dealing with multiple commercial organizations that turn out to be a part of a larger parent organization. In some applications, a company might have to merge internal and external data -- e.g., after acquisitions. Even if external data is in a quality format, it may not coincide with the internal data format.

Many other customer-based applications need a firm foundation of a high-quality customer database, such as the following:
  • CRM and customer systems
  • E-business and Web
  • Call centers
  • Marketing systems
  • Data matching and compliance
  • Justice, intelligence, and anti-fraud
  • House holding and customer matching

Some of the major industries that might benefit from customer data cleaning are:
  • Banking and finance
  • Government
  • Health and community services
  • Insurance
  • Tax systems
  • Law enforcement
  • Telecommunications

Concept of clean data
The data quality of one particular customer depends on the quality of each field (name, address, etc.):

[Customer Data Quality]i = g([field quality] 1, i, [field quality] 2, i, ........., [field quality] n, i)
g _ function
n _ number of fields for customer 1

The data quality of the entire information system depends on the quality of each customer record:

[IS Data Quality] = h(g([field quality] 1, 1, ........., [field quality] n, 1), ........, g([field quality] 1, m,
.........., [field quality] n, m))
h _ function
m _ number of customers

Thus:

[IS Data Quality] = h([Customer DQ] 1, [Customer DQ] 2, ........, [Customer DQ] m)
h _ function
m _ number of customers

You can conclude that the overall business IS quality ultimately depends on customer field quality.

High-quality customer name and address data must incorporate the following aspects:

Standardization or data consistency
A customer database may contain multiple variations of the same state, city, company, customer, or address, including multiple abbreviations and types.

Table 1: Multiple variations for similar values.
CountryUnited States
US
StateUttar Pradesh
UP
CityBombay
Mumbai
CompanyGEIS International Incorporated
GEIS INTL Ltd
NamePrem Shankar Patel
P S Patel
Street AddressB 44, Gautam Budh Nagar, UP
B-44, G B Nagar, UP

Such inconsistency leads to the problem of record duplication. Data should be properly standardized according to business rules.

De-duplication
Considering the last row of table 1, and assuming this address value belongs to Wipro Technologies, duplicate records will be added for this company for the same location (i.e., B 44, Gautam Budh Nagar, UP). Such duplication can lead to data integration problems. Consequently, if you try to join two tables on a particular attribute value (as given in the current example), "Gautam Budh Nagar" won't come out to be equal to "G B Nagar," although they're same.

Conformance to postal rules or data verification
A set of address records may include the following anomalies:
  • Invalid/missing house number for a street address.
  • Invalid/missing street name for a city.
  • Invalid/missing city with respect to a state.
  • Invalid/missing postal/ZIP code with respect to a city.

Less than 40 countries on the entire planet have formal postal strategies (rules) under proper implementation. Few Western/European countries have a postal/national address reference file to perform address validation and verification [OASIS]. Companies should implement an application-independent name and address standard, as address errors can contribute to high mail return rates, hindering the success of marketing programs.

Completeness
In many cases, the values of particular attributes may not be present in the customer database. This "missing data" problem occurs more frequently (about 65 percent) than other quality problems (only 60 percent).

Isolation of personal and business names
Organizations dealing with a customer database that blends personal and commercial records should maintain a coding technique to keep record types isolated. Absence of any such provision leads to incorrect construction of search keys and makes it extremely difficult to locate the customer record [CDQ Toolkit].

Data integration or linking
When a company has two records of the same customer it can't join or link -- due to different formats -- it can't get an overall view of the customer.

Cross-column-based correctness analysis
This is also an important activity in some applications -- e.g., determination of gender from name.

Business value

Business executives should view data quality as a direct means to reduce costs and impact revenue. Data quality should be a business issue in addition to a technological concept. Companies' survival depends on the information hidden under the piles of data. If the data is bad, the system is bound to fail. Companies often overlook data quality in a trade-off between time, cost, and performance. Businesses have to make sure they don't ignore it in the race to lower costs and deliver projects and products before deadline: There's no point in being quick if information is poor. Reports from META Group indicate that 75 percent of companies in the United States have yet to implement any data quality initiative. Business persons ignore data quality due to several factors:
  • Technical people paying little attention to the issue
  • The cost and difficulty in implementing a data quality initiative
  • An inability to measure return on investment (ROI)

To view the returns from a data quality initiative, one can consider these two examples:
  1. A company saves on postage by de-duplicating customer records that have a ratio of 20 percent duplication.
  2. Accurate business analysis of good customer data reduces anomalies by just 2 percent among a million customers in banks, where a customer in a five-year period keeps about US$15,000.

It's now easy for business executives to understand the effect data quality can have on business. Estimates are that 15 to 20 percent of the data in an organization can be erroneous. As external data collection increases, the quality of data and companies' control over it is reduced. As customers can view a large amount of data via the Internet, the need for data accuracy is strong.

These factors can help you achieve high data quality across your enterprise:
  • Setting quantifiable business goals
  • Finding the soft points or channels that need more data quality implementation and can contribute directly to business goals
  • Making data quality an enterprise-wide issue and involving "functional line" workers as well as IT personnel
  • Committing personnel and resources to data quality as investment
  • Finding the costs, causes, and fallouts of poor data quality
  • Employing a standard, best-practices-based model and methodology
  • Observing ROI

Enterprise integration and need

Enterprises have many lines or functions. Furthermore, they're divided into small manageable units. Things only worsen if the common process or data really doesn't match. The customer is the chief concern in each unit. Imagine the scenario in which a company mismatches customers because of a discrepancy in the data across many departments. If organizations want a unified view of their customers, they have to consolidate this data.

To survive and progress in a competitive environment, organizations are adopting CRM, data warehouse, ERP, SCM, Web channels, and e-commerce. These expensive methods don't deliver the desired results if the associated data isn't compatible, the data is conflicting, or the company can't interlink it properly. Rather than providing data and expecting gains, organizations should ensure its correctness and then expect the results. Ensuring the correctness of data is known as data quality awareness. For proper functioning of a data warehouse, CRM, SCM, ERP, Web channels, e-commerce, etc., correct data is necessary.

We've personally witnessed some data warehouse projects where all data items are well-defined, developed, executed, managed, delivered, and received, but the projects didn't deliver the expected results because of quality problems inherent in the data. Teams were working with low-quality data to get good results.

Minor differences in customer data can even distort a unified view of subjects such as customer and product. When a number of systems exist in isolation, seamless integration of the data becomes difficult. There has to be a system that can ensure data integration. To cite an example, let's assume the expensive data warehouse, sensitive CRM, widespread ERP, a long SCM process, and integrated e-commerce deal with the same set of customers, but digitally they can't map two identical customers because of some difference in a few bits. How are we going to get a unified view? The overall revenues and other figures as a total will match in the organization, but today's strategies focus on segmentation down to as granular a level as possible -- segmentation based on each company across different databases won't match unless you can link all views of a single company. We lose granularity, and we lose details to chalk out strategies. There's no point in incorporating data quality components in each system, separately; this strategy not only increases the cost but also enhances the risk of data disparity. Why can't we establish a universal source for common data? After realizing a strong need for this type of data quality implementation, we've proposed a central customer data bus (CCDB) based on existing data quality architecture available in the market.

Proposed architecture

The CCDB architecture (figure 1) deals with issues pertaining to data quality: e.g., selection, correction, modification, enhancement, and integration of data. CCDB is an architecture employing various tools and technologies to achieve the desired data quality. Although data cleansing tools such as Trillium Software System and DataFlux dfPower Studio are useful for data quality, you can employ Java and Microsoft Active Server Pages (ASP) to integrate the components with Web and other existing systems. Our CCDB has the following components:
  • Collection: It involves identification and profiling of data channels/sources as good, bad, accurate, reliable, confidential, erroneous, etc. This phase focuses on treating each data channel separately, providing data quality specific to the channel's error vulnerability. It also comprises standardization of data at each channel. A data quality initiative that claims to correct data at runtime has to identify sources of data and assess the intensity and style of data correction.
  • Standardization: After "collection," we apply standard or specific common formats and rules to the incoming data.
  • Correction: In this phase, the data is corrected according to its definition, value domains, range, and missing value algorithms defined.
  • Unification: We profile and match new data to existing data and establish a unique identity across the enterprise. The identity will determine similar data within and across data sources. It will append new data or provide missing data. Enhancement can be in the form of demography, geography, credit, and revenue earned.
  • Integration: In this step, we integrate data to all the child or peer systems that will use it for a data quality initiative. This step provides links and connections, integrates the data, and ensures data consistency across systems.

What's different from ETL?
Professionals who have worked on data warehouses might assume they can ensure data quality using extract, transform, and load (ETL) techniques. Although ETL involves data manipulation in the transformation part, it doesn't have capabilities for parsing (i.e., identifying various name and address fields), enhancing, and unifying the data in general and customer data in particular. In data quality terminology, ETL can address conversion or standardization, but it can't correct and enhance data. Also, ETL's focus is more on data flow, with some data transformation occurring prior to it entering a data warehouse. It doesn't focus on subject-specific data such as customer or product data.

A data quality initiative, on the other hand, focuses on subject-specific data, and data is enhanced and fixed. Data quality can be a specialized subset in ETL. Already, data quality tool companies have partnered with ETL tool companies, such as ACTA and Informatica, to provide data quality plug-ins for ETL tools. However, we strongly suggest you view data quality in a much larger context and address it widely in an enterprise.

Data quality solutions available

Data cleaning commercial tools
These tools correct, standardize, and enhance names and addresses:
  • Trillium Software System, Trillium Software
  • NADIS, MasterSoft International
  • Ultra Address Management, The Computing Group
  • DataSight, Group1 Software

De-duplication
  • QualityStage, Ascential
  • ETI Solution 5, Evolutionary Technologies International
  • Centrus Merge/Purge library, Qualitative Marketing Software
  • SSA-NAME3, Search Software America
  • dfPower, DataFlux Corporation
  • i.d.Centric, firstLogic
  • reUnion and MasterMerge, PitneyBowes
  • DoubleTake, StyleList, Personator, Peoplesmith
  • TwinFinder, Omikron
  • DeDupe, International Software Publishing
  • Merge/Purge Plus, Group1 Software

Data analysis (only analysis prior to cleansing)
  • Migration Architect, Evoke Software
  • Conversion Package, Gladstone Computer Services
  • Avelliono Discovery, Avellino    

Trillium and its capabilities

Trillium Software System is a recognized tool in the data quality domain. Trillium's software is a recognized standard of excellence in data quality solutions to Global 1000 companies. It provides the following capabilities:
  • Quick and easy integration into current CRM, e-business, and data warehouse applications.
  • A solution that'll conform to business rules, not business rules to which one has to conform.
  • Fraudulent customers and transactions detection.
  • The ability to identify and understand the difference between a ship-to and bill-to address.

To achieve data quality, the software:
  • Intelligently identifies elements in free-form data such as first and last names, business names, address lines, cities, serial numbers, dates, part numbers, and other data "shapes."
  • Recodes data patterns for consistency with support for international data.
  • Enhances missing data in customer addresses.
  • Identifies and matches households, business contacts, and other relationships.

Trillium has process flow of four modules to process data:
  • Converter: Data discovery and transformation -- Profile, investigate, analyze, format, scan/clean, and recode.
  • Parser: Customer ID and enhancement -- Elementize, classify, standardize, correct, and transform name and address.
  • Geocoder: Global customer validation -- Verify addresses, global postal codes, address correction, and append census data.
  • Matcher: Customer relationship matching -- Identify duplicates at personal and household level and present the output in the form of Pass(exact match) / Fail (Fresh record) / Suspect (Not exactly a match).

Trillium 6.0 features include the following:
  • Customer Data Parser Tuner: Interactive word pattern analysis, automatic business rules generation.
  • Matcher Tuner: Interactive matcher results viewing, graphical display of match groups, looks at all combinations of matches and suspects.
  • Business Data Parser: Generic (non-name and -address) parsing, user-assigned attributes and patterns, standardizes product data & comments fields, cleans other data (product, HR, finance).
  • Country Router: Multi-national data assignment, automatically disseminates country-specific records.

Implementation
You can implement the four platform-independent modules in systems in several ways:
  • Batch-Driver Programs: Provides processes for initial creation and incremental updates.
  • Callable Functions: Library functions accessible from virtually any calling program and/or communications manager.
  • Distributed Objects: CORBA, COM+, Enterprise JavaBeans (EJB), Java/Java Native Interface (JNI).
  • Plug-Ins: Fully integrated with DataStage, PowerCenter, Siebel, Ab Initio, and Microsoft's Data Transformation Services (DTS) for SQL Server.

Control Center
This utility of Trillium is a powerful GUI used to help set up and edit business rules and define unique data quality processes. Within the Control Center, you can establish new parsing standards, specific business rules for matching and linking records, identify industry terminology, or develop user-defined conditional logic.

A case study

Trillium provides dynamic utilities and proper functionality for data quality solutions. Therefore, Wipro identified it as the most appropriate tool for providing data quality solutions to a customer database of a well-known IT giant.

At the initial level, the client was maintaining a standalone company-registry system that contained its top 600 customers per time zone. The client identified certain basic business requirements:
  • To estimate the revenue that was earned from its best customers across all lines of business.
  • To gather information about all the products and services purchased by customers.
  • To get a clear picture of the number of quotes provided to a customer, across all lines of business, that turned into a sale.

For capturing the exact answers to these queries, the company had to understand every aspect of customers across all business units and geographies.

Effective cleansing was a mandatory activity for better management of core customer data and perfect alignment with the overall strategy of focusing on the customers. Wipro devised a complete application for providing data quality solutions to this customer registry system. The two Trillium utilities that were incorporated in the application were:
  • Batch Driver Programs (command lines that run on UNIX)
  • Callable Functions (APIs called through Java)

As figure 2 shows, in passage 1 (i.e., online process), a single customer-and-address or only customer record that would have been entered through the online screen passes through Trillium APIs available for the Converter, Parser, Geocoder, and Matcher modules, called by Java. The final cleansed output is shown on a confirmation screen and could be added to the customer database, according to user's willingness.

Passage 2 (i.e., "batch" process) takes XML files, consisting of customer-and-address or only customer records, as an input. This file is converted to a form that's acceptable to the command line utility of Trillium. While processing through command lines, the matching process occurs with respect to a flat file extracted from the customer database, at a particular instant. If there's a considerable time lag between the execution time of the application and the time when the extract is taken, the results obtained at the end of matching process can't be considered reliable, as some records would have been entered through the online screen during the lag period. Thus, the command line output -- i.e., a flat file -- is passed through only callable "matcher" module APIs. The output of this matching process is more reliable, as it occurs with respect to the latest state of the customer database. Finally, an output XML file is also created that carries information about the cleansed data and the database interactions that happened during the entire process.

Worth your efforts

Data quality is an opportunity for organizations to stay competitive. In the ever-evolving competitive scenario, organizations strive to be the best. The final theme is about information. As companies try to use information to have an edge over competitors, the quality of the data from which this information is extracted is the next area that will differentiate companies and determine their survival. Data quality brings organizations to a level where data is minutely scrutinized for its reliability and consistency. The advantages of data quality will multiply many times as it passes through analysis systems such as a data warehouse or business intelligence software that strive to extract the hidden information from the available data. A data quality initiative will form a core part of IT and business strategies and will make it to boardrooms. Conclusively, data quality is worth your efforts.


Javed Beg is a data warehouse consultant with Wipro Technologies, a global software services firm located in Bangalore, India. He holds an MS degree from IIT-Madras and a degree in mechanical engineering. He has contributed in the development of an application for providing data quality solutions for a customer database. His interest areas are data quality, database auditing, and designing data warehousing applications.

Shadab Hussain is a data warehouse consultant with Wipro Technologies. He holds an MBA degree from IIITM-Gwalior and a degree in electrical engineering. He has worked on data quality initiative projects. His interest areas are data quality, data warehouse design and architecture, and data modeling.

Wipro is the first PCMM Level 5 and SEI CMMi Level 5 certified IT services company globally. Wipro provides comprehensive IT solutions and services (including systems integration, IS outsourcing, package implementation, and software application aevelopment and maintenance) and research and development services (hardware and software design, development, and implementation) to corporations globally. Wipro's unique value proposition is further delivered through its pioneering Offshore Outsourcing Model and stringent quality processes of SEI and Six Sigma.

Printer-friendly
page layout

What do YOU think about this topic? Share your advice and thoughts using this form.

Your Name

REQUIRED : PUBLIC

Your E-Mail

REQUIRED : PRIVATE

Job, Company

OPTIONAL : PUBLIC

City, State, Country

OPTIONAL : PUBLIC

Your Web Site

OPTIONAL : PUBLIC

Your Comment

Please help everyone by keeping your comments on-topic, using clean language, and not defaming or making personal attacks.


Your e-mail address is required, but it will not be displayed to the public or given to anyone. See our Privacy Policy. Comments become visible after they pass our spam filter, and spammers and abusers are permanently blocked. Please report spam or abuse.

ARTICLE INFO

Web Edition: 2003.10.01, Doc #13060

FREE ACCESS FREE ACCESS

Keyword Tags: active server pages (asp), business software, Business Technology, collaboration, compliance, crm, database, database development, database management, Data Analysis, Data Quality, Database, Database Development, e-business, E-Mail, it networking, it strategy, java, messaging, microsoft sql server, Quality Assurance, security, Strategy, xml

Use of this or any other site, content, product or service of Advisor Media constitutes acceptance of Terms of Use.
Portions copyright ©1983-2010 Advisor Media, LLC. All Rights Reserved.
Reuse or reproduction of any portion or quantity of Advisor Media's copyrighted content, in any form, for any purpose, requires written permission.
ADVISOR®, the ADVISOR logo, and other names and logos that incorporate ADVISOR are registered trademarks, trademarks or service marks of Advisor Media, LLC in the United States and/or other countries.
Other trademarks are used for identification, editorial or descriptive purposes and are the property of their owners.
Hosted by Prominic.NET Website powered by
LOTUS SOFTWARE
oa BEGJA001 posted 2003-10-1 mod 03/17/2010 03:09:50 AM ztdbms/ztdbms
domino-144.advisor.com my.advisor.com 03/18/2010 10:45:22 AM