Data Architecture

Organization's data architecture consists of:

-Data stores
-ETL processes
-Metadata
-Data access

Data stores

operational and analytical

ETL processes

to move and transform data from one data store to another

Metadata

describing data stores and relationships between them

Data access

analysis software and middleware for controlling and providing access to data for users

Issues with Integrating Data

-Inconsistent key structures in different systems
-Synonyms: Different systems use different names
-Free-form vs. structured fields
-Inconsistent data values across systems
-Missing data

Techniques for Data Integration

-Consolidation
-Data federation
-Data propagation

Consolidation

all data into a centralized database

Data federation

provides a virtual view of data without actually creating one centralized database

Data propagation

Duplicate data across databases, with near real-time delay--replication

Static extract

capturing a snapshot of the source data at a point in time-take everything from desired columns

incremental extract

capturing changes that have occurred since the last static extract-select rows based on update/insert dates

Record level

-selection:data partioning
-joining-data combining
-aggregation-data summarization

Field-level

-single-field: from one field to one field
-multi-field: from many fields to one, or one field to many

Refresh mode

bulk rewriting of target data at periodic intervals--drop indexes; load; reindex

Update mode

only changes in source data are written to data warehouse--maintain indexes

Accuracy

degree to which value matches reality

Uniqueness

degree to which an entity is only represented once in system

Consistency

data representing same entity should have same value or a consistent value

Completeness

all values are represented

Timeliness

data is available when needed

Currency

data represents the current state of the entity

Conformance

data conforms to MetaData rules

Referential integrity

data complies with referential integrity constraints

Causes of poor data quality

-External data sources
-Redundant data storage and inconsistent metadata
-Data entry
-Lack of organizational commitment

Who is responsible for data quality

-Data governance
-Data steward

Data governance

high-level organizational groups and processes overseeing data stewardship across the organization

Data steward

A persona responsible for ensuring that organizational applications properly support the organization's data quality goals

TQM Principles

-Defect prevention
-Continuous improvement
-Use of enterprise data standards
-Strong foundation of measurment

Master Data Management (MDM)

Disciplines, technologies, and methods to ensure the currency, meaning, and quality of reference data within and across various subject areas

Identity registry

master data remains in source systems; registry provides applications with location

Integration hub

data changes broadcast through central service to subscribing databases

Persistent

central "golden record" maintained; all applications have access