Loading documents preview...
Data Integration – Techniques for Extracting, Transforming and Loading Data
Internal and Confidential
Agenda • Module 1 - Source Data Analysis & Modeling • Module 2 – Data Capture Analysis & Design • Module 3 – Data Transformation Analysis & Design • Module 4 – Data Transport & Load Design • Module 5 – Implementation Guidelines
December 14, 2010
Module 1 – Source Data Analysis and Modeling
Internal and Confidential
December 14, 2010
Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills
December 14, 2010
Data Acquisition Concepts • What is Data Acquisition? Data sources – Data acquisition is the process of moving data to and within the warehousing environment • Goal oriented – Not an isolated activity. Like data modeling, it is driven by goals external data Operational data and purpose of the data warehouse • Source Driven / Target Driven Get data from sources – Source Driven – the necessary activities for getting data into the Data Intake warehousing environment from various sources – Target Driven – activities for acquisition within the warehousing ETL environment Staging • Data Acquisition activities – Identifying the set of data needed Get data into warehouse – Choosing the extract approach – Extract data from the source Data Distribution – Apply transformations ETL Data – Load data warehouse Get data for delivery Information Delivery
ETL Data mart
ETL
Data mart
ETL
Access
Data mart
December 14, 2010
Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills
December 14, 2010
Source Data Analysis – Scope of Data Sources
• Data acquisition issues
– Identifying and understanding data sources – Mapping source data to target data – Deciding which data to capture – Effectively and efficiently capturing the data, and – Determining how and when to transform the data
• Acquisition process design must pay attention to: – Ensuring data quality – Efficiently loading warehouse databases – Using tools effectively – Planning for error recovery
Entities Customer Order Product…..
Subjects ? product
Customer
Finance
Process
HR
Organization
Business Events
History?
Receive order Ship order
Kinds of data?
Cancel order….
2003 2004 2005
Enterprise Events Merger with.. Acquisition of.. Termination of….
December 14, 2010
Source Data Analysis – Identification of sources • Types of sources – Operational Systems – Secondary Systems – Backups, Logs, and Archives – Shadow Systems – DSS/EIS Systems, and – External Data • On-Going versus Single Load Sources
December 14, 2010
Source Data Analysis – Evaluation of Sources Qualifying criteria
Assessment Questions
Availability
How available and accessible is the data? Are there technical obstacles to access? Or ownership and access authority issues?
Understandability
How easily understood is the data? Is it well documented? Does someone in the organization have depth of knowledge? Who works regularly with this data?
Stability
How frequently do data structures change? What is the history of change for the data? What is the expected life span of the potential data source?
Accuracy
How reliable is the data? Do the business people who work with the data trust it?
Timeliness
When and how often is the data updated? How current is the data? How much history is available? How available is it for extraction?
Completeness
Does the scope of data correspond to the scope of the data warehouse? Is any data missing?
Granularity
Is the source the lowest available grain ( most detailed level) for this data? December 14, 2010
Evaluation of Sources – Origin of data • Original Point of Entry – Best practice technique is to evaluate the original point of entry. “Is is this the very first place that the data is recorded anywhere within the business?” – If “yes”, then you have found the original point of entry. If “no”, then source may not be the original point of entry. Ask the follow up question “ Can the element be updated in this file/table?” – If not then this is not the original point of entry. If “yes” then the data element may be useful as a data warehousing source data
CLAIMCUSTOM POLICY C U S TO M E R -N U M B E R CUSTOM C U S TO M E R -N A M E CUSTOM GENDER DRIVER Point of Origin?
December 14, 2010
Evaluation of Sources – Origin of data • Original Point of Entry – This practice has many benefits – Data timeliness and accuracy are improved – Simplifies the set of extracts from the source system • Business System of Record – To what system do the business people go when they are validating results? – If business identifies a system as the “System of Record” then it must be considered as a probable warehousing data source • Data Stewardship – In organizations that have data stewardship program, involve the data stewards
December 14, 2010
Evaluation of Sources – An example
So
ata D e urc
re o t S
trix a M
FIELD ata D rce u So
en m Ele
trix a tM
CLAIM-NUMBER December 14, 2010
Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills
December 14, 2010
Source Data Modeling – Overview of Warehouse data modeling • Business Goals and Drivers Contextual • Information Needs
Models
•Warehousing subjects •Business Questions •Facts and Qualifiers •Targets Configuration
• Source composition • Source subjects
• Integrated Source Data Model (ERM)
• Source Data Structure Model
• Source Data File Descriptions
• Source Data Files
Triage
Conceptual Models
• Staging, Warehouse, & mart ER models • Data mart Dimensional models
• • • •
Staging Area Structure Warehouse structure Relational mart structures Dimensional mart structures
• Staging Physical Design • Warehouse Physical Design • Data Mart Physical Designs
• Implemented warehousing databases
Logical Models
Structural Models
Physical Models
Functional Databases
December 14, 2010
Source Data Modeling Source Data Modeling Objectives • A single logical model representing a design view of all source data within scope • An entity relationship model in 3rd normal form ( a business model without implementation redundancies) • Traceability from logical entities to the specific data sources that implement those entities • Traceability from logical relationships to the specific data sources that implement those relationships • Verification that each logical attribute is identifiable in implemented data sources Source Data Modeling Challenges • Many data sources do not have data models •
Where data models exist, they are probably out-dated and almost certainly not integrated
•
Many source structures are only documented in code (e.g. COBOL definitions of VSAM files)
•
Sometimes multiple and conflicting file descriptions exist for a single data structure
December 14, 2010
Business drivers Information needs
(scope)
Contextual
Source Data Modeling – The activities
What kinds of data stores
Source data
(analyze)
warehousing data To target modeling
Each source
Source composition model
Conceptual
Business goals
Does source model exist
no Which Modeling Approach?
bottom-up
Logical
top-down
Source Subject model
integrate
(design)
yes
validate
Existing data model Source Logical Model (ERM)
Structure Of data Store (matrix)
Structural (specify) Physical
Existing file desc
(optimize) locate
extract
Functional (Implement)
Existing data store
December 14, 2010
Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills
December 14, 2010
Understand the Scope of Data Sources (scope)
Context ual
Business drivers Information needs
What kinds of data stores
Source data
(analyze )
Concept ual
Source composition model
Business goals
warehousing data To target modeling
Each source
Which Modeling Approach?
Does source model exist
no top-down
bottom-up
yes
Source Subject model
Identify & Name Subjects
validate Existing data model
Associate Subjects
•Source composition model uses set notation to develop a subject area model •Classifies each source by the business subjects that it supports •Helps to understand •which subjects have a robust set of sources •which sources address a broad range of business subjects •Helpful to plan, size, sequence and schedule development of the DW increments December 14, 2010
Composition Subject Model - Example CUSTOMER
MIS customer table
REVENUE APMS premium file
POLICY
LIS policy file RPS policy file APMS policy master
CLAIM MIS product table
EXPENSE INCIDENT CPS claim master LIS claim file
CPS claim action file CPS party file
CPS claim detail file
ORGANIZATION PARTY MARKETPLACE
MIS auto Marketplace table MIS residential Marketplace table December 14, 2010
S
Composition Subject Matrix Example
December 14, 2010
Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills
December 14, 2010
Understanding source content - Integrated view of Non-integrated sources Source Data Modeling… “Not within the charter of the warehousing program to redesign data sources” • Understand the existing source data designs • Merging all of the designs into one representative model • The source logical data modeling process is not one of design but of integrating • To develop a logical source data model, you will need to integrate design information from multiple inputs including – Merging and integrating existing data models – Extending the subject model that represents any source – Extracting design structures from source data stores with reverse engineering techniques
December 14, 2010
Understanding Source content – Using existing models (analyze)
Conceptual
Source composition model
Which Modeling Approach?
yes
Does source model exist
no
top-down
bottom-up
Logical
Source Subject model
Check for Currency & Accuracy
Existing data model
Existing data model
Existing data model
Source Logical Model (ERM)
integrate
(design)
validate
Combine into Single model
•This modeling activity begins with collection of existing data models •Models must be validated – to ensure accuracy and currency •Existing models – “jump start” the process •Merging models – identifying and resolving redundancy and conflict across source models December 14, 2010
Understanding Source content – Working Top Down (analyze)
Conceptual
Source composition model
Does source model exist
no
Which Modeling Approach?
bottom-up
Logical
top-down
yes
Source Subject model
integrate
(design)
validate
Existing data model
Existing data model
Existing data model
Source Logical Model (ERM)
Identify, Name & Describe Entities
Identify, Name & Describe Relationships
Identify, Name & Describe Attributes
Map to Data Stores
December 14, 2010
Understanding Source Content – Working Bottom-Up • Derive the data model from the File descriptions • The source data element matrix serves as the tool to perform source data modeling • Source modeling and source assessment work well together and share the same set of documentation techniques. F ile
A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S
Prem ium Prem ium Prem ium Prem ium Prem ium Prem ium Prem ium Prem ium Prem ium Policy Policy Policy Policy
F ield
A ttrib u te (w h a t fa ct?)
ID (k ey ? )
PO LIC Y-N U M B ER U n iqu e Policy ID Yes NAME Policy H older N a m e A D D R ES S Policy H older A d dress PR EM IU M -A M O U N T C ost of Policy Prem ium PO LIC Y-TER M C overa g e D u ration B EG IN -D A TE S tart d ate of covera ge EN D -D A TE End d ate of covera g e D IS C O U N T-C D Iden tify kin d of d iscounPa t rtial S C H ED U LE B asis of discou nt am t PO LIC Y-N U M B ER U n iqu e Policy ID Yes C U S TO M ER -N U M B ERU n iqu e custom er ID Yes V IN V eh icle ID n um b er Yes MAKE V eh icle M anu facturer
E n tity R e la tio n ship (w h at (fo reig n k ey ?) su b ject?) PO LIC Y C U S TO M ER C U S TO M ER PO LIC Y PO LIC Y PO LIC Y PO LIC Y D IS C O U N T D IS C O U N T PO LIC Y C U S TO M ER PO LIC Y-> C U S TO M ER V EH IC LE PO LIC Y-> V EH IC LE V EH IC LE December 14, 2010
Integrating Multiple Views – Resolving Redundancy & Conflict Resolve Redundancy And Conflict
Model States
Normalize
Verify Model
Examine sets of entities and relationships: customer places order and person places order customer and person are redundant customer places order and customer sends order places redundant with sends Examine sets of entities and attributes When differently named entities have a high degree of similarity in their sets of attributes December 14, 2010
Understanding Source content - Data Profiling – Looking at the Data
Look at the data to: discover patterns know how it is used understand data quality identify all data values
3 types of profiling:
classify and organize
• Column profiling • Dependency profiling • Redundancy profiling
December 14, 2010
Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills
December 14, 2010
Logical to Physical Mapping Tracing Business data to Physical Implementation • Two way connection – What attribute is implemented by this field/column? – Which fields/columns implement this attribute? • Documenting the mapping – Extend the source data element matrix to include all data sources – Provides comprehensive documentation of source data, and detailed mapping from the business view to implemented data elements
December 14, 2010
Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills
December 14, 2010
Understanding the source systems – A Team Effort • Understanding the source systems feeding the warehousing environment is a critical success factor • All members of the warehousing team have a role in this effort • The acquisition (ETL) team is generally responsible to – document the source layouts – perform reverse engineering as needed – determine point of entry – identify system of record – and look at actual data values • The data modelers are likely to – create the single source logical model (using the inputs gathered by the acquisition team) • Business Analysts/representatives are involved to – look at the data and help understand the values – help to identify point of entry – and help to determine system of record
December 14, 2010
Module 2 – Data Capture Analysis and Design
Internal and Confidential
December 14, 2010
Module 2 – Data Capture Analysis & Design Data Capture Concepts Source/Target Mapping Source Data Triage Data Capture Design Considerations Time and Data Capture
December 14, 2010
Data Capture Concepts – An overview • Data capture – activities involved in getting data out of sources • Synonym for Data Extraction
source Extract
Data capture Analysis – What to extract? – Performed to understand requirements for data capture • Which data entities and elements are required by target data stores? • Which data sources are needed to meet target requirements?
Transform
• What data elements need to be extracted from each data source?
Data capture Design – When to extract? How to extract? – Performed to understand and specify methods of data capture
Load
target
• Timing of extracts from each data source • Kinds of data to be extracted • Occurrences of data (all or changes only) to be captured • Change detection methods • Extract technique (snapshot or audit trail) • Data capture method (push from source or pull from warehouse
December 14, 2010
Module 2 – Data Capture Analysis & Design Data Capture Concepts Source/Target Mapping Source Data Triage Data Capture Design Considerations Time and Data Capture
December 14, 2010
Source/Target Mapping – Mapping Objectives –
Primary technique used to perform data capture analysis
–
Mapping source data to target data
–
Three levels of detail • Entities • Data Stores • Data Elements
at D
a
ai av
l
source
ty il i b a
Extract
Transform
Source/Target mapping
Load
–
The terms “source” and “target” describe roles that a data store may play, not fixed characteristics of a data store
Data Req uire men ts
target
December 14, 2010
Source and Target as Roles Data sources
Source: Operational/external data
external data Operational data
Target: staging data Data Intake
The terms “source” and “target” describe roles that a data store
ETL
Staging
may play, not fixed characteristics of a data store
Source: staging data Target: data warehouse
Data Distribution
ETL
Source: data warehouse Target: data marts
Data warehouse
Information Delivery
ETL Data mart
ETL
Data mart
December 14, 2010
ETL Data mart
Mapping Techniques customer
Source & target data models product
service
Map source entities to target entities
Logical data models
Map source data stores to target data stores
Map source data elements to target data elements Structural &
design transformations
physical models
December 14, 2010
Source/Target Mapping: EXAMPLES ENTITY MAPPING
DATA STORE MAPPING
l
DATA ELEMENT MAPPING
MEMB
December 14, 2010
Source/target Mapping: Full set of Data elements Elements added by business
s es ns n si tio u B es qu
nt e lem
e ta x a D t ri ma
e bl ons a /t ti le crip i F s de
l ca l i g e Lo od m
Source/target mapping
l ca i ys gn h P esi d
ap et m g r ta ce/ r u So
Elements added by triage triage
transform design
December 14, 2010
Elements added by transform logic
l ca l i g e Lo od m
Module 2 – Data Capture Analysis & Design Data Capture Concepts Source/Target Mapping Source Data Triage Data Capture Design Considerations Time and Data Capture
December 14, 2010
Source Data Triage What to extract - opportunities • Source/target mapping analyzes need, triage analyzes opportunity • Triage is about extracting all data with potential value
What is Triage? • Source data structures are analyzed to determine the appropriate data elements for inclusion
Why Triage? • Ensure that a complete set of attributes is captured in the warehousing environment. • Rework is minimized
Triage and Acquisition • Performing triage is a joint effort between the acquisition team and the warehousing data modelers
December 14, 2010
The Triage Technique Source systems needed for increment
Select needed files Identify elements addressing known business questions Eliminate Operational and redundant elements Take all other business elements
First draft of element mapping for staging area or atomic DW
December 14, 2010
Module 2 – Data Capture Analysis & Design Data Capture Concepts Source/Target Mapping Source Data Triage Data Capture Design Considerations Time and Data Capture
December 14, 2010
Kinds of Data Event Data
Custom
Member Nu Membership Reference Data
Source system metadata
Source system keys
December 14, 2010
Data Capture Methods
ALL DATA
CHANGED DATA
PUSH TO WAREHOUSE
Replicate source Files/tables
Replicate Source changes Or transactions
PULL FROM SOURCE
Extract source Files/tables
Extract source Changes or transactions
December 14, 2010
Detecting Data Changes how to know which data has changed???
• Detecting changes at source • source date/time stamps • source transaction files/logs • replicate source data changes • DBMS logs • compare back-up files • Detecting changes after extract • compare warehouse extract generations • compare warehouse extract to source system
ALL DATA
CHANGED DATA
PUSH TO WAREHOUSE
Replicate source Files/tables
Replicate Source changes Or transactions
PULL FROM SOURCE
Extract source Files/tables
Extract source Changes or transactions
December 14, 2010
Module 2 – Data Capture Analysis & Design Data Capture Concepts Source/Target Mapping Source Data Triage Data Capture Design Considerations Time and Data Capture
December 14, 2010
Timing issues OLTP
Frequency of Acquisition
Sources
Data Extraction Data Transformation Work Tables
Warehouse Loading
Latency of Load
Intake layer
Periodicity Of Data Marts
Data Mart
December 14, 2010
Source System Considerations OLTP
Sources
Data Extraction
Work Tables
How will I know when source systems fail?
How long will it remain in the steady state?
When is the data ready in each source system ? How will i know when it’s ready?
How will I respond to source system failures?
How will i recover from a failure? December 14, 2010
Handling time variance – techniques and methods • SNAPSHOT
– Periodically posts records as of a specific point in time – Records all data of interest without regard to changes – Acquisition techniques to create snapshots • DBMS replication • Full File Unload or Copy
• AUDIT TRAIL – Records details of each change to data of interest – Details may include date and time of change, how the change was detected, reason for change, before and after data values, etc. – Acquisition techniques • DBMS triggers • DBMS replication • Incremental selection • Full file unload/copy
Important distinction between Snapshot and Audit trail : Audit trail techniques only Changed data is extracted and loaded, Snapshot all data is extracted and loaded, whether changed or not December 14, 2010
Module 3 – Data Transformation Analysis & Design
Internal and Confidential
December 14, 2010
Module 3 – Data Transformation Analysis & Design Data Transformation Concepts Transformation Analysis Transformation Design Transformation Rules and Logic Transformation Sequences and Processes
December 14, 2010
Transformation concepts – An overview • Data Transformation • Changes that occur to the data after it is extracted • Transformation processing removes – – – –
Complexities of operational environments Conflicts and redundancies of multiple databases Details of daily operations Obscurity of highly encoded data
• Transformation Analysis • Integrate disparate data • Change granularity of data • Assure data quality
• Transformation Design • Specifies the processing needed to meet the requirements that are determined by transformation analysis • Determining kinds of transformations – – – – – – –
Selection Filtering Conversion Translation Derivation Summarization Organized into programs, scripts, modules, jobs, etc. that are compatible with chosen tools and technology December 14, 2010
Module 3 – Data Transformation Analysis & Design Data Transformation Concepts Transformation Analysis Transformation Design Transformation Rules and Logic Transformation Sequences and Processes
December 14, 2010
Data Integration Requirements • Integration – Create a single view of the data • Integration & Staging Data – organize data by business subjects – ensure integrated identity through use of common, shared business keys • Integration & Warehouse data – implement data standards, including derivation of conformed facts and structuring of conformed dimensions – ensure integration of internal identifiers – where staging integrates real world keys, the warehouse needs to do the same for surrogate keys • Integration & Data marts – Intended to satisfy business specific /department specific requirements
December 14, 2010
Data Granularity Requirements • Granularity – Each change of data grain, from atomic data to progressively higher levels of summary – achieved through transformation • Granularity & Staging Data – Staging data kept at atomic level • Granularity & warehouse data – In a 3 tier environment, warehouse should contain all common and standard summaries • Granularity & Data marts – Derivation of summaries specific to individual needs
December 14, 2010
Data Quality Requirements • Data Cleansing – process by which data quality needs are met – range from filtering bad date to replacing data values with some alternative default or derived values • Cleansing & Staging Data – the earlier the data is cleansed, the better the result – sometimes important for staging data to reflect what was contained in the source systems – Delay data cleansing transformation until data is moved from staging to warehouse – Keep both cleansed and un-cleansed data in staging area • Cleansing & Warehouse data – data not cleansed in staging is cleansed before loaded into the warehouse • Cleansing & Data marts – cleansing at data marts is not necessarily desirable, however as a practical matter may be necessary
December 14, 2010
Module 3 – Data Transformation Analysis & Design Data Transformation Concepts Transformation Analysis Transformation Design Transformation Rules and Logic Transformation Sequences and Processes
December 14, 2010
Transformation Design - Approach transformation requirements
Identify transformation rules & logic
Determine transformation sequences
Specify transformation process
transformation specifications
December 14, 2010
Module 3 – Data Transformation Analysis & Design Data Transformation Concepts Transformation Analysis Transformation Design Transformation Rules and Logic Transformation Sequences and Processes
December 14, 2010
Kinds of transformations This Transformation type…
is Used to…
Selection
Choose one source to be used among multiple possibilities
Filtering
Choose a subset of rows from a source data table, or a subset of records from a source data file
Conversion and Translation
Change the format of data elements
Derivation
Create new data values, which can be inferred from the values of existing data elements
Summarization
Create new data values, which can inferred from the values of existing data elements
December 14, 2010
Selection Choose among alternative sources based upon selection rules
Extracted Source # 1
Select
Extracted Source # 2 sometimes from source 1
Transformed Target data
sometimes from source 2
‘If membership type is individual use member name from the membership master file, otherwise use member name from the business contact table’
December 14, 2010
Filtering eliminate some data from the target set of data based on filtering rules
Extracted Source data
Filter
Some rows or values discarded
Transformed Target data
‘If the last 2 digits of policy number are 04,27,46, or 89 extract data for the data mart, otherwise exclude the policy and all associated data’
December 14, 2010
Conversion Change data content and/or format based on conversion rules
Extracted Source data
Convert
Value/format in is different than value/format out
Transformed Target data
‘For policy history prior to 1994, reformat from Julian date to YYYYMMDD format. Default century to 19’
December 14, 2010
Translation Extracted Source data
encode values in
decode data whose values are encoded based on rules for translation
Translate
both encoded and decoded value out
Transformed Target data
‘if membership-type-code is ‘C’ translate to ‘Business’; If membership-type-code is ‘P’, blank, or null translate to ‘Individual’; otherwise translate to ‘Unknown’ ’
December 14, 2010
Derivation use existing data values to create new data based on derivation rules
Extracted Source data
Derive
new data values created…
Transformed Target data
More values out than in
‘Total Premium Cost = base-premium-amount + (sum of all additional coverage amounts)-(sum of all discount amounts) ’
December 14, 2010
Summarization Extracted Source data
atomic or base data in
Change data granularity based on rules of summarization
Summarize
Summary data out
Transformed Target data
‘for each store (for each product line (for each day (count the number of transactions, accumulate the total dollar value of the transactions))) ’ ‘for each week (sum daily transaction count, sum daily dollar total)
December 14, 2010
Identifying Transformation Rules
for any source-to-target data element association, what needs exist for: • selection? • filtering? • conversion? • translation? • derivation? • summarization? December 14, 2010
Specifying Transformation Rules cells expand to identify transformations by type & name
cleansing DTR027 (default value) Derivation DTR008 (Derive name) ic g ns o lo f i o at ly m te or ara ted f s n an sep me r t is u DTR008(Derive Name) doc
DTR027(Default Membership Type) If membership-type is null or invalid assume “family” membership
If membership-type is “family” separate name using comma characters prior to comma in customer-last-name after comma in customer-first-name biz-name
insert insert characters else move name to customerDecember 14, 2010
Module 3 – Data Transformation Analysis & Design Data Transformation Concepts Transformation Analysis Transformation Design Transformation Rules and Logic Transformation Sequences and Processes
December 14, 2010
Dependencies and Sequences • Time Dependency – when one transformation rule must execute before another • example: summarization of derived data cannot occur before the derivation • Rule Dependency – when execution of a transformation rule is based upon the result of another rule • example: different translations occur depending on source chosen by a selection rule • Grain Dependency – when developing one level of summary if based on results of a previous summarization • example: quarters can’t be summarized annually before months are summarized on a quarterly basis
December 14, 2010
Dependencies and Sequences 4
2 Specify selection Specify filtering
Specify conversion & translation Specify derivation Specify summarization
1
3
1. Identify the transformation rules 2. Understand rule dependency – package as modules 3. Understand time dependency – package as processes 4. Validate and define the test plan December 14, 2010
Modules and Programs DTR027(Default Membership Type) If membership-type is null or DTR008(Derive Name) If membership-type is “family” invalid assume “family” membership
separate name using comma insert characters prior to comma in customer-last-name insert characters after comma in customer-first-name
else move name to customer-biz-name
Transformation Rules Dependencie s among rules
Structures of Modules, Programs, Scripts, etc. December 14, 2010
Job Streams & Manual Procedures- completing the ETL design
Transformation Rules and their implementation Extract & Load Dependencie s
scheduling
execution e r tra ts
verification
Automated & Manual Procedures
communication December 14, 2010
Module 4 – Data Transportation & Loading Design
Internal and Confidential
December 14, 2010
Module 4 – Data Transport & Load Design
Data Transport and Load Concepts Data Transport Design Database Load Design
December 14, 2010
Overview Source Data
Extract
databa s
Load a hc m r of t al p od er e h w
t r ops nart at ad
Transform e load
Target Data December 14, 2010
Module 4 – Data Transport & Load Design
Data Transport and Load Concepts Data Transport Design Database Load Design
December 14, 2010
Data Transport Issues Source Data
Extract
which platforms? data volumes? transport frequency? network capacity? ASCII vs EBCDIC data security? transport methods?
Load
t r ops nart at ad
Transform
Target Data
December 14, 2010
Data Transport Techniques Source Data
Extract
Open FTP Secure FTP Alternatives to FTP Data compression Data Encryption ETL tools
Load
t r ops nart at ad
Transform
Target Data
December 14, 2010
Module 4 – Data Transport & Load Design
Data Transport and Load Concepts Data Transport Design Database Load Design
December 14, 2010
Database Load Issues Source Data
which DBMS? relational vs dimensional? tables & indices? load frequency? load timing? data volumes? exception handling? restart & recovery? load methods? referential integrity?
Extract
Transform databa s
Load
e load
Target Data December 14, 2010
Populating Tables • Drop and rebuild the tables • Insert (only) rows into a table • Delete old rows and insert changed rows
December 14, 2010
Indexing Load
Indices
Tables
update at load? dr o p & rebuild? index s egment ation? December 14, 2010
Updating allow updating of rows in tables?
Load
Indices Tables
• Isn’t the warehouse read only? • Updating Business Data • Updating Row level Metadata December 14, 2010
Referential Integrity • RI is the condition where every reference to another table has a foreign key/primary key match. • Three common options for RI • DBMS checking • Test load files before load using a tool/custom application • Test data base(s) after load using a tool/custom application
December 14, 2010
Timing Considerations • User Expectations • Data Readiness • Database synchronization
December 14, 2010
Exception Processing Transform
Load
Suspend exceptions
ok Reports Target data Log
Discard December 14, 2010
Integrating with ETL processes scheduling dependencies
restart/recover y dependencies
scheduling
M
communicatio n dependencies
T C ART XE
execution verification
scheduling
DA OL R OF S NART
process metadata
execution verification
parallel processing
ts e r tra
e r ts tra • Loading as a part of single transform & load job stream • Loads triggered by completion of transform job stream • Loads triggered by verification of transforms • Parallel ETL processing • Loading Partitions
scheduling execution verification
• Updating summary tables
e r tra ts
tool capabilities
December 14, 2010
Module 5 – Implementation Guidelines
Internal and Confidential
December 14, 2010
Module 5 – Implementation Guidelines
Data Acquisition Technology ETL Summary
December 14, 2010
Technology in Data Acquisition ETL Technology Data Mapping
Data Transformation Data Conversion
Storage Management Metadata Management
mega na M es abat a D
Data Movement
gni daoL es abat a D
ss ecc A at a D
s met s y S ecr uo S
Data Cleansing
December 14, 2010
ETL - Critical Success Factors
1. Design for the Future, Not for the Present 2. Capture and store only changed data 3. Fully understand source systems and data 4. Allow enough time to do the job right 5. Use the right sources, not the easy ones 6. Pay attention to data quality 7. Capture comprehensive ETL metadata 8. Test thoroughly and according to a test plan 9. Distinguish between one-time and ongoing loads 10. Use the right technology for the right reasons 11. Triage source attributes 12. Capture atomic level detail 13. Strive for subject orientation and integration 14. Capture history of changes in audit trail form 15. Modularize ETL processing 16. Ensure that business data is non-volatile 17. Use bulk loads and/or insert-only processing 18. Complete subject orientation and integration 19. Use the right data structures (relational vs. dimensional) 20. Use shared transformation rules and logic 21. Design for distribution first, then for access 22. Fully understand each unique access need 23. Use DBMS update capabilities 24. Design for access before other purposes 25. Design for access tool capabilities 26. Capture quality metadata and report data quality
v v v v v v v v v v v v v v v v v v
v
Cleansing
Granularity
Integration
Data Transformation Roles
Information Delivery
Distribution
Intake
Data Store Roles
v
v
v
v
v
v
v v
v v
v v v v v
v v v v v
v v v
v v v v v v
v v v v v v
v
v v
v v
v v v v
v v
v v v v
v v v v v v v
v
v
v
v v v
v v
v
v v
v v
v v v December 14, 2010
Exercises Exercise 1: Source Data Options Exercise 2: Source Data Modeling Exercise 3: Data Capture Exercise 4: Data Transformation Exercise 5: Data Acquisition Decision
December 14, 2010