02- Etl Design Strategy

  • Uploaded by: Rohit Singh
  • 0
  • 0
  • January 2021
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View 02- Etl Design Strategy as PDF for free.

More details

  • Words: 5,831
  • Pages: 95
Loading documents preview...
Data Integration – Techniques for Extracting, Transforming and Loading Data

Internal and Confidential

Agenda • Module 1 - Source Data Analysis & Modeling • Module 2 – Data Capture Analysis & Design • Module 3 – Data Transformation Analysis & Design • Module 4 – Data Transport & Load Design • Module 5 – Implementation Guidelines

December 14, 2010

Module 1 – Source Data Analysis and Modeling

Internal and Confidential

December 14, 2010

Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills

December 14, 2010

Data Acquisition Concepts • What is Data Acquisition? Data sources – Data acquisition is the process of moving data to and within the warehousing environment • Goal oriented – Not an isolated activity. Like data modeling, it is driven by goals external data Operational data and purpose of the data warehouse • Source Driven / Target Driven Get data from sources – Source Driven – the necessary activities for getting data into the Data Intake warehousing environment from various sources – Target Driven – activities for acquisition within the warehousing ETL environment Staging • Data Acquisition activities – Identifying the set of data needed Get data into warehouse – Choosing the extract approach – Extract data from the source Data Distribution – Apply transformations ETL Data – Load data warehouse Get data for delivery Information Delivery

ETL Data mart

ETL

Data mart

ETL

Access

Data mart

December 14, 2010

Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills

December 14, 2010

Source Data Analysis – Scope of Data Sources

• Data acquisition issues

– Identifying and understanding data sources – Mapping source data to target data – Deciding which data to capture – Effectively and efficiently capturing the data, and – Determining how and when to transform the data

• Acquisition process design must pay attention to: – Ensuring data quality – Efficiently loading warehouse databases – Using tools effectively – Planning for error recovery

Entities Customer Order Product…..

Subjects ? product

Customer

Finance

Process

HR

Organization

Business Events

History?

Receive order Ship order

Kinds of data?

Cancel order….

2003 2004 2005

Enterprise Events Merger with.. Acquisition of.. Termination of….

December 14, 2010

Source Data Analysis – Identification of sources • Types of sources – Operational Systems – Secondary Systems – Backups, Logs, and Archives – Shadow Systems – DSS/EIS Systems, and – External Data • On-Going versus Single Load Sources

December 14, 2010

Source Data Analysis – Evaluation of Sources Qualifying criteria

Assessment Questions

Availability

How available and accessible is the data? Are there technical obstacles to access? Or ownership and access authority issues?

Understandability

How easily understood is the data? Is it well documented? Does someone in the organization have depth of knowledge? Who works regularly with this data?

Stability

How frequently do data structures change? What is the history of change for the data? What is the expected life span of the potential data source?

Accuracy

How reliable is the data? Do the business people who work with the data trust it?

Timeliness

When and how often is the data updated? How current is the data? How much history is available? How available is it for extraction?

Completeness

Does the scope of data correspond to the scope of the data warehouse? Is any data missing?

Granularity

Is the source the lowest available grain ( most detailed level) for this data? December 14, 2010

Evaluation of Sources – Origin of data • Original Point of Entry – Best practice technique is to evaluate the original point of entry. “Is is this the very first place that the data is recorded anywhere within the business?” – If “yes”, then you have found the original point of entry. If “no”, then source may not be the original point of entry. Ask the follow up question “ Can the element be updated in this file/table?” – If not then this is not the original point of entry. If “yes” then the data element may be useful as a data warehousing source data

CLAIMCUSTOM POLICY C U S TO M E R -N U M B E R CUSTOM C U S TO M E R -N A M E CUSTOM GENDER DRIVER Point of Origin?

December 14, 2010

Evaluation of Sources – Origin of data • Original Point of Entry – This practice has many benefits – Data timeliness and accuracy are improved – Simplifies the set of extracts from the source system • Business System of Record – To what system do the business people go when they are validating results? – If business identifies a system as the “System of Record” then it must be considered as a probable warehousing data source • Data Stewardship – In organizations that have data stewardship program, involve the data stewards

December 14, 2010

Evaluation of Sources – An example

So

ata D e urc

re o t S

trix a M

FIELD ata D rce u So

en m Ele

trix a tM

CLAIM-NUMBER December 14, 2010

Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills

December 14, 2010

Source Data Modeling – Overview of Warehouse data modeling • Business Goals and Drivers Contextual • Information Needs

Models

•Warehousing subjects •Business Questions •Facts and Qualifiers •Targets Configuration

• Source composition • Source subjects

• Integrated Source Data Model (ERM)

• Source Data Structure Model

• Source Data File Descriptions

• Source Data Files

Triage

Conceptual Models

• Staging, Warehouse, & mart ER models • Data mart Dimensional models

• • • •

Staging Area Structure Warehouse structure Relational mart structures Dimensional mart structures

• Staging Physical Design • Warehouse Physical Design • Data Mart Physical Designs

• Implemented warehousing databases

Logical Models

Structural Models

Physical Models

Functional Databases

December 14, 2010

Source Data Modeling Source Data Modeling Objectives • A single logical model representing a design view of all source data within scope • An entity relationship model in 3rd normal form ( a business model without implementation redundancies) • Traceability from logical entities to the specific data sources that implement those entities • Traceability from logical relationships to the specific data sources that implement those relationships • Verification that each logical attribute is identifiable in implemented data sources Source Data Modeling Challenges • Many data sources do not have data models •

Where data models exist, they are probably out-dated and almost certainly not integrated



Many source structures are only documented in code (e.g. COBOL definitions of VSAM files)



Sometimes multiple and conflicting file descriptions exist for a single data structure

December 14, 2010

Business drivers Information needs

(scope)

Contextual

Source Data Modeling – The activities

What kinds of data stores

Source data

(analyze)

warehousing data To target modeling

Each source

Source composition model

Conceptual

Business goals

Does source model exist

no Which Modeling Approach?

bottom-up

Logical

top-down

Source Subject model

integrate

(design)

yes

validate

Existing data model Source Logical Model (ERM)

Structure Of data Store (matrix)

Structural (specify) Physical

Existing file desc

(optimize) locate

extract

Functional (Implement)

Existing data store

December 14, 2010

Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills

December 14, 2010

Understand the Scope of Data Sources (scope)

Context ual

Business drivers Information needs

What kinds of data stores

Source data

(analyze )

Concept ual

Source composition model

Business goals

warehousing data To target modeling

Each source

Which Modeling Approach?

Does source model exist

no top-down

bottom-up

yes

Source Subject model

Identify & Name Subjects

validate Existing data model

Associate Subjects

•Source composition model uses set notation to develop a subject area model •Classifies each source by the business subjects that it supports •Helps to understand •which subjects have a robust set of sources •which sources address a broad range of business subjects •Helpful to plan, size, sequence and schedule development of the DW increments December 14, 2010

Composition Subject Model - Example CUSTOMER

MIS customer table

REVENUE APMS premium file

POLICY

LIS policy file RPS policy file APMS policy master

CLAIM MIS product table

EXPENSE INCIDENT CPS claim master LIS claim file

CPS claim action file CPS party file

CPS claim detail file

ORGANIZATION PARTY MARKETPLACE

MIS auto Marketplace table MIS residential Marketplace table December 14, 2010

S

Composition Subject Matrix Example

December 14, 2010

Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills

December 14, 2010

Understanding source content - Integrated view of Non-integrated sources Source Data Modeling… “Not within the charter of the warehousing program to redesign data sources” • Understand the existing source data designs • Merging all of the designs into one representative model • The source logical data modeling process is not one of design but of integrating • To develop a logical source data model, you will need to integrate design information from multiple inputs including – Merging and integrating existing data models – Extending the subject model that represents any source – Extracting design structures from source data stores with reverse engineering techniques

December 14, 2010

Understanding Source content – Using existing models (analyze)

Conceptual

Source composition model

Which Modeling Approach?

yes

Does source model exist

no

top-down

bottom-up

Logical

Source Subject model

Check for Currency & Accuracy

Existing data model

Existing data model

Existing data model

Source Logical Model (ERM)

integrate

(design)

validate

Combine into Single model

•This modeling activity begins with collection of existing data models •Models must be validated – to ensure accuracy and currency •Existing models – “jump start” the process •Merging models – identifying and resolving redundancy and conflict across source models December 14, 2010

Understanding Source content – Working Top Down (analyze)

Conceptual

Source composition model

Does source model exist

no

Which Modeling Approach?

bottom-up

Logical

top-down

yes

Source Subject model

integrate

(design)

validate

Existing data model

Existing data model

Existing data model

Source Logical Model (ERM)

Identify, Name & Describe Entities

Identify, Name & Describe Relationships

Identify, Name & Describe Attributes

Map to Data Stores

December 14, 2010

Understanding Source Content – Working Bottom-Up • Derive the data model from the File descriptions • The source data element matrix serves as the tool to perform source data modeling • Source modeling and source assessment work well together and share the same set of documentation techniques. F ile

A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S A PM S

Prem ium Prem ium Prem ium Prem ium Prem ium Prem ium Prem ium Prem ium Prem ium Policy Policy Policy Policy

F ield

A ttrib u te (w h a t fa ct?)

ID (k ey ? )

PO LIC Y-N U M B ER U n iqu e Policy ID Yes NAME Policy H older N a m e A D D R ES S Policy H older A d dress PR EM IU M -A M O U N T C ost of Policy Prem ium PO LIC Y-TER M C overa g e D u ration B EG IN -D A TE S tart d ate of covera ge EN D -D A TE End d ate of covera g e D IS C O U N T-C D Iden tify kin d of d iscounPa t rtial S C H ED U LE B asis of discou nt am t PO LIC Y-N U M B ER U n iqu e Policy ID Yes C U S TO M ER -N U M B ERU n iqu e custom er ID Yes V IN V eh icle ID n um b er Yes MAKE V eh icle M anu facturer

E n tity R e la tio n ship (w h at (fo reig n k ey ?) su b ject?) PO LIC Y C U S TO M ER C U S TO M ER PO LIC Y PO LIC Y PO LIC Y PO LIC Y D IS C O U N T D IS C O U N T PO LIC Y C U S TO M ER PO LIC Y-> C U S TO M ER V EH IC LE PO LIC Y-> V EH IC LE V EH IC LE December 14, 2010

Integrating Multiple Views – Resolving Redundancy & Conflict Resolve Redundancy And Conflict

Model States

Normalize

Verify Model

Examine sets of entities and relationships: customer places order and person places order  customer and person are redundant customer places order and customer sends order  places redundant with sends Examine sets of entities and attributes When differently named entities have a high degree of similarity in their sets of attributes December 14, 2010

Understanding Source content - Data Profiling – Looking at the Data

Look at the data to: discover patterns know how it is used understand data quality identify all data values

3 types of profiling:

classify and organize

• Column profiling • Dependency profiling • Redundancy profiling

December 14, 2010

Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills

December 14, 2010

Logical to Physical Mapping Tracing Business data to Physical Implementation • Two way connection – What attribute is implemented by this field/column? – Which fields/columns implement this attribute? • Documenting the mapping – Extend the source data element matrix to include all data sources – Provides comprehensive documentation of source data, and detailed mapping from the business view to implemented data elements

December 14, 2010

Module 1 – Source Data Analysis & Modeling Data Acquisition Concepts Source Data Analysis Source Data Modeling Understanding the Scope of Data sources Understanding Source Content Logical to Physical Mapping Data Acquisition Roles and Skills

December 14, 2010

Understanding the source systems – A Team Effort • Understanding the source systems feeding the warehousing environment is a critical success factor • All members of the warehousing team have a role in this effort • The acquisition (ETL) team is generally responsible to – document the source layouts – perform reverse engineering as needed – determine point of entry – identify system of record – and look at actual data values • The data modelers are likely to – create the single source logical model (using the inputs gathered by the acquisition team) • Business Analysts/representatives are involved to – look at the data and help understand the values – help to identify point of entry – and help to determine system of record

December 14, 2010

Module 2 – Data Capture Analysis and Design

Internal and Confidential

December 14, 2010

Module 2 – Data Capture Analysis & Design Data Capture Concepts Source/Target Mapping Source Data Triage Data Capture Design Considerations Time and Data Capture

December 14, 2010

Data Capture Concepts – An overview • Data capture – activities involved in getting data out of sources • Synonym for Data Extraction

source Extract

Data capture Analysis – What to extract? – Performed to understand requirements for data capture • Which data entities and elements are required by target data stores? • Which data sources are needed to meet target requirements?

Transform

• What data elements need to be extracted from each data source?

Data capture Design – When to extract? How to extract? – Performed to understand and specify methods of data capture

Load

target

• Timing of extracts from each data source • Kinds of data to be extracted • Occurrences of data (all or changes only) to be captured • Change detection methods • Extract technique (snapshot or audit trail) • Data capture method (push from source or pull from warehouse

December 14, 2010

Module 2 – Data Capture Analysis & Design Data Capture Concepts Source/Target Mapping Source Data Triage Data Capture Design Considerations Time and Data Capture

December 14, 2010

Source/Target Mapping – Mapping Objectives –

Primary technique used to perform data capture analysis



Mapping source data to target data



Three levels of detail • Entities • Data Stores • Data Elements

at D

a

ai av

l

source

ty il i b a

Extract

Transform

Source/Target mapping

Load



The terms “source” and “target” describe roles that a data store may play, not fixed characteristics of a data store

Data Req uire men ts

target

December 14, 2010

Source and Target as Roles Data sources

Source: Operational/external data

external data Operational data

Target: staging data Data Intake

The terms “source” and “target” describe roles that a data store

ETL

Staging

may play, not fixed characteristics of a data store

Source: staging data Target: data warehouse

Data Distribution

ETL

Source: data warehouse Target: data marts

Data warehouse

Information Delivery

ETL Data mart

ETL

Data mart

December 14, 2010

ETL Data mart

Mapping Techniques customer

Source & target data models product

service

Map source entities to target entities

Logical data models

Map source data stores to target data stores

Map source data elements to target data elements Structural &

design transformations

physical models

December 14, 2010

Source/Target Mapping: EXAMPLES ENTITY MAPPING

DATA STORE MAPPING

l

DATA ELEMENT MAPPING

MEMB

December 14, 2010

Source/target Mapping: Full set of Data elements Elements added by business

s es ns n si tio u B es qu

nt e lem

e ta x a D t ri ma

e bl ons a /t ti le crip i F s de

l ca l i g e Lo od m

Source/target mapping

l ca i ys gn h P esi d

ap et m g r ta ce/ r u So

Elements added by triage triage

transform design

December 14, 2010

Elements added by transform logic

l ca l i g e Lo od m

Module 2 – Data Capture Analysis & Design Data Capture Concepts Source/Target Mapping Source Data Triage Data Capture Design Considerations Time and Data Capture

December 14, 2010

Source Data Triage What to extract - opportunities • Source/target mapping analyzes need, triage analyzes opportunity • Triage is about extracting all data with potential value

What is Triage? • Source data structures are analyzed to determine the appropriate data elements for inclusion

Why Triage? • Ensure that a complete set of attributes is captured in the warehousing environment. • Rework is minimized

Triage and Acquisition • Performing triage is a joint effort between the acquisition team and the warehousing data modelers

December 14, 2010

The Triage Technique Source systems needed for increment

Select needed files Identify elements addressing known business questions Eliminate Operational and redundant elements Take all other business elements

First draft of element mapping for staging area or atomic DW

December 14, 2010

Module 2 – Data Capture Analysis & Design Data Capture Concepts Source/Target Mapping Source Data Triage Data Capture Design Considerations Time and Data Capture

December 14, 2010

Kinds of Data Event Data

Custom

Member Nu Membership Reference Data

Source system metadata

Source system keys

December 14, 2010

Data Capture Methods

ALL DATA

CHANGED DATA

PUSH TO WAREHOUSE

Replicate source Files/tables

Replicate Source changes Or transactions

PULL FROM SOURCE

Extract source Files/tables

Extract source Changes or transactions

December 14, 2010

Detecting Data Changes how to know which data has changed???

• Detecting changes at source • source date/time stamps • source transaction files/logs • replicate source data changes • DBMS logs • compare back-up files • Detecting changes after extract • compare warehouse extract generations • compare warehouse extract to source system

ALL DATA

CHANGED DATA

PUSH TO WAREHOUSE

Replicate source Files/tables

Replicate Source changes Or transactions

PULL FROM SOURCE

Extract source Files/tables

Extract source Changes or transactions

December 14, 2010

Module 2 – Data Capture Analysis & Design Data Capture Concepts Source/Target Mapping Source Data Triage Data Capture Design Considerations Time and Data Capture

December 14, 2010

Timing issues OLTP

Frequency of Acquisition

Sources

Data Extraction Data Transformation Work Tables

Warehouse Loading

Latency of Load

Intake layer

Periodicity Of Data Marts

Data Mart

December 14, 2010

Source System Considerations OLTP

Sources

Data Extraction

Work Tables

How will I know when source systems fail?

How long will it remain in the steady state?

When is the data ready in each source system ? How will i know when it’s ready?

How will I respond to source system failures?

How will i recover from a failure? December 14, 2010

Handling time variance – techniques and methods • SNAPSHOT

– Periodically posts records as of a specific point in time – Records all data of interest without regard to changes – Acquisition techniques to create snapshots • DBMS replication • Full File Unload or Copy

• AUDIT TRAIL – Records details of each change to data of interest – Details may include date and time of change, how the change was detected, reason for change, before and after data values, etc. – Acquisition techniques • DBMS triggers • DBMS replication • Incremental selection • Full file unload/copy

Important distinction between Snapshot and Audit trail : Audit trail techniques only Changed data is extracted and loaded, Snapshot all data is extracted and loaded, whether changed or not December 14, 2010

Module 3 – Data Transformation Analysis & Design

Internal and Confidential

December 14, 2010

Module 3 – Data Transformation Analysis & Design Data Transformation Concepts Transformation Analysis Transformation Design Transformation Rules and Logic Transformation Sequences and Processes

December 14, 2010

Transformation concepts – An overview • Data Transformation • Changes that occur to the data after it is extracted • Transformation processing removes – – – –

Complexities of operational environments Conflicts and redundancies of multiple databases Details of daily operations Obscurity of highly encoded data

• Transformation Analysis • Integrate disparate data • Change granularity of data • Assure data quality

• Transformation Design • Specifies the processing needed to meet the requirements that are determined by transformation analysis • Determining kinds of transformations – – – – – – –

Selection Filtering Conversion Translation Derivation Summarization Organized into programs, scripts, modules, jobs, etc. that are compatible with chosen tools and technology December 14, 2010

Module 3 – Data Transformation Analysis & Design Data Transformation Concepts Transformation Analysis Transformation Design Transformation Rules and Logic Transformation Sequences and Processes

December 14, 2010

Data Integration Requirements • Integration – Create a single view of the data • Integration & Staging Data – organize data by business subjects – ensure integrated identity through use of common, shared business keys • Integration & Warehouse data – implement data standards, including derivation of conformed facts and structuring of conformed dimensions – ensure integration of internal identifiers – where staging integrates real world keys, the warehouse needs to do the same for surrogate keys • Integration & Data marts – Intended to satisfy business specific /department specific requirements

December 14, 2010

Data Granularity Requirements • Granularity – Each change of data grain, from atomic data to progressively higher levels of summary – achieved through transformation • Granularity & Staging Data – Staging data kept at atomic level • Granularity & warehouse data – In a 3 tier environment, warehouse should contain all common and standard summaries • Granularity & Data marts – Derivation of summaries specific to individual needs

December 14, 2010

Data Quality Requirements • Data Cleansing – process by which data quality needs are met – range from filtering bad date to replacing data values with some alternative default or derived values • Cleansing & Staging Data – the earlier the data is cleansed, the better the result – sometimes important for staging data to reflect what was contained in the source systems – Delay data cleansing transformation until data is moved from staging to warehouse – Keep both cleansed and un-cleansed data in staging area • Cleansing & Warehouse data – data not cleansed in staging is cleansed before loaded into the warehouse • Cleansing & Data marts – cleansing at data marts is not necessarily desirable, however as a practical matter may be necessary

December 14, 2010

Module 3 – Data Transformation Analysis & Design Data Transformation Concepts Transformation Analysis Transformation Design Transformation Rules and Logic Transformation Sequences and Processes

December 14, 2010

Transformation Design - Approach transformation requirements

Identify transformation rules & logic

Determine transformation sequences

Specify transformation process

transformation specifications

December 14, 2010

Module 3 – Data Transformation Analysis & Design Data Transformation Concepts Transformation Analysis Transformation Design Transformation Rules and Logic Transformation Sequences and Processes

December 14, 2010

Kinds of transformations This Transformation type…

is Used to…

Selection

Choose one source to be used among multiple possibilities

Filtering

Choose a subset of rows from a source data table, or a subset of records from a source data file

Conversion and Translation

Change the format of data elements

Derivation

Create new data values, which can be inferred from the values of existing data elements

Summarization

Create new data values, which can inferred from the values of existing data elements

December 14, 2010

Selection Choose among alternative sources based upon selection rules

Extracted Source # 1

Select

Extracted Source # 2 sometimes from source 1

Transformed Target data

sometimes from source 2

‘If membership type is individual use member name from the membership master file, otherwise use member name from the business contact table’

December 14, 2010

Filtering eliminate some data from the target set of data based on filtering rules

Extracted Source data

Filter

Some rows or values discarded

Transformed Target data

‘If the last 2 digits of policy number are 04,27,46, or 89 extract data for the data mart, otherwise exclude the policy and all associated data’

December 14, 2010

Conversion Change data content and/or format based on conversion rules

Extracted Source data

Convert

Value/format in is different than value/format out

Transformed Target data

‘For policy history prior to 1994, reformat from Julian date to YYYYMMDD format. Default century to 19’

December 14, 2010

Translation Extracted Source data

encode values in

decode data whose values are encoded based on rules for translation

Translate

both encoded and decoded value out

Transformed Target data

‘if membership-type-code is ‘C’ translate to ‘Business’; If membership-type-code is ‘P’, blank, or null translate to ‘Individual’; otherwise translate to ‘Unknown’ ’

December 14, 2010

Derivation use existing data values to create new data based on derivation rules

Extracted Source data

Derive

new data values created…

Transformed Target data

More values out than in

‘Total Premium Cost = base-premium-amount + (sum of all additional coverage amounts)-(sum of all discount amounts) ’

December 14, 2010

Summarization Extracted Source data

atomic or base data in

Change data granularity based on rules of summarization

Summarize

Summary data out

Transformed Target data

‘for each store (for each product line (for each day (count the number of transactions, accumulate the total dollar value of the transactions))) ’ ‘for each week (sum daily transaction count, sum daily dollar total)

December 14, 2010

Identifying Transformation Rules

for any source-to-target data element association, what needs exist for: • selection? • filtering? • conversion? • translation? • derivation? • summarization? December 14, 2010

Specifying Transformation Rules cells expand to identify transformations by type & name

cleansing DTR027 (default value) Derivation DTR008 (Derive name) ic g ns o lo f i o at ly m te or ara ted f s n an sep me r t is u DTR008(Derive Name) doc

DTR027(Default Membership Type) If membership-type is null or invalid assume “family” membership

If membership-type is “family” separate name using comma characters prior to comma in customer-last-name after comma in customer-first-name biz-name

insert insert characters else move name to customerDecember 14, 2010

Module 3 – Data Transformation Analysis & Design Data Transformation Concepts Transformation Analysis Transformation Design Transformation Rules and Logic Transformation Sequences and Processes

December 14, 2010

Dependencies and Sequences • Time Dependency – when one transformation rule must execute before another • example: summarization of derived data cannot occur before the derivation • Rule Dependency – when execution of a transformation rule is based upon the result of another rule • example: different translations occur depending on source chosen by a selection rule • Grain Dependency – when developing one level of summary if based on results of a previous summarization • example: quarters can’t be summarized annually before months are summarized on a quarterly basis

December 14, 2010

Dependencies and Sequences 4

2 Specify selection Specify filtering

Specify conversion & translation Specify derivation Specify summarization

1

3

1. Identify the transformation rules 2. Understand rule dependency – package as modules 3. Understand time dependency – package as processes 4. Validate and define the test plan December 14, 2010

Modules and Programs DTR027(Default Membership Type) If membership-type is null or DTR008(Derive Name) If membership-type is “family” invalid assume “family” membership

separate name using comma insert characters prior to comma in customer-last-name insert characters after comma in customer-first-name

else move name to customer-biz-name

Transformation Rules Dependencie s among rules

Structures of Modules, Programs, Scripts, etc. December 14, 2010

Job Streams & Manual Procedures- completing the ETL design

Transformation Rules and their implementation Extract & Load Dependencie s

scheduling

execution e r tra ts

verification

Automated & Manual Procedures

communication December 14, 2010

Module 4 – Data Transportation & Loading Design

Internal and Confidential

December 14, 2010

Module 4 – Data Transport & Load Design

Data Transport and Load Concepts Data Transport Design Database Load Design

December 14, 2010

Overview Source Data

Extract

databa s

Load a hc m r of t al p od er e h w

t r ops nart at ad

Transform e load

Target Data December 14, 2010

Module 4 – Data Transport & Load Design

Data Transport and Load Concepts Data Transport Design Database Load Design

December 14, 2010

Data Transport Issues Source Data

Extract

which platforms? data volumes? transport frequency? network capacity? ASCII vs EBCDIC data security? transport methods?

Load

t r ops nart at ad

Transform

Target Data

December 14, 2010

Data Transport Techniques Source Data

Extract

Open FTP Secure FTP Alternatives to FTP Data compression Data Encryption ETL tools

Load

t r ops nart at ad

Transform

Target Data

December 14, 2010

Module 4 – Data Transport & Load Design

Data Transport and Load Concepts Data Transport Design Database Load Design

December 14, 2010

Database Load Issues Source Data

which DBMS? relational vs dimensional? tables & indices? load frequency? load timing? data volumes? exception handling? restart & recovery? load methods? referential integrity?

Extract

Transform databa s

Load

e load

Target Data December 14, 2010

Populating Tables • Drop and rebuild the tables • Insert (only) rows into a table • Delete old rows and insert changed rows

December 14, 2010

Indexing Load

Indices

Tables

update at load? dr o p & rebuild? index s egment ation? December 14, 2010

Updating allow updating of rows in tables?

Load

Indices Tables

• Isn’t the warehouse read only? • Updating Business Data • Updating Row level Metadata December 14, 2010

Referential Integrity • RI is the condition where every reference to another table has a foreign key/primary key match. • Three common options for RI • DBMS checking • Test load files before load using a tool/custom application • Test data base(s) after load using a tool/custom application

December 14, 2010

Timing Considerations • User Expectations • Data Readiness • Database synchronization

December 14, 2010

Exception Processing Transform

Load

Suspend exceptions

ok Reports Target data Log

Discard December 14, 2010

Integrating with ETL processes scheduling dependencies

restart/recover y dependencies

scheduling

M

communicatio n dependencies

T C ART XE

execution verification

scheduling

DA OL R OF S NART

process metadata

execution verification

parallel processing

ts e r tra

e r ts tra • Loading as a part of single transform & load job stream • Loads triggered by completion of transform job stream • Loads triggered by verification of transforms • Parallel ETL processing • Loading Partitions

scheduling execution verification

• Updating summary tables

e r tra ts

tool capabilities

December 14, 2010

Module 5 – Implementation Guidelines

Internal and Confidential

December 14, 2010

Module 5 – Implementation Guidelines

Data Acquisition Technology ETL Summary

December 14, 2010

Technology in Data Acquisition ETL Technology Data Mapping

Data Transformation Data Conversion

Storage Management Metadata Management

mega na M es abat a D

Data Movement

gni daoL es abat a D

ss ecc A at a D

s met s y S ecr uo S

Data Cleansing

December 14, 2010

ETL - Critical Success Factors

1. Design for the Future, Not for the Present 2. Capture and store only changed data 3. Fully understand source systems and data 4. Allow enough time to do the job right 5. Use the right sources, not the easy ones 6. Pay attention to data quality 7. Capture comprehensive ETL metadata 8. Test thoroughly and according to a test plan 9. Distinguish between one-time and ongoing loads 10. Use the right technology for the right reasons 11. Triage source attributes 12. Capture atomic level detail 13. Strive for subject orientation and integration 14. Capture history of changes in audit trail form 15. Modularize ETL processing 16. Ensure that business data is non-volatile 17. Use bulk loads and/or insert-only processing 18. Complete subject orientation and integration 19. Use the right data structures (relational vs. dimensional) 20. Use shared transformation rules and logic 21. Design for distribution first, then for access 22. Fully understand each unique access need 23. Use DBMS update capabilities 24. Design for access before other purposes 25. Design for access tool capabilities 26. Capture quality metadata and report data quality

v v v v v v v v v v v v v v v v v v

v

Cleansing

Granularity

Integration

Data Transformation Roles

Information Delivery

Distribution

Intake

Data Store Roles

v

v

v

v

v

v

v v

v v

v v v v v

v v v v v

v v v

v v v v v v

v v v v v v

v

v v

v v

v v v v

v v

v v v v

v v v v v v v

v

v

v

v v v

v v

v

v v

v v

v v v December 14, 2010

Exercises Exercise 1: Source Data Options Exercise 2: Source Data Modeling Exercise 3: Data Capture Exercise 4: Data Transformation Exercise 5: Data Acquisition Decision

December 14, 2010

Related Documents

02- Etl Design Strategy
January 2021 0
Etl Architecture
January 2021 0
Etl Untels
January 2021 1

More Documents from "samputa"