Loading documents preview...
Data Visualization in Data Science Maloy Manna biguru.wordpress.com
linkedin.com/in/maloy
twitter.com/itsmaloy
Synopsis Having data is not enough. Adding context to data is essential to understand the data, find patterns and engage audiences. Data visualization is a key element of data science, the interdisciplinary field which deals with finding insights from data. • In this webinar, we explore the roles of data visualization at different stages of the data science process, and why it is essential. • We also look at how data is encoded visually with shape, size, color and other variables and also the basic principles of visual encoding can be applied to build better visualizations. • We cover narratives, types of bias and maps. • Finally we look at how various tools – both open source and off-the-shelf software that’s used in data science to build effective data visualizations.
Speaker profile Maloy Manna Project Manager - Engineering AXA Data Innovation Lab
• Over 14 years experience building data driven products and services • Previous organizations: Thomson Reuters, Saama, Infosys, TCS
biguru.wordpress.com
linkedin.com/in/maloy
twitter.com/itsmaloy
Contents
Defining Data visualization Data science process Data visualization Visual encoding of data Narrative structures Dataviz Technology & Tools
Defining Data visualization • • •
•
Visual display of quantitative information Mapping data to visual elements Encoding data with size, shape, color... Storytelling / narrative elements
Defining Data Visualization
Exploratory • •
Find insights Conversation between data and “you”
Explanatory •
Present insights
Data science project life-cycle • •
•
•
• •
Acquire data Prepare data Analysis & Modeling Evaluation & Interpretation Deployment Operations & Optimization
Data science process
EDA: Exploratory Data Analysis Data Wrangling Exploratory
Explanatory
Data Visualization
Source: Computational Information Design | Ben Fry
Exploratory data visualization
Data analysis approaches: Classical: Problem > Data > Model > Analysis > Conclusions
EDA: [Exploratory Data Analysis] Problem > Data > Analysis > Model > Conclusions
Bayesian: Problem > Data > Model > Prior distribution > Analysis > Conclusions
EDA = approach, not a set of techniques
Exploratory data visualization Statistical approaches: •
Quantitative •
• • • •
Hypothesis testing Analysis of variance (ANOVA) Point estimates and confidence intervals Least squares regression
Graphical • • •
• • •
Scatter plots Histograms Probability plots Residual plots Box plots Block plots
Exploratory data visualization Graphical • • •
• • •
Scatter plots Histograms Probability plots Residual plots Box plots Block plots
Exploratory data visualization
Graphical analysis procedures: • • • • • •
•
Testing assumptions Model selection Model validation Estimator selection Relationship identification Factor effect determination Outlier detection
MUST USE for deriving insights from data
Exploratory data analysis
Anscombe's quartet N=11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = 1.237 Correlation = 0.816
Exploratory data analysis
Explanatory data visualization
Design Engineering Journalism
Explanatory data visualization
Visualization is both an art and science •
Harry Beck's subway map of London
Visual encoding of data Data Types •
•
Quantitative • Continuous, Discrete Categorical • Nominal, Ordered, Interval
Visual encoding of data Categorical scales and graph design
Visual encoding of data Bandwidth of our senses: [Tor Norretranders]
Visual encoding of data
Data → visual display elements • • •
Position x Position y Retinal variables • •
•
Size, Orientation (ordered data) Color Hue, Shape (nominal data)
Animation
Visual encoding of data
Ranking visual display elements (framework): 1. 2.
Position along a common-scale e.g. scatter plots Position on identical but non-aligned scales
E.g. multiple scatter plots 3. Length e.g. bar chart 4. Angle & Slope e.g. pie-chart 5. Area e.g. bubbles 6. 7.
Volume, density & color saturation e.g. heat-map Color hue e.g. highlights
Ref. Graphical Perception & graphical methods for analyzing scientific data – William Cleveland & Robert McGill (1985)
Design principles
Choose the right type of chart • • • •
Trends / Change over time → Line charts Distributions → Histograms Summary Information → Table Relationships → Scatter Plots
Get it right in black & white (before adding color) Prefer 2D to 3D for statistical charts Use color to highlight Avoid rainbow palette Avoid chartjunk : “less is more” Try to have a high data-ink ratio
Design principles
Choose the right type of chart
Ranking
Time-series
Correlation
Nominal comparison
Deviation
Narrative structures
Data Journalism Traditional journalism
Data journalism
• Data around narrative
• Narrative around data
• Linear flow
• Complex, often non-linear flow
• Physical static media
• Online interactive media
Narrative structures
Narrative structures
Narrative structures Bias (and ethics: Don’t lie with data)
Bar-charts must have a zero-baseline Present data in its context
Narrative structures Bias: Misleading with data
Selective presentation with line-charts
• Author Bias • Data Bias • Reader Bias
Narrative structures Bias and Errors (statistics): • •
Selection bias e.g. in sampling Omitted-variable bias
Errors: • •
Hypothesis testing Null Hypothesis = default/no-effect state Null Hypothesis H0
Valid
Invalid
Reject
Type I error • False positive
Correct inference • True positive
Accept
Correct inference • True negative
Type II error • False negative
Narrative structures Storytelling:
Visual narratives have moved from author-driven to viewerdriven with use of highly interactive media for data visualization
Author-driven
Viewer-driven
Author driven
Viewer driven
Strong ordering
Exploratory
Heavy messaging
Ability to ask questions
Need for clarity and speed
Build own story
DataViz Technologies & Tools Off-the-shelf:
Tableau, Qlikview
Tools:
Predefined charts: Raw, Chartio, Plotly Google fusion tables, Excel, Gephi
Code & Javascript libraries:
R ggplot2, ggvis, rCharts + shiny(interactive apps) Python matplotlib, D3.js, Dimple.js, Leaflet, Rickshaw (use JSON data) Linux gnuplot
DataViz Technologies & Tools Tableau data viz
DataViz Technologies & Tools Chart in R ggplot2
References Visual display of Quantitative Information: Edward Tufte http://goo.gl/qb5ej Exploratory Data Analysis: John Tukey http://goo.gl/tV57HP Data Science Life cycle : Maloy Manna http://www.datasciencecentral.com/profiles/blogs/the-data-science-project-lifecycle Selecting right graph for your message: Stephen Few www.perceptualedge.com/articles/ie/the_right_graph.pdf Practical rules for using color in charts: Stephen Few www.perceptualedge.com/articles/visual.../rules_for_using_color.pdf OpenIntro Statistics: https://www.openintro.org/stat/ Misleading with statistics: Eric Portelance https://medium.com/i-data/misleading-with-statistics-c63780efa928 Computational Information Design: Ben Fry http://benfry.com/phd/dissertation-050312b-acrobat.pdf