Specific: Clearly define the exact goal or objective without ambiguity
Measurable: Establish concrete criteria to track progress and quantify outcomes
Achievable: Set realistic and attainable targets within existing resources and constraints
Relevant: Ensure the goal aligns with broader organizational or personal objectives
Time-bound: Set a precise deadline or timeframe for completing the goal
Data Ethics
Consent: Obtaining explicit permission from individuals before collecting or using their personal data
Transaction-transparency: Clearly communicating how data is collected, used, and shared
Openness: Providing clear and accessible information about data practices and processes
Privacy: Protecting individual data from unauthorized access or misuse
Ownership: Recognizing and respecting individuals' rights to their personal data
Currency: Ensuring data remains up-to-date, accurate, and relevant
Data Credibility
Reliability: Consistency of measurements across different collection instances
Originality: Direct sourcing from primary information sources without intermediary alterations
Comprehensiveness: Complete inclusion of all essential data elements required for accurate analysis
Currency: Timeliness and relevance of data to the current research or operational context
Citation: Formal acknowledgment and linkage to the original data production source
Data Solutions Questionaire
Practical: Is it easy to implement?
Unintended Consequences: What will happen if we do this?
Logical: Does this make sense logically?
Precedent Backed: What happened when we tried this before?
Ethical: Is this the right thing to do?
Difference with Alternative: Is it better than other ideas?
Metadata Types
Structural: Describes the internal organization and relationships within a data set
Administrative: Provides technical information about data management, creation, and preservation
Descriptive: Identifies and explains the content, context, and characteristics of data
Data Collection Best Practices
Systematic Data Management
Primary Recording: Capture raw data on paper as initial documentation
Digital Migration: Systematically transfer data to electronic format
Worksheet Optimization: Consolidate data in a single, structured worksheet
Structural Integrity
Implement a unique identifier (ID) column for precise record tracking
Allocate one column per distinct variable
Reserve first row for variable/column names
Quality Control
Completeness: Ensure every cell contains meaningful information
Meticulous Documentation: Maintain comprehensive and clear research notes
Standardization: Maintain consistent data entry protocols
Analytical Rigor
Precision: Avoid data speculation or unsubstantiated entries
Numerical Representation: Recognize zero (0) as a valid numerical value
Data Cleaning Protocol
Preservation
Create an unaltered backup of original dataset
Perform cleaning in a separate working table
Error Management
Systematic Error Tracking: Document and report all data anomalies
Utilize database functions for efficient and reliable data cleaning
Exploratory Data Analysis (EDA): Non-Sequential Iterative Approach
Discovery Phase
Rapidly scan dataset to understand fundamental characteristics and potential insights
Structural Assessment
Map data architecture, identifying variable types and potential inter-feature relationships
Validation Processes
Rigorously verify data integrity, statistical assumptions, and distribution consistency
Cleaning Techniques
Strategically handle missing values, outliers, and standardize data formats
Joining and Integration
Seamlessly merge datasets, ensuring referential integrity and comprehensive data unification
Presentation Preparation
Transform complex data into compelling visualizations and actionable insights
Iterative Refinement
Dynamically cycle through analysis stages, continuously challenging and improving initial assumptions
Key Principles
Maintain analytical flexibility, prioritizing deep data understanding over rigid methodological constraints
Normalization vs Standardization
Normalization scales data to a fixed range (typically 0-1) using min-max scaling, useful for algorithms that require bounded input features.
Standardization transforms data to have zero mean and unit variance, creating a standard normal distribution that helps handle outliers and works well with normally distributed data.
Imputing vs Weight of Evidence
Imputation replaces missing data with estimated values based on other available information to preserve data completeness and reduce bias.
Weight of Evidence (WoE): Convert all of the quantities to categories with giving non-availability of data a category this helps us not miss out any information due to bias.
Long Table vs Wide Table
A long table stores each data point as a separate row, with an identifier column to distinguish different series, allowing more flexible data representation.
A wide table condenses data so that each line represents an entire series, with multiple columns representing different data points, which can improve query performance for certain aggregations.
Comments
Post a Comment