My Data Protocols

My Data Protocols

Interactive Blog on Data Management

SMART Questions

  • Specific: Clearly define the exact goal or objective without ambiguity
  • Measurable: Establish concrete criteria to track progress and quantify outcomes
  • Achievable: Set realistic and attainable targets within existing resources and constraints
  • Relevant: Ensure the goal aligns with broader organizational or personal objectives
  • Time-bound: Set a precise deadline or timeframe for completing the goal

Data Ethics

  • Consent: Obtaining explicit permission from individuals before collecting or using their personal data
  • Transaction-transparency: Clearly communicating how data is collected, used, and shared
  • Openness: Providing clear and accessible information about data practices and processes
  • Privacy: Protecting individual data from unauthorized access or misuse
  • Ownership: Recognizing and respecting individuals' rights to their personal data
  • Currency: Ensuring data remains up-to-date, accurate, and relevant

Data Credibility

  • Reliability: Consistency of measurements across different collection instances
  • Originality: Direct sourcing from primary information sources without intermediary alterations
  • Comprehensiveness: Complete inclusion of all essential data elements required for accurate analysis
  • Currency: Timeliness and relevance of data to the current research or operational context
  • Citation: Formal acknowledgment and linkage to the original data production source

Data Solutions Questionaire

  • Practical: Is it easy to implement?
  • Unintended Consequences: What will happen if we do this?
  • Logical: Does this make sense logically?
  • Precedent Backed: What happened when we tried this before?
  • Ethical: Is this the right thing to do?
  • Difference with Alternative: Is it better than other ideas?

Metadata Types

  • Structural: Describes the internal organization and relationships within a data set
  • Administrative: Provides technical information about data management, creation, and preservation
  • Descriptive: Identifies and explains the content, context, and characteristics of data

Data Collection Best Practices

Systematic Data Management

  • Primary Recording: Capture raw data on paper as initial documentation
  • Digital Migration: Systematically transfer data to electronic format
  • Worksheet Optimization: Consolidate data in a single, structured worksheet

Structural Integrity

  • Implement a unique identifier (ID) column for precise record tracking
  • Allocate one column per distinct variable
  • Reserve first row for variable/column names

Quality Control

  • Completeness: Ensure every cell contains meaningful information
  • Meticulous Documentation: Maintain comprehensive and clear research notes
  • Standardization: Maintain consistent data entry protocols

Analytical Rigor

  • Precision: Avoid data speculation or unsubstantiated entries
  • Numerical Representation: Recognize zero (0) as a valid numerical value

Data Cleaning Protocol

Preservation

  • Create an unaltered backup of original dataset
  • Perform cleaning in a separate working table

Error Management

  • Systematic Error Tracking: Document and report all data anomalies
  • Utilize database functions for efficient and reliable data cleaning

Exploratory Data Analysis (EDA): Non-Sequential Iterative Approach

Discovery Phase

  • Rapidly scan dataset to understand fundamental characteristics and potential insights

Structural Assessment

  • Map data architecture, identifying variable types and potential inter-feature relationships

Validation Processes

  • Rigorously verify data integrity, statistical assumptions, and distribution consistency

Cleaning Techniques

  • Strategically handle missing values, outliers, and standardize data formats

Joining and Integration

  • Seamlessly merge datasets, ensuring referential integrity and comprehensive data unification

Presentation Preparation

  • Transform complex data into compelling visualizations and actionable insights

Iterative Refinement

  • Dynamically cycle through analysis stages, continuously challenging and improving initial assumptions

Key Principles

  • Maintain analytical flexibility, prioritizing deep data understanding over rigid methodological constraints

Normalization vs Standardization

  • Normalization scales data to a fixed range (typically 0-1) using min-max scaling, useful for algorithms that require bounded input features.
  • Standardization transforms data to have zero mean and unit variance, creating a standard normal distribution that helps handle outliers and works well with normally distributed data.

Imputing vs Weight of Evidence

  • Imputation replaces missing data with estimated values based on other available information to preserve data completeness and reduce bias.
  • Weight of Evidence (WoE): Convert all of the quantities to categories with giving non-availability of data a category this helps us not miss out any information due to bias.

Long Table vs Wide Table

  • A long table stores each data point as a separate row, with an identifier column to distinguish different series, allowing more flexible data representation.
  • A wide table condenses data so that each line represents an entire series, with multiple columns representing different data points, which can improve query performance for certain aggregations.

© Kirtiman Gopanayak's Blog. All rights reserved.

Comments

Popular posts from this blog

Anxiety Cheat Sheet

Types of Thought Experiments

99 Eponymous Laws