Light

About Me

Seasoned Data Engineering Professional with 10+ years of expertise in building cloud-native platforms, modern lakehouse architectures, and large-scale data pipelines across healthcare, finance, and enterprise domains. I specialize in designing and optimizing end-to-end ETL processes, real-time streaming solutions, and analytics-ready data models using Databricks, Snowflake, Spark, Kafka, and advanced transformation frameworks. With deep expertise in Azure, AWS, and GCP, I focus on performance tuning, compliance, and data reliability while delivering ML-ready datasets that enable advanced analytics and drive automation through governance, CI/CD, and infrastructure-as-code practices.

Technical Skills

Programming & Scripting

  • Python (ETL, automation, validation)
  • Advanced SQL (CTEs, window functions, query tuning)
  • Scala, Java, Bash, Shell scripting

Data Engineering & Processing

  • Apache Spark, PySpark, Databricks
  • Apache Flink, Kafka, Spark Streaming
  • Structured Streaming, Apache Beam, Airflow, Luigi

ETL & Data Transformation

  • Talend, Apache NiFi, Informatica, SSIS
  • Azure Data Factory, dbt, Matillion, DataStage

Data Warehousing & Lakehouse

  • Snowflake, Amazon Redshift, Google BigQuery
  • Azure Synapse, Databricks Delta Lake
  • Hive, Presto, Trino

Data Modeling & Architecture

  • Star Schema, Snowflake Schema, Data Vault
  • Kimball & Inmon methodologies
  • Bronze-Silver-Gold layering, Lakehouse architecture

Data Governance & Security

  • Apache Atlas, Alation, Collibra
  • Data lineage, PII masking, RBAC/ABAC access control
  • HIPAA, GDPR, SOC2 compliance

Cloud Platforms & Services

  • Azure: ADF, Synapse, HDInsight, Blob Storage, Databricks
  • AWS: S3, Glue, EMR, Lambda, Redshift, Athena, Kinesis
  • GCP: BigQuery, Dataflow, Dataproc, Pub/Sub

Databases & Storage

  • PostgreSQL, Oracle, SQL Server, MySQL, DB2
  • MongoDB, Cassandra, DynamoDB, Neo4j
  • InfluxDB, HBase, Elasticsearch, Parquet, ORC, Avro

Analytics & ML Enablement

  • BI-ready data marts, Feature engineering pipelines
  • Scikit-learn, TensorFlow, PyTorch, MLflow
  • Power BI, Tableau, Looker, Qlik Sense

Key Projects

Healthcare Data Lakehouse

Healthcare Data Lakehouse Modernization

Company: Contour Software

Description: Migrated on-premise ETL workflows into a Databricks and Snowflake-based lakehouse using Delta Lake and dbt, structured with Bronze-Silver-Gold layers. Enforced governance with Apache Atlas and role-based security to ensure HIPAA compliance, while enabling real-time patient data insights and predictive analytics.

Technologies: Databricks, Snowflake, Delta Lake, dbt, Apache Atlas, Azure

Enterprise BI Platform

Enterprise BI & Analytics Platform

Company: Contour Software

Description: Developed centralized data marts and semantic layers in Snowflake using dbt and automated ingestion pipelines with Airflow. Delivered KPIs through Power BI and Looker dashboards, improving self-service analytics and reducing reporting cycles by 40%.

Technologies: Snowflake, dbt, Airflow, Power BI, Looker

Real-Time Fraud Detection

Real-Time Financial Fraud Detection

Company: VentureDive

Description: Built streaming pipelines with Kafka, Spark Streaming, and AWS Lambda integrated with S3-based alerting. Reduced detection latency by 60% and automated fraud risk workflows, empowering compliance teams with near real-time monitoring and proactive interventions.

Technologies: Kafka, Spark Streaming, AWS Lambda, S3

Predictive Patient Analytics

Predictive Patient Readmission Modeling

Company: NorthBay Solutions

Description: Engineered ML-ready time-series datasets using Spark, Databricks, and Delta Lake for patient outcome prediction. Collaborated with data scientists leveraging Scikit-learn and MLflow to deploy models that improved readmission risk identification and early intervention planning.

Technologies: Spark, Databricks, Delta Lake, Scikit-learn, MLflow

View More on GitHub

Professional Experience

Data Engineer Lead
Contour Software
03/2022 – Present
  • Architected enterprise-grade multi-cloud platforms leveraging Azure, GCP, and Databricks for healthcare clients
  • Orchestrated full-scale lakehouse ecosystem integrating Delta Lake with Snowflake to support BI, AI, and application workloads
  • Operationalized dbt for transformation lifecycle management and standardized Airflow-based orchestration across 100+ production pipelines
  • Enforced governance frameworks with Apache Atlas to ensure data lineage, regulatory compliance, and controlled access for HIPAA-driven use cases
  • Optimized compute and storage infrastructure through dynamic scaling, achieving cost reductions exceeding 35%
  • Mentored a team of 8 engineers across data engineering, DevOps, and ML ops
Cloud Data Engineer
VentureDive
08/2018 – 02/2022
  • Transitioned legacy ETL into Azure Data Factory pipelines, enabling scalable, event-driven workflows across multiple domains
  • Engineered Azure Data Lake Gen2 and Databricks pipelines supporting hybrid batch and streaming for patient data
  • Structured Delta Lake layers (Bronze, Silver, Gold) to deliver curated, governed, and reusable datasets
  • Automated deployments with Terraform and Azure DevOps, embedding CI/CD and validated Spark applications
  • Accelerated BI adoption by integrating Snowflake, reducing query latency by 55% and enhancing analytical efficiency
Big Data Specialist
NorthBay Solutions
01/2016 – 07/2018
  • Built scalable batch pipelines with Apache Spark and Hive to process structured and unstructured healthcare data
  • Implemented real-time ingestion using Kafka for patient monitoring sensors and medical device streams
  • Improved job runtimes by 45% through optimized joins, partitioning, and caching strategies
  • Consolidated datasets from EHR, lab systems, and insurance claims into a centralized Hadoop-based platform
  • Produced engineered time-series datasets enabling predictive models for clinical readmission risk
Data Engineer
CodeNinja
07/2014 – 12/2015
  • Automated ingestion pipelines using Talend, SQL Server, and Python to consolidate patient data across hospital systems
  • Standardized clinical records and diagnosis datasets into unified models supporting compliance and reporting
  • Reduced manual refresh effort by 60% through scheduled ETL workflows and automated validations
  • Implemented rigorous quality checks ensuring integrity and reliability of downstream analytics
  • Designed dimensional data marts supporting KPIs and executive dashboards for hospital management
LinkedIn