GitHub - nngidi/Call-Data-Usage-Project: Data Engineering Project

git@github.com:nngidi/Call-Data-Usage-Project.git

Project Overview: Data Engineering Project

This project focuses on integrating concepts and tools learned during the Data Engineering training to address real-world challenges. The emphasis is on combining technical skills, problem-solving, and teamwork to implement a resilient, cost-effective, and scalable data pipeline architecture. The tools and systems to be used include Redpanda, Kafka, Debezium, and ScyllaDB, among others, to manage, process, and analyze data streams effectively.

Purpose

The primary goal is to:

Solve Complex Data Engineering Problems: Utilize technical skills creatively to process, analyze, and store large volumes of data.
Collaborate: Work in a team to design, implement, and present the solution.
Demonstrate Skills: Build a fully operational data pipeline that integrates multiple systems and tools.

Core Technologies

Redpanda: A high-performance Kafka alternative for streaming and buffering data.
Debezium: For change data capture (CDC) to track updates from the CRM database.
ScyllaDB: A high-velocity store for real-time querying and storage of processed data.
PostgreSQL: For raw and transformed data storage.
Docker: To containerize and manage infrastructure components.

High-Level Objectives

Data Ingestion and Streaming:
- Use Redpanda to stream data from the provided SFTP server and the CRM database using Debezium.
- Ensure seamless, scalable data flow with minimal latency.
Stream Processing:
- Summarize Call Data Record (CDR) information into daily usage statistics per MSISDN.
- Compute real-time metrics using ScyllaDB to handle high-velocity requirements.
Prepared Layers and Analytics:
- Create intermediate data layers in PostgreSQL to enable efficient querying and reporting.
Real-Time API:
- Expose summarized usage data through a secure REST API using basic authentication.

System Components and Deliverables

SFTP to Redpanda: Automate the ingestion of raw CDR files from SFTP into Redpanda.
CDC with Debezium: Stream changes from the CRM PostgreSQL database to Redpanda.
Real-Time Processing: Use a stream-processing framework to compute daily summaries and store them in ScyllaDB.
High-Velocity Storage: Implement ScyllaDB to manage low-latency read/write operations for processed data.
Usage API: Provide RESTful endpoints for external users to query daily statistics securely.

Getting Started

Setup Environment:
- Install Docker, Git, and other dependencies.
- git@github.com:nngidi/Call-Data-Usage-Project.git
Initialize Docker Environment:
- Run docker compose up -d or equivalent to start all components.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
cdr		cdr
crm		crm
forex		forex
pgsql		pgsql
redpanda_to_postgres		redpanda_to_postgres
volumes/config		volumes/config
.gitignore		.gitignore
README.md		README.md
cdr_data.csv		cdr_data.csv
cdr_voice.csv		cdr_voice.csv
docker-compose.yml		docker-compose.yml
idx_data.dat		idx_data.dat
reset-env.sh		reset-env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview: Data Engineering Project

Purpose

Core Technologies

High-Level Objectives

System Components and Deliverables

Getting Started

About

Uh oh!

Releases

Packages

Languages

nngidi/Call-Data-Usage-Project

Folders and files

Latest commit

History

Repository files navigation

Project Overview: Data Engineering Project

Purpose

Core Technologies

High-Level Objectives

System Components and Deliverables

Getting Started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages