git@github.com:nngidi/Call-Data-Usage-Project.git
This project focuses on integrating concepts and tools learned during the Data Engineering training to address real-world challenges. The emphasis is on combining technical skills, problem-solving, and teamwork to implement a resilient, cost-effective, and scalable data pipeline architecture. The tools and systems to be used include Redpanda, Kafka, Debezium, and ScyllaDB, among others, to manage, process, and analyze data streams effectively.
The primary goal is to:
- Solve Complex Data Engineering Problems: Utilize technical skills creatively to process, analyze, and store large volumes of data.
- Collaborate: Work in a team to design, implement, and present the solution.
- Demonstrate Skills: Build a fully operational data pipeline that integrates multiple systems and tools.
- Redpanda: A high-performance Kafka alternative for streaming and buffering data.
- Debezium: For change data capture (CDC) to track updates from the CRM database.
- ScyllaDB: A high-velocity store for real-time querying and storage of processed data.
- PostgreSQL: For raw and transformed data storage.
- Docker: To containerize and manage infrastructure components.
-
Data Ingestion and Streaming:
- Use Redpanda to stream data from the provided SFTP server and the CRM database using Debezium.
- Ensure seamless, scalable data flow with minimal latency.
-
Stream Processing:
- Summarize Call Data Record (CDR) information into daily usage statistics per MSISDN.
- Compute real-time metrics using ScyllaDB to handle high-velocity requirements.
-
Prepared Layers and Analytics:
- Create intermediate data layers in PostgreSQL to enable efficient querying and reporting.
-
Real-Time API:
- Expose summarized usage data through a secure REST API using basic authentication.
- SFTP to Redpanda: Automate the ingestion of raw CDR files from SFTP into Redpanda.
- CDC with Debezium: Stream changes from the CRM PostgreSQL database to Redpanda.
- Real-Time Processing: Use a stream-processing framework to compute daily summaries and store them in ScyllaDB.
- High-Velocity Storage: Implement ScyllaDB to manage low-latency read/write operations for processed data.
- Usage API: Provide RESTful endpoints for external users to query daily statistics securely.
-
Setup Environment:
- Install Docker, Git, and other dependencies.
- git@github.com:nngidi/Call-Data-Usage-Project.git
-
Initialize Docker Environment:
- Run
docker compose up -dor equivalent to start all components.
- Run