Skip to content

Develop a streaming data pipeline that retrieves data from an API in the English language. Produce this data in real-time to an Apache Kafka topic. Finally, build a Spark Streaming application to consume the records from Kafka and calculate the number of words in each record in real-time.

License

Notifications You must be signed in to change notification settings

gradedSystem/Real-time-Data-Analytics-Pipeline

Repository files navigation

Real-time Data Analytics Pipeline

License Stars Watchers Forks LinkedIn

Develop a streaming data pipeline that retrieves data from an API in the English language. Produce this data in real-time to an Apache Kafka topic. Finally, build a Spark Streaming application to consume the records from Kafka and calculate the number of words in each record in real-time.

Kafka Project

Kafka Logo Confluent Logo Python Logo AWS S3 Logo
Apache Kafka Confluent Python AWS S3

This repository contains a real-time data analytics pipeline built using Apache Kafka.

Technologies Used

  • Apache Kafka: A distributed event streaming platform used for building real-time data pipelines and streaming applications.

  • Virtualenv: A tool to create isolated Python environments.

  • Confluent Apache Kafka Cloud Provider: A Cloud Provider for Apache Kafka provided by Confluent.

Confluent Setup

Before running the Kafka producers and consumers, you need to set up your Confluent environment.

  1. Create a virtual environment and activate it:

    virtualenv env
    source env/bin/activate
  2. Install the Confluent Kafka Python client:

    pip install confluent-kafka
  3. Configure your file.ini with your Confluent Cloud API keys and cluster settings:

    [default]
    bootstrap.servers=<BOOTSTRAP SERVER>
    security.protocol=SASL_SSL
    sasl.mechanisms=PLAIN
    sasl.username=<CLUSTER API KEY>
    sasl.password=<CLUSTER API SECRET>
    
    [consumer]
    group.id=python_example_group_1
    
    auto.offset.reset=earliest

How to Run

To run the Kafka producers and consumers, follow these steps:

For Producer

  1. Make the producer script executable:

    chmod u+x producer.py
  2. Run the producer script with your file.ini:

    ./producer.py file.ini

For Consumer

  1. Make the consumer script executable:

    chmod u+x consumer.py
  2. Run the consumer script with your file.ini:

    ./consumer.py file.ini

Make sure to run producers and consumers in separate git bash instances.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Thanks to the Apache Kafka and Confluent communities for their excellent tools and documentation.

About

Develop a streaming data pipeline that retrieves data from an API in the English language. Produce this data in real-time to an Apache Kafka topic. Finally, build a Spark Streaming application to consume the records from Kafka and calculate the number of words in each record in real-time.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published