Apache Spark Educational
Educational Apache Spark Project
Educational Apache Spark Project
This Java code is part of an educational project using Apache Spark to process and analyze educational data. The code is structured to read data from CSV files, perform transformations, and execute both batch and streaming queries. Here’s a detailed explanation of the code:
-
Package and Imports: The code begins by declaring the package and importing necessary classes from the Apache Spark library. These imports include classes for Spark SQL, streaming, and data types.
-
Class Declaration: The main class
SparkGroup38
is declared, which contains the main method and other necessary components. -
Constants: A constant
numCourses
is defined, which is used later in the streaming query. -
Main Method: The
main
method is the entry point of the application. It accepts command-line arguments for the Spark master URL and the file path. -
Spark Session Initialization: A Spark session is created with the specified master URL and application name. The log level is set to “ERROR” to reduce verbosity.
-
Schema Definition: Schemas for three datasets (
profs
,courses
, andvideos
) are defined usingStructType
andStructField
. These schemas specify the structure of the CSV files to be read. -
Reading CSV Files: The code reads three CSV files (
profs.csv
,courses.csv
, andvideos.csv
) into Spark DataFrames using the defined schemas. Theoption
method is used to specify that the CSV files do not have headers and use commas as delimiters. -
Reading Streaming Data: A streaming DataFrame
visualizations
is created using therate
source, which generates rows at a specified rate (rowsPerSecond
). Thevalue
column is used to simulate video IDs. - Data Transformation and Caching:
-
videosWithCourse
: This DataFrame is created by joining thevideos
andcourses
DataFrames on thecourse_name
column. It is cached to avoid recomputation in subsequent queries. -
profWithCourse
: This DataFrame is created by joining theprofs
andcourses
DataFrames on thecourse_name
column. It is used only once, so it is not cached.
-
-
Query Q1: This query computes the total number of lecture hours per professor. It groups the
profWithCourse
DataFrame byprof_name
and sums thecourse_hours
. The result is displayed using theshow
method. -
Query Q2: This streaming query computes the total duration of all visualizations of videos for each course over a minute, updated every 10 seconds. It joins the
visualizations
DataFrame withvideosWithCourse
, groups by a time window andcourse_name
, and sums thevideo_duration
. The result is written to the console in “update” mode. -
Query Q3: This streaming query computes the total number of visualizations of each video with respect to the number of students in the course. It joins the
visualizations
DataFrame withvideosWithCourse
, groups byvideo_id
andcourse_students
, counts the visualizations, and computes the ratio of visualizations to students. The result is written to the console in “update” mode. -
Await Termination: The code waits for the streaming queries (
q2
andq3
) to terminate. If an exception occurs, it is caught and printed. - Spark Session Close: Finally, the Spark session is closed to release resources.
Overall, this code demonstrates how to use Apache Spark for both batch and streaming data processing in an educational context, providing insights into lecture hours, video visualizations, and student engagement.