Apache Spark Educational
Educational Apache Spark Project
Educational Apache Spark Project
This Java code is part of an educational project using Apache Spark to process and analyze educational data. The code is structured to read data from CSV files, perform transformations, and execute both batch and streaming queries. Here’s a detailed explanation of the code:
-
Package and Imports: The code begins by declaring the package and importing necessary classes from the Apache Spark library. These imports include classes for Spark SQL, streaming, and data types.
-
Class Declaration: The main class
SparkGroup38is declared, which contains the main method and other necessary components. -
Constants: A constant
numCoursesis defined, which is used later in the streaming query. -
Main Method: The
mainmethod is the entry point of the application. It accepts command-line arguments for the Spark master URL and the file path. -
Spark Session Initialization: A Spark session is created with the specified master URL and application name. The log level is set to “ERROR” to reduce verbosity.
-
Schema Definition: Schemas for three datasets (
profs,courses, andvideos) are defined usingStructTypeandStructField. These schemas specify the structure of the CSV files to be read. -
Reading CSV Files: The code reads three CSV files (
profs.csv,courses.csv, andvideos.csv) into Spark DataFrames using the defined schemas. Theoptionmethod is used to specify that the CSV files do not have headers and use commas as delimiters. -
Reading Streaming Data: A streaming DataFrame
visualizationsis created using theratesource, which generates rows at a specified rate (rowsPerSecond). Thevaluecolumn is used to simulate video IDs. - Data Transformation and Caching:
-
videosWithCourse: This DataFrame is created by joining thevideosandcoursesDataFrames on thecourse_namecolumn. It is cached to avoid recomputation in subsequent queries. -
profWithCourse: This DataFrame is created by joining theprofsandcoursesDataFrames on thecourse_namecolumn. It is used only once, so it is not cached.
-
-
Query Q1: This query computes the total number of lecture hours per professor. It groups the
profWithCourseDataFrame byprof_nameand sums thecourse_hours. The result is displayed using theshowmethod. -
Query Q2: This streaming query computes the total duration of all visualizations of videos for each course over a minute, updated every 10 seconds. It joins the
visualizationsDataFrame withvideosWithCourse, groups by a time window andcourse_name, and sums thevideo_duration. The result is written to the console in “update” mode. -
Query Q3: This streaming query computes the total number of visualizations of each video with respect to the number of students in the course. It joins the
visualizationsDataFrame withvideosWithCourse, groups byvideo_idandcourse_students, counts the visualizations, and computes the ratio of visualizations to students. The result is written to the console in “update” mode. -
Await Termination: The code waits for the streaming queries (
q2andq3) to terminate. If an exception occurs, it is caught and printed. - Spark Session Close: Finally, the Spark session is closed to release resources.
Overall, this code demonstrates how to use Apache Spark for both batch and streaming data processing in an educational context, providing insights into lecture hours, video visualizations, and student engagement.