public

Updated 4/30/2019

Install and Configure Spark on My Windows 10

I’v installed Anaconda Python, with PySpark and Findspark modules. I always use my “main” virtual env to do everything. I left my “base” alone. See my other page for details.

I’ve read many Spark Installation Guide for Windows. I always wonder why winutils (Window version of hadoop) is needed (for a standalone spark), until saw following post:

.... Even if you don't use Hadoop, Windows needs Hadoop to initialize the "hive" context.  You get the following error if Hadoop is not installed. ....

1. Install java

I installed jdk1.8.0_181 64-bit in C:\Java\jdk1.8.0_181, and I noticed there are a few other JAVA versions existed on my box.

In system environment, I setup JAVA_HOME to “C:\Java\jdk1.8.0_181”, and set PATH to C:\Java\jdk1.8.0_181\bin. I also I also set

Below is in system environment

JAVA_HOME=C:\Java\jdk1.8.0_181
_JAVA_OPTIONS=-Xmx16G -Xms512M

I noticed when I launch java I received following error

C:\Users\xxxx>java -version
Picked up _JAVA_OPTIONS: -Xmx16G -Xms512M
Invalid maximum heap size: -Xmx16G
The specified size exceeds the maximum representable size.
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

This is because the java in PATH (set before this version) was 32-bit java, which can’t utilized 16G stack

Note: setting Xmx16G in system evironment applies to all JVM. So it breaks some of my applications which uses 32bit Java. Need to find another place for Spark.

2. Install Spark

I downloaded spark2.4.2 “Pre-built for Apache Hadoop 2.7 and later” tgz file. tar xvf to ~\Programs\spark-2.4.2-bin-hadoop2.7

Below is in system environment

SPARK_HOME=C:\Users\xxxx\Programs\spark-2.4.2-bin-hadoop2.7

3. Install winutils.exe

I downloaded winutils.exe and put it in %SPARK_HOME%\hadoop\bin

Below is in system environment

HADOOP_HOME=%SPARK_HOME%\hadoop

and add path to

%HADOOP_HOME%\bin

4. Test

I tested following code in VS Studio

#%%
import findspark
findspark.init()


#%%
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.sql ("select 'spark' as hello ")
df.show()


#%%
df = spark.read.csv("a.csv",header=True, inferSchema=True)
df.show()
df.count()

5. Spark environment

I can launch pyspark shell (in git shell so I have python interpreter setup in path) as

~/Programs/spark-2.4.2-bin-hadoop2.7/bin/pypark2.cmd

then I can access Spark job status at

http://127.0.0.1:4040/jobs/

I ran the python code in the pyspark shell

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.sql ("select 'spark' as hello ")
df.show()

The job show in web gui

6. How to monitor job submitted by Jupyter/VS Code?

I tried http://127.0.0.1:4041/jobs/ which worked - it display all the jobs I submitted via VS Code.

7. Oracle JDBC Test

Edit spark-defaults.conf file (copied from spark-defaults.conf.template) in SPARK_HOME/conf

Add following line:

spark.driver.extraClassPath c:\\Users\\xxxx\\Programs\\sqldeveloper\\jdbc\\lib\\ojdbc8.jar

Here I used the JDBC came with SQL Developer. The configuration was tested with following code

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
empDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:@//hostname:port/sid") \
    .option("dbtable", "TEST_TABLE") \
    .option("user", "secret_uid") \
    .option("password", "secret_pwd") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .load()

empDF.count()

Learning Materials

[The Internal of Spark SQL] (https://jaceklaskowski.gitbooks.io/mastering-spark-sql/)

Learn PySpark

Data Flow