Updated 4/30/2019
Working on apache spark on windows
.... Even if you don't use Hadoop, Windows needs Hadoop to initialize the "hive" context. You get the following error if Hadoop is not installed. ....
I installed jdk1.8.0_181 64-bit in C:\Java\jdk1.8.0_181, and I noticed there are a few other JAVA versions existed on my box.
In system environment, I setup JAVA_HOME to “C:\Java\jdk1.8.0_181”, and set PATH to C:\Java\jdk1.8.0_181\bin. I also I also set
Below is in system environment
JAVA_HOME=C:\Java\jdk1.8.0_181
_JAVA_OPTIONS=-Xmx16G -Xms512M
I noticed when I launch java I received following error
C:\Users\xxxx>java -version
Picked up _JAVA_OPTIONS: -Xmx16G -Xms512M
Invalid maximum heap size: -Xmx16G
The specified size exceeds the maximum representable size.
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
This is because the java in PATH (set before this version) was 32-bit java, which can’t utilized 16G stack
Note: setting Xmx16G in system evironment applies to all JVM. So it breaks some of my applications which uses 32bit Java. Need to find another place for Spark.
I downloaded spark2.4.2 “Pre-built for Apache Hadoop 2.7 and later” tgz file. tar xvf to ~\Programs\spark-2.4.2-bin-hadoop2.7
Below is in system environment
SPARK_HOME=C:\Users\xxxx\Programs\spark-2.4.2-bin-hadoop2.7
I downloaded winutils.exe and put it in %SPARK_HOME%\hadoop\bin
Below is in system environment
HADOOP_HOME=%SPARK_HOME%\hadoop
and add path to
%HADOOP_HOME%\bin
I tested following code in VS Studio
#%%
import findspark
findspark.init()
#%%
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.sql ("select 'spark' as hello ")
df.show()
#%%
df = spark.read.csv("a.csv",header=True, inferSchema=True)
df.show()
df.count()
I can launch pyspark shell (in git shell so I have python interpreter setup in path) as
~/Programs/spark-2.4.2-bin-hadoop2.7/bin/pypark2.cmd
then I can access Spark job status at
http://127.0.0.1:4040/jobs/
I ran the python code in the pyspark shell
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.sql ("select 'spark' as hello ")
df.show()
The job show in web gui
I tried http://127.0.0.1:4041/jobs/ which worked - it display all the jobs I submitted via VS Code.
Edit spark-defaults.conf file (copied from spark-defaults.conf.template) in SPARK_HOME/conf
Add following line:
spark.driver.extraClassPath c:\\Users\\xxxx\\Programs\\sqldeveloper\\jdbc\\lib\\ojdbc8.jar
Here I used the JDBC came with SQL Developer. The configuration was tested with following code
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
empDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:@//hostname:port/sid") \
.option("dbtable", "TEST_TABLE") \
.option("user", "secret_uid") \
.option("password", "secret_pwd") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
empDF.count()
[The Internal of Spark SQL] (https://jaceklaskowski.gitbooks.io/mastering-spark-sql/)
