InstanceOfJava

Apache Spark create rdd from an array java | convert an array in to RDD

Posted by: Instanceofjava Posted date: August 15, 2017 / comment : 0 apache spark examples java

RDD is the spark's core abstraction.
Full form of RDD is resilient distributed dataset.
Each RDD will split into multiple partitions which may be computed in different machines of cluster
We can create Apache spark RDD in two ways
1. Parallelizing a collection
2. Loading an external dataset.
So Creating RDD from an array comes in under Parallelizing a collection.
Let us see Apache spark an example program to convert an array into RDD.

Program #1: Write a Apache spark java example program to create simple RDD using parallelize method of JavaSparkContext. convert an array in to RDD

package com.instanceofjava.sparkInterview;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
/**
* Apache spark examples:RDD in spark example program
* converting an array to RDD
* @author www.instanceofjava.com
*/
public class SparkTest {
public static void main(String[] args) {
SparkConf conf = new
SparkConf().setMaster("local[2]").setAppName("InstanceofjavaAPP");
JavaSparkContext sc = new JavaSparkContext(conf);
String[] arrayStr={"convert array to rdd","convert array into rdd"};
JavaRDD<String> strRdd=sc.parallelize(Arrays.asList(arrayStr));
System.out.println("apache spark rdd created: "+strRdd);
/**
* Return the first element in this RDD.
*/
System.out.println(strRdd.first());
}
}
}

Output:

apache spark rdd created: ParallelCollectionRDD[0] at parallelize at SparkTest.java:24
convert array to rdd

How to create rdd in apache spark using java

Posted by: Instanceofjava Posted date: August 15, 2017 / comment : 0 apache spark examples java

RDD is the spark's core abstraction.
Full form of RDD is resilient distributed dataset.
That means it is immutable collection of objects.
Each RDD will split into multiple partitions which may be computed in different machines of cluster
We can create Apache spark RDD in two ways
1. Parallelizing a collection
2. Loading an external dataset.
Now we sill see an example program on creating RDD by parallelizing a collection.
In Apache spark JavaSparkContext class providing parallelize() method.
Let us see the simple example program to create Apache spark RDD in java

Program #1: Write a Apache spark java example program to create simple RDD using parallelize method of JavaSparkContext.

package com.instanceofjava.sparkInterview;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
/**
* Apache spark examples:RDD in spark example program
* @author www.instanceofjava.com
*/
public class SparkTest {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("InstanceofjavaAPP");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> strRdd=sc.parallelize(Arrays.asList("apache spark element1","apache
spark element2"));
System.out.println("apache spark rdd created: "+strRdd);
/**
* Return the first element in this RDD.
*/
System.out.println(strRdd.first());
}
}
}

Output:

apache spark rdd created: ParallelCollectionRDD[0] at parallelize at SparkTest.java:21
spark element1