- RDD is the spark's core abstraction.
- Full form of RDD is resilient distributed dataset.
- Each RDD will split into multiple partitions which may be computed in different machines of cluster
- We can create Apache spark RDD in two ways
- Parallelizing a collection
- Loading an external dataset.
- So Creating RDD from an array comes in under Parallelizing a collection.
- Let us see Apache spark an example program to convert an array into RDD.
Program #1: Write a Apache spark java example program to create simple RDD using parallelize method of JavaSparkContext. convert an array in to RDD
- package com.instanceofjava.sparkInterview;
- import java.util.Arrays;
- import org.apache.spark.SparkConf;
- import org.apache.spark.api.java.JavaRDD;
- import org.apache.spark.api.java.JavaSparkContext;
- /**
- * Apache spark examples:RDD in spark example program
- * converting an array to RDD
- * @author www.instanceofjava.com
- */
- public class SparkTest {
- public static void main(String[] args) {
- SparkConf conf = new
- SparkConf().setMaster("local[2]").setAppName("InstanceofjavaAPP");
- JavaSparkContext sc = new JavaSparkContext(conf);
- String[] arrayStr={"convert array to rdd","convert array into rdd"};
- JavaRDD<String> strRdd=sc.parallelize(Arrays.asList(arrayStr));
- System.out.println("apache spark rdd created: "+strRdd);
- /**
- * Return the first element in this RDD.
- */
- System.out.println(strRdd.first());
- }
- }
- }
Output:
- apache spark rdd created: ParallelCollectionRDD[0] at parallelize at SparkTest.java:24
- convert array to rdd
No comments