Apache Spark create rdd from an array java | convert an array in to RDD

  • RDD is the spark's core abstraction.
  • Full form of RDD is resilient distributed dataset.
  • Each RDD will split into multiple partitions which may be computed in different machines of cluster
  • We can create Apache spark RDD in two ways
    1. Parallelizing a collection
    2. Loading an external dataset.
     
  • So Creating RDD from an array comes in under Parallelizing a collection.
  • Let us see Apache spark an example program to convert an array into RDD.



Program #1: Write a Apache spark java example program to create simple RDD using parallelize method of JavaSparkContext. convert an array in to RDD


  1.  package com.instanceofjava.sparkInterview;
  2.  
  3. import java.util.Arrays;
  4.  
  5. import org.apache.spark.SparkConf;
  6. import org.apache.spark.api.java.JavaRDD;
  7. import org.apache.spark.api.java.JavaSparkContext;
  8.  
  9. /**
  10.  *  Apache spark examples:RDD in spark example program  
  11. *  converting an array to RDD
  12.  * @author www.instanceofjava.com
  13.  */
  14. public class SparkTest {
  15.     
  16.     public static void main(String[] args) {
  17.         
  18.         SparkConf conf = new
  19. SparkConf().setMaster("local[2]").setAppName("InstanceofjavaAPP");
  20.         JavaSparkContext sc = new JavaSparkContext(conf);
  21.         
  22.         String[] arrayStr={"convert array to rdd","convert array into rdd"};
  23.         
  24.         JavaRDD<String> strRdd=sc.parallelize(Arrays.asList(arrayStr));
  25.         System.out.println("apache spark rdd created: "+strRdd);
  26.         
  27.         /**
  28.          * Return the first element in this RDD.
  29.          */
  30.         System.out.println(strRdd.first());
  31.         
  32.     }
  33.  
  34. }
  35. }

Output:

  1. apache spark rdd created: ParallelCollectionRDD[0] at parallelize at SparkTest.java:24
  2. convert array to rdd



convert an array to rdd

How to create rdd in apache spark using java

  • RDD is the spark's core abstraction.
  • Full form of RDD is resilient distributed dataset.
  • That means it is immutable collection of objects.
  • Each RDD will split into multiple partitions which may be computed in different machines of cluster
  • We can create Apache spark RDD in two ways
    1. Parallelizing a collection
    2. Loading an external dataset.
  • Now we sill see an example program on creating RDD by parallelizing a collection.
  • In Apache spark JavaSparkContext  class providing parallelize() method.
  • Let us see the simple example program to create Apache spark RDD in java



Program #1: Write a Apache spark java example program to create simple RDD using parallelize method of JavaSparkContext.

  1. package com.instanceofjava.sparkInterview;
  2. import java.util.Arrays;
  3.  
  4. import org.apache.spark.SparkConf;
  5. import org.apache.spark.api.java.JavaRDD;
  6. import org.apache.spark.api.java.JavaSparkContext;
  7.  
  8. /**
  9.  *  Apache spark examples:RDD in spark example program
  10.  * @author www.instanceofjava.com
  11.  */
  12. public class SparkTest {
  13.     
  14.  public static void main(String[] args) {
  15.         
  16.  SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("InstanceofjavaAPP");
  17.  JavaSparkContext sc = new JavaSparkContext(conf);
  18.         
  19.  JavaRDD<String> strRdd=sc.parallelize(Arrays.asList("apache spark element1","apache
  20. spark element2"));
  21.  System.out.println("apache spark rdd created: "+strRdd);
  22.         
  23. /**
  24.  * Return the first element in this RDD.
  25.  */
  26. System.out.println(strRdd.first());
  27.  }
  28.  
  29. }
  30.  
  31. }

Output:

  1. apache spark rdd created: ParallelCollectionRDD[0] at parallelize at SparkTest.java:21
  2. spark element1


create apache spark rdd java
Select Menu