Python Programs With External Modules to Spark

It is a common scenario that we need external modules in a PySpark program. Three alternatives could be employed here:

  1. Distribute the third-party modules across your spark cluster. This is the easiest way, but needs the administrative right of the cluster;
  2. Write your own functions in a single module and append it to the search path of SparkContext. Two utility functions are available: PySpark sc.addFile and sc.addPyFile.
  3. Package the module with multiple python files into a single .zip or .egg file. Refer to these answers elsewhere:

References

© 2017 InnoTrek All Rights Reserved.
Theme by hiero