It is a common scenario that we need external modules in a PySpark program. Three alternatives could be employed here:
- Distribute the third-party modules across your spark cluster. This is the easiest way, but needs the administrative right of the cluster;
- Write your own functions in a single module and append it to the search path of SparkContext. Two utility functions are available: PySpark sc.addFile and sc.addPyFile.
- Package the module with multiple python files into a single .zip or .egg file. Refer to these answers elsewhere: