Big Data Analytics using Artificial Intelligence: Apache Spark for Scalable Batch Processing

Himanshu Gupta1

1

Publication Date: 2024/09/10

Abstract: The rapid proliferation of data in the digital age has made big data analytics a critical tool for deriving insights and making informed decisions. However, processing and analyzing large datasets, often reaching hundreds of terabytes, presents significant challenges. This paper explores the use of Apache Spark, a powerful distributed computing framework, for batch processing in big data analytics using artificial intelligence (AI) techniques. We evaluate the scalability, efficiency, and accuracy of AI models when applied to massive datasets processed in Spark. Our experiments demonstrate that Apache Spark, coupled with machine learning and deep learning techniques, offers a robust solution for handling large-scale data analytics tasks. We also discuss the challenges associated with such large-scale processing and propose strategies for optimizing performance and resource utilization.

Keywords: No Keywords Available

DOI: https://doi.org/10.38124/ijisrt/IJISRT24AUG1656

PDF: https://ijirst.demo4.arinfotech.co/assets/upload/files/IJISRT24AUG1656.pdf

REFERENCES

  1. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., & others. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In *Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI 12), 15-28.
  2. Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., & others. (2015). Spark SQL: Relational Data Processing in Spark. In *Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data* (pp. 1383-1394).
  3. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. *Communications of the ACM*, 51(1), 107-113.
  4. Chen, Y., Alspaugh, S., & Katz, R. H. (2012). Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads. *Proceedings of the VLDB Endowment*, 5(12), 1802-1813.
  5. Kang, Y., Luo, Y., Tong, Y., & Wang, B. (2020). Efficient Distributed Machine Learning on Big Data. *IEEE Transactions on Big Data*, 6(2), 238-252.
  6. Meng, X., Bradley, J., Yuvaz, B., Sparks, E., Venkataraman, S., Liu, D., & others. (2016). Mllib: Machine Learning in Apache Spark. *Journal of Machine Learning Research*, 17(1), 1235-1241.
  7. Apache Spark Documentation. (n.d.). MLlib: Machine Learning Library.
  8. Zaharia, M., et al. (2010). Spark: Cluster computing with working sets. HotCloud'10.
  9. Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.