Performing In-Depth Data Analysis and Interpretation

Accelerating Exploratory Data Analysis with GPUs

Exploratory Data Analysis (EDA) is no longer confined to traditional CPU-based systems, as modern GPU-accelerated libraries are making significant strides in speeding up EDA tasks. A recent experiment showcases this potential, utilising a powerful GPU setup to process a large dataset.

The Experiment Setup

The system at the heart of the experiment features a Z270N-WIFI motherboard, an Intel 4-core i5-7600K processor, a Quadro P2000 GPU with 5 GB GDDR5 and 1024 Cuda Cores, 32.0 Gig of Corsair VENGEANCE Red memory, a Corsair Integrator 500w power supply, and storage support for SSD, FDD, and mSata drives.

The Data

The dataset used in the experiment, the 7+ Million Company dataset, is licensed under the creative commons CC0.01, making it accessible for various purposes. The large text file, weighing in at 1.1 GB, contains 7+ million rows and 11 columns.

The EDA Process

The experiment employed both cudf(cu) and pandas(pd) for the EDA process. The data file was loaded onto the GPU using the method, a process that took approximately 2 seconds. After loading, the file occupied 1.3+ GB of memory on the GPU.

Accelerating EDA

NVIDIA cuDF, a GPU-accelerated DataFrame library, played a crucial role in the experiment. It offers API compatibility with pandas and Polars, enabling fast execution of core DataFrame operations directly on the GPU. This accelerates typical data processing workflows used in EDA without requiring code changes.

cuDF handles large datasets efficiently using unified virtual memory, allowing processing datasets larger than GPU memory. It also integrates with the broader Python data science ecosystem for end-to-end workflows that include data exploration and modeling.

Moreover, GPU-accelerated graph analytics through NVIDIA cuGraph supports advanced EDA tasks in network or relational data, with huge speedups for community detection, centrality measures, and PageRank computations compared to CPU implementations like NetworkX.

While GPUs are predominantly known for speeding up deep learning model training, their parallel processing strengths extend well beyond to many data manipulation, transformation, and exploratory steps essential to EDA.

The Results

Loading the data frame into RAM using Pandas took 16 seconds, which is eight times slower than on the GPU. Deleting the data frame from the GPU releases the allocation on the device and allows more data frames to be loaded. Cleaning operations such as handling missing values, aggregation, and groupby are extremely fast on the GPU.

The Future of EDA

The experiment demonstrates that GPUs are not limited to deep learning training and can accelerate most of the entire EDA pipeline, including data loading, cleaning, transformation, statistical analysis, and visualization (when coupled with visualization tools). Libraries like cuDF and cuGraph make GPU acceleration accessible for traditional data science and exploratory analysis workloads, resulting in faster, scalable, and more interactive investigations.

The data for the experiment was obtained directly from People Data Labs. The experiment was conducted using the rapids-22.08 environment on Ubuntu 22.04.1 LTS. To load the data onto the GPU, the rapids-22.08 environment must be activated first. Copying data from the GPU to the motherboard RAM is necessary for visualization as CUDF does not wrap matplotlib directly.

The code for the experiment can be found on GitHub at ReadingList/cudf.ipynb. The Nvidia GPU being used has CUDA 11.4 available, and 1GB of the card's RAM is already allocated due to existing workloads. A data file with 7M records was chosen for the GPU experiment, which is described on Kaggle and used in other articles such as "Exploratory Data Analysis of 7 Million Companies using Python."

[1] NVIDIA, "NVIDIA cuDF: GPU-Accelerated DataFrame Library," 2022, https://developer.nvidia.com/cudf

[2] NVIDIA, "NVIDIA cuGraph: GPU-Accelerated Graph Analytics Library," 2022, https://developer.nvidia.com/cugraph

[3] RAPIDS, "RAPIDS: Accelerate Data Science with GPUs," 2022, https://rapids.ai/

[4] People Data Labs, "7 Million Company Dataset," 2022, https://people.co/datasets/7-million-company-dataset/

[5] Creative Commons, "CC0 1.0 Universal," 2012, https://creativecommons.org/publicdomain/zero/1.0/legalcode

In the context of the experiment, the large dataset employed for the Exploratory Data Analysis (EDA) process was a neurological-disorders dataset related to medical-conditions, which was processed using data-and-cloud-computing technology, specifically with a GPU-accelerated library and a powerful GPU setup for science-based research.
The results of the EDA process revealed that data loading, cleaning, transformation, statistical analysis, and visualization in the experiment were significantly accelerated through the use of technology like GPUs and libraries such as NVIDIA cuDF, demonstrating the potential for these tools to revolutionize the field of science, including the study of neurological-disorders and medical-conditions.

Performing In-Depth Data Analysis and Interpretation