Data Manipulation using Pandas Python Library: Analysis on Stack Overflow Data set — Part 1

Hirushi Ekanayake
4 min readSep 7, 2019

Introduction

Pandas is a Python library that is used for data manipulation and analysis. This article gives you basic knowledge on how to do the following tasks in Python using Pandas.

  1. Creating a Data Frame
  2. Creating a random sample from a Data Frame
  3. Merging two Data Frames
  4. Grouping the Data Frame based on an attribute

To perform these tasks I have used a Stack Overflow data set which can be downloaded from the following link. Kaggle.https://www.kaggle.com/stackoverflow/stacksample

Now let’s try out some basic data manipulations using this data set.

  1. Creating a Data Frame

Before moving on to create a data frame, let’s look at what a Data Frame is.

A Pandas Data Frame is a two dimensional, heterogeneous, tabular data structure which includes rows, columns and data. There are two ways that you can create a data frame.

  1. Using a data set

The simplest way to create a data frame is converting a csv (Command Separated Values) file into a data frame. Here I have used the above mentioned Stack Overflow data set to create this data frame. It can be achieved as follows.

When you have successfully loaded your data set into the Pandas data frame, you can see the entire data frame, data frame statistics and first n number of rows of the data frame or last n number of rows of the data frame, as follows.

2. Creating an empty data frame and filling in data

You can create an empty data frame and create new columns and enter data into them as follows.

2. Creating a random sample from the Data Frame

When you have a large data frame it is difficult and time consuming to manipulate the entire data set. As a solution, you can create a sample from the data frame. Here I have created a sample of 100 rows from the above created Stack Overflow data frame, using sampling with replacement.

3. Merging two Data Frames

Sometimes we may have to merge two separate data sets, when we are manipulating data. In the Stack Overflow data analysis scenario, I had two separate csv files for Questions and their Tags and I wanted to merge these two files. First I created two separate data frames for Questions.csv and Tags.csv as explained in the previous section. Next I merged these two data frames as follows.

4. Grouping the Data Frame based on an attribute

Grouping data is a preliminary data manipulation task and it can be simply achieved using the Pandas library. In the Stack Overflow data analysis, I wanted to group the questions based on their tags. I achieved this task as follows.

After you have grouped the data, you can see the values of these groups using the get_group function as follows.

Now we have filtered the questions with the ‘Java’ tag, from the Stack Overflow data set. This can be used for the purpose of creating a knowledge base on a specific technical area.

Conclusion

In this article I explained some of the basic data manipulation tasks on a Stack Overflow data set. I will meet you with the next part of this tutorial in the near future.

Thank you for reading!

📝 Read this story later in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.

--

--