Web Scraping and Regular Expressions: Doing it by Python.

source: shorturl.at/ahjAX

ith the largest online selection of leading brands in categories such as electronics, fashion, health & beauty, fragrances, grocery, baby products, and homeware, noon is the one-stop-shopping destination for everyone. In this article, I choose the noon website, data was collected from the perfume section for female, male, and kids departments.

This article talks about web scraping and regex to extract required data from this website.

What is Web Scraping?

If you wonder what web scraping is, “Web scraping is a technique of extracting information from websites. It focuses on the transformation of unstructured data on the web, into structured data that can be stored and analyzed”.

web scraping

What are Regular Expressions?

A regular expression is a sequence of characters that define a search pattern.

Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings

The Most Famous Quote in regex-dom

“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” — Jamie Zawinski (Netscape engineer)

There are many places where regex can be run — from your text editor to the bash shell to Python, and even SQL. It is typically baked into the standard library of programming languages.

In Python, it can be imported like so:

import re

As a Data scientist, when to use Regular Expressions?

Regular Expressions help in manipulating textual data, which is often a pre-requisite for data science projects that involve text analytics.

The dataset

After using selenium and beautiful soup to create our dataset, it looks like that:

dataset before cleaning

there are many data that could be extracted from the above features, lets see:

To extract ml from perfume name:

df['ml'] = df['name'].str.extract('(\d+ml)', expand=False)
df['name'] = df.apply(lambda row : row['name'].replace(str(row['ml']), ''), axis=1)

To extract brand from perfume name:

df['name'] = df.apply(lambda row : row['name'].replace(str(row['brand']), ''), axis=1)

To extract concentration from perfume name:


df['concentration'] = df['name'].str.extract('(.{3,5})$', expand=False)
df['name'] = df.name.str.replace(r'(.{3,5})$', '', regex=True)

To extract notes, scents, dispenser_type, andaromatherapy_type from the details feature:


df['Base Note']=df['details'].str.extract('((?<=Base Note).+?(?= &))' ,expand=False)
df['fragrance_notes']=df['details'].str.extract('((?<=Fragrance Notes).+?(?= &))' ,expand=False)
df['middle_note'] = df['details'].str.extract('((?<=Heart/Middle Note).+?(?= &))' ,expand=False)
df['department'] = df['details'].str.extract('((?<=Department).+?(?= &))' ,expand=False)
df['scents'] = df['details'].str.extract('((?<=Scents/Notes).+?(?= &))' ,expand=False)
df['dispenser_type'] = df['details'].str.extract('((?<=Dispenser Type).+?(?= &))' ,expand=False)
df['top_note'] = df['details'].str.extract('((?<=Top Note).+?(?= &))' ,expand=False)
df['aromatherapy_type'] = df['details'].str.extract('((?<=Aromatherapy Type).+?(?= &))' ,expand=False)

To extract the number of seller ratings, scents, dispenser_type, andaromatherapy_type from the details feature:

df['num_seller_ratings']= df.num_seller_ratings.str.replace(r'[()]', '', regex=True)#.str.strip()
df['seller_rating'] = df['num_seller_ratings'].str.extract('^(.{0,3})' ,expand=False)# .str.replace
df['num_seller_ratings']= df.num_seller_ratings.str.replace('^(.{0,3})' ,'')

After using regex techniques:

After using regex techniques:

noon_perfume Data Dictionary:

Here Data Dictionary after cleaning and feature engineering.

data dictionary

So from8 features we extracted 18 features, GREAT:)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store