Web Scraping and Regular Expressions: Doing it by Python.

3 min readMar 1, 2021

With the largest online selection of leading brands in categories such as electronics, fashion, health & beauty, fragrances, grocery, baby products, and homeware, noon is the one-stop-shopping destination for everyone. In this article, I choose the noon website, data was collected from the perfume section for female, male, and kids departments.

This article talks about web scraping and regex to extract required data from this website.

What is Web Scraping?

If you wonder what web scraping is, “Web scraping is a technique of extracting information from websites. It focuses on the transformation of unstructured data on the web, into structured data that can be stored and analyzed”.

What are Regular Expressions?

A regular expression is a sequence of characters that define a search pattern.
Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings

The Most Famous Quote in `regex-dom`

“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” — Jamie Zawinski (Netscape engineer)

There are many places where regex can be run — from your text editor to the bash shell to Python, and even SQL. It is typically baked into the standard library of programming languages.

In Python, it can be imported like so:

import re

As a Data scientist, when to use Regular Expressions?

Regular Expressions help in manipulating textual data, which is often a pre-requisite for data science projects that involve text analytics.

The dataset

After using selenium and beautiful soup to create our dataset, it looks like that:

there are many data that could be extracted from the above features, lets see:

To extract ml from perfume name:

df['ml'] = df['name'].str.extract('(\d+ml)', expand=False)
df['name'] = df.apply(lambda row : row['name'].replace(str(row['ml']), ''), axis=1)

To extract brand from perfume name:

df['name'] = df.apply(lambda row : row['name'].replace(str(row['brand']), ''), axis=1)

To extract concentration from perfume name:


df['concentration'] = df['name'].str.extract('(.{3,5})$', expand=False)
df['name'] = df.name.str.replace(r'(.{3,5})$', '', regex=True)

To extract notes, scents, dispenser_type, andaromatherapy_type from the details feature:


df['Base Note']=df['details'].str.extract('((?<=Base Note).+?(?= &))' ,expand=False)
df['fragrance_notes']=df['details'].str.extract('((?<=Fragrance Notes).+?(?= &))' ,expand=False)
df['middle_note'] = df['details'].str.extract('((?<=Heart/Middle Note).+?(?= &))' ,expand=False)
df['department'] = df['details'].str.extract('((?<=Department).+?(?= &))' ,expand=False)
df['scents'] = df['details'].str.extract('((?<=Scents/Notes).+?(?= &))' ,expand=False)
df['dispenser_type'] = df['details'].str.extract('((?<=Dispenser Type).+?(?= &))' ,expand=False)
df['top_note'] = df['details'].str.extract('((?<=Top Note).+?(?= &))' ,expand=False)
df['aromatherapy_type'] = df['details'].str.extract('((?<=Aromatherapy Type).+?(?= &))' ,expand=False)

To extract the number of seller ratings, scents, dispenser_type, andaromatherapy_type from the details feature:

df['num_seller_ratings']= df.num_seller_ratings.str.replace(r'[()]', '', regex=True)#.str.strip()
df['seller_rating'] = df['num_seller_ratings'].str.extract('^(.{0,3})' ,expand=False)#  .str.replace
df['num_seller_ratings']= df.num_seller_ratings.str.replace('^(.{0,3})' ,'')

After using regex techniques:

noon_perfume Data Dictionary:

Here Data Dictionary after cleaning and feature engineering.

So from8 features we extracted 18 features, GREAT:)

Web Scraping and Regular Expressions: Doing it by Python.

What is Web Scraping?

What are Regular Expressions?

The Most Famous Quote in `regex-dom`

As a Data scientist, when to use Regular Expressions?

The dataset

To extract ml from perfume name:

To extract brand from perfume name:

To extract concentration from perfume name:

To extract notes, scents, dispenser_type, andaromatherapy_type from the details feature:

To extract the number of seller ratings, scents, dispenser_type, andaromatherapy_type from the details feature:

noon_perfume Data Dictionary:

Written by Monirah abdulaziz

No responses yet

Web Scraping and Regular Expressions: Doing it by Python.

What is Web Scraping?

What are Regular Expressions?

The Most Famous Quote in regex-dom

As a Data scientist, when to use Regular Expressions?

The dataset

To extract ml from perfume name:

To extract brand from perfume name:

To extract concentration from perfume name:

To extract notes, scents, dispenser_type, andaromatherapy_type from the details feature:

To extract the number of seller ratings, scents, dispenser_type, andaromatherapy_type from the details feature:

noon_perfume Data Dictionary:

Written by Monirah abdulaziz

No responses yet

The Most Famous Quote in `regex-dom`