Skip to main content

Regular expressions in a nut shell

Definitions:
Regular Expression: It is a sequence of characters used as a search pattern.

Regular expression module API's:
re.match(regExp, string, flags): - It returns match object information like whether a pattern is matched or not
                                  - The span of the match(starting and ending index)
                                  - matched string
                                  - Flags contain special handling.
                                  - If the regExp doesn't match it returns 'None'

re.findall(regExp, string, flags): - method returns the list of strings valid for the match
                                   - Empty list if pattern not found in the string
                                   Ex: regExp = "\d+" string = 'hello 12 hi 89. Howdy 34', output = ['12', '89', '34']

re.split(regExp, string, max split):- method splits and returns a string based on regExp.
                                   - If no pattern found returns an empty string.
                                   Ex: regExp = "\d+" string = 'Twelve:12 Eighty nine:89.', output = ['Twelve:', '     Eighty nine','.']
                                       - In max split is configured with some value, then it will replace only those many times. Suppose if it is given as 1, then:
                                       regExp = "\d+" string = 'Twelve:12 Eighty nine:89 Nine:9.', Output: ['Twelve:', ' Eighty nine:89 Nine:9.']

re.sub(regExp, replace, string, replace_count): - This method checks for a pattern in the string if found replaces with "replace".
                                                  Ex: regExp = "\s+" string = "ab cd ef", replace = '', output = "abcdef"
                                                - If replace_count is specified, it will replace those many times only

re.subn(regExp, replace, string): This method is similar to re.sub except that it returns a tuple value which contains replaced string and number of times replacement happened.

re.search(regExp, string): This method looks for the first occurrence of regExp in the string.
                                                - It returns a method object which contains information like:
                                                - The span of the match
                                                - matched string
                                                - If it doesn't match returns 'None'

Match Object: It is the object returned by API's given above.

match object looks like: <_sre.SRE_Match object; span=(1, 7), match='234 23'>
In order to get these objects there are different methods and attributes:

match.group(): - It returns part of the string where the match of the regular expression found.
               - Sub groups can also be extracted from group:
                    - For the above example of match object: match.group(1) = '234', match.group(2) = '23'
                                                             match.group(1,2) = ('234', '23'), match.groups() = ('234', '23')
                    - For extracting the start, end and span of the match:
                                                            match.start() = 1, match.end() = 7 and match.span() = (2, 8)
                    - For extracting the regular expression used to match:
                                                            match.re = re.compile('(\d{3}) (\d{2})')
                    - For extracting the original string: match.string = "1234 234 56 343"

Raw strings as Regular expression:
                    - When 'r' or 'R' is used in front of any regular expression, it means that the regular expression is a raw string.
                    - For example, \n is a new line in a regular string, but in a raw string(r'\n') means two different characters '\' and 'n'
                    - '\' is used as an escape character including MetaCharacters. But, in the raw string '\' is also a character.

MetaCharacters:
'^' - It is used to define the pattern followed at the starting of the string. Ex: "^abc"
'$' - It is used to define the pattern followed at the ending of the string.   Ex: "abc$"
'?' - It is used to mention zero or one occurrence of a particular pattern. Ex: "ab?"
'*' - It is used to mention zero or more occurrences of a particular pattern Ex: "ab*"
'+' - It is used to match one or more occurrences of a particular pattern. Ex:"ab+"
'{n,m}'- It is used to match patterns left to it for at least n times and at most m times repetitions. Ex:"a{2,3}"
'[]'- It is used to specify the set of characters that have to matched.           Ex: "[abc],[a-e],[1-8]"
'.' - It is used to match any single character except newline(in order to include new line give re.DOTALL as a flag while searching) Ex: "ab."
'|' - It is used to match any of the two patterns(It works like logical or) Ex: "a|b"
'()'- It is used to group sub-patterns to make complex patterns. Ex: "(a|b|c)de"
'\' - It is used to escape various characters including MetaCharacters.
      Ex: "\$a" matches string if the string contains "$a" in the string.
      There are many special sequences in the regular expression. Some of them are:
      \A - It matches if the specified characters are the starting of the string
           Ex: "\Athe" matches "the sun" but not "I'm the sun"
      \b - It matches if the specified characters are at the starting or end of the WORD.
           Ex: "\bfoo" matches if foo is at the starting of any word. It matches "foo is foul" but not "afool"
               "foo\b" matches if foo is at the end of any word. It matches "the foo", "the afoo" but not "foo are", "a foodtest"
      \B - It is opposite of \b it matches if the specified pattern is not found in the string.
      \d - It matches any decimal digit in the string.
           Ex: "\d" matches "1234" but not "ab"
      \D - It is the opposite of \D. It matches if the specified pattern is not found in the string.
      \s - It matches any whitespace character. It is equivalent to [\n\t\r\f\v]
           Ex: "\s" matches " ab " but not "ab"
      \S - It matches non whitespace characters. It is opposite of "\s"
      \w - It matches any alphanumeric characters[A-Za-z0-9_]
           Ex: "\w" matches "abc89" but not "+%"
      \W - It is the opposite of "\w". It matches non-alphanumeric characters.
      \Z - Matches if specified characters are at the end of the string.
           Ex: "\ZPython"  matches "Python", "My favorite language is Python" but not "Python is fun"
Note:
1) If '^' is used inside '[]' followed by some characters, it means that search for the patters where those characters are excluded.
2) If the regular expression is "ma+n" it searches for patterns like "man" "maan" "maaan"(one or more appearances of a). In order to search for
   patterns like "maman" "mamaman" use "(ma)+n"(Characters has to wrapped inside "()" for entire pattern)
3) If the regular expression is ""
Examples:
Search pattern: '^a...s$'
                - This regular expression can be used to search for the pattern of which word is starting with 'a' any three characters
                  and ends with s.
                - For example, this regular expression finds out the pattern in strings like "aisus", "ab cs" etc., but not "Assus" etc.,

Comments

Popular posts from this blog

Leet Code: Problem #710 Random Pick with Blacklist

Given a blacklist  B containing unique integers from [0, N) , write a function to return a uniform random integer from [0, N) which is NOT  in B . Optimize it such that it minimizes the call to system’s Math.random() . Note: 1 <= N <= 1000000000 0 <= B.length < min(100000, N) [0, N)  does NOT include N. See interval notation . Example 1: Input: ["Solution","pick","pick","pick"] [[1,[]],[],[],[]] Output: [null,0,0,0] Example 2: Input: ["Solution","pick","pick","pick"] [[2,[]],[],[],[]] Output: [null,1,1,1] Example 3: Input: ["Solution","pick","pick","pick"] [[3,[1]],[],[],[]] Output: [null,0,0,2] Example 4: Input: ["Solution","pick","pick","pick"] [[4,[2]],[],[],[]] Output: [null,1,3,1] Explanation of Input Syntax: The input is two lists: the subroutines called and their argume...

Leet Code: Problem: 355. Design Twitter

Problem Statement: Design basic twitter which lets user follow and unfollow other users and show the latest new feed related to the current user and the user followers. Implement the following APIs void postTweet(int userId, int tweetId):     Stores the tweetId against the user ID. void follow(int followerId, int followeeId):     Marks that follower ID as following followee ID void unfollow(int followerId, int followeeId):     Marks that follower ID as unfollowing followee ID vector<int> getNewsFeed(int userId):     Returns the set of the latest 10 tweetIDs which include the current user tweetIDs and tweetIDs of the user that the follower if following. Approach to the problem: First we need to store the user IDs of the people a particular user is following To store that we can use map. To optimize things instead of storing list of followers, it is better to store them in a set for quicker access. So the followers data str...

LeetCode: Problem #1402. Reducing Dishes

Problem Statement: A chef has collected the data on the review for his dishes. Our Chef will take just 1 unit of time to prepare a dish. Our job is to tell him the dishes he has to make in the order to achieve maximum benefit. The maximum benefit is calculated using the formula time[i] * (review ratings). Example 1: Input: reviews = [-1, -10, -9, 0, 5] Output: 14 Explanation: Considering the dishes in the order of -1, 0 ,5 the calculation will be (-1 * 1 + 0 * 2 + 5 * 3) = 14 Example 2: Input: reviews = [6,5,4] Output: 32 Explanation: Considering the dishes in the order of 4, 5, 6 the calculation will be (4 * 1 + 5 * 2 + 6 * 3) = 32 Approach to the solution: Sort the given reviews so that we can concentrate only on maximum benefited reviews. Make cumulative sums from the end. This will help in deciding till which we have to consider the summation. Now start from the end at add the previous array of cumulative sums until a negative number is encountered. We have to iterate in reverse or...