check if the data is correct in database

woman lying under the leaves during daytime

I have a database of emails. like below, i want to filter out those emails are not correct.
for eg:

  1. if email is not having “.”
  2. if email have more than one “@”
  3. if email have more than one “.” before and after “@”
  4. if email have spaces inside email or outside email.
  5. if email have domain other than “gmail.com” like (hotmail.com, live.com)

please help me like this if in future i will found anything to amend than i can add more conditions.

df <- data.frame(email=c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]@live.com","[email protected]","[email protected]","[email protected]",
                   "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"))

for example the output be like

Emails require complex regular expressions to parse to account for almost all possible cases, such as

?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|\[x01-x09x0bx0cx0e-x7f])+)])

See RFC5322; see also this S/O

Starting at step 5 in the OP reduces the complexity, however, and makes the other tests in the OP unnecessary

suppressPackageStartupMessages({library(dplyr)
                                library(stringr)
                                })

df <- data.frame(email=c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]@live.com","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"))

is_gmail  <- "gmail.com"

df %>% filter(str_detect(email,is_gmail))
#>            email
#> 1  [email protected]
#> 2  [email protected]
#> 3  [email protected]
#> 4  [email protected]
#> 5  [email protected]
#> 6  [email protected]
#> 7  [email protected]
#> 8  [email protected]
#> 9  [email protected]
#> 10 [email protected]
#> 11 [email protected]

Created on 2020-08-27 by the reprex package (v0.3.0)

Latest posts