API Reference

This chapter is for developers of demeuk, it contains the API functions.

Demeuk-api

Demeuk - a simple tool to clean up corpora

Usage:
    demeuk [options]

Examples:
    demeuk -i inputfile.tmp -o outputfile.dict -l logfile.txt
    demeuk -i "inputfile*.txt" -o outputfile.dict -l logfile.txt
    demeuk -i "inputdir/*" -o outputfile.dict -l logfile.txt
    demeuk -i inputfile -o outputfile -j 24
    demeuk -i inputfile -o outputfile -c -e
    demeuk -i inputfile -o outputfile --threads all
    cat inputfile | demeuk --leak -j all | sort -u > outputfile

Standard Options:
    -i --input <path to file>       Specify the input file to be cleaned, or provide a glob pattern.
                                    (default: stdin)
    -o --output <path to file>      Specify the output file name. (default: stdout)
    -l --log <path to file>         Optional, specify where the log file needs to be writen to (default: stderr)
    -j --threads <threads>          Optional, specify amount of threads to spawn. Specify the string 'all' to make
                                    demeuk auto detect the amount of threads to start based on the CPU's
                                    (default: all threads).
                                    Note: threading will cost some setup time. Only speeds up for larger files.
    --input-encoding <encoding>     Forces demeuk to decode the input using this encoding (default: en_US.UTF-8).
    --output-encoding <encoding>    Forces demeuk to encoding the output using this encoding (default: en_US.UTF-8).
    -v --verbose                    When set, printing some extra information to stderr. And will print the
                                    lines containing errors to logfile.
    --debug                         When set, the logfile will not only contain lines which caused an error, but
                                    also line which were modified.
    --progress                      Prints out the progress of the demeuk process.
    -n --limit <int>                Limit the number of lines per thread.
    -s --skip <int>                 Skip <int> amount of lines per thread.
    --punctuation <punctuation>     Use to set the punctuation that is use by options. Defaults to:
                                    ! "#$%&'()*+,-./:;<=>?@[\]^_`{|}~
    --version                       Prints the version of demeuk.

Separating Options:
    -c --cut                        Specify if demeuk should split (default splits on ':'). Returns everything
                                    after the delimiter.
    --cut-before                    Specify if demeuk should return the string before the delimiter.
                                    When cutting, demeuk by default returns the string after the delimiter.
    -f --cut-fields <field>         Specifies the field to be returned, this is in the 'cut' language thus:
                                    N N'th field, N- from N-th field to end line, N-M, from N-th field to M-th
                                    field. -M from start to M-th field.
    -d --delimiter <delimiter>      Specify which delimiter will be used for cutting. Multiple delimiters can be
                                    specified using ','. If the ',' is required for cutting, escape it with a
                                    backslash. Only one delimiter can be used per line.

Check modules (check if a line matches a specific condition):
    --check-min-length <length>     Requires that entries have a minimal requirement of <length> unicode chars
    --check-max-length <length>     Requires that entries have a maximal requirement of <length> unicode chars
    --check-case                    Drop lines where the uppercase line is not equal to the lowercase line
    --check-controlchar             Drop lines containing control chars.
    --check-email                   Drop lines containing e-mail addresses.
    --check-hash                    Drop lines which are hashes.
    --check-mac-address             Drop lines which are MAC-addresses.
    --check-uuid                    Drop lines which are UUID.
    --check-non-ascii               If a line contain a non ascii char e.g. ü or ç (or everything outside ascii
                                    range) the line is dropped.
    --check-replacement-character   Drop lines containing replacement characters '�'.
    --check-starting-with <string>  Drop lines starting with string, can be multiple strings. Specify multiple
                                    with as comma-seperated list.
    --check-ending-with <string>    Drop lines ending with string, can be multiple strings. Specify multiple
                                    with as comma-seperated list.
    --check-contains <string>       Drop lines containing string, can be multiple strings. Specify multiple
                                    with as comma-seperated list.
    --check-empty-line              Drop lines that are empty or only contain whitespace characters
    --check-regex <string>          Drop lines that do not match the regex. Regex is a comma seperated list of
                                    regexes. Example: [a-z]{1,8},[0-9]{1,8}
    --check-min-digits <count>      Require that entries contain at least <count> digits
                                    (following the Python definition of a digit,
                                    see https://docs.python.org/3/library/stdtypes.html#str.isdigit)
    --check-max-digits <count>      Require that entries contain at most <count> digits
                                    (following the Python definition of a digit,
                                    see https://docs.python.org/3/library/stdtypes.html#str.isdigit)
    --check-min-uppercase <count>   Require that entries contain at least <count> uppercase letters
                                    (following the Python definition of uppercase,
                                    see https://docs.python.org/3/library/stdtypes.html#str.isupper)
    --check-max-uppercase <count>   Require that entries contain at most <count> uppercase letters
                                    (following the Python definition of uppercase,
                                    see https://docs.python.org/3/library/stdtypes.html#str.isupper)
    --check-min-specials <count>    Require that entries contain at least <count> specials
                                    (a special is defined as a non whitespace character which is not alphanumeric,
                                    following the Python definitions of both,
                                    see https://docs.python.org/3/library/stdtypes.html#str.isspace
                                    and https://docs.python.org/3/library/stdtypes.html#str.isalnum)
    --check-max-specials <count>    Require that entries contain at most <count> specials
                                    (a special is defined as a non whitespace character which is not alphanumeric,
                                    following the Python definitions of both,
                                    see https://docs.python.org/3/library/stdtypes.html#str.isspace
                                    and https://docs.python.org/3/library/stdtypes.html#str.isalnum)


Modify modules (modify a line in place):
    --hex                           Replace lines like: $HEX[41424344] with ABCD.
    --html                          Replace lines like: &#351;ifreyok with şifreyok.
    --html-named                    Replace lines like: &#alpha; Those structures are more like passwords, so
                                    be careful to enable this option.
    --lowercase                     Replace line like 'This Test String' to 'this test string'
    --title-case                    Replace line like 'this test string' to 'This Test String'
    --umlaut                        Replace lines like ko"ffie with an o with an umlaut.
    --mojibake                      Fixes mojibakes, which means lines like SmˆrgÂs will be fixed to Smörgås.
    --encode                        Enables guessing of encoding, based on chardet and custom implementation.
    --tab                           Enables replacing tab char with ':', sometimes leaks contain both ':' and '\t'.
    --newline                       Enables removing newline characters (\r\n) from end and beginning of lines.
    --non-ascii                     Replace non ascii char with their replacement letters. For example ü
                                    becomes u, ç becomes c.
    --trim                          Enables removing newlines representations from end and beginning. Newline
                                    representations detected are '\\n', '\\r', '\n', '\r', '<br>', and '<br />'.
    --transliterate <language>      Transliterate a strings, for example "ipsum" becomes "իպսում". The following
                                    languages are supported: ka, sr, l1, ru, mn, uk, mk, el, hy and bg.

Add modules (Modify a line, but keep the original as well):
    --add-lower                     If a line contains a capital letter this will add the lower case variant
    --add-first-upper               If a line does not contain a capital letter this will add the capital variant
    --add-title-case                Add a line like 'this test string' also as a 'This Test String'
    --add-latin-ligatures           If a line contains a single ligatures of a latin letter (such as ij), the line
                                    is correct but the original line contain the ligatures is also added to output.
    --add-split                     split on known chars like - and . and add those to the final dictionary.
    --add-umlaut                    In some spelling dicts, umlaut are sometimes written as: o" or i" and not as
                                    one char.
    --add-without-punctuation       If a line contains punctuations, a variant will be added without the
                                    punctuations

Remove modules (remove specific parts of a line):
    --remove-strip-punctuation      Remove starting and trailing punctuation
    --remove-punctuation            Remove all punctuation in a line
    --remove-email                  Enable email filter, this will catch strings like
                                    1238661:test@example.com:password
Macro modules:
    -g --googlengram                When set, demeuk will strip universal pos tags: like _NOUN_ or _ADJ
    --leak                          When set, demeuk will run the following modules:
                                        mojibake, encode, newline, check-controlchar
                                    This is recommended when working with leaks and was the default bevarior in
                                    demeuk version 3.11.0 and below.
    --leak-full                     When set, demeuk will run the following modules:
                                        mojibake, encode, newline, check-controlchar,
                                        hex, html, html-named,
                                        check-hash, check-mac-address, check-uuid, check-email,
                                        check-replacement-character, check-empty-line
bin.demeuk.add_first_upper(line)

Returns the line with the first letter capitalized and all the others in lowercase.

Param:

line (unicode)

Returns:

False if they are the same Capitalized string if they are not

bin.demeuk.add_latin_ligatures(line)

Returns the line cleaned of latin ligatures if there are any.

Param:

line (unicode)

Returns:

False if there are not any latin ligatures Corrected line

bin.demeuk.add_lower(line)

Returns if the upper case string is different from the lower case line

Param:

line (unicode)

Returns:

False if they are the same Lowered string if they are not

bin.demeuk.add_split(line, punctuation=(' ', '-', '\\.'))

Split the line on the punctuation and return elements longer then 1 char.

Param:

line (unicode)

Returns:

split line

bin.demeuk.add_title_case(line)

Returns title case string where all the first letters are capitals and all others in lowercase.

Param:

line (unicode)

Returns:

False if they are the same Title string if they are not

bin.demeuk.add_without_punctuation(line, punctuation)

Returns the line cleaned of punctuation.

Param:

line (unicode)

Returns:

False if there are not any punctuation Corrected line

bin.demeuk.check_case(line, ignored_chars=(' ', "'", '-'))

Checks if an uppercase line is equal to a lowercase line.

Param:

line (unicode) ignored_chars list(string)

Returns:

true if uppercase line is equal to uppercase line

bin.demeuk.check_character(line, character)

Checks if a line contains a specific character

Params:

line (unicode)

Returns:

true if line does contain the specific character

bin.demeuk.check_contains(line, strings)

Checks if a line does not contain specific strings

Params:

line (unicode) strings[str]

Returns:

true if line does contain any one of the strings

bin.demeuk.check_controlchar(line)

Detects control chars, returns True when detected

Params:

line (Unicode)

Returns:

Status, String

bin.demeuk.check_email(line)

Check if lines contain e-mail addresses with a simple regex

Params:

line (unicode)

Returns:

true is line does not contain email

bin.demeuk.check_empty_line(line)

Checks if a line is empty or only contains whitespace chars

Params:

line (unicode)

Returns:

true of line is empty or only contains whitespace chars

bin.demeuk.check_ending_with(line, strings)

Checks if a line ends with specific strings

Params:

line (unicode) strings[str]

Returns:

true if line does end with one of the strings

bin.demeuk.check_hash(line)

Check if a line contains a hash

Params:

line (unicode)

Returns:

true if line does not contain hash

bin.demeuk.check_length(line, min=0, max=0)

Does a length check on the line

Params:

line (unicode) min (int) max (int)

Returns:

true if length is ok

bin.demeuk.check_mac_address(line)

Check if a line contains a MAC-address

Params:

line (unicode)

Returns:

true if line does not contain a MAC-address

bin.demeuk.check_non_ascii(line)

Checks if a line contains a non ascii chars

Params:

line (unicode)

Returns:

true if line does not contain non ascii chars

bin.demeuk.check_regex(line, regex)

Checks if a line matches a list of regexes

Params:

line (unicode) regex (list)

Returns:

true if all regexes match false if line does not match regex

bin.demeuk.check_starting_with(line, strings)

Checks if a line start with a specific strings

Params:

line (unicode) strings[str]

Returns:

true if line does start with one of the strings

bin.demeuk.check_uuid(line)

Check if a line contains a UUID

Params:

line (unicode)

Returns:

true if line does not contain a UUID

bin.demeuk.chunkify(filename, size=1048576)
bin.demeuk.clean_add_umlaut(line)

Returns the line cleaned of incorrect umlauting

Param:

line (unicode)

Returns:

Corrected line

bin.demeuk.clean_cut(line, delimiters, fields)

Finds the first delimiter and returns the remaining string either after or before the delimiter.

Params:

line (unicode) delimiters list(unicode) fields (unicode)

Returns:

line (unicode)

bin.demeuk.clean_encode(line, input_encoding)

Detects and tries encoding

Params:

line (bytes)

Returns:

Decoded UTF-8 string

bin.demeuk.clean_googlengram(line)

Removes speechtags from line specific to the googlengram module

Param:

line (unicode)

Returns:

line (unicode)

bin.demeuk.clean_hex(line)

Converts strings like ‘$HEX[]’ to proper binary

Params:

line (bytes)

Returns:

line (bytes)

bin.demeuk.clean_html(line)

Detects html encode chars and decodes them

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_html_named(line)

Detects named html encode chars and decodes them

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_lowercase(line)

Replace all capitals to lowercase

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_mojibake(line)

Detects mojibake and tries to correct it. Mojibake are string that are decoded incorrectly and then encoded incorrectly. This results in strings like: único which should be único.

Param:

line (str)

Returns:

Cleaned string

bin.demeuk.clean_newline(line)

Delete newline characters at start and end of line

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_non_ascii(line)

Replace non ascii chars with there ascii representation.

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_tab(line)

Replace tab character with ‘:’ greedy

Params:

line (bytes)

Returns:

line (bytes)

bin.demeuk.clean_title_case(line)

Replace words to title word (uppercasing first letter)

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_transliterate(line, language)

Transliterate a string

Params:

line (Unicode) language (str)

Returns:

line (Unicode)

bin.demeuk.clean_trim(line)

Delete leading and trailing character sequences representing a newline from beginning end end of line.

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_up(lines)

Main clean loop, this calls all the other clean functions.

Parameters:

line (bytes) – Line to be cleaned up

Returns:

(str(Decoded line), str(Failed line))

bin.demeuk.contains_at_least(line, bound, char_property)

Check if the line contains at least bound characters with given property.

Params:

line (unicode) bound (int) char_property (str -> bool)

Returns:

true if at least bound characters match false otherwise

bin.demeuk.contains_at_most(line, bound, char_property)

Check if the line contains at most bound characters with given property.

Params:

line (unicode) bound (int) char_property (str -> bool)

Returns:

true if at most bound characters match false otherwise

bin.demeuk.init_worker(config_data)
bin.demeuk.main()
bin.demeuk.remove_email(line)

Removes e-mail addresses from a line.

Params:

line (unicode)

Returns:

line (unicode)

bin.demeuk.remove_punctuation(line, punctuation)

Returns the line without punctuation

Param:

line (unicode) punctuation (unicode)

Returns:

line without start and end punctuation

bin.demeuk.remove_strip_punctuation(line, punctuation)

Returns the line without start and end punctuation

Param:

line (unicode)

Returns:

line without start and end punctuation

bin.demeuk.stderr_print(*args, **kwargs)
bin.demeuk.try_encoding(line, encoding)

Tries to decode a line using supplied encoding

Params:

line (Byte): byte variable that will be decoded encoding (string): the encoding to be tried

Returns:

False if decoding failed String if decoding worked