API Reference

This chapter is for developers of demeuk, it contains the API functions.

Demeuk-api

Demeuk - a simple tool to clean up corpora

Usage:
    demeuk [options]

Examples:
    demeuk -i inputfile.tmp -o outputfile.dict -l logfile.txt
    demeuk -i "inputfile*.txt" -o outputfile.dict -l logfile.txt
    demeuk -i "inputdir/*" -o outputfile.dict -l logfile.txt
    demeuk -i inputfile -o outputfile -j 24
    demeuk -i inputfile -o outputfile -c -e
    demeuk -i inputfile -o outputfile --threads all

Standard Options:
    -i --input <path to file>       Specify the input file to be cleaned, or provide a glob pattern
    -o --output <path to file>      Specify the output file name.
    -l --log <path to file>         Optional, specify where the log file needs to be writen to
    -j --threads <threads>          Optional, demeuk doesn't use threads by default. Specify amount of threads to
                                    spawn. Specify the string 'all' to make demeuk auto detect the amount of threads
                                    to start based on the CPU's.
                                    Note: threading will cost some setup time. Only speeds up for larger files.
    --input-encoding <encoding>     Forces demeuk to decode the input using this encoding (default: en_US.UTF-8).
    --output-encoding <encoding>    Forces demeuk to encoding the output using this encoding (default: en_US.UTF-8).
    -v --verbose                    When set, the logfile will not only contain lines which caused an error, but
                                    also line which were modified.
    --progress                      Prints out the progress of the demeuk process.
    -n --limit <int>                Limit the number of lines per thread.
    -s --skip <int>                 Skip <int> amount of lines per thread.
    --punctuation <punctuation>     Use to set the punctuation that is use by options. Defaults to:
                                    ! "#$%&'()*+,-./:;<=>?@[\]^_`{|}~
    --version                       Prints the version of demeuk.

Separating Options:
    -c --cut                        Specify if demeuk should split (default splits on ':'). Returns everything
                                    after the delimiter.
    --cut-before                    Specify if demeuk should return the string before the delimiter.
                                    When cutting, demeuk by default returns the string after the delimiter.
    -f --cut-fields <field>         Specifies the field to be returned, this is in the 'cut' language thus:
                                    N N'th field, N- from N-th field to end line, N-M, from N-th field to M-th
                                    field. -M from start to M-th field.
    -d --delimiter <delimiter>      Specify which delimiter will be used for cutting. Multiple delimiters can be
                                    specified using ','. If the ',' is required for cutting, escape it with a
                                    backslash. Only one delimiter can be used per line.

Check modules (check if a line matches a specific condition):
    --check-min-length <length>     Requires that entries have a minimal requirement of <length> unicode chars
    --check-max-length <length>     Requires that entries have a maximal requirement of <length> unicode chars
    --check-case                    Drop lines where the uppercase line is not equal to the lowercase line
    --check-controlchar             Drop lines containing control chars.
    --check-email                   Drop lines containing e-mail addresses.
    --check-hash                    Drop lines which are hashes.
    --check-mac-address             Drop lines which are MAC-addresses.
    --check-uuid                    Drop lines which are UUID.
    --check-non-ascii               If a line contain a non ascii char e.g. ü or ç (or everything outside ascii
                                    range) the line is dropped.
    --check-replacement-character   Drop lines containing replacement characters '�'.
    --check-starting-with <string>  Drop lines starting with string, can be multiple strings. Specify multiple
                                    with as comma-seperated list.
    --check-ending-with <string>    Drop lines ending with string, can be multiple strings. Specify multiple
                                    with as comma-seperated list.
    --check-empty-line              Drop lines that are empty or only contain whitespace characters
    --check-regex <string>          Drop lines that do not match the regex. Regex is a comma seperated list of
                                    regexes. Example: [a-z]{1,8},[0-9]{1,8}

Modify modules (modify a line in place):
    --hex                           Replace lines like: $HEX[41424344] with ABCD.
    --html                          Replace lines like: &#351;ifreyok with şifreyok.
    --html-named                    Replace lines like: &#alpha; Those structures are more like passwords, so
                                    be careful to enable this option.
    --lowercase                     Replace line like 'This Test String' to 'this test string'
    --title-case                    Replace line like 'this test string' to 'This Test String'
    --umlaut                        Replace lines like ko"ffie with an o with an umlaut.
    --mojibake                      Fixes mojibakes, which means lines like SmˆrgÂs will be fixed to Smörgås.
    --encode                        Enables guessing of encoding, based on chardet and custom implementation.
    --tab                           Enables replacing tab char with ':', sometimes leaks contain both ':' and '\t'.
    --newline                       Enables removing newline characters (\r\n) from end and beginning of lines.
    --non-ascii                     Replace non ascii char with their replacement letters. For example ü
                                    becomes u, ç becomes c.
    --trim                          Enables removing newlines representations from end and beginning. Newline
                                    representations detected are '\\n', '\\r', '\n', '\r', '<br>', and '<br />'.

Add modules (Modify a line, but keep the original as well):
    --add-lower                     If a line contains a capital letter this will add the lower case variant
    --add-latin-ligatures           If a line contains a single ligatures of a latin letter (such as ij), the line
                                    is correct but the original line contain the ligatures is also added to output.
    --add-split                     split on known chars like - and . and add those to the final dictionary.
    --add-umlaut                    In some spelling dicts, umlaut are sometimes written as: o" or i" and not as
                                    one char.
    --add-without-punctuation       If a line contains punctuations, a variant will be added without the
                                    punctuations

Remove modules (remove specific parts of a line):
    --remove-strip-punctuation      Remove starting and trailing punctuation
    --remove-punctuation            Remove all punctuation in a line
    --remove-email                  Enable email filter, this will catch strings like
                                    1238661:test@example.com:password
Macro modules:
    -g --googlengram                When set, demeuk will strip universal pos tags: like _NOUN_ or _ADJ
    --leak                          When set, demeuk will run the following modules:
                                        mojibake, encode, newline, check-controlchar
                                    This is recommended when working with leaks and was the default bevarior in
                                    demeuk version 3.11.0 and below.
    --leak-full                     When set, demeuk will run the following modules:
                                        mojibake, encode, newline, check-controlchar,
                                        hex, html, html-named,
                                        check-hash, check-mac-address, check-uuid, check-email,
                                        check-replacement-character, check-empty-line
bin.demeuk.add_latin_ligatures(line)

Returns the line cleaned of latin ligatures if there are any.

Param:

line (unicode)

Returns:

False if there are not any latin ligatures Corrected line

bin.demeuk.add_lower(line)

Returns if the upper case string is different from the lower case line

Param:

line (unicode)

Returns:

False if they are the same Lowered string if they are not

bin.demeuk.add_split(line, punctuation=(' ', '-', '\\.'))

Split the line on the punctuation and return elements longer then 1 char.

Param:

line (unicode)

Returns:

split line

bin.demeuk.add_without_punctuation(line, punctuation)

Returns the line cleaned of punctuation.

Param:

line (unicode)

Returns:

False if there are not any punctuation Corrected line

bin.demeuk.check_case(line, ignored_chars=(' ', "'", '-'))

Checks if an uppercase line is equal to a lowercase line.

Param:

line (unicode) ignored_chars list(string)

Returns:

true if uppercase line is equal to uppercase line

bin.demeuk.check_character(line, character)

Checks if a line contains a specific character

Params:

line (unicode)

Returns:

true if line does contain the specific character

bin.demeuk.check_controlchar(line)

Detects control chars, returns True when detected

Params:

line (Unicode)

Returns:

Status, String

bin.demeuk.check_email(line)

Check if lines contain e-mail addresses with a simple regex

Params:

line (unicode)

Returns:

true is line does not contain email

bin.demeuk.check_empty_line(line)

Checks if a line is empty or only contains whitespace chars

Params:

line (unicode)

Returns:

true of line is empty or only contains whitespace chars

bin.demeuk.check_ending_with(line, strings)

Checks if a line ends with specific strings

Params:

line (unicode) strings[str]

Returns:

true if line does end with one of the strings

bin.demeuk.check_hash(line)

Check if a line contains a hash

Params:

line (unicode)

Returns:

true if line does not contain hash

bin.demeuk.check_length(line, min=0, max=0)

Does a length check on the line

Params:

line (unicode) min (int) max (int)

Returns:

true if length is ok

bin.demeuk.check_mac_address(line)

Check if a line contains a MAC-address

Params:

line (unicode)

Returns:

true if line does not contain a MAC-address

bin.demeuk.check_non_ascii(line)

Checks if a line contains a non ascii chars

Params:

line (unicode)

Returns:

true if line does not contain non ascii chars

bin.demeuk.check_regex(line, regex)

Checks if a line matches a list of regexes

Params:

line (unicode) regex (list)

Returns:

true if all regexes match false if line does not match regex

bin.demeuk.check_starting_with(line, strings)

Checks if a line start with a specific strings

Params:

line (unicode) strings[str]

Returns:

true if line does start with one of the strings

bin.demeuk.check_uuid(line)

Check if a line contains a UUID

Params:

line (unicode)

Returns:

true if line does not contain a UUID

bin.demeuk.chunkify(fname, config, size=1048576)
bin.demeuk.clean_add_umlaut(line)

Returns the line cleaned of incorrect umlauting

Param:

line (unicode)

Returns:

Corrected line

bin.demeuk.clean_cut(line, delimiters, fields)

Finds the first delimiter and returns the remaining string either after or before the delimiter.

Params:

line (unicode) delimiters list(unicode) fields (unicode)

Returns:

line (unicode)

bin.demeuk.clean_encode(line, input_encoding)

Detects and tries encoding

Params:

line (bytes)

Returns:

Decoded UTF-8 string

bin.demeuk.clean_googlengram(line)

Removes speechtags from line specific to the googlengram module

Param:

line (unicode)

Returns:

line (unicode)

bin.demeuk.clean_hex(line)

Converts strings like ‘$HEX[]’ to proper binary

Params:

line (bytes)

Returns:

line (bytes)

bin.demeuk.clean_html(line)

Detects html encode chars and decodes them

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_html_named(line)

Detects named html encode chars and decodes them

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_lowercase(line)

Replace all capitals to lowercase

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_mojibake(line)

Detects mojibake and tries to correct it. Mojibake are string that are decoded incorrectly and then encoded incorrectly. This results in strings like: único which should be único.

Param:

line (str)

Returns:

Cleaned string

bin.demeuk.clean_newline(line)

Delete newline characters at start and end of line

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_non_ascii(line)

Replace non ascii chars with there ascii representation.

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_tab(line)

Replace tab character with ‘:’ greedy

Params:

line (bytes)

Returns:

line (bytes)

bin.demeuk.clean_title_case(line)

Replace words to title word (uppercasing first letter)

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_trim(line)

Delete leading and trailing character sequences representing a newline from beginning end end of line.

Params:

line (Unicode)

Returns:

line (Unicode)

bin.demeuk.clean_up(filename, chunk_start, chunk_size, config)

Main clean loop, this calls all the other clean functions.

Parameters:

line (bytes) – Line to be cleaned up

Returns:

(str(Decoded line), str(Failed line))

bin.demeuk.main()
bin.demeuk.remove_email(line)

Removes e-mail addresses from a line.

Params:

line (unicode)

Returns:

line (unicode)

bin.demeuk.remove_punctuation(line, punctuation)

Returns the line without punctuation

Param:

line (unicode) punctuation (unicode)

Returns:

line without start and end punctuation

bin.demeuk.remove_strip_punctuation(line, punctuation)

Returns the line without start and end punctuation

Param:

line (unicode)

Returns:

line without start and end punctuation

bin.demeuk.try_encoding(line, encoding)

Tries to decode a line using supplied encoding

Params:

line (Byte): byte variable that will be decoded encoding (string): the encoding to be tried

Returns:

False if decoding failed String if decoding worked