API Reference
This chapter is for developers of demeuk, it contains the API functions.
Demeuk-api
Demeuk - a simple tool to clean up corpora
Usage:
demeuk [options]
Examples:
demeuk -i inputfile.tmp -o outputfile.dict -l logfile.txt
demeuk -i "inputfile*.txt" -o outputfile.dict -l logfile.txt
demeuk -i "inputdir/*" -o outputfile.dict -l logfile.txt
demeuk -i inputfile -o outputfile -j 24
demeuk -i inputfile -o outputfile -c -e
demeuk -i inputfile -o outputfile --threads all
cat inputfile | demeuk --leak -j all | sort -u > outputfile
Standard Options:
-i --input <path to file> Specify the input file to be cleaned, or provide a glob pattern.
(default: stdin)
-o --output <path to file> Specify the output file name. (default: stdout)
-l --log <path to file> Optional, specify where the log file needs to be writen to (default: stderr)
-j --threads <threads> Optional, specify amount of threads to spawn. Specify the string 'all' to make
demeuk auto detect the amount of threads to start based on the CPU's
(default: all threads).
Note: threading will cost some setup time. Only speeds up for larger files.
--input-encoding <encoding> Forces demeuk to decode the input using this encoding (default: en_US.UTF-8).
--output-encoding <encoding> Forces demeuk to encoding the output using this encoding (default: en_US.UTF-8).
-v --verbose When set, printing some extra information to stderr. And will print the
lines containing errors to logfile.
--debug When set, the logfile will not only contain lines which caused an error, but
also line which were modified.
--progress Prints out the progress of the demeuk process.
-n --limit <int> Limit the number of lines per thread.
-s --skip <int> Skip <int> amount of lines per thread.
--punctuation <punctuation> Use to set the punctuation that is use by options. Defaults to:
! "#$%&'()*+,-./:;<=>?@[\]^_`{|}~
--version Prints the version of demeuk.
Separating Options:
-c --cut Specify if demeuk should split (default splits on ':'). Returns everything
after the delimiter.
--cut-before Specify if demeuk should return the string before the delimiter.
When cutting, demeuk by default returns the string after the delimiter.
-f --cut-fields <field> Specifies the field to be returned, this is in the 'cut' language thus:
N N'th field, N- from N-th field to end line, N-M, from N-th field to M-th
field. -M from start to M-th field.
-d --delimiter <delimiter> Specify which delimiter will be used for cutting. Multiple delimiters can be
specified using ','. If the ',' is required for cutting, escape it with a
backslash. Only one delimiter can be used per line.
Check modules (check if a line matches a specific condition):
--check-min-length <length> Requires that entries have a minimal requirement of <length> unicode chars
--check-max-length <length> Requires that entries have a maximal requirement of <length> unicode chars
--check-case Drop lines where the uppercase line is not equal to the lowercase line
--check-controlchar Drop lines containing control chars.
--check-email Drop lines containing e-mail addresses.
--check-hash Drop lines which are hashes.
--check-mac-address Drop lines which are MAC-addresses.
--check-uuid Drop lines which are UUID.
--check-non-ascii If a line contain a non ascii char e.g. ü or ç (or everything outside ascii
range) the line is dropped.
--check-replacement-character Drop lines containing replacement characters '�'.
--check-starting-with <string> Drop lines starting with string, can be multiple strings. Specify multiple
with as comma-seperated list.
--check-ending-with <string> Drop lines ending with string, can be multiple strings. Specify multiple
with as comma-seperated list.
--check-contains <string> Drop lines containing string, can be multiple strings. Specify multiple
with as comma-seperated list.
--check-empty-line Drop lines that are empty or only contain whitespace characters
--check-regex <string> Drop lines that do not match the regex. Regex is a comma seperated list of
regexes. Example: [a-z]{1,8},[0-9]{1,8}
--check-min-digits <count> Require that entries contain at least <count> digits
(following the Python definition of a digit,
see https://docs.python.org/3/library/stdtypes.html#str.isdigit)
--check-max-digits <count> Require that entries contain at most <count> digits
(following the Python definition of a digit,
see https://docs.python.org/3/library/stdtypes.html#str.isdigit)
--check-min-uppercase <count> Require that entries contain at least <count> uppercase letters
(following the Python definition of uppercase,
see https://docs.python.org/3/library/stdtypes.html#str.isupper)
--check-max-uppercase <count> Require that entries contain at most <count> uppercase letters
(following the Python definition of uppercase,
see https://docs.python.org/3/library/stdtypes.html#str.isupper)
--check-min-specials <count> Require that entries contain at least <count> specials
(a special is defined as a non whitespace character which is not alphanumeric,
following the Python definitions of both,
see https://docs.python.org/3/library/stdtypes.html#str.isspace
and https://docs.python.org/3/library/stdtypes.html#str.isalnum)
--check-max-specials <count> Require that entries contain at most <count> specials
(a special is defined as a non whitespace character which is not alphanumeric,
following the Python definitions of both,
see https://docs.python.org/3/library/stdtypes.html#str.isspace
and https://docs.python.org/3/library/stdtypes.html#str.isalnum)
Modify modules (modify a line in place):
--hex Replace lines like: $HEX[41424344] with ABCD.
--html Replace lines like: şifreyok with şifreyok.
--html-named Replace lines like: &#alpha; Those structures are more like passwords, so
be careful to enable this option.
--lowercase Replace line like 'This Test String' to 'this test string'
--title-case Replace line like 'this test string' to 'This Test String'
--umlaut Replace lines like ko"ffie with an o with an umlaut.
--mojibake Fixes mojibakes, which means lines like SmˆrgÂs will be fixed to Smörgås.
--encode Enables guessing of encoding, based on chardet and custom implementation.
--tab Enables replacing tab char with ':', sometimes leaks contain both ':' and '\t'.
--newline Enables removing newline characters (\r\n) from end and beginning of lines.
--non-ascii Replace non ascii char with their replacement letters. For example ü
becomes u, ç becomes c.
--trim Enables removing newlines representations from end and beginning. Newline
representations detected are '\\n', '\\r', '\n', '\r', '<br>', and '<br />'.
--transliterate <language> Transliterate a strings, for example "ipsum" becomes "իպսում". The following
languages are supported: ka, sr, l1, ru, mn, uk, mk, el, hy and bg.
Add modules (Modify a line, but keep the original as well):
--add-lower If a line contains a capital letter this will add the lower case variant
--add-first-upper If a line does not contain a capital letter this will add the capital variant
--add-title-case Add a line like 'this test string' also as a 'This Test String'
--add-latin-ligatures If a line contains a single ligatures of a latin letter (such as ij), the line
is correct but the original line contain the ligatures is also added to output.
--add-split split on known chars like - and . and add those to the final dictionary.
--add-umlaut In some spelling dicts, umlaut are sometimes written as: o" or i" and not as
one char.
--add-without-punctuation If a line contains punctuations, a variant will be added without the
punctuations
Remove modules (remove specific parts of a line):
--remove-strip-punctuation Remove starting and trailing punctuation
--remove-punctuation Remove all punctuation in a line
--remove-email Enable email filter, this will catch strings like
1238661:test@example.com:password
Macro modules:
-g --googlengram When set, demeuk will strip universal pos tags: like _NOUN_ or _ADJ
--leak When set, demeuk will run the following modules:
mojibake, encode, newline, check-controlchar
This is recommended when working with leaks and was the default bevarior in
demeuk version 3.11.0 and below.
--leak-full When set, demeuk will run the following modules:
mojibake, encode, newline, check-controlchar,
hex, html, html-named,
check-hash, check-mac-address, check-uuid, check-email,
check-replacement-character, check-empty-line
- bin.demeuk.add_first_upper(line)
Returns the line with the first letter capitalized and all the others in lowercase.
- Param:
line (unicode)
- Returns:
False if they are the same Capitalized string if they are not
- bin.demeuk.add_latin_ligatures(line)
Returns the line cleaned of latin ligatures if there are any.
- Param:
line (unicode)
- Returns:
False if there are not any latin ligatures Corrected line
- bin.demeuk.add_lower(line)
Returns if the upper case string is different from the lower case line
- Param:
line (unicode)
- Returns:
False if they are the same Lowered string if they are not
- bin.demeuk.add_split(line, punctuation=(' ', '-', '\\.'))
Split the line on the punctuation and return elements longer then 1 char.
- Param:
line (unicode)
- Returns:
split line
- bin.demeuk.add_title_case(line)
Returns title case string where all the first letters are capitals and all others in lowercase.
- Param:
line (unicode)
- Returns:
False if they are the same Title string if they are not
- bin.demeuk.add_without_punctuation(line, punctuation)
Returns the line cleaned of punctuation.
- Param:
line (unicode)
- Returns:
False if there are not any punctuation Corrected line
- bin.demeuk.check_case(line, ignored_chars=(' ', "'", '-'))
Checks if an uppercase line is equal to a lowercase line.
- Param:
line (unicode) ignored_chars list(string)
- Returns:
true if uppercase line is equal to uppercase line
- bin.demeuk.check_character(line, character)
Checks if a line contains a specific character
- Params:
line (unicode)
- Returns:
true if line does contain the specific character
- bin.demeuk.check_contains(line, strings)
Checks if a line does not contain specific strings
- Params:
line (unicode) strings[str]
- Returns:
true if line does contain any one of the strings
- bin.demeuk.check_controlchar(line)
Detects control chars, returns True when detected
- Params:
line (Unicode)
- Returns:
Status, String
- bin.demeuk.check_email(line)
Check if lines contain e-mail addresses with a simple regex
- Params:
line (unicode)
- Returns:
true is line does not contain email
- bin.demeuk.check_empty_line(line)
Checks if a line is empty or only contains whitespace chars
- Params:
line (unicode)
- Returns:
true of line is empty or only contains whitespace chars
- bin.demeuk.check_ending_with(line, strings)
Checks if a line ends with specific strings
- Params:
line (unicode) strings[str]
- Returns:
true if line does end with one of the strings
- bin.demeuk.check_hash(line)
Check if a line contains a hash
- Params:
line (unicode)
- Returns:
true if line does not contain hash
- bin.demeuk.check_length(line, min=0, max=0)
Does a length check on the line
- Params:
line (unicode) min (int) max (int)
- Returns:
true if length is ok
- bin.demeuk.check_mac_address(line)
Check if a line contains a MAC-address
- Params:
line (unicode)
- Returns:
true if line does not contain a MAC-address
- bin.demeuk.check_non_ascii(line)
Checks if a line contains a non ascii chars
- Params:
line (unicode)
- Returns:
true if line does not contain non ascii chars
- bin.demeuk.check_regex(line, regex)
Checks if a line matches a list of regexes
- Params:
line (unicode) regex (list)
- Returns:
true if all regexes match false if line does not match regex
- bin.demeuk.check_starting_with(line, strings)
Checks if a line start with a specific strings
- Params:
line (unicode) strings[str]
- Returns:
true if line does start with one of the strings
- bin.demeuk.check_uuid(line)
Check if a line contains a UUID
- Params:
line (unicode)
- Returns:
true if line does not contain a UUID
- bin.demeuk.chunkify(filename, size=1048576)
- bin.demeuk.clean_add_umlaut(line)
Returns the line cleaned of incorrect umlauting
- Param:
line (unicode)
- Returns:
Corrected line
- bin.demeuk.clean_cut(line, delimiters, fields)
Finds the first delimiter and returns the remaining string either after or before the delimiter.
- Params:
line (unicode) delimiters list(unicode) fields (unicode)
- Returns:
line (unicode)
- bin.demeuk.clean_encode(line, input_encoding)
Detects and tries encoding
- Params:
line (bytes)
- Returns:
Decoded UTF-8 string
- bin.demeuk.clean_googlengram(line)
Removes speechtags from line specific to the googlengram module
- Param:
line (unicode)
- Returns:
line (unicode)
- bin.demeuk.clean_hex(line)
Converts strings like ‘$HEX[]’ to proper binary
- Params:
line (bytes)
- Returns:
line (bytes)
- bin.demeuk.clean_html(line)
Detects html encode chars and decodes them
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_html_named(line)
Detects named html encode chars and decodes them
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_lowercase(line)
Replace all capitals to lowercase
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_mojibake(line)
Detects mojibake and tries to correct it. Mojibake are string that are decoded incorrectly and then encoded incorrectly. This results in strings like: único which should be único.
- Param:
line (str)
- Returns:
Cleaned string
- bin.demeuk.clean_newline(line)
Delete newline characters at start and end of line
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_non_ascii(line)
Replace non ascii chars with there ascii representation.
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_tab(line)
Replace tab character with ‘:’ greedy
- Params:
line (bytes)
- Returns:
line (bytes)
- bin.demeuk.clean_title_case(line)
Replace words to title word (uppercasing first letter)
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_transliterate(line, language)
Transliterate a string
- Params:
line (Unicode) language (str)
- Returns:
line (Unicode)
- bin.demeuk.clean_trim(line)
Delete leading and trailing character sequences representing a newline from beginning end end of line.
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_up(lines)
Main clean loop, this calls all the other clean functions.
- Parameters:
line (bytes) – Line to be cleaned up
- Returns:
(str(Decoded line), str(Failed line))
- bin.demeuk.contains_at_least(line, bound, char_property)
Check if the line contains at least bound characters with given property.
- Params:
line (unicode) bound (int) char_property (str -> bool)
- Returns:
true if at least bound characters match false otherwise
- bin.demeuk.contains_at_most(line, bound, char_property)
Check if the line contains at most bound characters with given property.
- Params:
line (unicode) bound (int) char_property (str -> bool)
- Returns:
true if at most bound characters match false otherwise
- bin.demeuk.init_worker(config_data)
- bin.demeuk.main()
- bin.demeuk.remove_email(line)
Removes e-mail addresses from a line.
- Params:
line (unicode)
- Returns:
line (unicode)
- bin.demeuk.remove_punctuation(line, punctuation)
Returns the line without punctuation
- Param:
line (unicode) punctuation (unicode)
- Returns:
line without start and end punctuation
- bin.demeuk.remove_strip_punctuation(line, punctuation)
Returns the line without start and end punctuation
- Param:
line (unicode)
- Returns:
line without start and end punctuation
- bin.demeuk.stderr_print(*args, **kwargs)
- bin.demeuk.try_encoding(line, encoding)
Tries to decode a line using supplied encoding
- Params:
line (Byte): byte variable that will be decoded encoding (string): the encoding to be tried
- Returns:
False if decoding failed String if decoding worked